The synthesis method is set using the Parameter
command. This
command has an argument of Synth_Method
, followed by an atomic
name for the synthesis type. Although simply calling this function
changes the synthesis method, each different method may require
further setting up. This section describes those dependencies.
A number of waveform synthesis methods are already available. The number should increase, and those current be improved. See section CHATR Commands, subsection: `Parameter Synth_Method' for a reasonably up-to-date list. That section is created from a source which can be found in file `$CHATR_ROOT/src/chatr/commands.c' (and may even be up-to-date). Most modules are optional. To determine what is installed, look for the existence of the relevant-named directory under the path `$CHATR_ROOT/src/'. No directory present, no module installed.
This is selected using the command
(Parameter Synth_Method FORMANT_SYN)
No other setup is required. This method uses a public domain version of a formant synthesizer based on techniques described in Allen 87. The output is surprisingly bad, which is perhaps partly due to a mis-match in phoneme names (phonemes are mapped ad-hoc in this method), or maybe actual bugs in the code (probably porting bugs). Whatever, it remains to say that the result is difficult to understand.
This is selected using the command
(Parameter Synth_Method ISARD)
Before this method may be used, it is necessary to load the diphone
index and the LPC representation of the diphones themselves. This is
achieved using the Load_Isard
command. This command takes two
arguments, both filenames. The first is the index file and the
second is the diphone data. The given files are accessed via the
library load-path, so if they exist in a directory named in the list
of library directories named in the value of the variable
load-path
, no absolute path names are required. In fact the
values for the two arguments will almost always be the same.
(Load_Isard "../dbs/isard/diphlocs.txt" "../dbs/isard/engcdn.stf")
This synthesizer was originally written by Steve Isard of Edinburgh University. It should be stressed this is NOT the `CSTR synthesizer'. The diphones are LPC encoded, allowing easy modification of pitch and duration at concatenation time. The diphones are British English (RP) and hence will not always sound good with American English pronunciation.
A second diphone synthesizer is also included. It is the waveform synthesizer developed at the Centre for Speech Technology Research, University of Edinburgh. The system is designed to use a number of different diphone sets, though currently only one is available. It allows use of these sets in different encodings.
Before this method can be used it is necessary to load the diphones. The main function to do this is
(Load_Taylor)
This function acts on the values of certain Lisp variables. If set, the values should be
`T_Index_Name'
`T_Dictionary_Name'
`T_Vox_Path'
`T_Pm_Path'
T_Sample_Rate
T_Diphone_Storage
GROUPED
, indicating all diphone waveforms
and pitch-marks are compiled into a single dictionary, or
SEPARATE
indicating there is one waveform and pitch-mark file
per nonsense word.
T_Diphone_Type
WAVEFORM
SHORTWAVEFORM
FRAMES
LPC
CODED_4
CODED_5
CODED_6
CODED_ALAW
PITCH_LPC
RES_LPC
MAX_DIPHONES
AVAILABLE_DIPHONES
Not all of the above variables need to be set. A typical setting in the ATR-ITL environment would be
(set T_Dictionary_Path "/usr/pi/data/diphones/gw/group/gw.vox.diph") (set T_Index_Path "/usr/pi/data/diphones/gw/dictionary/diphdic.grp") (set T_Sample_Rate "20000") (set T_Diphone_Type "WAVEFORM") (set T_Diphone_Storage "GROUPED") (Load_Taylor) (Parameter Synth_Method TAYLOR)
This module is the most developed synthesis system within CHATR. Waveform synthesis is achieved by concatenating labeled units from a database of natural speech. Only general aspects are covered in this section, for a full description see section Unit Databases.
The UDB (Unit DataBase) module tries to deal with speech databases in a uniform abstract way. Once a database is described and loaded, it may be selected as the synthesis method using the command
(Parameter Synth_Method UDB)
Unit selection strategy is usually set up at database definition time. It may be changed using the command
(Database Set Strategy Simple)
There are several strategies available, though the two most usable are
Simple
Generic
After selection, units may be concatenated by a number of methods. See section Unit Concatenation Methods, for details.
An initial port of the non-uniform unit concatenative synthesizer developed previously at ATR is also included in this version of CHATR. Note that the port is still a little buggy but is beginning to be functional. Note this synthesizer is completely different in code (though no so much in spirit) from the unit selection system described above. This system only supports Japanese, but different databases may be loaded.
Different databases may be selected at run time. Currently there are two available, one male (MHT) and one female (FKN). The high level NUUTALK module has MHT intonation statistics hardwired, so it does not synthesize a female voice with appropriate intonation, but it may be used to synthesize female voices from lower levels of input (e.g. `segF0').
The example speaker `nuu_mht' sets up synthesis for MHT using the NUUTLAK system
(speaker_nuu_mht)
A typical romaji input for this synthesis method is
(Utterance Nuutalk ((Ninput arayuru geNjituwo, subete, jibuNnohouhe nejimagetanoda)))
More examples are available in files `$CHATR_ROOT/lib/utterance/jpex**.utt'. (Substitute numbers for **.) File `~/jpex02.utt' is a `segf0'-input example of an original MHT spoken phrase. Files `~/jpex03.utt' and `~/jpex04.utt' contain FKN `segf0' examples. There may still be problems with the CHATR port of female speech, as FKN does not sound as good as MHT.
After selection, the same concatenation methods as for standard unit
selection (see section Unit Concatenation Methods) can be used, namely
NUUCEP, PS_PSOLA, DUMB, DUMB+ and NULL. These are set through the
Parameter Concat_Method
commands.
Some parameter settings affect the process as follows
cep_dist/vq_dist
(set NT_cost_type 'cep_dist)causes each candidate unit to be checked against possible matches with a distance measure for whole cepstrum vectors. This typically causes the system to be slow, as many cepstrum files must be accessed. The second option is selected using
(set NT_cost_type 'vq_dist)This uses a set of vector quantizations for the cepstrum vectors (actually MFCC). It allows a much faster selection process. So far not much experimentation has been done in this area, but even this attempt produces similar results in half the processing time. In order to use this, a vector quantization table must be included in the data at build time.
garbage collection
NT_cep_gc_strategy
as in
(set NT_cep_gc_strategy 'NONE) (set NT_cep_gc_strategy '500)If set to
NONE
, the cache is never flushed, which will cause
you to run out of space after some time. If set to a number, after
that number of cepstrum files have been loaded the cache will be
completely flushed at the end of the following utterance. If set to
any other value, the cache is fully flushed at the end of each
utterance.
When running with vq_dist
, the cep cache is mostly useless.
Only a few files are actually read, so the default is perfectly
adequate. When the cep_dist
strategy is used, the cache
becomes more useful but has to be pretty large (hundreds) before it
has any effect.
Although CHATR can synthesize an utterance in less time then it takes to say it, if the utterance contains 30 seconds of speech you still need to wait around 20 seconds before the first word is heard. As utterances are typically `sentences', their size can vary quite drastically from a single word to a whole paragraph. A more practical method is to synthesize prosodic phrase by prosodic phrase rather than sentence by sentence. Prosodic phrases (assuming we can predict them adequately) do have an upper limit (based on the size of a speaker's lungs), so in general should not last for tens of seconds.
CHATR has an option (at waveform synthesis time) to synthesize the utterance in parts rather than as a whole, thus reducing the time until he first waveform is generated. It actually does not do this by prosody phrase, but sections the utterance into parts that are separated by silence, as predicted by higher levels of the system. Moreover the silences themselves may be generated by a number of options--this is because although it would be nice to select natural pauses from a database, practically our databases do not contain a good distribution of natural inter-sentential pauses.
There are disadvantages, however. When using the DATLINK as standard audio output (as many researchers do), this technique fails to produced natural sounding speech. This is because the DATLINK always introduces a substantial pause between waveforms, typically over a second and often longer. This length of pause may be acceptable at utterance major phrase boundaries, but not at those of minor phrases. The second disadvantage is that although the main utterance is split into sub-utterances, all the information that would normally be available in an utterance after synthesis is not copied from the sub-utterances back into the main utterance. In particular the units selected and the units/target costs. This second problem is less important, as phrase-by-phrase synthesis will normally only be used in time critical applications (such as text-to-speech), when investigating the details of the synthesis is not of interest.
Phrase-by-phrase synthesis is controlled through the parameter
variable syn_params
. As with other parameter variables, it
takes a list of pairs as a value. Each pair consists of a parameter
name followed by a parameter value. The parameter names are
phrase_by_phrase
Synth_Method
) will happen phrase by phrase. Default is `off'.
whole_wave
silence_method
zeros
, the silences will not be synthesized by
selecting units from the database, but by creating small waveforms of
zeros.
hardware_silence
An example use of phrase-by-phrase synthesis is given in the Lisp
function ntts
defined in `$CHATR_ROOT/lib/data/tts.ch'
If synth_hook
is set, the use of this method is a little more
complicated. Defined functions will be applied to the synthesized
sub-utterances rather than the whole utterance.
After synthesis, the output waveform may be passed through a number of filters. One of the most common filters is one that changes the volume. When multiple speakers are used in the same session, different inherent volumes in the database may make one speaker sound much quieter than another, so a change is desired.
There are other filters, including high and low pass.
The filters selection command is
(Filter_Wave UTT FILTERNAME [optional arguments])
Calling Filter_Wave
with no arguments gives a list of
available filters and their arguments.
(set utt1 (Utterance Text "Good morning")) (Synth utt1) (Say (Filter_Wave utt1 'Chorus)) (Say (Filter_Wave utt1 'Backwards))\
These filters destructively modify the waveform in their utterance argument.
Two utterance may also be combined using
(set utt3 (Merge_Waves utt1 utt2))
Different sample rates are catered for automatically.
For volume control there is a specific function which will modify the volume between maximum and minimum
(Regain_Wave utt1 '0.9)
Maximum volume is 1.0, minimum 0.0.
Waveforms may be changed to a different sample rate using the function
(Resamp_Wave utt1 12000)
Note that all of these functions may be called on every utterance
by using the synth_hook
variable. If this contains a list of
functions they will be automatically applied to the utterance
after waveform synthesis.
Go to the first, previous, next, last section, table of contents.