This chapter runs through an example of gathering and characterizing the files necessary for CHATR to build a synthesizer based on a speech corpus database. The process is long and requires much disk space and cpu time. Although it is mostly automatic, there are number of stages where informed decisions need to be made. A familiarity with the operation will greatly aid you in successfully building a usable synthesis database.
Apart from the main building and training scripts, various `awk' and `sed' one-liners have evolved during the development and use of CHATR; since these are heavily environment-specific and developers may prefer to generate their own, they have not been included in the main text but gathered together in an appendix. See section Various Short Useful(?) Scripts, for ideas.
Before proceeding further there will be a short explanation of some database-building terminology as specifically used in this chapter.
Each waveform file is identified by a short identifier, a
file-id
. This will typically be the name of the file minus any
extension. For example, if the files are called
sc001.wav sc002.wav sc003.wav ...
then the file-id's are
sc001 sc002 sc003 ...
The following are just brief descriptions of the contents of sub-directories created while making a synthesis database. See section Preparing the Database, for details on how to acquire the initial files.
`wav/'
`lab/'
`db_utils/'
`stats/'
`units/'
`f0/'
`pm/'
`cep/'
`vq/'
`chatr/seg/'
`index/'
This procedure will eventually create a fully trained database with index files. To use the resulting database within CHATR, only one definition command needs to be executed. See section Defining a Speaker, for details.
The construction process requires access to the following software packages
Choose a short name for your database and create a directory for it. All files will be generated in that directory by default. It will be referred to as the `Speaker Top Directory' for the rest of this chapter, and should be selected as the working directory when issuing any shell commands.
Only one place in the ultimate database definition refers to a speaker directory, so it may be easily moved afterwards.
In your newly created Speaker Top Directory, you will need
Copy the waveform and phoneme label files into the `wav/' and `lab/' directories respectively. Ensure byte-order is correct and headers are removed. See section Various Short Useful(?) Scripts, for some simple scripts which may help automate this task.
Create a file called `files' to contain a list of all the waveform files in the database. Assuming only the waveform files are called `*.wav', use the shell command
ls wav/*.wav > files
This file is used by the database training shell scripts to determine which files are to be processed. If (when?) things go wrong, it may be edited to remove the names of the files causing errors, i.e., include only the files that are to be used.
Symbolically link `db_utils/' with `$CHATR_ROOT/chatr/db_utils/' using the shell command
ln -s $CHATR_ROOT/chatr/db_utils/ db_utils
These files contain the scripts and binaries used to build a database.
Create the database to hold index files created during training. Use the shell command
mkdir index
Three files need to be copied and characterized before proceeding. They are
db_description
index/DBNAME_synth.ch
index/DBNAME_train.ch
Templates are available in `$CHATR_ROOT/db_utils/'.
See section Database Description File, for details of the required editing to the Database Description file. See section Database Parameters File, for details of the required editing to the Database Parameters file. See section Training Parameters File, for details of the required editing to the Training Parameters file. Finally, don't see section Advanced Features, unless you're a researcher who wants to experiment with some of the finer points of the system.
The Database Description file `db_description' is used to select initial database parameters.
It is recommended to copy the template from the `$CHATR_ROOT/db_utils/' directory.
Copy into the Speaker Top Directory using the shell command
cp $CHATR_ROOT/db_utils/db_description db_description
Edit that file to specify the database name, the phoneme set, and the sample rate for the database.
It may be found necessary to modify the variable GET_F0_PARAMS. This specifies the minimum and maximum expected F0 for the speaker. It helps to set the range to the likely limits for a particular speaker database, so the defaults may not necessarily be suitable. This is especially so for male speakers, where the lowest value in the database may be below the default minimum. If the subject is male, at least un-comment this line.
Some values in the middle section may also need to change depending on the environment. Run with the defaults as a start, they should suffice.
All parameters of a speech database are described in the Database Parameters file called `DBNAME_synth.ch'. This file describes both the database itself, plus other aspects of the voice, such as lexicon, intonation and duration parameters.
It is recommended to copy the template in `$CHATR_ROOT/db_utils/DBNAME_synth.ch' to `index/', replacing `DBNAME' with the name by which it is wished the speaker to be known.
Use the shell command
cp $CHATR_ROOT/db_utils/DBNAME_synth.ch index/NEW-NAME_synth.ch
This file must be edited. In many parts the required editing is essential to enable the building of a speech database; other areas contain advanced features only of interest to researchers. For that reason this section is further divided into two parts. See section Essential Editing, for the necessary. See section Advanced Features, for the heavy stuff.
The database parameters file can be viewed in two parts,
initialization and selection. When loaded, this file
should initialize and load all necessary parameters for the use of
the database as a CHATR speaker. The function
speaker_DBNAME
(at the end of the file) should select the
actual database and auxiliary parameters required. The idea is that
users will call that function when changing between alternate
speakers.
Every occurrence of `<>' marks a part that requires editing. In general, change occurrences of DBNAME to the database name, PHONESET to the phoneme set name, and DICTNAME to the appropriate dictionary name.
Make a note of the directory where the data is defined for use when
the function defspeaker
is eventually called. Be aware that
the variable DBNAME_data_dir
will be set before the
file `DBNAME_synth.ch' is loaded.
Each section of the template will now be examined in detail.
First, decide on the phoneme set you wish to use and ensure it is
loaded. If the phoneme set is a standard set (i.e. `radio2', `mrpa',
`BEEP', or `nuuph') you may simply require
the definition
file. If not, you must define your phoneme set in a file in the
`/index' directory. Note all unit names in the database must be
a member of this phoneme set. A commented-out example of loading a
definition specific to a database is available. See section Defining a New Phoneme Set for more details.
The main database declaration is next. It defines the name of the index file, and the format of the waveform, pitch mark and cepstrum files. If you use the standard database as your platform, the example names will be satisfactory--but remember to remove the `<>' marks. This section also defines the wave file type and sample rate (this must be set), as well as the phoneme set.
The Silence
definition is used in unit selection as a context
for units which come at the start or end of a file. Ensure that the
example silence entry has a reasonable value for all fields that
exist in your database. It is assumed that there is an effective
silence before and after each file in the database; however, good
database design should ensure there are actually some silence units
in the waveform as well. As is commented in the template (but still
easily overlooked!), make sure the phoneme-name for space (the
character(s) between the double quotes) corresponds to that used in
the chosen phoneme set.
Depending on the phoneme set chosen, the next section may not require editing. It is a definition of clusters of phonemes which share discrete distance functions. The names used are of course phoneme-set dependent. The template lists phonemes belonging to the `mrpa' set. If a different set is selected, the phonemes in this section must be altered to match those in that set.
The names and number of clusters are arbitrary and may be altered to
suit. Groups with similar articulatory characteristics work well.
Note that all unit names in the database must be in at least
one class. As an example, using the mrpa
phoneme set, a
possible clustering is
(set DBNAME_PhoneSets '( (lowv (a o oo aa ai oi au ou u@)) (midv (uh e uu @@ ei e@)) (highv (i u ii i@ @)) (plosive (p t h b d g k ch jh)) (fricative (s z sh zh f v th dh)) (nasal_n (n ny)) (glide (l r w y)) (nasal (m n ng)) (misc (#)) ))
It is important that the groups have a reasonable number of members. If there are too few training will not be possible. Likewise, if there are too many occurrences within a group it may require too much disk or swap space to calculate.
The next section of the template defines which distance functions are to be trained and used. No editing is essential here. See section Advanced Features, if you wish to experiment.
The next significant section contains nus_DBNAME_params
, which
defines some general parameters for the unit selection process.
Their current values are probably acceptable, although the beam and
candidate widths could possibly be reduced. See section Variable Index
for details of their values.
The following section is not executed during training mode, as it is then that the files to be loaded are generated. Defaults are already selected. See section Advanced Features, if automatic `pruning' of decision-trees is required.
If (and only if) training is not possible for some reason, a weights
file should be created to name the distance functions that are to be
used. Reasonable guesses for weights are possible. The format is a
list of weights for each phoneme class. Each weight consists of a
single phone or list of phones in the class, followed by a list of
distance function and weight pairs. A special phone named any
may be used to cover all phonemes not otherwise specified. One
suitable default weights file might contain
(quote ((any (p_phone_ident 0.3) (n_phone_ident 0.3) (duration 0.5) (pitch 1.0) (p_pitch 0.5) )))
Next the lexicon must be defined and selected. A number of lexicons are already built into CHATR. Presently they are
cmu
beep
mrpa
japanese
The file `$CHATR_ROOT/lib/data/lexicons.ch' contains definition functions for the above lexicons.
Remove the comment characters from the line containing the language required and insert the name of the lexicon you wish to use. Ideally the lexicon base dialect should match that of the speech corpus about to be trained. For example, using a `beep' lexicon to supply words to an American English corpus results in somewhat odd speech. There are more appropriate ways to transform speakers; See section Phoneme Maps, for details.
See section Lexicon, for more details on building your own lexicons.
For duration set-up, a number of choices are available. A linear
regression model can be used which is train-able for multiple
languages. Examples built from f2b
(American English) and
MHT
(Japanese) exist and can be used with other databases of
the same language. A neural-net based model is also available but
does not train well, though the example included from f2b
is
acceptable. For Japanese, a built-in model exists. It has no
external parameters and requires no set-up. See section Duration, for
more details, including training new duration models. Specifically
for an example of tuning a linear regression model, see section Training a New Duration Model.
The second part of the duration definition defines pause durations at phrase boundaries. An appropriate definition must be loaded.
There are a number of intonation systems built into CHATR. The
most stable ones are based around ToBI and hence for building working
speech synthesis voices ToBI is recommended. The appropriate
parameters should be set for that speaker. See section Variable Index,
for the appropriate values for ToBI_params
and
mb_params
. For more details on building intonation models see
section Intonation. One stable method for predicting F0 from ToBI
labels is a model using linear regression. Two models have been
included with the system, one for English (from f2b) and one for
Japanese (from MHT). These models can be mapped to other speaker's
F0 range given the target speakers F0 mean and standard deviation.
These speaker specific parameters are generated by the script
make_tobif0_params
during the building of a database.
The final part of this file defines a function that when called will select the appropriate parameters to cause the synthesizer to use that voice. For efficiency you should try to ensure everything is loaded into CHATR and this function need only set some variables. Again follow the comments and modify (comment out/uncomment) the sections appropriate to the database you are building.
Note that if you set any other variables in your
speaker_DBNAME
function you have to ensure that the values are
reset when the synthesizer switches to another speaker. In order to
do this without editing all other speaker synth files, you can
redefine the speaker_reset
function. See section Speaker Reset Function, for an example.
This section describes the features of the Database Parameters file which do not necessarily require editing for the building of a new speech database, but will be of interest to those involved in speech synthesis research.
Distance functions fall into two classes: shared and table.
Shared distance functions are in general continuous. The default set is
(set DBNAME_SharedDFs '( (p_phone_ident -1 phone ident eql) (n_phone_ident 1 phone ident eql) (duration 0 dur_z ident abs) (pitch 0 pitch_z ident abs) (p_pitch -1 pitch_z ident abs) (n_pitch 1 pitch_z ident abs) ))
Table distance functions are discrete fields which will be trained in the phonetic groups defined above. The default set is
(set DBNAME_TableDFs '( (p_vc -1 phone ph_vc 2) (p_height -1 phone ph_height 4) (p_length -1 phone ph_length 6) (p_front -1 phone ph_front 4) (p_v_rnd -1 phone ph_v_rnd 2) (p_c_type -1 phone ph_c_type 7) (p_c_place -1 phone ph_c_place 7) (p_c_vox -1 phone ph_c_vox 2) (n_vc 1 phone ph_vc 2) (n_height 1 phone ph_height 4) (n_length 1 phone ph_length 6) (n_front 1 phone ph_front 4) (n_v_rnd 1 phone ph_v_rnd 2) (n_c_type 1 phone ph_c_type 7) (n_c_place 1 phone ph_c_place 7) (n_c_vox 1 phone ph_c_vox 2) ))
The above lists are automatically expanded into actual distance function definitions. See section Distance Functions, for a full description. The fields are: distance name, offset (-1 previous phone, 0 current, 1 next phone), the field name in the database to apply it too, the mapping function (ident, log or map name for table functions), and the distance type or table size to use.
The next section of the Database Parameters file defines parameters used in unit selection. Values (continuity weights in particular) may be changed as wished to facilitate hand-tuning. Two values are relevant to training but ignored during normal synthesis. They are
(dur_penalty 1.0) (endpoint_weight 0.0)
Training produces a number files which need to be loaded at synthesis time but not during training. This is handled by the `DBNAME_synth.ch' file, where there are a set of commands which are only executed in non-training mode. All have defaults, but researchers will surely wish to explore other options.
The first commands is mandatory. It is necessary to select which set of weights to include when this file is actually used. The line is
(set DBNAME_DiscTables (load (strcat DBNAME_data_dir "index/DiscreteTables.ch")))
The next line selects the level of pruning. If no pruning is to be done, only the 0-level weights are needed. The template-default statement is
(set DBNAME_Weights (load (strcat DBNAME_data_dir "index/weights0.ch")))
If pruning is required, instead select the weights appropriate for the pruning level you desire. For example
(set DBNAME_Weights (load (strcat DBNAME_data_dir "index/weights2.ch")))
Note: only select one level, `comment-out' the default if another is chosen. See section Pruning, for more information.
The next functions set up the trained weights for synthesis. They are
(SetTableDFs DBNAME_PhoneSets DBNAME_TableDFs DBNAME_DiscTables) (SetSharedDFs DBNAME_SharedDFs) (Database Set Weights DBNAME_Weights)
The Training Parameters file is used to to train the unit selection weights.
It is recommended to copy the template in `$CHATR_ROOT/db_utils/DBNAME_train.ch' to `index/', replacing `DBNAME' with the name by which it is wished the speaker to be known.
Use the shell command
cp $CHATR_ROOT/db_utils/DBNAME_train.ch index/NEW-NAME_train.ch
This file must be edited. All occurrences of DBNAME should be replaced with the speaker database name chosen above.
This is the only required editing. There are a large number of configuration parameters which will be of interest to researchers but beyond the scope of this manual; general comments are given throughout the script.
Databases are large and therefore very likely to contain errors. A number of specific tests are provided to try to detect the most likely problems. You are strongly advised to run these tests and study the results. Remember, database errors are probably the most common cause of bad synthesis.
db_utils/check_labs
This function checks the unit labels, identifying the number of occurrences of each, and finding any with unusually short durations.
Do fix any problems before continuing.
db_utils/check_phoneset
This function checks that the labels are all in the defined phoneme set. This only works when using a CHATR standard phoneme set. See section Defining a New Phoneme Set, if you are defining your own phoneme set. Again, do fix any problems before continuing.
db_utils/check_align
This checks the location of labels within waveforms. It is intended to detect offsets in label files, mis-matches of label files to waveforms, and possibly mistaken sample rates.
Once more, do fix any problems before continuing.
db_utils/check_labwav FILE-ID
Strictly speaking it is not actually a test, but uses XWAVES to display an example waveform and label file as described in the database. Check that the labels match the waveform, and that the waveform itself is of the right byte order. You should check all files using this method, but of course, you de-bugged the database before you started this process, didn't you?
At least check three files--one near the beginning, one in the middle, and one near the end.
Before any processing occurs it is also wise to check that the waveform files in the database are of similar quality. Different waveform files often have quite different mean power, and it may be useful to normalize them. The CHATR function to do this is
db_utils/normal_power
It may also be necessary to exclude some waveforms files because of extraneous noise--background music, etc. In that case, either delete or comment-out those file names from the `files' file.
Training of a database is a computationally expensive process. It can take from 20 minutes for a small database (e.g. gsw200 with 14 minutes of speech) to over 10 hours (e.g. f3a with 2.5 hours of speech). The most CPU intensive process is the calculation of the acoustic distance tables (or phoneme tables). These are calculated in the first major training step.
Database build disc space requirements are about 2.5 to 3 times the disk space of the `wav/' directory, plus space for the training of the acoustic distance files. Presently these can require anything from 2Mb to 1.5Gb, depending on the size of database.
The DISTFILE_FILEBASE
variable defines where the copies of the
tables will be stored on disk. This is by default
`dist/DBNAME_'. There should be a lot of free space in that
partition. The size is closest related to the square of the number
of occurrences of each phoneme. For example
gsw: approximately 8700 units ==> 12.7Mb f2b: approximately 41000 units ==> 243Mb f3a: approximately 97200 units ==> 1300Mb
Once the data is stored on disk, it can be reloaded quickly to speed up multiple training runs and the multiple stages in the training scripts.
Note: do NOT use the `/tmp' directory--it is not big enough.
Setting the clean_up
parameter in the udb_train_params
LISP variable will cause the memory copy of the distance
tables to be deleted after each time it is used. If this variable is
not set, the training procedure will keep a copy of all the
distance tables in memory. The internal distance tables are twice
the size of those stored on disk (e.g. 2.6Gb for `f3a'), so you
may need lots of swap space. Except for the smallest databases, the
clean_up
parameter should be set. When the distance table is
next required it can be loaded from the disk copy. This is strongly
recommended, since calculating it again from scratch is very slow.
First make all the directories that are used in the process. Use the command
db_utils/make_alldirs
This creates a directory called `dist/' to contain the unit distances used in training. For larger databases it is a good idea to change this to a symbolic link pointing to another partition with lots of free space.
The script in file $CHATR_ROOT/db_utils/make_db
lists the main
sub-processes involved in the process of database building. If
everything is set up properly this script will build a fully trained
database. It is called in BASH or SH using the command
db_utils/make_db >make.log 2>&1
This will run the process in the background and send a commentary to the file `make.log'. It is recommended to open that file in an editor and periodically monitor the contents as a progress check. Note that as with most software, once an error occurs there is little use in continuing, even if that error was minor. Stop the process, find and fix the problem, then start again.
Look out for comments like `Cannot communicate with server after 100 tries.'. Several sub-processes require the automatic issue of various site software licenses. This warning message usually means none are presently free. No further progress can be made until a license becomes available. To determine current allocation, use the command
elmadmin -l
Go to the most soft-hearted person on the list and ask them to type the commands
hfree efree
Be quick to re-start the training before someone else snags the now free license!
Ideally the entire database build process should be fully automatic. However, there are sometimes problems (especially the first time a database is built) and it may be necessary to go through each stage by hand. The following is a description of what each stage is trying to achieve and some of the problems that may occur.
Note that in general stage order is significant unless otherwise stated.
The first stage is to do the basic signal processing of the database: pitch extraction and mel-cepstrum parameter (MFCC) calculation. These could be run in parallel on different machines. The scripts used are
db_utils/make_melcep db_utils/make_f0s
Vector quantization of MFCC parameters, pitch and power are generated for 10mS frames for the whole database. The script used is
db_utils/make_acoustic_params
Pitch-marks are generated for each file. The script used is
db_utils/make_pitchmarks
Depending on the method used for generating pitch marks
(fz_track
or other), this may be run in parallel with the
creation of the F0 and MFCC files.
Warnings of the form `No peak found: N N' may be generated, but they can be disregarded. Similarly, the message `sqrt: DOMAIN error' is an internal script issue and may be ignored.
Note that that fz_track
may crash if the pitch of the waveform
being tracked moves outside the specified range. This can happen
particularly with male speech, where the default is 70Hz to 228Hz.
It is uncommon but not impossible for the male speech pitch to go as
low as 30Hz. You should specify an operations file appropriate for
the speaker.
The MFCC files are now merged with F0 information for training. The script used is
db_utils/make_traincep
Next, the label files may be processed to produce unit description files. There will be one line per unit, with all fields specified. The script used is
db_utils/make_units
If new fields are to be added to a database they should be added to the files in the `units/' directory at this point. See section Adding a New Feature to a Database for details.
If the ToBI F0 prediction by linear regression is desired but a full training is not possible (i.e. the database does not have ToBI labels), mapping parameters are required. They are generated using the script
db_utils/make_tobif0_params
These must be edited into the `DBNAME_synth.ch' file.
If the linear regression model is to be used to predict durations, parameters are required to map the model durations to the range of those of the target. These are generated using the script
db_utils/make_lrdurstats
Now that the unit descriptions are available, a CHATR representation of them is made. The script used is
db_utils/make_unitindex
All information has now been collected together so a binary representation of the full database index, pitch marks, acoustic parameters etc. etc. may be created. The script used is
db_utils/make_indexout
For the testing of a database with natural targets, a CHATR
representation of each utterance is required for the test_seg
function. This is achieved using the script
db_utils/make_segs
The final stage is the training of weights for unit selection. There is a pre-requirement that both the `index/DBNAME_synth.ch' and `index/DBNAME_train.ch' files be created and edited. Training can take some time and may use lots of database space. See section System Requirements, for details. The script used is
db_utils/make_training
Note: If training fails (whether due to out-of-memory error or anything else) during the making of the distance tables, the last made table from the database `dist/' must be deleted. It may be incomplete and hence reloading it later will cause an error. On re-commencement of this script, the last-created table will automatically be located and training continue from there.
A fully trained and described database should now exist. Before it can be used by CHATR it must be defined. See section Defining a Speaker, for details.
The newly created database may be selected using the function
(speaker_DBNAME)
This will auto-load your `DBNAME_synth.ch' file and execute the
speaker_DBNAME
function defined in that file.
Select the newly created database using the function
(speaker_DBNAME)
Initial tests of the database are best made using natural targets. The `test_seg' files generated during database creation and training are ideal for this. (9) Use the command
(Say (test_seg "FILE-ID"))
where FILE-ID
is a file-id from your newly created database.
A synthesized version of the file-id.wav
file will now be
played. Of course ideally it should sound like the original!
Once a database is proven to be stable it's defspeaker
definition may be added to the file
`$CHATR_ROOT/lib/data/itlspeakers.ch' in the CHATR
distribution so others may access it.
Note that initial tests should be done directly in a users own installation of CHATR, i.e. from the `.chatrrc' file or directly at the command line.
It may be desired to define a new phoneme set particular to a new database. This has been considered and some support is given. First you must create a CHATR file in the `index/' directory defining the phoneme set, called `PHONESET_def.ch'. See section Phoneme Sets, for details about how to define a phoneme set.
The desired phoneme set must be loaded in `DBNAME_synth.ch'. A line, commented out, shows the format.
Note that when a new phoneme set is used, that database will not work with the higher levels of the system directly. A new lexicon and possibly a new intonation module and duration module will be required, especially if this is a new language. Of course natural target resynthesis will work without any of these higher levels. In this case simply do not define any lexicon, intonation or duration in the `DBNAME_synth.ch' file.
Alternatively, a phoneme map may be defined between an existing phoneme set and the new phoneme set. The Phoneme Internal set can be an existing one and a mapping will occur automatically. Although this will work the mapping system is probably not powerful enough to get the best results thus this should only be used as an intermediate step.
This method of building a speech synthesis database allows for the pruning of units from the database which are found to be unpredictable. There are two reasons for pruning, first to reduce the size of the database so synthesis will be faster, and second to remove units which have properties which do not reflect the features they are labeled with. Pruning is still very much in its initial stages, this area deserves much more work before it can improve databases as much as we feel it is possible.
The training algorithm provides the options for levels of pruning.
See the setting of train_level
near the top of
`DBNAME_synth.ch'. Setting the variable to non-nil will cause
training to do levels of pruning. Pruning parameters are set in the
variable udb_train_params
set further down the training file.
Once a set of units to be pruned is generated (they will be saved in `index/DBNAME_prune*.ch', the index must be rebuilt without the pruned units. This done via the following command
db_utils/make_pruning LEVEL
Note that the pruned units are only removed from the index; the actual entries themselves still exist within the database but will never be selected. They must remain because their neighbors may require information about their context and hence have to refer to these pruned units.
More serious pruning, e.g. removal of whole bad files, should really be done before CHATR processes the data.
Pruning does not happen by default during the building of databases as currently we feel the advantage from it is minimal and more experimentation is required.
It is possible to reduce the size of a database significantly by resampling the waveforms files. For example, changing the waveform files from a 16kHz sampling 16-bit linear database to 8K ulaw would result in a 75% space saving. Very little difference in sound quality will be noticed if the eventual output is to be played on a low-level audio system such as the Sun /dev/audio.
Similarly, further space could be saved if a higher sample rate version of the waveforms files were used.
The format of the waveform files may be changed without recompilation of any part of the database index. All information is time-based rather than sample-based, even pitch mark files.
Given the example template of `DBNAME_synth.ch' for a 16kHz, 16-bit linear waveform, we would have a declaration such as
(Database Set WaveFileType raw) (Database Set WaveSampleRate 16000) (Database Set WaveEncoding lin16MSB) (Database Set WaveFileSkeleton (strcat DBNAME_data_dir "wav/%s.wav"))
To change to a database of 8K ulaw first convert all the files in `wav/' to 8K ulaw. An external program or CHATR may be used to do this. Then edit the above lines in `DBNAME_synth.ch' to become
(Database Set WaveFileType raw) ;; i.e. unheaded (Database Set WaveSampleRate 8000) (Database Set WaveEncoding ulaw) (Database Set WaveFileSkeleton (strcat DBNAME_data_dir "wav/%s.au"))
See section Command Index, command Database
for details of the
formats supported.
All speaker functions defined in `DBNAME_synth.ch' call the
function speaker_reset
. Currently that function is defined
but does nothing. The use for the function is to reset any variables
set for a particular speaker before another speaker is selected. Of
course, all of the speakers description files could be edited, but
that would be a lot of work. So instead you should change the
speaker_reset
function defined in
`$CHATR_ROOT/lib/data/speakutils.ch'.
If you don't have access to that function or don't wish to modify it,
you can still get the same effect by redefining the function
speaker_reset
in your own `DBNAME_synth.ch' file. In
case someone else has already done that, the following method is
recommended. In this instance, you define a new version of
speaker_reset
which calls the existing definition and also
includes your own reset information. If everyone uses this technique,
resets will happen properly.
Suppose your new speaker `zaphod' requires the variable
spareheads
to be set to one
, but that needs to be
nil
for all other speakers. In `zaphod_synth.ch', after
the definition of speaker_zaphod
(which sets spareheads
to one
), you should add
(set zaphod_previous_speaker_reset speaker_reset) (define speaker_reset () "New speaker reset that adds resets for speakers after calling zaphod" (zaphod_previous_speaker_reset) ;; previously defined speaker_rest (set spareheads nil) )
A common requirement may be the addition of a new feature to an existing database. This would be most likely within our own research group where we wish to test the suitability of some new feature in the selection process. This section offers a walk through of what needs to be changed in an existing database to achieve this.
A new field may be added, trained and tested without any change to the CHATR C source code. However, if this field is to be added to the full synthesis process, you must of course modify the C source code in order to be able to predict this field.
The first stage is to generate the values for the new field(s) for each unit in the database. This unfortunately is not quite as easy as it sounds. You must ensure that the fields generated align with the unit labels in the `lab/*.lab' files. You should take an adequate amount of time to ensure this is the case.
Note the following process is destructive, in that it modifies the database already existing in a database directory. This only modifies the files in directories `units/', `chatr/seg/' and `index/', so a set of shadow links can be set up if desired. You should of course not be experimenting with a database that others may currently be using.
First create files in `units/', one for each file-id in the system with an extension. The files may contain more than one new field. These fields can be pasted to the end of the existing unit files in that directory using the command
db_utils/add_newfields <newfield_fileextention>
This will modify all `.units' files in that directory appending the new fields.
Now create the file `index/DBNAME_extrafields' containing the field declaration for the new fields you wish to add to the database. Fields can be floats, ints, or categories. For example, if two new fields are added, one for ToBI accents and one for ToBI ending tones, the file `DBNAME_extrafields' may look like
(tobi_accent (NONE H* !H* L+H* L+!H* L* L*+H OTHER)) (tobi_tone (NONE L-L% L-H% H- L- H-H% OTHER))
Note it is necessary (for a later shell script) to have leading spaces on the above lines.
Now a new index can be created with those new fields using the command
db_utils/make_unitindex
Then compiled using
db_utils/make_indexout
If problems occur in making this index you must fix them before continuing.
Next a CHATR utterance representation of the database entries
should be created for use with test_seg
. The format of these
files includes all fields in a database entry, even if there is
currently no way to predict the value of a particular field during
text-to-speech. Use the command
db_utils/make_segs
It will also be necessary to amend the silence entry definition in `DBNAME_synth.ch' to define values for the new fields created. For example
(Database Set Silence ("pau" 0 67 0.0 120 0.0 0.210 0.0 5.369 0.0 NONE NONE 0))
Note your new fields occur one before the end.
Training of the new fields is automatic. You need to edit `index/DBNAME_synth.ch' to define new distance functions for the new fields (and possibly delete existing distance functions if you do not wish them). Note a database may contain more fields than are actually used in selection, therefore if comparing competing fields, the same compiled index may be used but different training (and hence different weights files) is all that need change.
For full details of distance functions see section Distance Functions.
Here we will only deal with a limited form of customization. There
are two major classes of distance functions: continuous (float or
int) and categorical. These are trained differently. New continuous
distance functions should be listed in the variable
DBNAME_SharedDFs
, while categorical distance functions are
listed in DBNAME_TableDFs
. These lists are expanded
automatically into full distance function definitions during
training.
A continuous listing consist of 5 fields as follows
distance name
position offset
field name
DBNAME_extrafields
.
mapping
ident
, i.e. no mapping, or log
, for
logarithm.
difference measure
eql
returns 0 if the two
values are equal, and 1 otherwise (this is only reasonable for int
valued fields). abs
means the absolute different between the
two values and sqr
means the squared difference.
A categorical listing also consists of 5 fields as follows
distance name
position offset
field name
DBNAME_extrafields
.
mapping
ident
, but if some further quantization is desired it may be
achieved by defining a new Discrete and Map. See section Discretes and Maps, for detailed information.
size
ident
, this is number of members in the field declaration. If
the mapping is something other than ident
, this is the number
of items in the category being mapped to.
Note that categorical distance functions will be trained in phone groups. This will rarely be wrong but sometimes have more differentiation than is necessary.
In our example we have two new fields we wish to train. Both are
category fields so we have to add their descriptions to the variable
DBNAME_TableDFS
. The additions should look like
(tobi_accent 0 tobi_accent ident 8) (tobi_tone 0 tobi_tone ident 7)
That is (for the first line) the new distance function is called
tobi_accent
, it is to apply to the current phone (0), using
the field name tobi_accent
, with no mapping (ident) and has 8
members.
If we wished to have a distance function on not only the current phone but also on the context, we could add distance functions that include the ToBI accents of the left and right context as in
(p_tobi_accent -1 tobi_accent ident 8) (n_tobi_accent 1 tobi_accent ident 8)
Thus then the distance name and the offset field changes, but the field name of course remains the same.
Once the new distance measures have been defined, you can train the new weights. The standard database script can be used which is
db_utils/make_training
The calculated distance measures in the directory `dist/' are re-used by default if they exist, so keeping them is useful when training on different fields as it makes re-training much faster. Note if you change the phoneme set you must delete the old distance files and re-create them.
After training you should check the training log file in `index/DBNAME_train.log' to see the contribution of the new fields you have introduced.
Once it is decided that a new field is worth predicting, CHATR
will need to be modified to actually predict it. Functions in the
file `$CHATR_ROOT/src/udb/udb_targfuncs.c' are used to generate
all target fields. These functions return a Lisp cell (to
generically deal with the appropriate type, float, int or
categorical) from a segment stream cell. An entry should be added to
the table df_targ_val_name2func
relating the new fieldname to
a function. The function may simply access a field in the segment
stream cell (or one related to it), or do some calculation. It may
be that a feature function already exists to generate the appropriate
value and hence may simply be called with an easy wrap around
function (cf. udb_tf_sylpos
).
The object of training is to find the weighting that minimizes the
distance of the selected units from the original. We do not yet know
the ideal distance measure. However, it will probably be a signal
processing value that directly follows humans' perceptions of good or
bad synthesis. Approximations of this measure are possible and
CHATR supports a mechanism for selecting what to use. The
distance method used is defined through the variable
cep_dist_parms
. This name was chosen as it will most likely
involve manipulation of cepstrum parameters.
The assumption is that a set of parameters is defined for each frame
(at some increment) in the database. This is specified through the
CoefFileSkeleton
setting of the Database
command. The
format of these files may vary but HTK headed and ATR improved
cepstrum files are currently supported. Remember to add new formats
to `file/cep_io.c'.
The distance measures themselves are defined in
`chatr/cep_dist.c'. Currently supported are Euclidean and
weighted Euclidean. Two alignment options are also provided:
naive
which does no time alignment between the selected units'
cepstrum vectors and the original (just taking the shortest), and
tw
which linear interpolates the selected units' cepstrum
vectors to the original. Note that after selection, the individual
units cepstrum vectors are each time aligned, so overall alignment
should not be a problem.
Go to the first, previous, next, last section, table of contents.