Go to the first, previous, next, last section, table of contents.


Creating & Training a Speech Synthesizer Database

This chapter runs through an example of gathering and characterizing the files necessary for CHATR to build a synthesizer based on a speech corpus database. The process is long and requires much disk space and cpu time. Although it is mostly automatic, there are number of stages where informed decisions need to be made. A familiarity with the operation will greatly aid you in successfully building a usable synthesis database.

Apart from the main building and training scripts, various `awk' and `sed' one-liners have evolved during the development and use of CHATR; since these are heavily environment-specific and developers may prefer to generate their own, they have not been included in the main text but gathered together in an appendix. See section Various Short Useful(?) Scripts, for ideas.

Before proceeding further there will be a short explanation of some database-building terminology as specifically used in this chapter.

Each waveform file is identified by a short identifier, a file-id. This will typically be the name of the file minus any extension. For example, if the files are called

     sc001.wav
     sc002.wav
     sc003.wav
     ...

then the file-id's are

     sc001
     sc002
     sc003
     ...

Directory Structure

The following are just brief descriptions of the contents of sub-directories created while making a synthesis database. See section Preparing the Database, for details on how to acquire the initial files.

`wav/'
The waveform files. These are unheaded, in native byte format, with `.wav' extensions.
`lab/'
The phoneme labels. These are in XWAVES-label format, with `.lab' extensions. The file-id's must match those of the waveform files.
`db_utils/'
Shell scripts and programs used in creation of the following files. It is recommended that a symbolic link be made to this directory within CHATR. A copy of the entire directory could be made, but the files take up a lot of space and are only of use during the building and training of the speech database.
`stats/'
Unit statistics files. These files contain duration, mean pitch, mean power and mean voicing for each unit in the database.
`units/'
Unit descriptions containing all features to be used in a database.
`f0/'
F0 files.
`pm/'
Pitch mark files.
`cep/'
Cepstrum parameter files.
`vq/'
Various vector quantization setup files, and vector quantization for all files in the database.
`chatr/seg/'
CHATR representations of the utterances in the database. These are used for resynthesis tests.
`index/'
Where all the final generated files are gathered together and the eventual CHATR compiled index in made.

Preparing the Database

This procedure will eventually create a fully trained database with index files. To use the resulting database within CHATR, only one definition command needs to be executed. See section Defining a Speaker, for details.

The construction process requires access to the following software packages

Choose a short name for your database and create a directory for it. All files will be generated in that directory by default. It will be referred to as the `Speaker Top Directory' for the rest of this chapter, and should be selected as the working directory when issuing any shell commands.

Only one place in the ultimate database definition refers to a speaker directory, so it may be easily moved afterwards.

In your newly created Speaker Top Directory, you will need

Copy the waveform and phoneme label files into the `wav/' and `lab/' directories respectively. Ensure byte-order is correct and headers are removed. See section Various Short Useful(?) Scripts, for some simple scripts which may help automate this task.

Create a file called `files' to contain a list of all the waveform files in the database. Assuming only the waveform files are called `*.wav', use the shell command

     ls wav/*.wav > files

This file is used by the database training shell scripts to determine which files are to be processed. If (when?) things go wrong, it may be edited to remove the names of the files causing errors, i.e., include only the files that are to be used.

Symbolically link `db_utils/' with `$CHATR_ROOT/chatr/db_utils/' using the shell command

     ln -s $CHATR_ROOT/chatr/db_utils/ db_utils

These files contain the scripts and binaries used to build a database.

Create the database to hold index files created during training. Use the shell command

     mkdir index

Three files need to be copied and characterized before proceeding. They are

db_description
The Database Description file.
index/DBNAME_synth.ch
The Database Parameters file.
index/DBNAME_train.ch
The Training Parameters file.

Templates are available in `$CHATR_ROOT/db_utils/'.

See section Database Description File, for details of the required editing to the Database Description file. See section Database Parameters File, for details of the required editing to the Database Parameters file. See section Training Parameters File, for details of the required editing to the Training Parameters file. Finally, don't see section Advanced Features, unless you're a researcher who wants to experiment with some of the finer points of the system.

Database Description File

The Database Description file `db_description' is used to select initial database parameters.

It is recommended to copy the template from the `$CHATR_ROOT/db_utils/' directory.

Copy into the Speaker Top Directory using the shell command

     cp $CHATR_ROOT/db_utils/db_description db_description

Edit that file to specify the database name, the phoneme set, and the sample rate for the database.

It may be found necessary to modify the variable GET_F0_PARAMS. This specifies the minimum and maximum expected F0 for the speaker. It helps to set the range to the likely limits for a particular speaker database, so the defaults may not necessarily be suitable. This is especially so for male speakers, where the lowest value in the database may be below the default minimum. If the subject is male, at least un-comment this line.

Some values in the middle section may also need to change depending on the environment. Run with the defaults as a start, they should suffice.

Database Parameters File

All parameters of a speech database are described in the Database Parameters file called `DBNAME_synth.ch'. This file describes both the database itself, plus other aspects of the voice, such as lexicon, intonation and duration parameters.

It is recommended to copy the template in `$CHATR_ROOT/db_utils/DBNAME_synth.ch' to `index/', replacing `DBNAME' with the name by which it is wished the speaker to be known.

Use the shell command

     cp $CHATR_ROOT/db_utils/DBNAME_synth.ch index/NEW-NAME_synth.ch

This file must be edited. In many parts the required editing is essential to enable the building of a speech database; other areas contain advanced features only of interest to researchers. For that reason this section is further divided into two parts. See section Essential Editing, for the necessary. See section Advanced Features, for the heavy stuff.

Essential Editing

The database parameters file can be viewed in two parts, initialization and selection. When loaded, this file should initialize and load all necessary parameters for the use of the database as a CHATR speaker. The function speaker_DBNAME (at the end of the file) should select the actual database and auxiliary parameters required. The idea is that users will call that function when changing between alternate speakers.

Every occurrence of `<>' marks a part that requires editing. In general, change occurrences of DBNAME to the database name, PHONESET to the phoneme set name, and DICTNAME to the appropriate dictionary name.

Make a note of the directory where the data is defined for use when the function defspeaker is eventually called. Be aware that the variable DBNAME_data_dir will be set before the file `DBNAME_synth.ch' is loaded.

Each section of the template will now be examined in detail.

First, decide on the phoneme set you wish to use and ensure it is loaded. If the phoneme set is a standard set (i.e. `radio2', `mrpa', `BEEP', or `nuuph') you may simply require the definition file. If not, you must define your phoneme set in a file in the `/index' directory. Note all unit names in the database must be a member of this phoneme set. A commented-out example of loading a definition specific to a database is available. See section Defining a New Phoneme Set for more details.

The main database declaration is next. It defines the name of the index file, and the format of the waveform, pitch mark and cepstrum files. If you use the standard database as your platform, the example names will be satisfactory--but remember to remove the `<>' marks. This section also defines the wave file type and sample rate (this must be set), as well as the phoneme set.

The Silence definition is used in unit selection as a context for units which come at the start or end of a file. Ensure that the example silence entry has a reasonable value for all fields that exist in your database. It is assumed that there is an effective silence before and after each file in the database; however, good database design should ensure there are actually some silence units in the waveform as well. As is commented in the template (but still easily overlooked!), make sure the phoneme-name for space (the character(s) between the double quotes) corresponds to that used in the chosen phoneme set.

Depending on the phoneme set chosen, the next section may not require editing. It is a definition of clusters of phonemes which share discrete distance functions. The names used are of course phoneme-set dependent. The template lists phonemes belonging to the `mrpa' set. If a different set is selected, the phonemes in this section must be altered to match those in that set.

The names and number of clusters are arbitrary and may be altered to suit. Groups with similar articulatory characteristics work well. Note that all unit names in the database must be in at least one class. As an example, using the mrpa phoneme set, a possible clustering is

     (set DBNAME_PhoneSets
         '(
           (lowv (a o oo aa ai oi au ou u@))
           (midv (uh e uu @@ ei e@))
           (highv (i u ii i@ @))
           (plosive (p t h b d g k ch jh))
           (fricative (s z sh zh f v th dh))
           (nasal_n (n ny))
           (glide (l r w y))
           (nasal (m n ng))
           (misc (#))
          ))

It is important that the groups have a reasonable number of members. If there are too few training will not be possible. Likewise, if there are too many occurrences within a group it may require too much disk or swap space to calculate.

The next section of the template defines which distance functions are to be trained and used. No editing is essential here. See section Advanced Features, if you wish to experiment.

The next significant section contains nus_DBNAME_params, which defines some general parameters for the unit selection process. Their current values are probably acceptable, although the beam and candidate widths could possibly be reduced. See section Variable Index for details of their values.

The following section is not executed during training mode, as it is then that the files to be loaded are generated. Defaults are already selected. See section Advanced Features, if automatic `pruning' of decision-trees is required.

If (and only if) training is not possible for some reason, a weights file should be created to name the distance functions that are to be used. Reasonable guesses for weights are possible. The format is a list of weights for each phoneme class. Each weight consists of a single phone or list of phones in the class, followed by a list of distance function and weight pairs. A special phone named any may be used to cover all phonemes not otherwise specified. One suitable default weights file might contain

     (quote
       ((any
        (p_phone_ident 0.3)
        (n_phone_ident 0.3)
        (duration 0.5)
        (pitch 1.0)
        (p_pitch 0.5)
     )))

Next the lexicon must be defined and selected. A number of lexicons are already built into CHATR. Presently they are

cmu
An American English lexicon, based on the CMU lexicon (0.1) containing about 100,000 words. It also uses the US Naval Research letter-to-sound rules for words not explicitly listed.
beep
A British English lexicon, based on the BEEP lexicon (0.5) containing about 160,000 words. It also uses the US Naval Research letter to sound rules for words not explicitly listed.
mrpa
A British English lexicon, developed at CSTR containing about 23,000 entries. It also uses the US Naval Research letter-to-sound rules for words not explicitly listed.
japanese
A lexicon which contains no explicit words but depends completely on a set of letter to phoneme rules for changing romaji into nuuph phonemes.

The file `$CHATR_ROOT/lib/data/lexicons.ch' contains definition functions for the above lexicons.

Remove the comment characters from the line containing the language required and insert the name of the lexicon you wish to use. Ideally the lexicon base dialect should match that of the speech corpus about to be trained. For example, using a `beep' lexicon to supply words to an American English corpus results in somewhat odd speech. There are more appropriate ways to transform speakers; See section Phoneme Maps, for details.

See section Lexicon, for more details on building your own lexicons.

For duration set-up, a number of choices are available. A linear regression model can be used which is train-able for multiple languages. Examples built from f2b (American English) and MHT (Japanese) exist and can be used with other databases of the same language. A neural-net based model is also available but does not train well, though the example included from f2b is acceptable. For Japanese, a built-in model exists. It has no external parameters and requires no set-up. See section Duration, for more details, including training new duration models. Specifically for an example of tuning a linear regression model, see section Training a New Duration Model.

The second part of the duration definition defines pause durations at phrase boundaries. An appropriate definition must be loaded.

There are a number of intonation systems built into CHATR. The most stable ones are based around ToBI and hence for building working speech synthesis voices ToBI is recommended. The appropriate parameters should be set for that speaker. See section Variable Index, for the appropriate values for ToBI_params and mb_params. For more details on building intonation models see section Intonation. One stable method for predicting F0 from ToBI labels is a model using linear regression. Two models have been included with the system, one for English (from f2b) and one for Japanese (from MHT). These models can be mapped to other speaker's F0 range given the target speakers F0 mean and standard deviation. These speaker specific parameters are generated by the script make_tobif0_params during the building of a database.

The final part of this file defines a function that when called will select the appropriate parameters to cause the synthesizer to use that voice. For efficiency you should try to ensure everything is loaded into CHATR and this function need only set some variables. Again follow the comments and modify (comment out/uncomment) the sections appropriate to the database you are building.

Note that if you set any other variables in your speaker_DBNAME function you have to ensure that the values are reset when the synthesizer switches to another speaker. In order to do this without editing all other speaker synth files, you can redefine the speaker_reset function. See section Speaker Reset Function, for an example.

Advanced Features

This section describes the features of the Database Parameters file which do not necessarily require editing for the building of a new speech database, but will be of interest to those involved in speech synthesis research.

Distance functions fall into two classes: shared and table.

Shared distance functions are in general continuous. The default set is

     (set DBNAME_SharedDFs
        '(
          (p_phone_ident -1 phone   ident eql)
          (n_phone_ident  1 phone   ident eql)
          (duration       0 dur_z   ident abs)
          (pitch          0 pitch_z ident abs)
          (p_pitch       -1 pitch_z ident abs)
          (n_pitch        1 pitch_z ident abs)
          ))

Table distance functions are discrete fields which will be trained in the phonetic groups defined above. The default set is

     (set DBNAME_TableDFs
       '(
         (p_vc      -1 phone ph_vc      2)
         (p_height  -1 phone ph_height  4)
         (p_length  -1 phone ph_length  6)
         (p_front   -1 phone ph_front   4)
         (p_v_rnd   -1 phone ph_v_rnd   2)
         (p_c_type  -1 phone ph_c_type  7)
         (p_c_place -1 phone ph_c_place 7)
         (p_c_vox   -1 phone ph_c_vox   2)
 
         (n_vc       1 phone ph_vc      2)
         (n_height   1 phone ph_height  4)
         (n_length   1 phone ph_length  6)
         (n_front    1 phone ph_front   4)
         (n_v_rnd    1 phone ph_v_rnd   2)
         (n_c_type   1 phone ph_c_type  7)
         (n_c_place  1 phone ph_c_place 7)
         (n_c_vox    1 phone ph_c_vox   2)
         ))

The above lists are automatically expanded into actual distance function definitions. See section Distance Functions, for a full description. The fields are: distance name, offset (-1 previous phone, 0 current, 1 next phone), the field name in the database to apply it too, the mapping function (ident, log or map name for table functions), and the distance type or table size to use.

The next section of the Database Parameters file defines parameters used in unit selection. Values (continuity weights in particular) may be changed as wished to facilitate hand-tuning. Two values are relevant to training but ignored during normal synthesis. They are

     (dur_penalty 1.0)
     (endpoint_weight 0.0)

Training produces a number files which need to be loaded at synthesis time but not during training. This is handled by the `DBNAME_synth.ch' file, where there are a set of commands which are only executed in non-training mode. All have defaults, but researchers will surely wish to explore other options.

The first commands is mandatory. It is necessary to select which set of weights to include when this file is actually used. The line is

     (set DBNAME_DiscTables (load (strcat DBNAME_data_dir 
                                   "index/DiscreteTables.ch")))

The next line selects the level of pruning. If no pruning is to be done, only the 0-level weights are needed. The template-default statement is

     (set DBNAME_Weights (load (strcat DBNAME_data_dir 
                                   "index/weights0.ch")))

If pruning is required, instead select the weights appropriate for the pruning level you desire. For example

     (set DBNAME_Weights (load (strcat DBNAME_data_dir 
                                   "index/weights2.ch")))

Note: only select one level, `comment-out' the default if another is chosen. See section Pruning, for more information.

The next functions set up the trained weights for synthesis. They are

     (SetTableDFs DBNAME_PhoneSets DBNAME_TableDFs DBNAME_DiscTables)
     (SetSharedDFs DBNAME_SharedDFs)
     (Database Set Weights DBNAME_Weights)

Training Parameters File

The Training Parameters file is used to to train the unit selection weights.

It is recommended to copy the template in `$CHATR_ROOT/db_utils/DBNAME_train.ch' to `index/', replacing `DBNAME' with the name by which it is wished the speaker to be known.

Use the shell command

     cp $CHATR_ROOT/db_utils/DBNAME_train.ch index/NEW-NAME_train.ch

This file must be edited. All occurrences of DBNAME should be replaced with the speaker database name chosen above.

This is the only required editing. There are a large number of configuration parameters which will be of interest to researchers but beyond the scope of this manual; general comments are given throughout the script.

Checking the Database

Databases are large and therefore very likely to contain errors. A number of specific tests are provided to try to detect the most likely problems. You are strongly advised to run these tests and study the results. Remember, database errors are probably the most common cause of bad synthesis.

The first test is

     db_utils/check_labs

This function checks the unit labels, identifying the number of occurrences of each, and finding any with unusually short durations.

Do fix any problems before continuing.

The second test is

     db_utils/check_phoneset

This function checks that the labels are all in the defined phoneme set. This only works when using a CHATR standard phoneme set. See section Defining a New Phoneme Set, if you are defining your own phoneme set. Again, do fix any problems before continuing.

The third test is

     db_utils/check_align

This checks the location of labels within waveforms. It is intended to detect offsets in label files, mis-matches of label files to waveforms, and possibly mistaken sample rates.

Once more, do fix any problems before continuing.

The fourth test is

     db_utils/check_labwav FILE-ID

Strictly speaking it is not actually a test, but uses XWAVES to display an example waveform and label file as described in the database. Check that the labels match the waveform, and that the waveform itself is of the right byte order. You should check all files using this method, but of course, you de-bugged the database before you started this process, didn't you?

At least check three files--one near the beginning, one in the middle, and one near the end.

Before any processing occurs it is also wise to check that the waveform files in the database are of similar quality. Different waveform files often have quite different mean power, and it may be useful to normalize them. The CHATR function to do this is

     db_utils/normal_power

It may also be necessary to exclude some waveforms files because of extraneous noise--background music, etc. In that case, either delete or comment-out those file names from the `files' file.

System Requirements

Training of a database is a computationally expensive process. It can take from 20 minutes for a small database (e.g. gsw200 with 14 minutes of speech) to over 10 hours (e.g. f3a with 2.5 hours of speech). The most CPU intensive process is the calculation of the acoustic distance tables (or phoneme tables). These are calculated in the first major training step.

Database build disc space requirements are about 2.5 to 3 times the disk space of the `wav/' directory, plus space for the training of the acoustic distance files. Presently these can require anything from 2Mb to 1.5Gb, depending on the size of database.

The DISTFILE_FILEBASE variable defines where the copies of the tables will be stored on disk. This is by default `dist/DBNAME_'. There should be a lot of free space in that partition. The size is closest related to the square of the number of occurrences of each phoneme. For example

     gsw: approximately 8700  units ==> 12.7Mb
     f2b: approximately 41000 units ==> 243Mb
     f3a: approximately 97200 units ==> 1300Mb

Once the data is stored on disk, it can be reloaded quickly to speed up multiple training runs and the multiple stages in the training scripts.

Note: do NOT use the `/tmp' directory--it is not big enough.

Setting the clean_up parameter in the udb_train_params LISP variable will cause the memory copy of the distance tables to be deleted after each time it is used. If this variable is not set, the training procedure will keep a copy of all the distance tables in memory. The internal distance tables are twice the size of those stored on disk (e.g. 2.6Gb for `f3a'), so you may need lots of swap space. Except for the smallest databases, the clean_up parameter should be set. When the distance table is next required it can be loaded from the disk copy. This is strongly recommended, since calculating it again from scratch is very slow.

Making the Database

First make all the directories that are used in the process. Use the command

     db_utils/make_alldirs

This creates a directory called `dist/' to contain the unit distances used in training. For larger databases it is a good idea to change this to a symbolic link pointing to another partition with lots of free space.

The script in file $CHATR_ROOT/db_utils/make_db lists the main sub-processes involved in the process of database building. If everything is set up properly this script will build a fully trained database. It is called in BASH or SH using the command

     db_utils/make_db >make.log 2>&1

This will run the process in the background and send a commentary to the file `make.log'. It is recommended to open that file in an editor and periodically monitor the contents as a progress check. Note that as with most software, once an error occurs there is little use in continuing, even if that error was minor. Stop the process, find and fix the problem, then start again.

Look out for comments like `Cannot communicate with server after 100 tries.'. Several sub-processes require the automatic issue of various site software licenses. This warning message usually means none are presently free. No further progress can be made until a license becomes available. To determine current allocation, use the command

     elmadmin -l

Go to the most soft-hearted person on the list and ask them to type the commands

     hfree
     efree

Be quick to re-start the training before someone else snags the now free license!

Ideally the entire database build process should be fully automatic. However, there are sometimes problems (especially the first time a database is built) and it may be necessary to go through each stage by hand. The following is a description of what each stage is trying to achieve and some of the problems that may occur.

Note that in general stage order is significant unless otherwise stated.

The first stage is to do the basic signal processing of the database: pitch extraction and mel-cepstrum parameter (MFCC) calculation. These could be run in parallel on different machines. The scripts used are

     db_utils/make_melcep
     db_utils/make_f0s

Vector quantization of MFCC parameters, pitch and power are generated for 10mS frames for the whole database. The script used is

     db_utils/make_acoustic_params

Pitch-marks are generated for each file. The script used is

     db_utils/make_pitchmarks

Depending on the method used for generating pitch marks (fz_track or other), this may be run in parallel with the creation of the F0 and MFCC files.

Warnings of the form `No peak found: N N' may be generated, but they can be disregarded. Similarly, the message `sqrt: DOMAIN error' is an internal script issue and may be ignored.

Note that that fz_track may crash if the pitch of the waveform being tracked moves outside the specified range. This can happen particularly with male speech, where the default is 70Hz to 228Hz. It is uncommon but not impossible for the male speech pitch to go as low as 30Hz. You should specify an operations file appropriate for the speaker.

The MFCC files are now merged with F0 information for training. The script used is

     db_utils/make_traincep

Next, the label files may be processed to produce unit description files. There will be one line per unit, with all fields specified. The script used is

     db_utils/make_units

If new fields are to be added to a database they should be added to the files in the `units/' directory at this point. See section Adding a New Feature to a Database for details.

If the ToBI F0 prediction by linear regression is desired but a full training is not possible (i.e. the database does not have ToBI labels), mapping parameters are required. They are generated using the script

     db_utils/make_tobif0_params

These must be edited into the `DBNAME_synth.ch' file.

If the linear regression model is to be used to predict durations, parameters are required to map the model durations to the range of those of the target. These are generated using the script

     db_utils/make_lrdurstats

Now that the unit descriptions are available, a CHATR representation of them is made. The script used is

     db_utils/make_unitindex

All information has now been collected together so a binary representation of the full database index, pitch marks, acoustic parameters etc. etc. may be created. The script used is

     db_utils/make_indexout

For the testing of a database with natural targets, a CHATR representation of each utterance is required for the test_seg function. This is achieved using the script

     db_utils/make_segs

The final stage is the training of weights for unit selection. There is a pre-requirement that both the `index/DBNAME_synth.ch' and `index/DBNAME_train.ch' files be created and edited. Training can take some time and may use lots of database space. See section System Requirements, for details. The script used is

     db_utils/make_training

Note: If training fails (whether due to out-of-memory error or anything else) during the making of the distance tables, the last made table from the database `dist/' must be deleted. It may be incomplete and hence reloading it later will cause an error. On re-commencement of this script, the last-created table will automatically be located and training continue from there.

A fully trained and described database should now exist. Before it can be used by CHATR it must be defined. See section Defining a Speaker, for details.

The newly created database may be selected using the function

     (speaker_DBNAME)

This will auto-load your `DBNAME_synth.ch' file and execute the speaker_DBNAME function defined in that file.

Testing a New Database

Select the newly created database using the function

     (speaker_DBNAME)

Initial tests of the database are best made using natural targets. The `test_seg' files generated during database creation and training are ideal for this. (9) Use the command

     (Say (test_seg "FILE-ID"))

where FILE-ID is a file-id from your newly created database. A synthesized version of the file-id.wav file will now be played. Of course ideally it should sound like the original!

Once a database is proven to be stable it's defspeaker definition may be added to the file `$CHATR_ROOT/lib/data/itlspeakers.ch' in the CHATR distribution so others may access it.

Note that initial tests should be done directly in a users own installation of CHATR, i.e. from the `.chatrrc' file or directly at the command line.

Minor Customization

Defining a New Phoneme Set

It may be desired to define a new phoneme set particular to a new database. This has been considered and some support is given. First you must create a CHATR file in the `index/' directory defining the phoneme set, called `PHONESET_def.ch'. See section Phoneme Sets, for details about how to define a phoneme set.

The desired phoneme set must be loaded in `DBNAME_synth.ch'. A line, commented out, shows the format.

Note that when a new phoneme set is used, that database will not work with the higher levels of the system directly. A new lexicon and possibly a new intonation module and duration module will be required, especially if this is a new language. Of course natural target resynthesis will work without any of these higher levels. In this case simply do not define any lexicon, intonation or duration in the `DBNAME_synth.ch' file.

Alternatively, a phoneme map may be defined between an existing phoneme set and the new phoneme set. The Phoneme Internal set can be an existing one and a mapping will occur automatically. Although this will work the mapping system is probably not powerful enough to get the best results thus this should only be used as an intermediate step.

Pruning

This method of building a speech synthesis database allows for the pruning of units from the database which are found to be unpredictable. There are two reasons for pruning, first to reduce the size of the database so synthesis will be faster, and second to remove units which have properties which do not reflect the features they are labeled with. Pruning is still very much in its initial stages, this area deserves much more work before it can improve databases as much as we feel it is possible.

The training algorithm provides the options for levels of pruning. See the setting of train_level near the top of `DBNAME_synth.ch'. Setting the variable to non-nil will cause training to do levels of pruning. Pruning parameters are set in the variable udb_train_params set further down the training file.

Once a set of units to be pruned is generated (they will be saved in `index/DBNAME_prune*.ch', the index must be rebuilt without the pruned units. This done via the following command

     db_utils/make_pruning LEVEL

Note that the pruned units are only removed from the index; the actual entries themselves still exist within the database but will never be selected. They must remain because their neighbors may require information about their context and hence have to refer to these pruned units.

More serious pruning, e.g. removal of whole bad files, should really be done before CHATR processes the data.

Pruning does not happen by default during the building of databases as currently we feel the advantage from it is minimal and more experimentation is required.

Changing Format of Waveform Files

It is possible to reduce the size of a database significantly by resampling the waveforms files. For example, changing the waveform files from a 16kHz sampling 16-bit linear database to 8K ulaw would result in a 75% space saving. Very little difference in sound quality will be noticed if the eventual output is to be played on a low-level audio system such as the Sun /dev/audio.

Similarly, further space could be saved if a higher sample rate version of the waveforms files were used.

The format of the waveform files may be changed without recompilation of any part of the database index. All information is time-based rather than sample-based, even pitch mark files.

Given the example template of `DBNAME_synth.ch' for a 16kHz, 16-bit linear waveform, we would have a declaration such as

     (Database Set WaveFileType raw)
     (Database Set WaveSampleRate 16000)
     (Database Set WaveEncoding lin16MSB)
     (Database Set WaveFileSkeleton (strcat DBNAME_data_dir "wav/%s.wav"))

To change to a database of 8K ulaw first convert all the files in `wav/' to 8K ulaw. An external program or CHATR may be used to do this. Then edit the above lines in `DBNAME_synth.ch' to become

     (Database Set WaveFileType raw)    ;; i.e. unheaded
     (Database Set WaveSampleRate 8000)
     (Database Set WaveEncoding ulaw)
     (Database Set WaveFileSkeleton (strcat DBNAME_data_dir "wav/%s.au"))

See section Command Index, command Database for details of the formats supported.

Speaker Reset Function

All speaker functions defined in `DBNAME_synth.ch' call the function speaker_reset. Currently that function is defined but does nothing. The use for the function is to reset any variables set for a particular speaker before another speaker is selected. Of course, all of the speakers description files could be edited, but that would be a lot of work. So instead you should change the speaker_reset function defined in `$CHATR_ROOT/lib/data/speakutils.ch'.

If you don't have access to that function or don't wish to modify it, you can still get the same effect by redefining the function speaker_reset in your own `DBNAME_synth.ch' file. In case someone else has already done that, the following method is recommended. In this instance, you define a new version of speaker_reset which calls the existing definition and also includes your own reset information. If everyone uses this technique, resets will happen properly.

Suppose your new speaker `zaphod' requires the variable spareheads to be set to one, but that needs to be nil for all other speakers. In `zaphod_synth.ch', after the definition of speaker_zaphod (which sets spareheads to one), you should add

     (set zaphod_previous_speaker_reset speaker_reset)

     (define speaker_reset ()
        "New speaker reset that adds resets for speakers after
calling zaphod"
     (zaphod_previous_speaker_reset) ;; previously defined speaker_rest
     (set spareheads nil)
     )

Adding a New Feature to a Database

A common requirement may be the addition of a new feature to an existing database. This would be most likely within our own research group where we wish to test the suitability of some new feature in the selection process. This section offers a walk through of what needs to be changed in an existing database to achieve this.

A new field may be added, trained and tested without any change to the CHATR C source code. However, if this field is to be added to the full synthesis process, you must of course modify the C source code in order to be able to predict this field.

The first stage is to generate the values for the new field(s) for each unit in the database. This unfortunately is not quite as easy as it sounds. You must ensure that the fields generated align with the unit labels in the `lab/*.lab' files. You should take an adequate amount of time to ensure this is the case.

Note the following process is destructive, in that it modifies the database already existing in a database directory. This only modifies the files in directories `units/', `chatr/seg/' and `index/', so a set of shadow links can be set up if desired. You should of course not be experimenting with a database that others may currently be using.

First create files in `units/', one for each file-id in the system with an extension. The files may contain more than one new field. These fields can be pasted to the end of the existing unit files in that directory using the command

     db_utils/add_newfields <newfield_fileextention>

This will modify all `.units' files in that directory appending the new fields.

Now create the file `index/DBNAME_extrafields' containing the field declaration for the new fields you wish to add to the database. Fields can be floats, ints, or categories. For example, if two new fields are added, one for ToBI accents and one for ToBI ending tones, the file `DBNAME_extrafields' may look like

     (tobi_accent (NONE H* !H* L+H* L+!H* L* L*+H OTHER))
     (tobi_tone (NONE L-L% L-H% H- L- H-H% OTHER))

Note it is necessary (for a later shell script) to have leading spaces on the above lines.

Now a new index can be created with those new fields using the command

     db_utils/make_unitindex

Then compiled using

     db_utils/make_indexout

If problems occur in making this index you must fix them before continuing.

Next a CHATR utterance representation of the database entries should be created for use with test_seg. The format of these files includes all fields in a database entry, even if there is currently no way to predict the value of a particular field during text-to-speech. Use the command

     db_utils/make_segs

It will also be necessary to amend the silence entry definition in `DBNAME_synth.ch' to define values for the new fields created. For example

     (Database Set Silence 
        ("pau" 0 67 0.0 120 0.0 0.210 0.0 5.369 0.0 NONE NONE 0))

Note your new fields occur one before the end.

Training of the new fields is automatic. You need to edit `index/DBNAME_synth.ch' to define new distance functions for the new fields (and possibly delete existing distance functions if you do not wish them). Note a database may contain more fields than are actually used in selection, therefore if comparing competing fields, the same compiled index may be used but different training (and hence different weights files) is all that need change.

For full details of distance functions see section Distance Functions. Here we will only deal with a limited form of customization. There are two major classes of distance functions: continuous (float or int) and categorical. These are trained differently. New continuous distance functions should be listed in the variable DBNAME_SharedDFs, while categorical distance functions are listed in DBNAME_TableDFs. These lists are expanded automatically into full distance function definitions during training.

A continuous listing consist of 5 fields as follows

distance name
Must be unique.
position offset
-1 means previous, 0 means current and 1 means next unit.
field name
The field name. For new fields it is the name introduced in DBNAME_extrafields.
mapping
The mapping function to use. For continuous field names this can currently be ident, i.e. no mapping, or log, for logarithm.
difference measure
This defines the function used to give the distance between the target and database unit fields. eql returns 0 if the two values are equal, and 1 otherwise (this is only reasonable for int valued fields). abs means the absolute different between the two values and sqr means the squared difference.

A categorical listing also consists of 5 fields as follows

distance name
Must be unique.
position offset
-1 means previous, 0 means current and 1 means next unit.
field name
The field name. For new fields it is the name introduced in DBNAME_extrafields.
mapping
The mapping function use. For new fields this is probably ident, but if some further quantization is desired it may be achieved by defining a new Discrete and Map. See section Discretes and Maps, for detailed information.
size
The number of members in the category. If the mapping is ident, this is number of members in the field declaration. If the mapping is something other than ident, this is the number of items in the category being mapped to.

Note that categorical distance functions will be trained in phone groups. This will rarely be wrong but sometimes have more differentiation than is necessary.

In our example we have two new fields we wish to train. Both are category fields so we have to add their descriptions to the variable DBNAME_TableDFS. The additions should look like

     (tobi_accent   0 tobi_accent ident 8)
     (tobi_tone     0 tobi_tone ident 7)

That is (for the first line) the new distance function is called tobi_accent, it is to apply to the current phone (0), using the field name tobi_accent, with no mapping (ident) and has 8 members.

If we wished to have a distance function on not only the current phone but also on the context, we could add distance functions that include the ToBI accents of the left and right context as in

     (p_tobi_accent   -1 tobi_accent ident 8)
     (n_tobi_accent    1 tobi_accent ident 8)

Thus then the distance name and the offset field changes, but the field name of course remains the same.

Once the new distance measures have been defined, you can train the new weights. The standard database script can be used which is

     db_utils/make_training

The calculated distance measures in the directory `dist/' are re-used by default if they exist, so keeping them is useful when training on different fields as it makes re-training much faster. Note if you change the phoneme set you must delete the old distance files and re-create them.

After training you should check the training log file in `index/DBNAME_train.log' to see the contribution of the new fields you have introduced.

Modifying CHATR to Predict a New Field

Once it is decided that a new field is worth predicting, CHATR will need to be modified to actually predict it. Functions in the file `$CHATR_ROOT/src/udb/udb_targfuncs.c' are used to generate all target fields. These functions return a Lisp cell (to generically deal with the appropriate type, float, int or categorical) from a segment stream cell. An entry should be added to the table df_targ_val_name2func relating the new fieldname to a function. The function may simply access a field in the segment stream cell (or one related to it), or do some calculation. It may be that a feature function already exists to generate the appropriate value and hence may simply be called with an easy wrap around function (cf. udb_tf_sylpos).

Objective Distance Measure

The object of training is to find the weighting that minimizes the distance of the selected units from the original. We do not yet know the ideal distance measure. However, it will probably be a signal processing value that directly follows humans' perceptions of good or bad synthesis. Approximations of this measure are possible and CHATR supports a mechanism for selecting what to use. The distance method used is defined through the variable cep_dist_parms. This name was chosen as it will most likely involve manipulation of cepstrum parameters.

The assumption is that a set of parameters is defined for each frame (at some increment) in the database. This is specified through the CoefFileSkeleton setting of the Database command. The format of these files may vary but HTK headed and ATR improved cepstrum files are currently supported. Remember to add new formats to `file/cep_io.c'.

The distance measures themselves are defined in `chatr/cep_dist.c'. Currently supported are Euclidean and weighted Euclidean. Two alignment options are also provided: naive which does no time alignment between the selected units' cepstrum vectors and the original (just taking the shortest), and tw which linear interpolates the selected units' cepstrum vectors to the original. Note that after selection, the individual units cepstrum vectors are each time aligned, so overall alignment should not be a problem.


Go to the first, previous, next, last section, table of contents.