Go to the first, previous, next, last section, table of contents.

Tuning Prosodic Models

This chapter describes how speaker duration and intonation models may be created from labeled databases. These techniques are still very experimental and only appropriate to some databases--discretion is advised in their use.

Hopefully the following is not just a step-by-step instruction to train models, but also gives an insight to the sort of investigations which are possible with the CHATR system.

PhonoForm Utterance Types

There are many types of information which are pertinent to the generation of prosody in speech synthesis. As anyone who has wished to build models from data will be deeply aware of, getting the appropriate information from a database in the right format is a time consuming and error-prone task. To try to combat that, the following models extract their information from a common structure which is built with information from a speech database. Building that structure is still non-trivial, but once built all systems may access it in a well defined uniform manner thus reducing effort and errors.

PhonoForm utterances have been built and supplied for speakers f2b and MHT. Files can be found under `$ROOT/f2b/chatr/pf/' and `$ROOT/MHT/chatr/pf/' respectively. For the local path-names to these databases, check the file `$CHATR_ROOT/lib/data/itlspeakers.ch'.

Building PhonoForm Utterance Types

The object is to create a CHATR utterance which contains all information that you wish to use in generating training and test data for building models. The PhonoForm utterance type is designed for that. It allows explicit specification of

Segments
Duration
Power
Pitch
Syllables (or mora in Japanese)
Lexical stress (or accent in Japanese)
ToBI labels (accents, phrase accents and boundary tones)
Break indices
Words

A PhonoForm structure may be created for each utterance in the database from a set of XWAVES-labeled files of the form described below. Some of these files already exist (made during the speech synthesis database creation process), and others can be easily produced from that information. There are some however, that require significant effort to obtain, e.g. hand labeled. Those required are

phoneme labels: These are directly available from the `$SPEAKER_ROOT/lab/' directory.
pitch: These can be created from the information generated during the building of a speech synthesis database using the script $CHATR_ROOT/db_utils/make_pf_pitch.
power: These can be created from the information generated during the building of a speech synthesis database using the script $CHATR_ROOT/db_utils/make_pf_power.
syllables: These identify the ends of syllables and also specify if the syllable is `stressed' or not. Note the definition of `stressed' is database dependent. The value from this is used in the PhonoForm utterance in the same field as lexical stress. Hence both values should at least try to match. Although there are algorithms that can syllabify phonemes from any source, it is much more useful if this syllabification matches that produced by the lexicon that is to be used in text-to-speech. It is up to the user to provide this file. In Japanese it is easier. A phoneme-label to syllable (actually mora) script is provided in $CHATR_ROOT/db_utils/make_pf_jsyl. `Stressed' and `unstressed' in Japanese should be lexical accents (i.e. what we mark with a single quote). Getting that information into the syllable file is the responsibility of the user.
tones: ToBI tones. Although named ToBI, to be fair this would not necessarily need to be any particular ToBI labeling system--it currently works well with English or Japanese ToBI. Hence any intonation labels should be appropriate. Note that these labels will be syllable aligned in the PhonoForm structure. Any fine marking of position within a syllable (i.e. the `<' in JToBI) or fine positioning (i.e. the `HiF0' in ToBI) is currently ignored. Another problem is the accuracy of the labeling, especially in ending or starting tones. Ending tone positions sometimes fall within a syllable boundary (i.e. the last phoneme in the syllable) and sometimes they are just over. A similar, but reversed situation happens with starting tones. An attempt to compensate for this error is made in the final construction, but it cannot guarantee to work in all cases.
breaks: Tobi break indices. These should be values between 0 and 4.
words: These identify word boundaries. In Japanese this is not so clear, but these boundaries should reflect what word boundaries would occur in actual synthesis, which is probably `bunsetsu'.

Assuming the existence of the above files in the speaker database, they should be collected together into one file for each file-id and placed under the directory `$SPEAKER_ROOT/others/'. The file should take the form of an XWAVES-labeled file, but must have no header and the `color' set to a type (word, phone, tones etc). File names are called `file-id.labs'.

At this time small adjustments can be made to some labels to try to ensure they appear on the correct side of boundaries. It may be necessary to change these `fiddle factors' to get the right results.

Once created, the `*.labs' files may be converted into a CHATR utterance input form, i.e a bracketed structure. The following script attempts to do this automatically

     db_utils/make_pfs

The result should be a set of files in `$SPEAKER_ROOT/chatr/pf/' that describe an utterance in the database, one for each file-id.

Using PhonoForm Utterances

The simplest way to use PhonoForm utterances is as follows

     (test_pf FILE-ID)

Like the other test function, `test_seg', this excludes the named file from the database and then synthesizes it from the remaining database units. In this form the result should be the same as a simple test_seg as the segments, durations, power and pitch are the same. See the file `$CHATR_ROOT/lib/data/udb.ch' for the definition of test_pf. This file also contains some other examples of general functions for testing udb databases

By default a PhonoForm utterance is simply loaded and then the waveform synthesizer is called. A more interesting way of using this utterance is to load it, then call your own module, then call the waveform synthesis routine. This way you find out how well (or badly) your module is affecting the synthesis. The easiest way to do this is by defining your own synthesis routine. Suppose we wish to check a new duration module and hence just call duration, in the context of natural phrasing, accents, pitch, and segments. We could define

     (define dur_synth (utt)
        (Input utt)     ;; do all the loading of the utterance
        (Duration utt)  ;; just run my new duration module
        (Synthesis utt) ;; generate a new waveform
     )

Now we can test our duration module with everything else natural.

     (dur_synth (lpf FILE-ID))

The function lpf simply loads the PhonoForm utterance.

NOTE the above example is a little simple as the pitch target points may be adversely affected by the fact that the durations have changed. You may need to call Int_target again to regenerate the intonation pitch targets. Also, lpf does not exclude the named file, `FILE-ID', from the search. Note it may be better to modify the original waveform if you are testing duration prediction. If you use PSOLA just on the original waveform, you get to hear the distortion introduced by the duration module rather than the distortion introduced by the duration module plus that introduced by unit selection.

Extracting Features

Another use of the PhonoForm utterance structure is to extract data for building prediction models. The function Feats_Out will dump a set of features for a given utterance. For example, suppose it is wished to determine the mid-pitch of all syllables. Perhaps it is believed that features such as lexical stress, tobi accent, boundary after syllable, number of syllables in from the start of phrase and number of syllables to end can be used to predict the value. For a PhonoForm utterance, simply run

     (test_pf FILE-ID)
     (Feats_Out utt_pf 'Syl 
      '(syl_f0 stress tobi_accent bi syl_in syl_out)
      (strcat "feats/" FILE-ID ".sylinfo"))

This will dump the information in a file called `feats/FILE-ID.sylinfo', one line per vector. Of course, this information is really required from the whole database. So, assuming that the Lisp variable files is set to a list of all file-ids in the database, that would be achieved by

     (Parameter Synth_Method NONE)
     (set required_feats '(syl_f0 stress tobi_accent ...))
     (define get_feats (name)
        (print name)   ;; so we can see the progress
        (test_pf name)
        (Feats_Out utt_pf 'Syl required_feats 
                 (strcat "feats/" FILE-ID ".sylinfo")))
     (mapc get_feats files)

The first line means that no waveform synthesis occurs, making this dump substantially faster. See section Training a New Duration Model, for a large example of using this technique to collect information from a database for the building of models.

Available features are defined in the file `$CHATR_ROOT/src/chatr/feats.c'. and may be listed in Lisp by calling Feats_Out with no arguments.

Training a New Duration Model

This section describes how to extract information from a database (using PhonoForm utterances) and build a linear regression model for predicting duration. A duration model like this is already included in the CHATR distribution `$CHATR_ROOT/lib/data/f2b_lrdur.ch' built from the BU FM Radio data corpus speaker `f2b'. This shows how to produce the same for a new speaker.

See the file `$CHATR_ROOT/lib/examples/train_lrdur.ch' for the example lisp code discussed here.

The first stage is to extract data from the database in the form of features. In this instance we are going to build two models, one for vowels and one for consonants. As these two models use different features, we have to create two sets of feature files. So two feature lists are defined, one for consonants in the variable C_durfields and a second for vowels in V_durfields. Features available are defined in `$CHATR_ROOT/src/chatr/feats.c'.

After the loading and setup of the database, the variable files contains a list of all file-ids in the database. This entire list will be used first and the split of training and testing data done after.

The dumping functions must be defined next. Since we cannot very easily dump the features separately divided between vowels and consonants, we will dump them all and split them later. At the same time we dump the individual phonemes in a separate file. The function dumpfeats will map over the whole database. Enter

     (dumpfeats)

For each file-id in the database this function will dump three files under a directory called `$SPEAKER_ROOT/feats/'. These files contain the vowel features (as defined by V_durfields), the consonant features (as defined by C_durfields), and the phonemes.

Note that if the features of a single file-id are required, the function get_feats_utt can be used.

The next task is to collate the fields into training and test data files for the linear regression software. An example shell script to do this is given in $CHATR_ROOT/lib/data/examples/make_lrdat. The script first collates all the consonant data together and removes all vectors that are actually for pauses, breaths and vowels. It then splits the feature sets into train and test sets by simply adding parentheses round the vectors. It then does the same for the vowel data.

The third stage is to build the linear regression models from the feature vectors. This should be done in CHATR with the file `$CHATR_ROOT/lib/examples/train_lrdur.ch' loaded, from the `$SPEAKER_ROOT/dur/' directory. The function dolrall takes a file name and a list of features and builds a linear regression model, then tests that model with the training and test data. Two calls are necessary, one for the consonants and a second for the vowels. The commands are

     (dolrall "datC01" C_durfields)
     (dolrall "datV01" V_durfields)

Note the mean errors are in z scores so they are not immediately recognizable as durations.

The files `datC01.info' and `datV01.info' contain the detailed results of linear regression and should be studied carefully. Interesting points are of course the correlation, the stepwise model showing relative contributions of each feature, and the `dropped' section showing features which make no contribution. Often this is because they are completely predictable from some other feature or set of features.

Once you are happy with the prediction capability, you can build an actual duration model from this data and test it in the system. This is done by a call to the function save_lrmodel with arguments of consonant model name, vowel model name and output file name. For example

     (save_lrmodel "datC01" "datV01" "lrmodel01.ch")

It is recommended to edit the created file `lrmodel01.ch' to add a comment about where this model came from.

Running the above function also sets the variable dur_lr_model by default. It is more likely that some other variable should be set and that dur_lr_model only be set when this model is actually selected. See the file `$CHATR_ROOT/db_utils/DBNAME_synth.ch' for a typical use of dur_lr_model.

Now that a duration model is built we can actually use it to predict durations. Again we wish to run through the whole database loading the `pf' utterances, save the actual duration and then run our new duration on that utterance and save our predicted duration. We can do this with the function test_new_dur_model defined in `$CHATR_ROOT/lib/examples/train_lrdur.ch'. From the main database directory enter

     (test_new_dur_model "dur/lrmodel01.ch")

There will now be files `*.accdur' and `*.preddur', under directory `feats/', one of each for each file-id. It is left as an exercise for the reader to use these files to find the mean error and correlation for vowels and consonants and the overall model.

Training a ToBI-Based F0 Prediction Model

Here we will give an example of building an F0 prediction model using linear regression with ToBI labels as input parameters. First it is assumed that your database from which the model is to be built is labeled with ToBI (or similar) labels, and a set of PhonoForm utterances has been created. See section PhonoForm Utterance Types, for more information.

We can build a new F0 prediction model in a similar way to how a new duration model is built. See section Training a New Duration Model, for an example. Again, you need to decide on the features. In this case three models are required. However, as they are for syllables rather than phonemes, the amount of data is much smaller than for the duration case.

Training a New Reduction Model

The reduction model used in speaker f2b and some other English speakers was trained from the f2b database. Again the PhonoForm utterances were used. The feature pf_reduced was used as the value that was to be predicted. This value was is calculated for each syllable. The word the syllable is in is looked up in the lexicon then an attempt is made to align the syllables in the lexicon version with the actual version. If they do line up, a check is made to see if the syllable's vowel is the same as the vowel in the corresponding syllable in the lexical entry. If different, it checks a list of schwa pairs, and if the actual vowel is listed as a schwa version of the lexical vowel, the syllable is marked as reduced.

A set of vectors was collected including this reduced value plus other pertinent features. From that information a CART decision tree was built to try to predict reduction. The result is the small tree now in the file `$CHATR_SOURCE/lib/data/reduce.ch'. It seems relevant and the resulting sound is reasonable.

This technique does effectively train to a particular lexicon, as various lexicons often make different decisions about the amount of vowel reduction in their lexical entries. This particular example was trained using the CMU lexicon, and if the same reduction tree is used with a speaker which uses the BEEP lexicon, the results are not so good.

Go to the first, previous, next, last section, table of contents.