Go to the first, previous, next, last section, table of contents.

Duration

A number of mechanisms exist within the CHATR to predict duration of segments in synthesis. This chapter discusses each in turn.

Different duration methods are selected using the Parameter command. For example

     (Parameter Duration_Method KLATT_DUR)

A global stretch parameter is available to modify the overall speed of predicted durations. Note it is simply a factor by which the durations of each segment are multiplied--no segment reduction takes place. It is set using the command

     (Parameter Duration_Stretch 1.2)

The default value is 1.0. A value of 0.0 or less is not allowed. The value is automatically reset to 1.0 whenever a new duration method is selected.

Pause Durations

Note that in all cases pause/silence durations are predicted with a different mechanism than non-pause phonemes. A better pause duration prediction system is probably required, but it is already separated from the various existing duration modules. Pause durations are based on the prosodic boundary level of the word ending with the segment immediately preceding the pause. A table of pause lengths based on boundary level may given through the Stats command. A typical example (as defined in the library file `$CHATR_ROOT/lib/data/rp_pause.ch') is

     (Stats Pause
       (
        (discourse  400) 
        (sentence   250) 
        (clause     100) 
        (phrase     50)
        )
      )

The prediction of the duration of a pause at the beginning of an utterance is a problem not worth consideration. In general we do not know what has gone before (except in the text-to-speech case), so cannot predict how much pause is required. It seems fair to assume that utterances do consist of a complete prosodic phrase, so a small pause is not unreasonable. Currently a pause of 50mS is always generated if a duration module is called.

When using a DATLINK as an audio output device, there is a significant delay before the playing of the waveform starts. Hence the `Stats Pause' values may need to be reduced. For text-to-speech (synthesizing in sentence-sized chunks), a smaller value for the `sentence' pause is recommended, as the pause generated within the DATLINK between playing waves can be as much as 750 milliseconds. Of course there may be ways to stop the DATLINK from doing this.

Klatt Durations

This is an implementation of the Klatt duration rule system as described in Allen 87[Ch. 9]. It follows the 10 rules as closely as possible. This module requires initialization using the Stats function. This takes the form

     (Stats Klatt_dur PHONEME-SET-NAME
       (
        (phone~0_stats)
        (phone~1_stats)
        ...
        (phone~n_stats)
        )
      )

The PHONEME-SET-NAME is optional. If specified, it must be the name of a currently defined phoneme set. If omitted, the current input phoneme set is assumed. Individual phoneme statistics consist of a triplet; a phoneme name, an inherent duration (in milliseconds) and a minimum duration (in milliseconds).

As an example, a partial description for the `mrpa' phoneme set is

     (Stats Klatt_dur mrpa
       (
        (   120    60)   ; AX
        (@  180    80)   ; ER
        (a   230    80)   ; AE
        (aa  240   100)   ; AA
        (ai  250   150)   ; AY
        (au  240   100)   ; AO
        (b    85    60)   ; BB
       ...
        )
      )

A full `mrpa' definition is listed in the library file `$CHATR_ROOT/lib/data/rp_dur.ch'.

Kaiki Durations

This has been extracted from the NUUTALK code as its own stand-alone CHATR duration module. It is specific to the Japanese phoneme set `nuuph' (and not very robust with alternatives). It is viewed as a stop-gap to allow Japanese to be synthesized in a more general way, that depends on internal NUUTALK code. No parameters are available for modification.

The Kaiki duration method is selected by the following command

     (Parameter Duration_Method KAIKI_DUR)

Neural Net Durations

Using some of the ideas from Campbell 92, this duration method basically breaks the task into two levels. First, syllable durations are predicted. Then, based on those values, the durations of the phonemes within that syllable are predicted. In this implementation, both the syllable durations and the phoneme durations are predicted using one or two neural nets. See section Neural Nets, for a description of how to use the neural net system within CHATR.

Note that silence and pause durations are not predicted using these nets, a separate pause duration mechanism is employed based on phrase-break levels. See section Pause Durations, for details.

The nets and a description of their inputs are given to this module through the Lisp variable nnd_nets. Its value may be of length 2 or 4. A net is described by two items, a list of atomic input features and a net itself (as generated by the function NN_Train). If two nets are given (length=4), the first net is used to predict syllable durations while the second is used to generate phoneme durations. If only one net is given (length=2), it is used to predict phoneme durations directly.

Features used as input to the neural net are obtained via the feature function mechanism. See section Feature Functions, for a full description.

The library file `$CHATR_ROOT/lib/data/f2b_dur_nnet.ch' contains an example of NNet duration data. This has been trained from the BU FM Radio database using the female radio announcer f2b. The syllable net inputs are

     (ppblvl pblvl blvl nblvl nnblvl 
      pcoda coda 
      paccented accented naccented
      ppbprom pbprom bprom nbprom nnbprom 
      ppstress pstress stress nstress nnstress 
      ppvtypeN pvtypeN vtypeN nvtypeN nnvtypeN 
      onset nonset 
      foot remssyl remssylsent 
      psyl_type syl_type nsyl_type
     )

Note that all these features return character strings of digits. When concatenated together they form the input to the net. The best way to find the definitions of these features is by looking at the code in the file `$CHATR_ROOT/src/chatr/feats.c'

Parameters to this module are set in the Lisp variables nnd_params.

Two examples are included. The file `$CHATR_ROOT/lib/data/f2b_dur_nnet.ch' offers a syllable net and a phoneme net, while `$CHATR_ROOT/lib/data/f2b_phnet.ch' offers direct phoneme prediction.

Linear Regression--Phones

A method for using linear regression for duration predict is also included. The CHATR Lisp function Linear_Regression may be used to build linear regression models. Once created they may used in duration prediction as described below. The method is selected using the command

     (Parameter Duration_Method LR_DUR)

Once set the module takes its input from the Lisp variable dur_lr_model. Its value should be pair of linear regression models, each consisting of a list of pairs listing feature name and weight. The first value in the mode should be the intercept. The model should predict z scores of absolute durations in milliseconds (this may change).

A second variable dur_lr_targ_stats should contain a list of phonemes plus means and standard deviations (in milliseconds) for the target speaker.

Thus this method allows a degree of speaker independence, though no formal test have been made about how well cross-prediction works.

See the file `$CHATR_ROOT/lib/data/f2b_lrdur.ch' for some English examples. The file `$CHATR_ROOT/lib/data/mht_lrdur.ch' contains Japanese examples.

The target statistics may be created for a database at database creation time using the script $CHATR_ROOT/db_utils/make_lrdurstats.

Linear Regression--Syllables

A second linear regression module exists following the Campbell method more closely but has yet to be fully tested. It is selected using

     (Parameter Duration_Method LR_DUR_SYL)

In this case dur_lr_model should contain a single linear regression model for prediction syllable duration (from syllable cells). The second variable dur_lr_targ_stats should contain each phones with means and standard deviations on log durations.

The syllable duration is predicted then the phonemes in it are summed and a factor is found to modify them in a fraction of their standard deviations. Again this should have a degree of speaker independence in it following Campbell's original work Campbell 92 but full tests have not yet been made. However, in this case we are not using neural nets which may cause the success of the method to differ.

Go to the first, previous, next, last section, table of contents.