A number of mechanisms exist within the CHATR to predict duration of segments in synthesis. This chapter discusses each in turn.
Different duration methods are selected using the Parameter
command. For example
(Parameter Duration_Method KLATT_DUR)
A global stretch parameter is available to modify the overall speed of predicted durations. Note it is simply a factor by which the durations of each segment are multiplied--no segment reduction takes place. It is set using the command
(Parameter Duration_Stretch 1.2)
The default value is 1.0. A value of 0.0 or less is not allowed. The value is automatically reset to 1.0 whenever a new duration method is selected.
Note that in all cases pause/silence durations are predicted with a
different mechanism than non-pause phonemes. A better pause duration
prediction system is probably required, but it is already separated
from the various existing duration modules. Pause durations are
based on the prosodic boundary level of the word ending with the
segment immediately preceding the pause. A table of pause lengths
based on boundary level may given through the Stats
command.
A typical example (as defined in the library file
`$CHATR_ROOT/lib/data/rp_pause.ch') is
(Stats Pause ( (discourse 400) (sentence 250) (clause 100) (phrase 50) ) )
The prediction of the duration of a pause at the beginning of an utterance is a problem not worth consideration. In general we do not know what has gone before (except in the text-to-speech case), so cannot predict how much pause is required. It seems fair to assume that utterances do consist of a complete prosodic phrase, so a small pause is not unreasonable. Currently a pause of 50mS is always generated if a duration module is called.
When using a DATLINK as an audio output device, there is a significant delay before the playing of the waveform starts. Hence the `Stats Pause' values may need to be reduced. For text-to-speech (synthesizing in sentence-sized chunks), a smaller value for the `sentence' pause is recommended, as the pause generated within the DATLINK between playing waves can be as much as 750 milliseconds. Of course there may be ways to stop the DATLINK from doing this.
This is an implementation of the Klatt duration rule system as
described in Allen 87[Ch. 9]. It follows the 10 rules as
closely as possible. This module requires initialization using the
Stats
function. This takes the form
(Stats Klatt_dur PHONEME-SET-NAME ( (phone~0_stats) (phone~1_stats) ... (phone~n_stats) ) )
The PHONEME-SET-NAME
is optional. If specified, it must be
the name of a currently defined phoneme set. If omitted, the current
input phoneme set is assumed. Individual phoneme statistics consist
of a triplet; a phoneme name, an inherent duration (in milliseconds)
and a minimum duration (in milliseconds).
As an example, a partial description for the `mrpa' phoneme set is
(Stats Klatt_dur mrpa ( ( 120 60) ; AX (@ 180 80) ; ER (a 230 80) ; AE (aa 240 100) ; AA (ai 250 150) ; AY (au 240 100) ; AO (b 85 60) ; BB ... ) )
A full `mrpa' definition is listed in the library file `$CHATR_ROOT/lib/data/rp_dur.ch'.
This has been extracted from the NUUTALK code as its own stand-alone CHATR duration module. It is specific to the Japanese phoneme set `nuuph' (and not very robust with alternatives). It is viewed as a stop-gap to allow Japanese to be synthesized in a more general way, that depends on internal NUUTALK code. No parameters are available for modification.
The Kaiki duration method is selected by the following command
(Parameter Duration_Method KAIKI_DUR)
Using some of the ideas from Campbell 92, this duration method basically breaks the task into two levels. First, syllable durations are predicted. Then, based on those values, the durations of the phonemes within that syllable are predicted. In this implementation, both the syllable durations and the phoneme durations are predicted using one or two neural nets. See section Neural Nets, for a description of how to use the neural net system within CHATR.
Note that silence and pause durations are not predicted using these nets, a separate pause duration mechanism is employed based on phrase-break levels. See section Pause Durations, for details.
The nets and a description of their inputs are given to this module
through the Lisp variable nnd_nets
. Its value may be of
length 2 or 4. A net is described by two items, a list of atomic
input features and a net itself (as generated by the function
NN_Train
). If two nets are given (length=4), the first net is
used to predict syllable durations while the second is used to
generate phoneme durations. If only one net is given (length=2), it
is used to predict phoneme durations directly.
Features used as input to the neural net are obtained via the feature function mechanism. See section Feature Functions, for a full description.
The library file `$CHATR_ROOT/lib/data/f2b_dur_nnet.ch' contains an example of NNet duration data. This has been trained from the BU FM Radio database using the female radio announcer f2b. The syllable net inputs are
(ppblvl pblvl blvl nblvl nnblvl pcoda coda paccented accented naccented ppbprom pbprom bprom nbprom nnbprom ppstress pstress stress nstress nnstress ppvtypeN pvtypeN vtypeN nvtypeN nnvtypeN onset nonset foot remssyl remssylsent psyl_type syl_type nsyl_type )
Note that all these features return character strings of digits. When concatenated together they form the input to the net. The best way to find the definitions of these features is by looking at the code in the file `$CHATR_ROOT/src/chatr/feats.c'
Parameters to this module are set in the Lisp variables nnd_params.
Two examples are included. The file `$CHATR_ROOT/lib/data/f2b_dur_nnet.ch' offers a syllable net and a phoneme net, while `$CHATR_ROOT/lib/data/f2b_phnet.ch' offers direct phoneme prediction.
A method for using linear regression for duration predict is also
included. The CHATR Lisp function Linear_Regression
may
be used to build linear regression models. Once created they may
used in duration prediction as described below. The method is
selected using the command
(Parameter Duration_Method LR_DUR)
Once set the module takes its input from the Lisp variable
dur_lr_model
. Its value should be pair of linear regression
models, each consisting of a list of pairs listing feature name and
weight. The first value in the mode should be the intercept. The
model should predict z scores of absolute durations in milliseconds
(this may change).
A second variable dur_lr_targ_stats
should contain a list of
phonemes plus means and standard deviations (in milliseconds) for the
target speaker.
Thus this method allows a degree of speaker independence, though no formal test have been made about how well cross-prediction works.
See the file `$CHATR_ROOT/lib/data/f2b_lrdur.ch' for some English examples. The file `$CHATR_ROOT/lib/data/mht_lrdur.ch' contains Japanese examples.
The target statistics may be created for a database at database
creation time using the script
$CHATR_ROOT/db_utils/make_lrdurstats
.
A second linear regression module exists following the Campbell method more closely but has yet to be fully tested. It is selected using
(Parameter Duration_Method LR_DUR_SYL)
In this case dur_lr_model
should contain a single linear
regression model for prediction syllable duration (from syllable
cells). The second variable dur_lr_targ_stats
should contain
each phones with means and standard deviations on log
durations.
The syllable duration is predicted then the phonemes in it are summed and a factor is found to modify them in a fraction of their standard deviations. Again this should have a degree of speaker independence in it following Campbell's original work Campbell 92 but full tests have not yet been made. However, in this case we are not using neural nets which may cause the success of the method to differ.
Go to the first, previous, next, last section, table of contents.