Go to the first, previous, next, last section, table of contents.

Intonation

A number of different intonation theories have been implemented within CHATR. By `intonation theory' we mean some symbolic representation (possibly with continuous parameters) that can be used to generated an underlying F0 in synthesis.

There are other ways to specify F0 within CHATR apart from an intonation system, such as by specifying values for frames throughout the utterance.

In an intonation system, information is contained within the `Intone' stream. This is primarily related to syllables in English, and mora in Japanese. Intonation parameters must be of the same type for a whole utterance. The intonation parameter may be specified directly in some input methods, or predicted by some higher-level part of CHATR---typically the HLP rules.

HLCB and RFC

HLCB is selected by either of the following commands

     (Parameter Int_Method CSTR)

     (Parameter Int_Method CSTR)

This was the first intonation system to be implemented within CHATR and hence is both the simplest and probably the most stable. It is however, rather limited. The work is based on that described in Taylor 92. Basically, syllables may be marked with one of four elements. H (high), L (low), C (continuation), or B (boundary). In addition, these elements may be followed by features. The features (as can the elements) may be individually defined, but in our examples the defined features are

     H      early      late      downstep

     L      early      late      downstep

     C      rise

     B      initial

Elements and features define values and modifications of values for a fixed number of continuous parameters. They are used in the prediction of the RFC, a lower level, more explicit representation of the F0 contour. These definitions may be tuned for a particular speaker's pitch range.

Definitions should be made using the Stats Intonation command.

ToBI

The ToBI intonation method is selected using the command

     (Parameter Int_Method ToBI)

This is an implementation of the English ToBI system described in Silverman 92. As with the other intonation systems included within CHATR, it consists of three sub-parts:

Prediction of accents and tones.
Realization of these tones into F0 target points, taking into account speaker range and parameters.
Converting the set of target points to a sampled F0 at desired rate by smoothing the target points.

Stage 1 is performed before any duration information is available. This is because duration prediction methods need to know accent information, so accents and tones must be predicted before the duration module is run.

Stage 2 is called after durations have been predicted and hence can deal with absolute positioning.

ToBI parameters are set using the variable ToBI_params. This should be a Lisp a-list of name and value pairs. Currently supported names are

pitch_accents

A list of supported pitch accents. Although this is a parameter, it can really only have a fixed value. Specific C code must exist to realise each accent as a set of target points. The value this should have is

     (pitch_accents H* !H* L* L+H* L*+H L+!H* H+!H* HiF0 X*? *?)

A subset of these is also acceptable. For reference, the above are presently the actual accents that appear in speaker f2b of the Boston University's Radio News corpus. (See Ostendorf 95.)

phrase_accents

A list of supported phrase accents. Again, although this is a parameter there is really only one set of values it can take. That should be

     (phrase_accents H- L-)

boundary_tones

A list of supported phrase accents. Once more, although this is a parameter there is really only one set of values it can take. That should be

     (boundary_tones H-H% L-H% L-L% H-L%)

target_method

The method to generate F0 target points. Two values are possible: The first is APL (see Anderson 84), which predicts target values of syllables that are accented or toned. The second is LR, which uses linear regression to predict start, mid-vowel, and end target point for all syllables. APL is the default. This method uses the large number of parameters defined below to tune the predicted value. The results of LR are closer to the natural F0, but at the cost of not being as general. The database building mechanism uses LR. An LR model may be mapped to a different speakers pitch range using the following two parameters

target_f0mean: The mean F0 for vowels for a particular speaker. This is used to map from the pitch range of the speaker used to create an LR F0 model to another speaker's pitch range.
target_f0std: The standard deviation of F0 for vowels for a particular speaker. This is used in the same way as target_f0mean.

If target method is LR, a list of three linear regression models should be set in the variable tobi_lrf0_model. These predict the start, mid-vowel and end values for a syllable. The feature name-weight `pairs' may optionally have a third argument specifying a feature map. Feature maps allow category valued features to be mapped to binary ones. If the value returned by a feature is in the named feature map, then the value is 1, otherwise it is 0. Example linear regression models can be found in the files `$CHATR_ROOT/lib/data/f2b_lrf0.ch' for English and `$CHATR_ROOT/lib/data/mht_lrf0.ch' for Japanese (JToBI).

The following parameters are only used if the target_method is APL. Currently no mechanism is available to automatically tune these parameters.

topval: Size in Hertz above refval for maximum sized accents. Speaker-dependent.
baseval: Size in Hertz below refval for minimum sized accents. Speaker-dependent.
refval: Size in Hertz of mid-value. For most speakers this is best set to the mean F0 of the speaker.
h1: Factor by which topval is multiplied to position step before H accents. Speaker-dependent in some cases.
l1: Factor by which baseval is multiplied to position step before L accents. Speaker-dependent in some cases.
prom1: Factor by which topval is multiplied to position top of H accents.
prom2: Factor by which topval or baseval multiplied to position top of !H accents, H accents in compound accents, and H and L in phrase accents
prom3: Factor by which is topval or baseval is multiplied to position end of phrase accents.
HiF0_factor: Factor to increase H*'s when marked with HiF0. Default is 1.3.
decline_range: Value in Hertz of total decline to be made over a phrase.
hamwin_size: Size in milliseconds of smoothing window used to produce smoothed F0 from target points. This is typically around 240 to 400 mS.

The actual method used in the implementation was strongly influenced by example code (incomplete) from AT & T Bell Labs, with significant input from Mary Beckman. Hence it follows their model (and parameter names) very closely. The APL technique is also described in Anderson 84.

JToBI

The JToBI (Japanese-ToBI) intonation method is selected using the command

     (Parameter Int_Method JToBI)

This is an implementation (in conjunction with Mary Beckman) of the work described in Pierre Humbert 88b.

Parameters may be set using the variable mb_params.

Although many parameters are available for controlling the prediction of F0 target points, the same linear regression method used by the English ToBI system produces better results, and more importantly can be trained.

A linear regression model consists of three separated models for predicting the start, mid-vowel and end target points for syllable. A forth item in the variable tobi_lrf0_model is the source mean F0 and standard deviation, which allows F0 pitch mapping between speakers. The format is exactly the same as used for the English ToBI. A Japanese JToBI LR model example can be found in the file `$CHATR_ROOT/lib/data/mht_lrf0.ch'.

Fujisaki

The Fujisaki intonation method is selected using the command

     (Parameter Int_Method Fujisaki)

An implementation of the Fujisaki model Fujisaki 83 is available for Japanese. It is still experimental but does produce F0 contours. Parameters are set using the variable fujisaki_model. Details of the parameters and their values may be determined by looking at the actual code in the file `$CHATR_ROOT/src/intonation/fujisaki.c'.

Tilt Theory

The Tilt intonation method is selected using the command

     (Parameter Int_Method Tilt)

Using the work described in Taylor 93b, this model offers a labeling system which may be automatically derived from waveforms or phoneme labels.

Adding a New Speaker

The most difficult part about adding a new speaker is labeling the data. Once the data is in the form that CHATR requires, everything else is simple.

CHATR requires a syllable utterance type description for each utterance. This is comprised of a list of phrases, each with a start F0. Within each phrase is a list of syllables and each may have one or more events marked. An example is given below

     (Utterance  
        (Syllable (space rfc)(format feature)(dimen num)) (
        (
         (:C ()
           ((hh  60) (eh  65)                        ((E)))
           ((l   33) (ow 207)                        ())
         )
         (:C ()
           ((dh  27) (ih  56)                        ((E)))
           ((s   75) (ih  56) (z   44)               ())
           ((dh  42) (ax  36)                        ())
           ((k   95) (aa 129) (n   44)               ((E)))
           ((f   77) (r   36) (en  57)               ())
           ((s   77) (ao 156)                        ())
           ((f   83) (eh 105) (s  203)               ())
        )
        ))

In this type of description, only the presence of an event need be marked.

In addition, an RFC input description is required. An example is given below

     (Utterance RFC(
     (sil    303     ( ( sil 0 166 ) ))
     (hh     60      ())
     (eh     65      ( ( fall 21 166 )))
     (l      33      ())
     (ow     207     ( ( conn 67 125 ) ( sil 197 120 )))
     (sil    155     ())
     (dh     27      ( ( rise 0 149 )))
     (ih     56      ())
     (s      75      ( ( fall 60 173 )))
     (ih     56      ())
     (z      44      ())
     (dh     42      ( ( conn 4 151 )))
     (ax     36      ())
     (k      95      ())
     (aa     129     ())
     (n      44      ( ( fall 5 142 )))
     (f      77      ())
     (r      36      ())
     (en     57      ())
     (s      77      ( ( conn 74 95 )))
     (ao     156     ())
     (f      83      ())
     (eh     105     ())
     (s      203     ( ( sil 91 91 )))
     (sil    524     ())
     ))

The CHATR user function train_input takes these two utterance descriptions and produces a syllable description in the RFC event space

     (Utterance  (Syllable (space rfc)(format num)(dimen linear)) (
     (:C ((Start 166))
        ((hh  60) (eh  65)                       
                      ((C   0.00) (E 0.00 0.00 -41.00 144.00 21.00)))
        ((l   33) (ow 207)                        ((C   0.00) ))
     )
     (:C ((Start 149))
        ((dh  27) (ih  56)                       
                      ((C   0.00) (E 24.00 143.00 -22.00 119.00 116.00)))
        ((s   75) (ih  56) (z   44)               ((C  -9.00) ))
        ((dh  42) (ax  36)                        ())
        ((k   95) (aa 129) (n   44)              
                      ((E 0.00 0.00 -47.00 283.00 134.00)))
        ((f   77) (r   36) (en  57)               ((C  -4.00) ))
        ((s   77) (ao 156)                        ())
        ((f   83) (eh 105) (s  203)               ())
     )
     ))

Next, the function Rfc_to_Tilt is called, which transforms this into tilt space. With a sufficient number of utterances in tilt space, statistics can be collected on each of the 4 tilt parameters and the phrase start F0 parameter. The mean and standard deviations need to be calculated, which can be done using S or any other utility. The tilt descriptions can be derived from the utterance file. Alternatively, using the Int_Stats function returns a list of all the events or connections (or both) for an utterance. By calling mapc, one can obtain all the statistics for a database.

Once the statistics have been collected, a speaker table can be constructed by entering the mean and standard deviations in the appropriate places. A typical speaker file is given below

     (Stats Intonation (
             (Element E (def tilt E)(
                      (amp      = 47  Hz)
                      (dur      = 291 mS)
                      (tilt     = 0.0 rel)
                      (peak_pos = 59  mS)
                     ))
             (Element E (var tilt E)(
                      (amp      = 31  Hz)
                      (dur      = 141 mS)
                      (tilt     = 0.75 rel)
                      (peak_pos = 136 mS)
                     ))
             (Element C (def any C)(
                     (amp       = 0.0 Hz)
                     ))
             (Element C (var any C)(
                      (amp      = 10 Hz)
                     ))
             (Element P (def any P)(
                     (amp       = 151.0 Hz)
                     ))
             (Element P (var any P)(
                      (amp      = 20 Hz)
                     ))
     ))

Defining a New Feature Set

A little care needs to be taken here as the system will accept inappropriate feature sets but become confused by them.

An example feature set is

     (Stats Intonation (
             (Feature rise (binary tilt C)(
                      (amp       +=  10 rel)
                      ))
             (Feature fall (binary tilt C)(
                      (amp       -=  10 rel)
                      ))
             (Feature amp (scalar tilt E)(
                      (amp       +=  1 rel)
                      (dur       +=  1 rel)
                      ))
             (Feature early (binary tilt E)(
                      (peak_pos -= 1.1 rel)
                      ))
             (Feature late (binary tilt E)(
                      (peak_pos += 1.1 rel)
                      ))
             (Feature rise (binary tilt E)(
                      (tilt      += +1 rel)
                      ))
             (Feature fall (binary tilt E)(
                      (tilt      += -1 rel)
                      ))
             )
     )

Feature headers are defined in the form

     <name>  ( <type > <space> <element> )

The name need only be unique to the space and element, so connection and event features can have the same name without confusion. type refers to scalar or binary. space refers to rfc or tilt, though presently only tilt is fully implemented. element refers to whether the feature should operate on an event or a connection.

Feature bodies are defined in the form

     (variable operator value dimension)

variable specifies which tilt variables are to be affected. operator should always be += or -=. Note value is a standard deviation, so very large values are inadvisable. dimension is not presently used.

Go to the first, previous, next, last section, table of contents.