A number of different intonation theories have been implemented within CHATR. By `intonation theory' we mean some symbolic representation (possibly with continuous parameters) that can be used to generated an underlying F0 in synthesis.
There are other ways to specify F0 within CHATR apart from an intonation system, such as by specifying values for frames throughout the utterance.
In an intonation system, information is contained within the `Intone' stream. This is primarily related to syllables in English, and mora in Japanese. Intonation parameters must be of the same type for a whole utterance. The intonation parameter may be specified directly in some input methods, or predicted by some higher-level part of CHATR---typically the HLP rules.
HLCB is selected by either of the following commands
(Parameter Int_Method CSTR)
Or
(Parameter Int_Method CSTR)
This was the first intonation system to be implemented within
CHATR and hence is both the simplest and probably the most
stable. It is however, rather limited. The work is based on that
described in Taylor 92. Basically, syllables may be marked
with one of four elements. H
(high), L
(low), C
(continuation), or B
(boundary). In addition, these elements
may be followed by features. The features (as can the elements) may
be individually defined, but in our examples the defined features are
H early late downstep L early late downstep C rise B initial
Elements and features define values and modifications of values for a fixed number of continuous parameters. They are used in the prediction of the RFC, a lower level, more explicit representation of the F0 contour. These definitions may be tuned for a particular speaker's pitch range.
Definitions should be made using the Stats Intonation
command.
The ToBI intonation method is selected using the command
(Parameter Int_Method ToBI)
This is an implementation of the English ToBI system described in Silverman 92. As with the other intonation systems included within CHATR, it consists of three sub-parts:
Stage 1 is performed before any duration information is available. This is because duration prediction methods need to know accent information, so accents and tones must be predicted before the duration module is run.
Stage 2 is called after durations have been predicted and hence can deal with absolute positioning.
ToBI parameters are set using the variable ToBI_params
. This
should be a Lisp a-list of name and value pairs. Currently supported
names are
pitch_accents
(pitch_accents H* !H* L* L+H* L*+H L+!H* H+!H* HiF0 X*? *?)A subset of these is also acceptable. For reference, the above are presently the actual accents that appear in speaker f2b of the Boston University's Radio News corpus. (See Ostendorf 95.)
phrase_accents
(phrase_accents H- L-)
boundary_tones
(boundary_tones H-H% L-H% L-L% H-L%)
target_method
APL
(see Anderson 84), which predicts
target values of syllables that are accented or toned. The second is
LR
, which uses linear regression to predict start, mid-vowel,
and end target point for all syllables. APL
is the default.
This method uses the large number of parameters defined below to tune
the predicted value. The results of LR
are closer to the
natural F0, but at the cost of not being as general. The database
building mechanism uses LR
. An LR
model may be mapped
to a different speakers pitch range using the following two
parameters
target_f0mean
LR
F0
model to another speaker's pitch range.
target_f0std
target_f0mean
.
If target method is LR
, a list of three linear regression
models should be set in the variable tobi_lrf0_model
. These
predict the start, mid-vowel and end values for a syllable. The
feature name-weight `pairs' may optionally have a third argument
specifying a feature map. Feature maps allow category valued
features to be mapped to binary ones. If the value returned by a
feature is in the named feature map, then the value is 1, otherwise
it is 0. Example linear regression models can be found in the files
`$CHATR_ROOT/lib/data/f2b_lrf0.ch' for English and
`$CHATR_ROOT/lib/data/mht_lrf0.ch' for Japanese (JToBI).
The following parameters are only used if the target_method
is
APL
. Currently no mechanism is available to automatically
tune these parameters.
topval
refval
for maximum sized accents.
Speaker-dependent.
baseval
refval
for minimum sized accents.
Speaker-dependent.
refval
h1
topval
is multiplied to position step before H
accents. Speaker-dependent in some cases.
l1
baseval
is multiplied to position step before
L accents. Speaker-dependent in some cases.
prom1
topval
is multiplied to position top of H
accents.
prom2
topval
or baseval
multiplied to
position top of !H accents, H accents in compound accents, and H and
L in phrase accents
prom3
topval
or baseval
is multiplied to
position end of phrase accents.
HiF0_factor
decline_range
hamwin_size
The actual method used in the implementation was strongly influenced by example code (incomplete) from AT & T Bell Labs, with significant input from Mary Beckman. Hence it follows their model (and parameter names) very closely. The APL technique is also described in Anderson 84.
The JToBI (Japanese-ToBI) intonation method is selected using the command
(Parameter Int_Method JToBI)
This is an implementation (in conjunction with Mary Beckman) of the work described in Pierre Humbert 88b.
Parameters may be set using the variable mb_params
.
Although many parameters are available for controlling the prediction of F0 target points, the same linear regression method used by the English ToBI system produces better results, and more importantly can be trained.
A linear regression model consists of three separated models for
predicting the start, mid-vowel and end target points for syllable.
A forth item in the variable tobi_lrf0_model
is the source
mean F0 and standard deviation, which allows F0 pitch mapping between
speakers. The format is exactly the same as used for the English
ToBI. A Japanese JToBI LR
model example can be found in the
file `$CHATR_ROOT/lib/data/mht_lrf0.ch'.
The Fujisaki intonation method is selected using the command
(Parameter Int_Method Fujisaki)
An implementation of the Fujisaki model Fujisaki 83 is
available for Japanese. It is still experimental but does produce F0
contours. Parameters are set using the variable
fujisaki_model
. Details of the parameters and their values
may be determined by looking at the actual code in the file
`$CHATR_ROOT/src/intonation/fujisaki.c'.
The Tilt intonation method is selected using the command
(Parameter Int_Method Tilt)
Using the work described in Taylor 93b, this model offers a labeling system which may be automatically derived from waveforms or phoneme labels.
The most difficult part about adding a new speaker is labeling the data. Once the data is in the form that CHATR requires, everything else is simple.
CHATR requires a syllable utterance type description for each utterance. This is comprised of a list of phrases, each with a start F0. Within each phrase is a list of syllables and each may have one or more events marked. An example is given below
(Utterance (Syllable (space rfc)(format feature)(dimen num)) ( ( (:C () ((hh 60) (eh 65) ((E))) ((l 33) (ow 207) ()) ) (:C () ((dh 27) (ih 56) ((E))) ((s 75) (ih 56) (z 44) ()) ((dh 42) (ax 36) ()) ((k 95) (aa 129) (n 44) ((E))) ((f 77) (r 36) (en 57) ()) ((s 77) (ao 156) ()) ((f 83) (eh 105) (s 203) ()) ) ))
In this type of description, only the presence of an event need be marked.
In addition, an RFC input description is required. An example is given below
(Utterance RFC( (sil 303 ( ( sil 0 166 ) )) (hh 60 ()) (eh 65 ( ( fall 21 166 ))) (l 33 ()) (ow 207 ( ( conn 67 125 ) ( sil 197 120 ))) (sil 155 ()) (dh 27 ( ( rise 0 149 ))) (ih 56 ()) (s 75 ( ( fall 60 173 ))) (ih 56 ()) (z 44 ()) (dh 42 ( ( conn 4 151 ))) (ax 36 ()) (k 95 ()) (aa 129 ()) (n 44 ( ( fall 5 142 ))) (f 77 ()) (r 36 ()) (en 57 ()) (s 77 ( ( conn 74 95 ))) (ao 156 ()) (f 83 ()) (eh 105 ()) (s 203 ( ( sil 91 91 ))) (sil 524 ()) ))
The CHATR user function train_input
takes these two
utterance descriptions and produces a syllable description in the RFC
event space
(Utterance (Syllable (space rfc)(format num)(dimen linear)) ( (:C ((Start 166)) ((hh 60) (eh 65) ((C 0.00) (E 0.00 0.00 -41.00 144.00 21.00))) ((l 33) (ow 207) ((C 0.00) )) ) (:C ((Start 149)) ((dh 27) (ih 56) ((C 0.00) (E 24.00 143.00 -22.00 119.00 116.00))) ((s 75) (ih 56) (z 44) ((C -9.00) )) ((dh 42) (ax 36) ()) ((k 95) (aa 129) (n 44) ((E 0.00 0.00 -47.00 283.00 134.00))) ((f 77) (r 36) (en 57) ((C -4.00) )) ((s 77) (ao 156) ()) ((f 83) (eh 105) (s 203) ()) ) ))
Next, the function Rfc_to_Tilt
is called, which transforms
this into tilt space. With a sufficient number of utterances in tilt
space, statistics can be collected on each of the 4 tilt parameters
and the phrase start F0 parameter. The mean and standard deviations
need to be calculated, which can be done using S
or any other
utility. The tilt descriptions can be derived from the utterance
file. Alternatively, using the Int_Stats
function returns a
list of all the events or connections (or both) for an utterance. By
calling mapc
, one can obtain all the statistics for a
database.
Once the statistics have been collected, a speaker table can be constructed by entering the mean and standard deviations in the appropriate places. A typical speaker file is given below
(Stats Intonation ( (Element E (def tilt E)( (amp = 47 Hz) (dur = 291 mS) (tilt = 0.0 rel) (peak_pos = 59 mS) )) (Element E (var tilt E)( (amp = 31 Hz) (dur = 141 mS) (tilt = 0.75 rel) (peak_pos = 136 mS) )) (Element C (def any C)( (amp = 0.0 Hz) )) (Element C (var any C)( (amp = 10 Hz) )) (Element P (def any P)( (amp = 151.0 Hz) )) (Element P (var any P)( (amp = 20 Hz) )) ))
A little care needs to be taken here as the system will accept inappropriate feature sets but become confused by them.
An example feature set is
(Stats Intonation ( (Feature rise (binary tilt C)( (amp += 10 rel) )) (Feature fall (binary tilt C)( (amp -= 10 rel) )) (Feature amp (scalar tilt E)( (amp += 1 rel) (dur += 1 rel) )) (Feature early (binary tilt E)( (peak_pos -= 1.1 rel) )) (Feature late (binary tilt E)( (peak_pos += 1.1 rel) )) (Feature rise (binary tilt E)( (tilt += +1 rel) )) (Feature fall (binary tilt E)( (tilt += -1 rel) )) ) )
Feature headers are defined in the form
<name> ( <type > <space> <element> )
The name
need only be unique to the space and element, so
connection and event features can have the same name without
confusion. type
refers to scalar or binary. space
refers to rfc or tilt, though presently only tilt is fully
implemented. element
refers to whether the feature should
operate on an event or a connection.
Feature bodies are defined in the form
(variable operator value dimension)
variable
specifies which tilt variables are to be affected.
operator
should always be +=
or -=
. Note
value
is a standard deviation, so very large values are
inadvisable. dimension
is not presently used.
Go to the first, previous, next, last section, table of contents.