This chapter describes one of the major developed subsystems within
CHATR. Here we describe how speech databases may be used as
data for making new voices in CHATR. Given a set of waveform
files and phoneme label files, we can automatically build a speech
synthesizer voice. The method is designed as a much more generalized
method of synthesis pioneered by ATR's Nuutalk
system
(Nuutalk 92), following the direction described in
Campbell 92b.
The chapter describes the basic concept behind a database description as well as the actual details of the commands for declaring a database that CHATR can use, and the options available for tuning unit selection and training. The philosophy behind this method of synthesis is discussed in Campbell 94 (with a larger description in Campbell 95), while details of the selection methods themselves are described in Black 95d and Hunt 96.
This chapter concentrates on giving a full explanation of the the CHATR commands used in definition, declaration and use of a unit database. See section Creating & Training a Speech Synthesizer Database, for information on how to initially build a database.
A database is viewed as a set of ordered strings of units. For our purposes the units are always phonemes. Actually they need not necessarily be classical phonemes but the symbols do need to be declared as a phoneme set. Each unit is declared with a number of `features'. Features can be of any of four types
str
int
flt
cat
Any fields may exist but four are mandatory: a file-id; a phoneme name; a start position in milliseconds; and a duration in milliseconds. Other fields are possible, such as pitch, power, accentedness, phase of the moon, etc.
A database unit description consists of the field declarations and the units themselves.
Field declarations identify the type of each field in a unit description. The first three are mandatory. An example is
((name str) (start int) (duration int) (dur_z flt) (pitch flt) (pitch_z flt) (voice flt) (voice_z flt) (power flt) (power_z flt) (accent (high low none)) (filenumber int))
Note accent
, a category feature, is declared with a set of
discrete values. Also name
is special although declared as a
string is actually discrete consisting of a subset of the phonemes
within the database's declared phoneme set.
The unit descriptions themselves are split into groups, a group for each file in the database. Units within a group are taken to be contiguous. The general format is
(file-id (unit~0 start~0 duration~0 ...) (unit~1 start~1 duration~1 ...) (unit~2 start~2 duration~2 ...) ... )
The file-id will be described below. It should not be a full pathname but only the short identifier for the file. This is so that the database may be moved to (or accessed from) a different place in the file system without changing the unit description file. If units are in the same file but are not contiguous they should be specified in a separated group with the same file-id (and appropriate start times).
A unit database description is declared using the Database
Unit
command. It is followed by a filename in which the compiled
index built from the unit description will be saved. Also, an
optional phoneme set for the units may be specified. In no phoneme
set is explicitly given, the current internal phoneme set is used.
The phoneme set of the units must be defined before units may be
compiled. Unit names must be a member of the phoneme set otherwise
an error is given. If all phonemes in the phoneme set are not found
in the the set of unit descriptions, a warning message is given. A
summary of the contents of the database is always given after
compilation.
Note that the unit database is compiled straight into a binary file. However, this file is checked at load time and byte swapping performed if required. This means compiled unit files may be used across different architectures.
At compile time, other information may be included if available, such as pitch-marks (see section Pitch Marks) or vector quantization (see section Acoustic Frame Parameters) for each file.
A very small unit database declaration is shown in the following example
(Database Units "index-5.out" mrpa ((name str) (start int) (duration int)) (w0176 (# 0 133) (h 133 50) (i 184 22) (m 206 94) (s 301 127) (e 429 103) (l 532 103) (f 636 198)) (w0252 (# 0 98) (i 98 76) (m 175 61) (p 236 98) (oo 334 136) (t 471 68) ( 539 56) (n 595 73) (th 669 120)) (w0481 (# 0 75) (i 75 59) (n 134 99) (d 234 37) ( 273 63) (s 336 96) (th 432 58) (r 492 42) (i 534 136)) (w0482 (# 0 38) (b 38 60) (i 99 110) (l 210 77) (d 287 35) (i 322 93) (ng 417 85)) (w0352 (# 0 68) (k 68 130) (w 199 29) (e 229 71) (s 301 101) (ch 403 57) ( 460 82) (n 543 68)) )
The above units only contain name, start and duration. Any other features could follow them (though the same number of features must appear in each unit description). The order of fields after the first three must be fixed to the declaration but need not be the same across different databases.
A binary index is created by the Database Unit
command. In
addition to the unit descriptions themselves, pitch marks, acoustic
frame parameters (typically cepstrum vq distances, local power and F0
for each file), and a phoneme table with means and standard
deviations for pitch, power and duration may also be included.
See section Creating & Training a Speech Synthesizer Database, for full
details of building a database from waveform files and phoneme label
files.
Pitch marks may be created for waveforms in a number of different
ways. The format of the pitch marks files is a position in
milliseconds (for historical reasons we are stuck with specifying in
milliseconds but forced to allow a number of places after the point),
and a digit (0 for unvoiced and 1 for voiced). In one method of
using pitch marks, unvoiced sections are marked with `fake' pitch
marks. The distinction is marked by the the second digit. Since
reading in pitch marks files at run time can take over 10% of the
overall time to synthesize an utterance, a method of preloading the
pitch marks is offered. If at unit description compile-time the
variable udb_pm_table
has a non-NIL value, that is taken as
the pitch marks. It should consist of a list, a member for each file
in the unit database. Each member should consist of a list of pairs,
consisting of a float (position in milliseconds for mark) and a digit
(0 for unvoiced, 1 for voiced). Alternatively udb_pm_table
may simply be a list of atoms, each atom is a path name to a file
containing the pitch marks. This second technique is recommend as
much less memory is required in compiling a unit description.
In order to find good join points for units, parameters are included for each frame in the database. Currently three such parameters are normally included: a vector quantization of cepstrum vectors, local power and local F0. A number of variables are used to set the aspects of these parameters.
udb_vq_frame_size
udb_vq_codebook_size
udb_vq_table
filenumber
field in unit
entries (as with the pitch marks). A member in this list may be a
list of quantizations or a file name which consists of on value per
line. Using this as a list of file names is more efficient on
memory at index compile time.
udb_vq_distab
udb_vq_frame_params
If the database is to be used with the Generic Selection Strategy,
the variable udb_nus_phones
should be set to a list. One row
is required for each phoneme in the phoneset. Each entry should
consist of: the phone name, the mean duration for that phone, the
duration standard deviation, the F0 mean and standard deviation, the
voicing mean and standard deviation and the power mean and standard
deviation. An example is
(set udb_nus_phones '( (N 80.855 25.717 124.450 30.525 0.900 0.209 6.864 0.701 ) (PAU 368.030 196.260 120.690 33.982 0.041 0.095 3.490 0.692 ) (a 90.485 28.937 107.080 40.955 0.795 0.290 6.728 1.004 ) (b 52.201 17.826 114.320 28.462 0.980 0.106 6.753 0.663 ) (by 117.500 26.671 107.080 26.716 1.000 0.100 7.008 0.623 ) (cch 158.330 35.377 138.080 24.945 0.164 0.157 5.329 1.186 ) ... ))
It is the intention of this model for unit databases that unit descriptions be separated from database use. We wish to allow easy use of multiple databases (even the multiple sets of units from the same database). Also we may wish to change the waveform files to different sampling rates or encodings (e.g. ulaw), without changing the unit index.
CHATR has the notion of a unit database. This requires a number of parameters. Multiple unit databases may exist within a single CHATR session and can be easy switched between.
Each database has a name (a string) which can be used to identify the the database. There is the notion of a current database. The following commands set values in the current database.
(Database Set Name name)
(Database Set IndexFile filename)
Database Units
command.
(Database Set PhonemeSet name)
(Database Set WaveSampleRate rate)
(Database Set WaveFileType ftype)
(Database Set WaveEncoding etype)
(Database Set WaveFileSkeleton string)
printf
-format statement (in conjunction with the file-id from
a unit) to form a full path-name for the file containing the
waveform. The file-id may contain local sub-directory information,
i.e. the wave files need not all be in the same directory. A typical
example might be as follows, assuming in this case all the wave files
are in the same directory. Each file is called `w*.wav' where
`*' are numbers. For file-ids of `w0001', `w0002',
`w0003', etc. the wave-file skeleton would be
(Database Set WaveFileSkeleton "/usr/home/data/cmu/smalldata/wav/%s.wav")Another example is where the waveform files are called `C*.wav', but are split over a number of subdirectories. Then we could use file-ids of the form `C01/C01' `C01/C02', etc. and have a wave file skeleton such as
(Database Set WaveFileSkeleton "/usr/home/data/bigdata/wav/%s.wav")
(Database Set PitchMarkFileSkeleton string)
printf
-format string containing
one `%s'. This is used with the file-id of a unit to find the
pitch mark file. Note that the same file-id is used for both the
wave file and the pitch-mark file. Hence the pitch-mark files must
be in the same directory structure as the wave files, though
typically not in the same directory. An example might be
(Database Set WaveFileSkeleton "/usr/home/data/cmu/moredata/pm/%s.pm")Note that pitch-marks may be preloaded in a database. If this is so, the value must still be set, but is not used. Pitch-mark files are generated from waveform files using various techniques, both pitch tracking (using
get_f0
or
fz_track
) or from EGG files.
(Database Set PitchMarkType string)
fz_track
will produce pitch-marks in
this form. In both cases the pitch-mark file should contain one
pitch-mark per line, consisting of: a float position in the waveform
file in milliseconds (yes, it is weird to have a floating point value
for milliseconds, but that is what is required), followed by
whitespace, followed by a flag 1 if the position marks a voiced pitch
mark, or 0 if it marks an unvoiced (or `fake') pitch mark. When
PitchMarkType
is set to voice_only
, all flags should be
1.
A typical description of a database ready for use is
(Database Set IndexFile "/data/gsw/index/index-5.out") (Database Set Name "gordon-5") (Database Set PhonemeSet mrpa) (Database Set WaveSampleRate 12000) (Database Set WaveFileType nist) (Database Set WaveFileSkeleton "/data/gsw/wav/%s.wav") (Database Set PitchMarkFileSkeleton "/data/gsw/pm/%s.pm")
The setting can be made in any order, but all must be done before a database can be used.
Once declared, a database can be used. Issue the command
(Database Use "gordon-5")
This will check the internal consistency of the declaration and load
the unit index. If Database Use
is given an argument, it
should be a database name. That database will be selected as the
current one, and the unit index will be loaded if not already done.
In order to save a database for later reference, it is necessary to use the command
(Database Keep)
This keeps the current database on a list of available databases.
Thus given two databases that are loaded in the system, the following would allow swapping between them
(Database use "gordon-5") (Say (Synth utt1)) (Database use "sally-5") (Say (Synth utt1))
At each swap, no new data is required to be loaded. Note however, the above is a little oversimplified. The intonation statistics would probably need to change, as Sally speaks in a different pitch range than Gordon.
Finally, now a database has been declared, it is necessary to select `UDB' as the synthesis method before it can be used. This is achieved with the command
(Parameter Synth_Method UDB)
The first selection strategy is: starting at the start of the segment stream, find the longest connected stream of units that match. Then repeat this process until the end of the utterance is found. This doesn't guarantee the least number of breaks, but is a simple and robust strategy. It may be selected using the command
(Database Set Strategy Simple)
Note this strategy respects the values for exclude_list
in
nus_params
.
This strategy can be useful if the original phoneme string is desired, for example, in testing duration and pitch modification by PS_PSOLA.
The selection algorithm can be effectively by-passed by specifying which units have to be chosen. This is used typically when an existing selection is to be changed in a minor way.
An example use of this form is given in `$CHATR_ROOT/lib/data/resynth.ch'.
This strategy allows any features in the database entry and also allows arbitrary distance functions to be defined for those features. High quality selection is possible based on minimizing both distance between target segment and selected unit, and distance between selected unit and previous selected unit, i.e. the join. The method is fully described in Black 95d and Hunt 96.
Distance functions for measuring match of target to database candidates may be defined externally. Although not fully general, they cover most of the types of distance function we will probably need for some time. Except for one place, all distances as defined in Lisp and hence require no change to the C code.
A distance function definition consists of 5 parts
name
offset
fieldname
preprocessor function
distance function
abs
sqr
eql
For example, the following defines three distance functions for the previous duration, current duration and next duration
(Database DistFunc p_duration -1 dur_z ident abs) (Database DistFunc duration 0 dur_z ident abs) (Database DistFunc n_duration 1 dur_z ident abs)
A more complex example defines a distance table for a field named accent whose values are `(high low none)'
(Database DistFunc accent -1 accent ident ((0.0 0.5 1.0) (0.5 0.0 1.0) (1.0 1.0 1.0)))
Specifically this allows distances for discrete features (or derived discrete features) to be specified externally and hence easily allow external procedures to be used to train distances.
Each database may have a number of discrete types defined. By
default all category type fields in a database are automatically
defined as discrete types. In addition, a discrete call phone
is defined, whose type consists of all unit names in the database.
Other discretes may be defined by the following command
(Database DefDiscrete name (value~0 value~1 value~2 ... ))
Also, again automatically, discrete types are created for fields in the phoneme set definition, as if the following were executed for each database
(Database DefDiscrete vc (+ -)) (Database DefDiscrete v_length (a s l d g -)) (Database DefDiscrete v_height (1 2 3 0)) ;; 1=high 2=mid 3=low (Database DefDiscrete v_front (1 2 3 0)) ;; 1=front 2=mid 3=back (Database DefDiscrete v_rnd (+ -)) (Database DefDiscrete c_type (s f a n l c 0)) (Database DefDiscrete c_place (l a p b d v 0)) (Database DefDiscrete c_vox (+ -))
The names are unfortunately again due to history and hence not as meaningful as they could be. Note the order of values in a discrete definition are significant, and the distance tables in distance function definition must reflect that order.
Discrete maps can be defined between any two defined discrete types. This allows values to be derived from existing fields in the database. Again, a number of mappings are pre-defined for any database. Such as
ph_vc
ph_length
ph_height
ph_front
ph_rnd
ph_c_type
ph_c_place
ph_c_vox
Other database-specific maps may be explicitly specified using the command
(Database DefMap name from-type to-type (pair~0, pair~1, pair~2 ... ))
A pair must appear for all values of from-type, though the
order in which the pairs are specified is not significant. For
example, suppose we have a discrete unit field entry called
tobi_accent
which takes the values
(H* L* L+H* L++H !H* none)
We can define a discrete map to reduce these values to three (high low and none) and then define a distance function for reduced values. For example
(Database DefDiscrete accent3 (high low none)) (Database DefMap tobi_reduce tobi_accent accent3 ((H* high) (L* low) (L+H* low) (L*+H low) (!H* high) (none none))) (Database DistFunc raccent 0 tobi_accent tobi_reduce ((0.0 0.5 1.0) (0.5 0.0 1.0) (1.0 1.0 1.0)))
For each distance function, a weight may be specified for it's
contribution to the distance between a target unit and a candidate
unit. Weights are specified using the Database Set Weights
command. The value following may be either Clear, to remove
all existing weights, a list of weights per phoneme, set of phonemes
or any. The specific weights are set first (in order of
specification), then any remaining phonemes which do not have weights
are given the weights specified by any. A weights
specification consists of a list of sets. Examples may be found in
any database directory in
`$SPEAKER_ROOT/index/DBNAME_weights*.ch'. Note that weights can
only be specified for defined distance functions.
There are other specific parameters to the generic unit selection
strategy which may be set through the variable nus_params
.
These allow setting of beam width and weighting for various
parameters in the search process. A full setting might be
cand_width
beam_width
exclude_list
join_wt
unit_wt
cand_thresh
vq_wt
vq_vq_wt
vq_pow_wt
vq_f0_wt
See section Creating & Training a Speech Synthesizer Database, for details on how to actually build a database.
Go to the first, previous, next, last section, table of contents.