Go to the first, previous, next, last section, table of contents.

Unit Databases

This chapter describes one of the major developed subsystems within CHATR. Here we describe how speech databases may be used as data for making new voices in CHATR. Given a set of waveform files and phoneme label files, we can automatically build a speech synthesizer voice. The method is designed as a much more generalized method of synthesis pioneered by ATR's Nuutalk system (Nuutalk 92), following the direction described in Campbell 92b.

The chapter describes the basic concept behind a database description as well as the actual details of the commands for declaring a database that CHATR can use, and the options available for tuning unit selection and training. The philosophy behind this method of synthesis is discussed in Campbell 94 (with a larger description in Campbell 95), while details of the selection methods themselves are described in Black 95d and Hunt 96.

This chapter concentrates on giving a full explanation of the the CHATR commands used in definition, declaration and use of a unit database. See section Creating & Training a Speech Synthesizer Database, for information on how to initially build a database.

Unit Descriptions

A database is viewed as a set of ordered strings of units. For our purposes the units are always phonemes. Actually they need not necessarily be classical phonemes but the symbols do need to be declared as a phoneme set. Each unit is declared with a number of `features'. Features can be of any of four types

str: Any character string.
int: Any integer value.
flt: Any float value.
cat: Any set of discrete tokens.

Any fields may exist but four are mandatory: a file-id; a phoneme name; a start position in milliseconds; and a duration in milliseconds. Other fields are possible, such as pitch, power, accentedness, phase of the moon, etc.

A database unit description consists of the field declarations and the units themselves.

Field declarations identify the type of each field in a unit description. The first three are mandatory. An example is

     ((name str)
      (start int)
      (duration int)
      (dur_z flt)
      (pitch flt)
      (pitch_z flt)
      (voice flt)
      (voice_z flt)
      (power flt)
      (power_z flt)
      (accent (high low none))
      (filenumber int))

Note accent, a category feature, is declared with a set of discrete values. Also name is special although declared as a string is actually discrete consisting of a subset of the phonemes within the database's declared phoneme set.

The unit descriptions themselves are split into groups, a group for each file in the database. Units within a group are taken to be contiguous. The general format is

     (file-id
             (unit~0 start~0 duration~0 ...)
             (unit~1 start~1 duration~1 ...)
             (unit~2 start~2 duration~2 ...)
             ...
)

The file-id will be described below. It should not be a full pathname but only the short identifier for the file. This is so that the database may be moved to (or accessed from) a different place in the file system without changing the unit description file. If units are in the same file but are not contiguous they should be specified in a separated group with the same file-id (and appropriate start times).

A unit database description is declared using the Database Unit command. It is followed by a filename in which the compiled index built from the unit description will be saved. Also, an optional phoneme set for the units may be specified. In no phoneme set is explicitly given, the current internal phoneme set is used. The phoneme set of the units must be defined before units may be compiled. Unit names must be a member of the phoneme set otherwise an error is given. If all phonemes in the phoneme set are not found in the the set of unit descriptions, a warning message is given. A summary of the contents of the database is always given after compilation.

Note that the unit database is compiled straight into a binary file. However, this file is checked at load time and byte swapping performed if required. This means compiled unit files may be used across different architectures.

At compile time, other information may be included if available, such as pitch-marks (see section Pitch Marks) or vector quantization (see section Acoustic Frame Parameters) for each file.

A very small unit database declaration is shown in the following example

     (Database Units "index-5.out" mrpa
       ((name str) 
        (start int)
        (duration int))
       (w0176 (# 0 133) (h 133 50) (i 184 22) (m 206 94) (s 301 127)
          (e 429 103) (l 532 103) (f 636 198))
        (w0252 (# 0 98) (i 98 76) (m 175 61) (p 236 98) (oo 334 136)
          (t 471 68) ( 539 56) (n 595 73) (th 669 120))
        (w0481 (# 0 75) (i 75 59) (n 134 99) (d 234 37) ( 273 63)
          (s 336 96) (th 432 58) (r 492 42) (i 534 136))
        (w0482 (# 0 38) (b 38 60) (i 99 110) (l 210 77) (d 287 35)
          (i 322 93) (ng 417 85))
        (w0352 (# 0 68) (k 68 130) (w 199 29) (e 229 71) (s 301 101)
          (ch 403 57) ( 460 82) (n 543 68))
     )

The above units only contain name, start and duration. Any other features could follow them (though the same number of features must appear in each unit description). The order of fields after the first three must be fixed to the declaration but need not be the same across different databases.

Compiling a Binary Index

Compiling a Binary Index

A binary index is created by the Database Unit command. In addition to the unit descriptions themselves, pitch marks, acoustic frame parameters (typically cepstrum vq distances, local power and F0 for each file), and a phoneme table with means and standard deviations for pitch, power and duration may also be included. See section Creating & Training a Speech Synthesizer Database, for full details of building a database from waveform files and phoneme label files.

Pitch Marks

Pitch marks may be created for waveforms in a number of different ways. The format of the pitch marks files is a position in milliseconds (for historical reasons we are stuck with specifying in milliseconds but forced to allow a number of places after the point), and a digit (0 for unvoiced and 1 for voiced). In one method of using pitch marks, unvoiced sections are marked with `fake' pitch marks. The distinction is marked by the the second digit. Since reading in pitch marks files at run time can take over 10% of the overall time to synthesize an utterance, a method of preloading the pitch marks is offered. If at unit description compile-time the variable udb_pm_table has a non-NIL value, that is taken as the pitch marks. It should consist of a list, a member for each file in the unit database. Each member should consist of a list of pairs, consisting of a float (position in milliseconds for mark) and a digit (0 for unvoiced, 1 for voiced). Alternatively udb_pm_table may simply be a list of atoms, each atom is a path name to a file containing the pitch marks. This second technique is recommend as much less memory is required in compiling a unit description.

Acoustic Frame Parameters

In order to find good join points for units, parameters are included for each frame in the database. Currently three such parameters are normally included: a vector quantization of cepstrum vectors, local power and local F0. A number of variables are used to set the aspects of these parameters.

udb_vq_frame_size: Specifies the frame advance in milliseconds (typically 10 milliseconds).
udb_vq_codebook_size: Number of clusters. As vqs are encoded as single bytes this must be less than 256. So far 128 is the most common value.
udb_vq_table: The quantization itself. A list with a member for each file in the database. The order must match the filenumber field in unit entries (as with the pitch marks). A member in this list may be a list of quantizations or a file name which consists of on value per line. Using this as a list of file names is more efficient on memory at index compile time.
udb_vq_distab: A table of Euclidean distances between each quantized number. This table should be a list of lists, `n' by `n' (where `n' is the number of clusters). Each distance is given as a number between 0 and 1.
udb_vq_frame_params: The number of parameters in each frame. First is cepstrum vq, second is power (quantized) and third is F0 (zeroed for unvoiced).

Phone Table

If the database is to be used with the Generic Selection Strategy, the variable udb_nus_phones should be set to a list. One row is required for each phoneme in the phoneset. Each entry should consist of: the phone name, the mean duration for that phone, the duration standard deviation, the F0 mean and standard deviation, the voicing mean and standard deviation and the power mean and standard deviation. An example is

(set udb_nus_phones '(
 (N   80.855   25.717   124.450   30.525   0.900   0.209   6.864   0.701   )
 (PAU 368.030   196.260   120.690   33.982   0.041   0.095   3.490   0.692   )
 (a   90.485   28.937   107.080   40.955   0.795   0.290   6.728   1.004   )
 (b   52.201   17.826   114.320   28.462   0.980   0.106   6.753   0.663   )
 (by  117.500   26.671   107.080   26.716   1.000   0.100   7.008   0.623   )
 (cch 158.330   35.377   138.080   24.945   0.164   0.157   5.329   1.186   )
 ...
 ))

Declaring Unit Databases

It is the intention of this model for unit databases that unit descriptions be separated from database use. We wish to allow easy use of multiple databases (even the multiple sets of units from the same database). Also we may wish to change the waveform files to different sampling rates or encodings (e.g. ulaw), without changing the unit index.

CHATR has the notion of a unit database. This requires a number of parameters. Multiple unit databases may exist within a single CHATR session and can be easy switched between.

Each database has a name (a string) which can be used to identify the the database. There is the notion of a current database. The following commands set values in the current database.

(Database Set Name name)

The database name. This is used for selecting the unit database from among others in the system.

(Database Set IndexFile filename)

The name of a compiled binary unit description. The same as the filename argument to the Database Units command.

(Database Set PhonemeSet name)

The phoneme set for the units. The synthesis method will map between the internal phoneme set and the unit phoneme set if a mapping is provided. See section Phoneme Maps.

(Database Set WaveSampleRate rate)

The sample rate must always be specified, and in Hz. It is necessary for all wave-files in a unit database be in the same format and at the same sample rate.

(Database Set WaveFileType ftype)

The file type for the waveform files in the database. This may be any of the waveform file types that CHATR knows about (e.g. `nist', `sunau' etc). If it is `raw', an encoding and byte order must also be specified.

(Database Set WaveEncoding etype)

The encoding for the file. It may be any of `ulaw', `lin16MSB', `lin16LSB' or `lin16' (the native byte order of the machine that CHATR is currently running on). It is ignored unless the wave file type is `raw'. NOTE `lin16MSB' is Sun, M68000 and HP byte order, while `lin16LSB' is i386, VAX, Alpha and DEC MIPS byte order.

(Database Set WaveFileSkeleton string)

The given string should contain a `%s'. It is used as a C printf-format statement (in conjunction with the file-id from a unit) to form a full path-name for the file containing the waveform. The file-id may contain local sub-directory information, i.e. the wave files need not all be in the same directory. A typical example might be as follows, assuming in this case all the wave files are in the same directory. Each file is called `w*.wav' where `*' are numbers. For file-ids of `w0001', `w0002', `w0003', etc. the wave-file skeleton would be

     (Database Set WaveFileSkeleton
        "/usr/home/data/cmu/smalldata/wav/%s.wav")

Another example is where the waveform files are called `C*.wav', but are split over a number of subdirectories. Then we could use file-ids of the form `C01/C01' `C01/C02', etc. and have a wave file skeleton such as

     (Database Set WaveFileSkeleton 
        "/usr/home/data/bigdata/wav/%s.wav")

(Database Set PitchMarkFileSkeleton string)

The given string should be a C printf-format string containing one `%s'. This is used with the file-id of a unit to find the pitch mark file. Note that the same file-id is used for both the wave file and the pitch-mark file. Hence the pitch-mark files must be in the same directory structure as the wave files, though typically not in the same directory. An example might be

     (Database Set WaveFileSkeleton 
        "/usr/home/data/cmu/moredata/pm/%s.pm")

Note that pitch-marks may be preloaded in a database. If this is so, the value must still be set, but is not used. Pitch-mark files are generated from waveform files using various techniques, both pitch tracking (using get_f0 or fz_track) or from EGG files.

(Database Set PitchMarkType string)

Two types of pitch-mark are accepted. If the argument is `voiced_only', then marks exist only at actual pitch-marks. If the value is `all_marked' (default), then marks exist not only at voiced part, but also `fake' ones exist throughout unvoiced portions. The program fz_track will produce pitch-marks in this form. In both cases the pitch-mark file should contain one pitch-mark per line, consisting of: a float position in the waveform file in milliseconds (yes, it is weird to have a floating point value for milliseconds, but that is what is required), followed by whitespace, followed by a flag 1 if the position marks a voiced pitch mark, or 0 if it marks an unvoiced (or `fake') pitch mark. When PitchMarkType is set to voice_only, all flags should be 1.

A typical description of a database ready for use is

     (Database Set IndexFile "/data/gsw/index/index-5.out")
     (Database Set Name "gordon-5")
     (Database Set PhonemeSet mrpa)
     (Database Set WaveSampleRate 12000)
     (Database Set WaveFileType nist)
     (Database Set WaveFileSkeleton "/data/gsw/wav/%s.wav")
     (Database Set PitchMarkFileSkeleton "/data/gsw/pm/%s.pm")

The setting can be made in any order, but all must be done before a database can be used.

Using a Database

Once declared, a database can be used. Issue the command

     (Database Use "gordon-5")

This will check the internal consistency of the declaration and load the unit index. If Database Use is given an argument, it should be a database name. That database will be selected as the current one, and the unit index will be loaded if not already done.

In order to save a database for later reference, it is necessary to use the command

     (Database Keep)

This keeps the current database on a list of available databases.

Thus given two databases that are loaded in the system, the following would allow swapping between them

     (Database use "gordon-5")
     (Say (Synth utt1))
     (Database use "sally-5")
     (Say (Synth utt1))

At each swap, no new data is required to be loaded. Note however, the above is a little oversimplified. The intonation statistics would probably need to change, as Sally speaks in a different pitch range than Gordon.

Finally, now a database has been declared, it is necessary to select `UDB' as the synthesis method before it can be used. This is achieved with the command

     (Parameter Synth_Method UDB)

Simple Selection Strategy

The first selection strategy is: starting at the start of the segment stream, find the longest connected stream of units that match. Then repeat this process until the end of the utterance is found. This doesn't guarantee the least number of breaks, but is a simple and robust strategy. It may be selected using the command

     (Database Set Strategy Simple)

Note this strategy respects the values for exclude_list in nus_params.

This strategy can be useful if the original phoneme string is desired, for example, in testing duration and pitch modification by PS_PSOLA.

Hand Selection

The selection algorithm can be effectively by-passed by specifying which units have to be chosen. This is used typically when an existing selection is to be changed in a minor way.

An example use of this form is given in `$CHATR_ROOT/lib/data/resynth.ch'.

Generic Selection Strategy

This strategy allows any features in the database entry and also allows arbitrary distance functions to be defined for those features. High quality selection is possible based on minimizing both distance between target segment and selected unit, and distance between selected unit and previous selected unit, i.e. the join. The method is fully described in Black 95d and Hunt 96.

Distance Functions

Distance functions for measuring match of target to database candidates may be defined externally. Although not fully general, they cover most of the types of distance function we will probably need for some time. Except for one place, all distances as defined in Lisp and hence require no change to the C code.

A distance function definition consists of 5 parts

name

The name identifies the function, later weights may be set with respect to this name.

offset

A number -1, 0, 1 indicating which pair of units this function should apply to. -1 means the previous target unit and the database unit previous to the current candidate unit. 0 means the current target unit and current candidate unit. 1 means the next target unit and database unit next to the current candidate.

fieldname

Identifies which field name is to be selected from the database unit description. It also identifies the C function used to obtain this value for the target utterance. See the file `udb/udb_targfuncs.c' for those functions.

preprocessor function

This specifies a function or mapping for the value identified by the field-name. The value of this field may be `ident', when no function is required. If the type of the field name is float, this may be `log' causing an absolute value to be converted to its log value. If the type of the field is category, the preprocessor function may be a `discrete_map' (see below). This option is designed for obtaining derived values from a database, in particular, things like voicing, place of articulation, or vowel type from phoneme information. Both the unit entry and target values are mapped through this function.

distance function

This calculates the distance between the target and database entry values. For field types `flt' and `int', predefined functions are given with names

abs: Absolute difference
sqr: Squared difference
eql: Returning 0.0 is equal and 1.0 otherwise. (This is probably not of much use in the `flt' case.)

If the field is `str' then only `eql' is currently defined. If the field is type `cat', then in addition to `eql', a table of distances may be specified. This must be specified as a square table of floats, with each line bracketed.

For example, the following defines three distance functions for the previous duration, current duration and next duration

     (Database DistFunc p_duration -1 dur_z ident abs)
     (Database DistFunc duration    0 dur_z ident abs)
     (Database DistFunc n_duration  1 dur_z ident abs)

A more complex example defines a distance table for a field named accent whose values are `(high low none)'

     (Database DistFunc accent -1 accent ident ((0.0 0.5 1.0)
                                                (0.5 0.0 1.0)
                                                (1.0 1.0 1.0)))

Specifically this allows distances for discrete features (or derived discrete features) to be specified externally and hence easily allow external procedures to be used to train distances.

Discretes and Maps

Each database may have a number of discrete types defined. By default all category type fields in a database are automatically defined as discrete types. In addition, a discrete call phone is defined, whose type consists of all unit names in the database.

Other discretes may be defined by the following command

     (Database DefDiscrete name (value~0
      value~1 value~2 ... ))

Also, again automatically, discrete types are created for fields in the phoneme set definition, as if the following were executed for each database

     (Database DefDiscrete vc (+ -))
     (Database DefDiscrete v_length (a s l d g -))
     (Database DefDiscrete v_height (1 2 3 0)) ;; 1=high 2=mid 3=low
     (Database DefDiscrete v_front (1 2 3 0))  ;; 1=front 2=mid 3=back
     (Database DefDiscrete v_rnd (+ -))
     (Database DefDiscrete c_type (s f a n l c 0))
     (Database DefDiscrete c_place (l a p b d v 0))
     (Database DefDiscrete c_vox (+ -))

The names are unfortunately again due to history and hence not as meaningful as they could be. Note the order of values in a discrete definition are significant, and the distance tables in distance function definition must reflect that order.

Discrete maps can be defined between any two defined discrete types. This allows values to be derived from existing fields in the database. Again, a number of mappings are pre-defined for any database. Such as

ph_vc: Mapping phone to vowel/consonant.
ph_length: Mapping phone to vowel length (or undefined).
ph_height: Mapping phone to vowel height (or undefined).
ph_front: Mapping phone to vowel frontedness (or undefined).
ph_rnd: Mapping phone to vowel rounding (or undefined).
ph_c_type: Mapping phone to consonant type (or undefined).
ph_c_place: Mapping phone to consonant place of articulation (or undefined).
ph_c_vox: Mapping phone to consonant voicing (or undefined).

Other database-specific maps may be explicitly specified using the command

     (Database DefMap name from-type to-type
     (pair~0, pair~1, pair~2 ... ))

A pair must appear for all values of from-type, though the order in which the pairs are specified is not significant. For example, suppose we have a discrete unit field entry called tobi_accent which takes the values

     (H* L* L+H* L++H !H* none)

We can define a discrete map to reduce these values to three (high low and none) and then define a distance function for reduced values. For example

     (Database DefDiscrete accent3 (high low none))
     (Database DefMap tobi_reduce tobi_accent accent3
              ((H* high) (L* low) (L+H* low) (L*+H low) (!H* high)
               (none none)))
     (Database DistFunc raccent 0 tobi_accent tobi_reduce 
                ((0.0 0.5 1.0)
                 (0.5 0.0 1.0)
                 (1.0 1.0 1.0)))

Unit Distance Weights

For each distance function, a weight may be specified for it's contribution to the distance between a target unit and a candidate unit. Weights are specified using the Database Set Weights command. The value following may be either Clear, to remove all existing weights, a list of weights per phoneme, set of phonemes or any. The specific weights are set first (in order of specification), then any remaining phonemes which do not have weights are given the weights specified by any. A weights specification consists of a list of sets. Examples may be found in any database directory in `$SPEAKER_ROOT/index/DBNAME_weights*.ch'. Note that weights can only be specified for defined distance functions.

Other Parameters

There are other specific parameters to the generic unit selection strategy which may be set through the variable nus_params. These allow setting of beam width and weighting for various parameters in the search process. A full setting might be

cand_width: The number of candidates to do full distance on at each target segment. 20 is a reasonable value, though it can probably be smaller.
beam_width: The number of possibility paths of candidates to carry forward to the next target unit. 10 is a reasonable value.
exclude_list: A list of file-ids of files that should be omitted during unit selection. This is a quick and easy way to exclude a file (to ensure none of the original units are selected) while synthesizing it. It is not suitable however, to include large numbers of files in this list as it becomes very inefficient. If large numbers of files must be excluded, build a new index file.
join_wt: Value by which to multiply continuity distance when combining unit distance and continuity distance.
unit_wt: Value by which to multiply unit distance when combining unit distance and continuity distance.
cand_thresh: Weight used for threshold when pre-pruning candidates for full distance measure. This value is important and can make a big difference to the quality of selection. 1.0 or 2.0 are reasonable figures.
vq_wt: Weight by which to multiply the overall acoustic frame parameter distance (note this should be given a more general name).
vq_vq_wt: Weight by which to multiply the vq parameter (param 0) in acoustic parameter distance.
vq_pow_wt: Weight by which to multiply the power parameter (param 1) in acoustic parameter distance.
vq_f0_wt: Weight by which to multiply the F0 parameter (param 2) in acoustic parameter distance.

See section Creating & Training a Speech Synthesizer Database, for details on how to actually build a database.

Go to the first, previous, next, last section, table of contents.