The CHATR system tries to be many things. It tries to be modular, portable, efficient and even produce good synthesis. These goals are not always complementary. However it is strongly felt by the authors that in order to ultimately improve on the classic `pipeline' TTS structure, it is necessary to introduce a new more flexible architecture.
CHATR can best be thought of as a general system in which modules act on utterances. Each utterance will have a complex internal structure. A module may access (read or write) any part of an utterance though typically they read many parts but write only one type of information. A typical module may be a waveform synthesizer, something that takes a description of phonemes, durations and fundamental frequency, and generates a waveform. Or it may be something that takes words and looks them up in a lexicon returning their pronunciation. Each module may have its own internal structure if necessary, however, it's communication with the rest of the system is via an utterance object.
There are many references in this manual to modules and functions, so the meaning of these titles as used in CHATR will now be defined.
A function is a piece of code which performs a small low-level task, such as byte-swapping, file-to-file transfer of phonemes or words (stream building), or removing unneeded phrase parts etc. Functions are called by modules. Functions can (and often do) call other functions.
A module is a collection of functions, each of which may be called by many other modules. It is a stand-alone unit which takes an input, performs a major task and supplies a complete output, ready for use by the user. Modules are called in the form of functions. A module does not call another module.
Since modules are defined in the form of Lisp functions, full control of flow may be specified within a Lisp function to synthesize. Similarly, testing of some functions may be controlled directly by a user without the need to recompile the system.
By default the Synth
command decides which modules to be
called based on the utterance type. See function chatr()
in
file `$CHATR_ROOT/src/chatr/chatr.c' for the actual mapping.
There are various levels in which a system like CHATR can be used. At one extreme CHATR can simply be used as a black box that generates speech. Thus it can be used as a speech synthesizer for a general natural language processing system. At the other extreme, a user can add and change modules in the system adding new features to the synthesizer. Other levels exist, for instance, redefining the HLP rules or adding a new speech database is possible without recompiling the system. CHATR is designed to be both a speech synthesizer and a tool for researching speech synthesis.
In order to offer a uniform environment between the internals of CHATR and the outside world, almost all data and command i/o is done via Lisp s-expressions (bracketed structures). S-expressions offer a very simple but uniform representation for complex objects. This means we need only define one main function for reading and writing data. No special `read-intonation-stats' or `print-segment-stream' function (or syntax) is required. All non-binary data is conventionally represented in this form.
As in Lisp, commands have the generic form
(command_name arg~1 arg~2 ... arg~n)
Commands are interpreted by an evaluator. A table of commands relates the Lisp-level command name to a C function which interprets that command. Again, this means there is a uniform method for specifying actions (playing data, saving data, setting parameters etc) within the system.
The processing sequence of CHATR is as follows
--------------------- | | | Text Input |----------------- | | | --------------------- Voice Input | | | | \|/ --------------------- --------------------- | | | Phoneme Conversion | | | --------------------- | \|/ --------------------- | | | Prosody Prediction |----------------- | | | --------------------- Prosody Input | | | | \|/ --------------------- --------------------- | | | Unit Selection | | | --------------------- | ______ \|/ / \ --------------------- / \ | | | Speech | | Waveform Processing | <---> | | | | | Data | --------------------- \ / | \______/ \|/ --------------------- | | ___|\ | Audio Output | ___ ) "Hello, I am CHATR." -----| | |/ | --------------------- | Text Display | | | ---------------------
There are several different methods for performing each process. Users may use system defaults (except for Audio Output (see section Audio Setup - Software)), or, using CHATR commands, select a particular preferred method. Present system defaults are
Text Input
Phoneme Conversion
Prosody Prediction
Unit Selection
Waveform Processing
Audio Output
.chatrrc
file. See section Audio Setup - Software, for details.
There are many forms in which text can be presented to the system.
Voice input may come from several sources.
Each utterance is represented internally as an object, many of which may exist in the system at once. The synthesis of the utterance (ultimately a waveform) is generated with respect to various parameters set up beforehand. An utterance consists of a number of levels, called streams. The number of streams may vary depending on the type of synthesis being used. There is a method for declaring which currently defined streams are to be used for utterances. Each stream can be viewed as a series of ordered cells.(3) Each stream cell has contents which are dependent on the type of stream.
Streams can easily be added to the system, but in the version this manual supports we have the following
Input
Phrase
Word
Syllable
Phoneme
Intones
RFC
Segment
Wave
Unit
The basic architecture enables each cell on each level to be linked to any number of cells on any other level. As a result it is easy to find, for instance from a phoneme, which syllable it is part of. Through that (or even directly) which word it is within can be found. Likewise, which phonemes are in each word is also available by following a pointer. Importantly these levels are not in a simple hierarchy. Although there is an obvious hierarchical relationship between words, syllables and phonemes, there is not such an obvious relationship between intones and phonemes. Therefore such a strict hierarchy is not built in; any level may be related to any other level as required.
It is very important to state that the above levels are not strict. More could be added, or the existing ones ignored. Currently we have not fixed any, though it does appear the segment stream can be thought of as an important level between the high level aspects of synthesis and the underlying waveform generating synthesis method. We have not followed other systems, which are defined in a strict pipeline of processes, where each module feeds data to the next module in the pipe. Such a model would mean that one module can fix what information is available for later modules in the pipe. A requirement of more data in a later module might require changes in all previous modules so that the necessary information is available. Here all modules can access all levels (though typically do not), without any dependency on other modules.
Simply put, the overall synthesizer system takes an utterance object of which most levels are not yet filled in. Various modules are called (depending on parameters) that fill in these other levels of the utterance, eventually leading to a waveform (if requested) that can be played by several mechanisms.
Input to CHATR is in the form of an utterance created by
the Utterance
command. Several types of input may be
specified at quite different levels, varying from raw text to a
simple waveform. The current possibilities are
Text
(Utterance Text "You can pay for the hotel with a credit card.")Of course, with such a high level input, little control may be exercised over the prosodic form. This is, however, the simplest input type.
HLP
(Utterance HLP (((CAT S) (IFT Statement)) (((CAT NP) (LEX you))) (((CAT VP)) (((CAT Aux) (LEX can))) (((CAT V) (LEX pay))) (((CAT PP)) (((CAT Prep) (LEX for))) (((CAT NP)) (((CAT Det) (LEX the))) (((CAT N) (LEX hotel))))) (((CAT PP)) (((CAT Prep) (LEX with))) (((CAT NP)) (((CAT Det) (LEX a))) (((CAT Adj) (LEX credit) (Focus +))) (((CAT N) (LEX card))))))))
PhonoWord
PitchRange
feature. A typical example is
(Utterance PhonoWord (:D () (:S () (:C () (my (B (i)) ) (sister (H (l))) (who) ((lives (CAT V))) (in) (edinburgh (H(d))(B ()))) (:C ((PitchRange one)) (knows (B(i))) (an) (electrician (H (d)) )) ) ) ))Each word may also be labeled with intonational features. Such an example is
(Utterance PhonoWord (:D () (:S () (:C () (marianna (H*)) (made) (the) (marmalade (H*) (L-L%))))))If the intonation method is set to ToBI then it is possible to specify ToBI-like utterances in this form. No direct representation of break levels is currently possible in this mode, but the bracketed four-level structure offers a form of numbered break levels. Note that in order to stop the ToBI (and JToBI) modules from ignoring your specification, the variable
HLP_realise_strategy
must be
set to Simple_Rules
. Use the command
(set HLP_realise_strategy 'Simple_Rules)
PhonoForm
(Utterance PhonoForm (:D nil (:S ((PauseLength 65)) (Word Attorney nil (Syl ax () (Phoneme ax 70 8.5100 ((187.0000 35)))) (Syl t.er ((Stress 1) (Intones HiF0 H*)) (Phoneme t 110 7.1200 ((242.0000 55))) (Phoneme er 80 8.7500 ((255.0000 40)))) (Syl n.iy nil (Phoneme n 50 8.6700 ((233.0000 25))) (Phoneme iy 60 8.3400 ((193.0000 30))))) (Word General ((Break 1)) (Syl d.jh.eh.n ((Stress 1) (Intones !H*)) (Phoneme d 60 7.7300 ((173.0000 30))) (Phoneme jh 40 7.4100 ((226.0000 20))) (Phoneme eh 110 8.4300 ((205.0000 55))) (Phoneme n 30 8.2700 ((196.0000 15)))) (Syl axr () (Phoneme axr 130 8.2600 ((158.0000 65)))) (Syl el ((Intones L-H%)) (Phoneme el 110 7.9000 ((180.0000 55)))))) (:S nil (Word James nil (Syl d.jh.ey.m.z ((Stress 1) (Intones H*)) (Phoneme d 70 7.0900 ((182.0000 35))) (Phoneme jh 50 7.0800 ((184.0000 25))) (Phoneme ey 150 8.2200 ((154.0000 75))) (Phoneme m 100 7.7600 ((143.0000 50))) (Phoneme z 30 6.7700 ((200.0000 15))))) (Word Shannon ((Break 1)) (Syl sh.ae.n ((Stress 1) (Intones HiF0 H*)) (Phoneme sh 90 6.9900 ((200.0000 45))) (Phoneme ae 150 8.3700 ((172.0000 75))) (Phoneme n 80 8.1000 ((144.0000 40)))) (Syl ax.n ((Intones L-L%)) (Phoneme ax 30 7.4400 ((104.0000 15))) (Phoneme n 50 7.1100 ((145.0000 25))))))))
Segment
save segment
command. This allows fine
control over what is actually to be synthesized.
The fields in each segment are: segment (phoneme) name, duration in
milliseconds, power, and a list of F0 targets. Each F0 target
consists of a frequency in Hz, followed by an index in milliseconds
into the segment at which that target frequency is desired. An
example is
(Utterance Segment ( ( # 50 0 ((80 0))) ( m 58 0 ((80 0))) ( ai 148 0 ((140 42) (135 101))) ( s 105 0 ()) ( i 94 0 ((205 18))) ( s 61 0 ((145 44))) ( t 45 0 ()) ( 60 0 ()) ( h 59 0 ()) ( uu 140 0 ()) ( l 80 0 ()) ( i 97 0 ()) ( v 51 0 ()) ( z 60 0 ()) ( i 97 0 ((80 78))) ( n 58 0 ()) ( e 115 0 ((130 23))) ( d 54 0 ((80 28))) ( i 43 0 ()) ( n 39 0 ()) ( b 74 0 ()) ( uh 100 0 ()) ( r 22 0 ((80 2))) ( 80 0 ()) ( # 200 0 ((130 0)))))
SegF0
(Utterance SegF0 ("MHT01.f0" ( ( PAU 255 0 () ) ( a 85 0 () ) ( r 10 0 () ) ( a 95 0 () ) ( y 45 0 () ) ( u 105 0 () ) ( r 20 0 () ) ( u 95 0 () ) ( g 55 0 () ) ( e 85 0 () ) ( N 100 0 () ) ( j 45 0 () ) ( i 75 0 () ) ( ts 125 0 () ) ( u 90 0 () ) ( o 177.5 0 () ) ( PAU 447 0 () ) ( s 117 0 () ) ( u 42 0 () ) ... ) ) )In this case the F0 is specified in a separate file, with F0 points specified one per line. Each line should consist of two numbers, a position in milliseconds for the start of the utterance, and the desired F0 value in Hz. Each may optionally be surrounded by parentheses. Alternatively, the F0 may be specified directly in-line in the utterance. Instead of a file name, that part may be a list of bracketed pairs of positions in milliseconds and Hz values. The pairs in an explicit list of a file need not be at regular intervals, but should be in order.
RFC
(Utterance RFC ( (sil 335 ( ( sil 0 135 ) )) (hh 48 ( ( conn 28 135 ))) (ax 23 ()) (l 30 ( ( fall 21 138 ))) (ow 224 ( ( conn 192 82 ))) (sil 327 ( ( sil 56 87 ))) (ls 77 ( ( conn 74 121 ))) (ih 90 ( ( rise 47 119 ))) (z 42 ()) (dh 29 ()) (ih 56 ()) (s 72 ( ( fall 58 163 ))) (dh 32 ()) (iy 54 ()) (h# 103 ( ( conn 54 111 ))) (ao 22 ()) (f 54 ()) (ax 35 ()) (s 66 ()) (f 53 ()) (er 45 ()) (dh 23 ()) (ax 32 ()) (k 90 ()) (aa 87 ()) (n 26 ()) (f 59 ()) (r 45 ()) (ax 44 ()) (n 45 ( ( sil 16 115 ))) (s 134 ()) (sil 542 ()) ))
Syllable
(Utterance (Syllable (space rfc) (format feature)) ( (:C ((PitchRange one) (start top)) ((hh 48) (ax 23) ()) ((l 30) (ow 224) ((H (ds)))) ) (:C ((PitchRange one)) ((ih 90) (z 42) ()) ((dh 29) (ih 56) (s 72) ((H (us)))) ((dh 32) (iy 54) ()) ((ao 126) (H (ds))) ((f 54) (ax 35) (s 66) ((C (r)))) ((f 53) (er 45) ()) ((dh 23) (ax 32) ()) ((k 90) (aa 87)(n 26) ((H (ds)))) ((f 59) (r 45) (ax 44) (n 45) (s 134) ((B ()))) ) ))This example also shows how the utterance type may include other features identifying sub-type information. If the type is non-atomic, it may include a feature list which may be accessed later during synthesis.
Wave
file_type
, sample_rate
, and coding
. If no
features are specified, the value of the global wave file-type (set
by the command Wave_Filetype
) is used. An example is
(Utterance Wave ("$ROOT/usr/home/data/cmu/maem/wav/C01.03.wav" (file_type "nist")))If
file_type
is `raw', a sample rate and coding type
should be specified. If no sample rate is given, the global rate
(set by the command Sampling_Rate
) is used. No default is
available for coding
, so use one of either
lin16MSB
lin16LSB
(Utterance Wave ("$ROOT/usr/home/data/cmu/maem/wav/C01.03.raw" (file_type "raw") (sample_rate 16000) (coding lin16LSB)))
Utterances are created by the Utterance
command. They may be
saved in variables (using the set
command) and then given
to other commands as arguments. Thus one can do
(set utt1 (Utterance HLP ...)) #<Utt 349078> (Synth utt1) #<Utt 349078> (Say utt1)
Or you can pass the result of the utterance command directly in as an argument to another function thus
(Say (Synth (Utterance HLP ...))) #<Utt 349078>
However, it is often useful to save the utterance in a variable so it may referenced later. A common form is
(set utt1 (Utterance HLP ...)) #<Utt 349078> (Say (Synth utt1)) #<Utt 349078>
There is a notion of a current utterance. Many commands that
take an utterance as an argument will use the current utterance if no
argument is actually given. The current utterance is the utterance
generated by the most recent Utterance
command--irrespective
of any other utterances that have been referenced in between.
There is a system which allows an utterance to be synthesized and
played, and any other arbitrary function to be called with the newly
created utterance as an argument. This follows the ideas of EMACS by
offering hooks. If the variable utt_hook
is set to
either a function name or a list of function names, these functions
are called, in order, with the new utterance as an argument. For
example, if you wish all new utterances to be synthesized and played
at the time they are compiled, you may use the command
(set utt_hook (list Synth Say))
A similar hook (synth_hook
) also exists for use after the full
waveform is synthesized by the Synth
or Synthesize
commands. This is intended for low-level waveform manipulations to
be specified, such as altering the gain or sample frequency.
Another utterance input method is by a very simple text-to-speech
CHATR module. Text in files (or from `standard in') may be
directly synthesized via the Textfile
command. It takes one
argument, a filename. The file is assumed to be a text file. This
will read in `sentences' and build HLP utterances from them (but a
little crudely!). Sentences are defined as a string of tokens
terminated by a full stop, question mark, exclamation mark or blank
line. This input is really too low level for normal use, and hence a
number of wrap-around functions are offered. The functions
tts
, jtts
and mtts
offer English, Japanese and
mixed text to speech.
All tts functions take a single file name as an argument. If the file name `-' is given, CHATR will read from `standard in'. When in this interactive sub-mode, a different prompt is used. See section The Command-line Prompt, for an example. To exit from keyboard tts mode, enter an empty sentence. That is, after finishing a sentence, enter a single full stop.
The CHATR tts system synthesizes on a sentence-by-sentence basis. Ends of sentences are identified by blank lines, full stops, question or exclamation marks. Note that CHATR does not yet support Japanese input in Kana or Kanji form while in interactive mode. Romaji may be used. Japanese is fully supported from files, however, and also by using the EMACS interface.
A simple example is
(tts "-") Hello, this is a speech synthesis system. Sentences may be broken over lines, but will not be synthesized until the end of the actual sentence. . (tts "War-and-Peace")
Note that this sub-system is still pretty minimal. There is work to be done in adding a parser to the text-to-CHATR sub-system, and later work will add better treatment of numbers, acronyms etc. At present, although it works, there are still many things that could be done to improve it.
In text mode, audio output is asynchronous, allowing synthesis of the next utterance while the previous one is still being played. This goes some way to reduce the pauses between utterances. This incremental form of synthesis is still a little crude, but is quite adequate.
Text made up of multiple language may be spoken by any one of the CHATR voices available. The process is as follows
split text input text _________ _________ | | | | |........ | |.........| | | filters |... | | | |........ | |...... |\ ___ |.........| |....... | \ | | |....... | /|_________| \|L1 |________________________ |..... | / |___| \ \ | | / _________ \ \ |...... | / |.........| ___ \ \ |........ | / | | | | \ \ |........ |/ | |---|L2 | _ _ _ _\_ _ _ _ _ _ _ _\_ _ _ |.........|------|.. | |___|\ |_|_|_|_|_|_|_|_|_|_|_|_|_|_|_| |....... |\ | | \___/____________/_____/ |..... | \ | | ___ / | | \ |_________| | |________________/ |....... | \ |L3 | |........ | \ _________ /|___| phone sequence |... | \| | / | . | | |/ ___ / | . | |.... | | | / | . | | | |Ln | / | | | | |___| / |_________| | | / |_________| / / (multi-lingual) / / (omni-lingual) / / / word / speaker-specific / sequence / phone mappings / / / / ________________/____________ / |___|_______|_|___________|___| / / _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ / |_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
First the source text is split into sections, each containing one language. There may be multiple parts of the text in each section if the original text switches from one language to another and back again. Detection is dependent on the machine-dictated characteristics of the language. For instance, Japanese is detected by the presence of double-byte characters; German by the existence of umlaut characters.(4)
Each part of the text can then be processed differently by
CHATR. The text in the original language of the current speaker
follows the conventional route through to unit selection. Other
languages have phonemes mapped to suitable sounding phonemes from
that speakers set. See the `README' file under
$CHATR_ROOT/src/conv_phoneme
for more details.
Next a sequencer builds the individual strings of phonemes into correct order. Finally, unit selection can take place.
By this method any speaker may be made to speak any language, albeit with the accent of the original language.
Several modules are called which take an utterance and additional information as parameters, based on the utterance itself and various global settings. The modules are called from a high level function initiated when the synthesis of an utterance is requested. There are many modules, but each is designed to be self-contained (though typically depends on various lower level architecture access routines). Some of the modules are
HLP
Lexicon
Phoneme-to-Segment
Intonation
Duration
Synthesis
Audio
There are other modules too. The point to be made here is that they can vary, and synthesis researchers may wish to add their own. See section Developing New Modules for CHATR, for how to modify the system and add your own modules.
Go to the first, previous, next, last section, table of contents.