The purpose of a lexicon is to translate `words' (arbitrary atomic tokens) into syllables with pronunciation and stress. The lexicon system within CHATR, like many of the other parts of the system, is designed to be very powerful but currently only offers minimal functionality. The system is designed such that we can later replace this module with a more sophisticated one. This also allows users to incorporate a lexicon of their own choice if they so wish.
Lexicons may be switched (even within a session if required), thereby affecting language or dialect.
In the standard CHATR installation, lexicons are kept in the directory `$CHATR_ROOT/lib/dic/'. This may be changed if desired. See section Calling Customization at Initialization, for details.
There are currently five lexicons utilized within CHATR. The names are
mrpa
beep
cmu
japanese
celex
This list is extended each time CHATR is made capable in a new language. Depending on intended CHATR use, not all the above lexicons may be loaded. Check the definition file `$CHATR_ROOT/lib/data/lexicons.ch' to find the latest availability.
A lexicon contains a compiled set of entries and/or a (usually) small set of addenda items. There is also a flag to define what should be done if a word cannot be found in the lexicon--either fail or apply some form of letter-to-sound rules. Although a lexicon has a specific phoneme set, mapping may be performed between a lexicon's phoneme set and the currently selected CHATR internal phoneme set, if such a map is defined.
The basic form of a lexical entry is
(word (syllable~1... syllable~n) [ features ])
where those elements are defined as
word
syllable~1 - ~n
phoneme~1 - phoneme~n
(1 | 0)
features
feature-pair
feature-name
feature-value
A typical example is
(beautiful (((b y uu) (1)) ((t i ) (0)) ((f u l) (0))))
The `feature-pair' part of a lexical entry allows the specification of homographs (words with different pronunciation but same spelling). For example, consider the phonetic difference in the word `lives' between the sentences `Cats have nine lives' and `He lives in Japan'. In the lexicon the word is represented in two variants thus
(lives (((l ai v z) (1))) ((CAT N) (PLU +))) (lives (((l i v z) (1))) ((CAT V))))
Of course such entries could equally be distinguished with different citation forms, for example
(lives-n (((l ai v z) (1)))) (lives-v (((l i v z) (1))))
Currently there is no morphological analysis, which means all words and their inflections (and derivations) need to be explicitly included in the lexicon. This is tedious and is an area for future improvement.
A distinct lexicon may be created using the command
(Lexicon Select name)
If the named lexicon already exists it is selected; if it does not, a new (empty) one is created.
Three of four items must be defined in `$CHATR_ROOT/lib/data/lexicons.ch' for a lexicon to be accessible to CHATR. They are
This is achieved after creation and compilation (see section Lexicon Compilation) using the following commands
(Lexicon Phone_Set phoneset-name)
(Lexicon Use file-name)
Lexicon Compile
command.
Optional, depending on if the lexicon has been compiled.
(Lexicon Fail fail_action)
Error
LTS
JLTS
(Lexicon Add PHONEMESET-NAME entries...)
PHONEMESET-NAME
need not actually
be the same as the current lexicon phoneme-set name, if a
mapping is available.
As an example, the current entry for the `beep' lexicon is
(define setup_beep_lex () (require 'beep_def) (Lexicon Select beep) (Lexicon Phone_Set beep) (Lexicon Use beep_compiled_lex) (Lexicon Fail LTS) (Lexicon Add beep (chatr (((ch ae) (1)) ((t ax) (0)))) (sally (((s ae) (1)) ((l iy) (0)))) (today (((t ax) (0)) ((d ey) (1)))) (ATR (((ey) (1)) ((t iy) (0)) ((aa) (1)))) (the (((dh ax) (0)))) (synthesis (((s ih n) (1)) ((th ax) (0)) ((s ih s) (0)))) (dog (((d oh g) (0)))) ) )
A given set of entries may be used in one of two different ways; compiled or directly. For large lists of entries, compiling is highly recommended, as access will be significantly faster than by the direct specification method. Although accessing takes time, the loading of a full lexicon (tens of thousands of entries) is a far bigger cost.
Before proceeding with compilation, the phoneme set used by the lexicon must be selected. The command is
(Phoneme Internal_Set PHONEME-SET-NAME)
The lexicon may now be compiled using the function Lexicon
Compile
. This takes a file containing the entire lexicon and
generates a file suitable as an argument to the function
Lexicon Use
.(7) Basically it checks the format of the entries and
sorts them, ensuring a binary search will be possible. Note that the
format of the compiled form may be quite different than that of the
source.
Two file-names are required as arguments to Lexicon Compile
.
The form is
(Lexicon Compile "IN-FILE" "OUT-FILE")
The `OUT-FILE' is the name of the file in which it is wished to store the sorted lexicon.
The `IN-FILE' file should contain a call to the function
Lexicon
in the form of one s-expression. The syntax is
(Lexicon entry~1 entry~2 ... entry~n )
If the Lisp variable lexicon_syllabify
is set, the entries can
be in a different format and CHATR will attempt to syllabify
them automatically. It will not be perfect, but does offer a way to
automatically deal with large imported lexicons where we have little
control over the input form. The format required for the phonemes is
not as bracketed syllables (the format in which CMU and BEEP lexicons
are distributed), but simply as a list of phonemes. If the digits 1
or 2 are appended to vowels, they are removed and the syllable they
were located against marked as stressed. As an example, an input
entry like this
("abductive" (ae b d ah1 k t ih v))
would automatically be converted to
("abductive" (((ae b) (0)) ((d ah k) (1)) ((t ih v) (0))))
Once compiled the lexicon must be defined. See section Creating and Defining a Lexicon, for details.
It is possible to directly access a lexicon without creating an utterance. The command is
(Lexicon Lookup word)
The system returns the word in syllable-groups with stress marked, using the phoneme-set of the currently selected speaker. As an example, if the current speaker uses the `mrpa' phoneset, the above command will cause the system to respond with
(word (((w @@ d) (1))))
Of course substituting `word' in the above example for another word will cause details on that to be returned.
It may be that a particular phonetic rendering of a word doesn't suit an application. Features may need to change to represent a dialect or speech manner. This may be achieved using the command
(Lexicon Add [PHONESET-NAME] entry~1 entry~2... entry~n)
where the terms are
PHONESET-NAME
entry~1 - entry~n
Referring to the previous example, some might prefer a stronger sounding of the `r' in `word'. Such a new entry would be
(Lexicon Add mrpa (word (((w @@ r d) (1)))))
Note that modifications must be entered using phonetic symbols from
the phoneme-set used by the currently selected speaker. For
instance, entry of `@@'
(used by `mrpa' but not by `beep')
when a `beep'-coded speaker is selected will result in an error. A
mapping will be utilized if one exists. See section Phoneme Set Definitions, for information on obtaining phoneme lists.
Modifications stay in effect until either changed again, a different speaker is selected, or the current session of CHATR is quit.
Go to the first, previous, next, last section, table of contents.