Go to the first, previous, next, last section, table of contents.

Lexicon

The purpose of a lexicon is to translate `words' (arbitrary atomic tokens) into syllables with pronunciation and stress. The lexicon system within CHATR, like many of the other parts of the system, is designed to be very powerful but currently only offers minimal functionality. The system is designed such that we can later replace this module with a more sophisticated one. This also allows users to incorporate a lexicon of their own choice if they so wish.

Lexicons may be switched (even within a session if required), thereby affecting language or dialect.

In the standard CHATR installation, lexicons are kept in the directory `$CHATR_ROOT/lib/dic/'. This may be changed if desired. See section Calling Customization at Initialization, for details.

Current CHATR Lexicons

There are currently five lexicons utilized within CHATR. The names are

mrpa: The CSTR created mrpa-phoneme lexicon of around 23000 entries.
beep: The Cambridge University beep-phoneme lexicon consisting of around 163000 entries.
cmu: The CMU darpa-phoneme lexicon (converted to the radio2 phoneme set) consisting of around 99000 entries.
japanese: A lexicon with no entries but a letter-to-sound function to convert from romaji to the nuuph phoneme set.
celex: The Kiel University celex-phoneme lexicon consisting of around 313000 entries.

This list is extended each time CHATR is made capable in a new language. Depending on intended CHATR use, not all the above lexicons may be loaded. Check the definition file `$CHATR_ROOT/lib/data/lexicons.ch' to find the latest availability.

Lexicon Entries

A lexicon contains a compiled set of entries and/or a (usually) small set of addenda items. There is also a flag to define what should be done if a word cannot be found in the lexicon--either fail or apply some form of letter-to-sound rules. Although a lexicon has a specific phoneme set, mapping may be performed between a lexicon's phoneme set and the currently selected CHATR internal phoneme set, if such a map is defined.

The basic form of a lexical entry is

     (word (syllable~1... syllable~n) [ features ])

where those elements are defined as

word: An atom - the word to be defined.
syllable~1 - ~n: ((phoneme~1... phoneme~n) (1 | 0))
phoneme~1 - phoneme~n: A series of atoms describing the phonetic sound of that portion of the word.
(1 | 0): An atom indicating if the preceding phoneme should be stressed (1) or not (0).
features: (feature-pair...). Word category information to facilitate discernment of homographs.
feature-pair: (feature-name feature-value)
feature-name: An atom naming the feature to be defined, such as the grammatical category or numeric value of the word.
feature-value: An atom defining the value or status of the named category, such as `N' for `Noun' or `+' for `affirmative'.

A typical example is

     (beautiful (((b y uu) (1)) 
                 ((t i ) (0)) 
                 ((f u l) (0))))

The `feature-pair' part of a lexical entry allows the specification of homographs (words with different pronunciation but same spelling). For example, consider the phonetic difference in the word `lives' between the sentences `Cats have nine lives' and `He lives in Japan'. In the lexicon the word is represented in two variants thus

     (lives (((l ai v z) (1))) ((CAT N) (PLU +)))
     (lives (((l i v z) (1))) ((CAT V))))

Of course such entries could equally be distinguished with different citation forms, for example

     (lives-n (((l ai v z) (1))))
     (lives-v (((l i v z) (1))))

Currently there is no morphological analysis, which means all words and their inflections (and derivations) need to be explicitly included in the lexicon. This is tedious and is an area for future improvement.

Creating and Defining a Lexicon

A distinct lexicon may be created using the command

     (Lexicon Select name)

If the named lexicon already exists it is selected; if it does not, a new (empty) one is created.

Three of four items must be defined in `$CHATR_ROOT/lib/data/lexicons.ch' for a lexicon to be accessible to CHATR. They are

A phoneme set.
The compiled lexicon.
An instruction of what to do if a word is not found in the lexicon.
An addenda (optional).

This is achieved after creation and compilation (see section Lexicon Compilation) using the following commands

(Lexicon Phone_Set phoneset-name)

Define the phoneme set for the lexicon.

(Lexicon Use file-name)

Identify a file compiled using the Lexicon Compile command. Optional, depending on if the lexicon has been compiled.

(Lexicon Fail fail_action)

This identifies what will happen if a given word is not found in the lexicon. Possible actions are

Error: Signal an error (the default).
LTS: Use letter-to-sound rules to provide a pronunciation. The rules used by CHATR are those developed by the US Naval Research Laboratory, Washington DC.
JLTS: Use Japanese letter-to-sound rules. This assumes the word is in romaji.

(Lexicon Add PHONEMESET-NAME entries...)

Add a word to the lexicon. Optional - if you don't have anything to add, don't use it! Note the PHONEMESET-NAME need not actually be the same as the current lexicon phoneme-set name, if a mapping is available.

As an example, the current entry for the `beep' lexicon is

     (define setup_beep_lex ()
       (require 'beep_def)
       (Lexicon Select beep)
       (Lexicon Phone_Set beep)
       (Lexicon Use beep_compiled_lex)
       (Lexicon Fail LTS)
       (Lexicon Add beep
	        (chatr       (((ch ae) (1)) ((t ax) (0))))
	        (sally       (((s ae) (1)) ((l iy) (0))))
	        (today       (((t ax) (0)) ((d ey) (1))))
	        (ATR         (((ey) (1)) ((t iy) (0)) ((aa) (1))))
	        (the         (((dh ax) (0))))
	        (synthesis (((s ih n) (1)) ((th ax) (0)) ((s ih s) (0))))
	        (dog (((d oh g) (0))))
	        )
     )

Lexicon Compilation

A given set of entries may be used in one of two different ways; compiled or directly. For large lists of entries, compiling is highly recommended, as access will be significantly faster than by the direct specification method. Although accessing takes time, the loading of a full lexicon (tens of thousands of entries) is a far bigger cost.

Before proceeding with compilation, the phoneme set used by the lexicon must be selected. The command is

     (Phoneme Internal_Set PHONEME-SET-NAME)

The lexicon may now be compiled using the function Lexicon Compile. This takes a file containing the entire lexicon and generates a file suitable as an argument to the function Lexicon Use.(7) Basically it checks the format of the entries and sorts them, ensuring a binary search will be possible. Note that the format of the compiled form may be quite different than that of the source.

Two file-names are required as arguments to Lexicon Compile. The form is

     (Lexicon Compile "IN-FILE" "OUT-FILE")

The `OUT-FILE' is the name of the file in which it is wished to store the sorted lexicon.

The `IN-FILE' file should contain a call to the function Lexicon in the form of one s-expression. The syntax is

     (Lexicon 
       entry~1
       entry~2
      ...
       entry~n
       )

If the Lisp variable lexicon_syllabify is set, the entries can be in a different format and CHATR will attempt to syllabify them automatically. It will not be perfect, but does offer a way to automatically deal with large imported lexicons where we have little control over the input form. The format required for the phonemes is not as bracketed syllables (the format in which CMU and BEEP lexicons are distributed), but simply as a list of phonemes. If the digits 1 or 2 are appended to vowels, they are removed and the syllable they were located against marked as stressed. As an example, an input entry like this

     ("abductive" (ae b d ah1 k t ih v))

would automatically be converted to

     ("abductive" (((ae b) (0)) ((d ah k) (1)) ((t ih v) (0))))

Once compiled the lexicon must be defined. See section Creating and Defining a Lexicon, for details.

Accessing a Lexicon

Lexicon Interrogation

It is possible to directly access a lexicon without creating an utterance. The command is

     (Lexicon Lookup word)

The system returns the word in syllable-groups with stress marked, using the phoneme-set of the currently selected speaker. As an example, if the current speaker uses the `mrpa' phoneset, the above command will cause the system to respond with

     (word (((w @@ d) (1))))

Of course substituting `word' in the above example for another word will cause details on that to be returned.

Lexicon Modification

It may be that a particular phonetic rendering of a word doesn't suit an application. Features may need to change to represent a dialect or speech manner. This may be achieved using the command

     (Lexicon Add [PHONESET-NAME] entry~1 entry~2... entry~n)

where the terms are

PHONESET-NAME: Optional. If not specified, assumes the phoneme-set of the currently selected speaker.
entry~1 - entry~n: List of phonemes and stress levels. Phonemes must be from the phoneme-set used by the currently selected speaker.

Referring to the previous example, some might prefer a stronger sounding of the `r' in `word'. Such a new entry would be

     (Lexicon Add mrpa (word (((w @@ r d) (1)))))

Note that modifications must be entered using phonetic symbols from the phoneme-set used by the currently selected speaker. For instance, entry of `@@' (used by `mrpa' but not by `beep') when a `beep'-coded speaker is selected will result in an error. A mapping will be utilized if one exists. See section Phoneme Set Definitions, for information on obtaining phoneme lists.

Modifications stay in effect until either changed again, a different speaker is selected, or the current session of CHATR is quit.

Go to the first, previous, next, last section, table of contents.