Go to the first, previous, next, last section, table of contents.


Various Short Useful(?) Scripts

During the development, test and use of CHATR, many `awk' and `sed' one-liners have evolved; since these are heavily environment-specific and in any case users may prefer to generate their own, they have not been included in the main text of the manual but gathered together in this appendix.

Preparing an English Database

The following example shows how simple scripts were generated and used to prepare an English speech database for training.

For ease of explanation the following assumptions are made: the original waveform and phoneme files were located in database `$ROOT/Communal_db/A/' as `.d' and `.phones' files respectively; the speech database was to be created under `$ROOT/Speech_dbs/'. In keeping with local naming conventions, the speech database name `MAA' was chosen. All these names will of course vary from database to database depending on personal preferences.

To copy the wave files while removing headers, go to `$ROOT/Communal_db/A/' and use

     for i in *.d
       do n=`basename $i .d`
       bhd $i >$ROOT/Speech_dbs/MAA/wav/$n.wav
       done     

To copy the label files, go to `$ROOT/Communal_db/A/' and use

     for i in *.phones
       do n=`basename $i .phones`
       cp $i $ROOT/Speech_dbs/MAA/lab/$n.lab
       done     

The following script may not be necessary in all cases; the problem was that in this database `silence' was labeled as `SIL' rather than `pau' as defined in the specified phoneme-set. To replace them, in the Speaker Top Directory use

     mv lab tmp_lab
     mkdir lab
     for i in tmp_lab/*.lab
       do n=`basename $i .lab`
       sed 's/SIL/pau/g' $i >lab/$n.lab
       done
     cd tmp_lab
     rm *.lab
     cd ..
     rm tmp_lab

The following script may not be necessary in all cases; the problem was that some phonemes were stress-marked (e.g. `aa+1' instead of just `aa') and hence not recognized as part of the phone specified set. To remove these marks, in the Speaker Top Directory use

     mv lab tmp_lab
     mkdir lab
     for i in tmp_lab/*.lab
       do n=`basename $i .lab`
       sed 's/+1//g' $i >lab/$n.lab
       done
     cd tmp_lab
     rm *.lab
     cd ..
     rm tmp_lab

The following script may not be necessary in all cases; the problem was that some labels were tagged with stress information (in this instance `;*' and `; *') which were not required. To remove them, in the Speaker Top Directory use

     mv lab tmp_lab
     mkdir lab
     for i in tmp_lab/*.lab
       do n=`basename $i .lab`
       awk '{$4="";$5="";print}' $i >lab/$n.lab
       done
     cd tmp_lab
     rm *.lab
     cd ..
     rm tmp_lab

The following script may not be necessary in all cases; the problem was that the timings in each label-file did not start from zero but were offset sequentially from the beginning of the original speech corpus.

Correction is in two parts. Before performing the first part, create a directory below the Speaker Top Directory to contain the start-time files using

     mkdir start_times

To extract the start-times from the wave (`.d') files, go to `$ROOT/Communal_db/A/' and use

     for i in *.d
       do n=`basename $i .d`
       hditem -i start_time $i >$ROOT/Speech_dbs/MAA/start_times/$n.st
       done

To zero-reference label-file start-times, in the Speaker Top Directory use

     mv lab tmp_lab
     mkdir lab
     for i in tmp_lab/*.lab
       do n=`basename $i .lab`
       awk '{if(NF!=1)$1-='`cat start_times/$n.st`';print}' $i >lab/$n.lab
       done
     cd tmp_lab
     rm *.lab
     cd ..
     rm tmp_lab

Preparing a German Database

See the papers Teaching CHATR German Intonation - Striegnitz 97 and German in Eight Weeks - Brinkmann 97 for details.


Go to the first, previous, next, last section, table of contents.