During the development, test and use of CHATR, many `awk' and `sed' one-liners have evolved; since these are heavily environment-specific and in any case users may prefer to generate their own, they have not been included in the main text of the manual but gathered together in this appendix.
The following example shows how simple scripts were generated and used to prepare an English speech database for training.
For ease of explanation the following assumptions are made: the original waveform and phoneme files were located in database `$ROOT/Communal_db/A/' as `.d' and `.phones' files respectively; the speech database was to be created under `$ROOT/Speech_dbs/'. In keeping with local naming conventions, the speech database name `MAA' was chosen. All these names will of course vary from database to database depending on personal preferences.
To copy the wave files while removing headers, go to `$ROOT/Communal_db/A/' and use
for i in *.d do n=`basename $i .d` bhd $i >$ROOT/Speech_dbs/MAA/wav/$n.wav done
To copy the label files, go to `$ROOT/Communal_db/A/' and use
for i in *.phones do n=`basename $i .phones` cp $i $ROOT/Speech_dbs/MAA/lab/$n.lab done
The following script may not be necessary in all cases; the problem was that in this database `silence' was labeled as `SIL' rather than `pau' as defined in the specified phoneme-set. To replace them, in the Speaker Top Directory use
mv lab tmp_lab mkdir lab for i in tmp_lab/*.lab do n=`basename $i .lab` sed 's/SIL/pau/g' $i >lab/$n.lab done cd tmp_lab rm *.lab cd .. rm tmp_lab
The following script may not be necessary in all cases; the problem was that some phonemes were stress-marked (e.g. `aa+1' instead of just `aa') and hence not recognized as part of the phone specified set. To remove these marks, in the Speaker Top Directory use
mv lab tmp_lab mkdir lab for i in tmp_lab/*.lab do n=`basename $i .lab` sed 's/+1//g' $i >lab/$n.lab done cd tmp_lab rm *.lab cd .. rm tmp_lab
The following script may not be necessary in all cases; the problem was that some labels were tagged with stress information (in this instance `;*' and `; *') which were not required. To remove them, in the Speaker Top Directory use
mv lab tmp_lab mkdir lab for i in tmp_lab/*.lab do n=`basename $i .lab` awk '{$4="";$5="";print}' $i >lab/$n.lab done cd tmp_lab rm *.lab cd .. rm tmp_lab
The following script may not be necessary in all cases; the problem was that the timings in each label-file did not start from zero but were offset sequentially from the beginning of the original speech corpus.
Correction is in two parts. Before performing the first part, create a directory below the Speaker Top Directory to contain the start-time files using
mkdir start_times
To extract the start-times from the wave (`.d') files, go to `$ROOT/Communal_db/A/' and use
for i in *.d do n=`basename $i .d` hditem -i start_time $i >$ROOT/Speech_dbs/MAA/start_times/$n.st done
To zero-reference label-file start-times, in the Speaker Top Directory use
mv lab tmp_lab mkdir lab for i in tmp_lab/*.lab do n=`basename $i .lab` awk '{if(NF!=1)$1-='`cat start_times/$n.st`';print}' $i >lab/$n.lab done cd tmp_lab rm *.lab cd .. rm tmp_lab
See the papers Teaching CHATR German Intonation - Striegnitz 97 and German in Eight Weeks - Brinkmann 97 for details.
Go to the first, previous, next, last section, table of contents.