The process involves creating an index of phones and their prosodic characteristics for each utterance in the corpus. The re-sequencing synthesiser doesn't necessarily produce any sounds; it merely determines an optimal sequence for random-access replay from the original speech to give the best approximation to a desired utterance from the segments available in a given speech corpus. The synthesis method is independent of language or speaker but requires a sufficient source database that represents a balanced sample of the language
To find the optimal sequence of segments for concatenation, the synthesiser selects from amongst candidates in the database using a weighted combination of their acoustic and prosodic features to maximize continuity between segments while at the same time minimising the distance of each from its prosodic target. Optimal performance is achieved by under-specification of prosody, so that only key points in the utterance have targets and the remainder are considered prosodically neutral. In conjunction with loose selection of units from a continuous-speech corpus, prosodic under-specification maximises the number of candidate segments and uses the redundancy of information in natural speech to reduce or eliminate distortions in the output synthesis.
Page last updated: 23rd January 1998.