Applied Speech and Audio Processing: With matlab examples
Download 2.66 Mb. Pdf ko'rish
|
Applied Speech and Audio Processing With MATLAB Examples ( PDFDrive )
7.6. Speech synthesis
181 can be stored along with a pitch parameter to enable recreation of speech. Furthermore, advances in the processing of speech can allow for post-processing of a stitched-together sentence to improve naturalness – for example by imposing an overall pitch contour (with an end-of-sentence downward tail – or upward for a fake Australian accent). At a basic technology level, stored voice playback is an example of a concatenative system which concatenates, or strings together sequences of sounds to synthesise an output [20]. Word-level or sentence-level concatenation is a far simpler, but less general speech synthesis solution than phoneme concatenation, which we will discuss next. 7.6.2 Text-to-speech systems Text-to-speech describes the process of turning written words into audible speech. A readable overview of these systems is provided by Dutoit [21]. At their simplest, these systems can operate with only single words, and although they can use voice storage and playback, most normally use either stored or generated phonemes. More complicated systems handle entire sentences at a time. What is needed at a word level then is a dictionary or heuristic which relates each written word to a sequence of phonemes, in effect a rule to pronounce each word. In English, this is a non-trivial task, because spelling is often not phonetic: there are many words which must be pronounced in a way contrary to a phonetic reading of their spelling. In Chinese, the task is far easier on the one hand because, apart from a few characters having dual pronunciation (normally distinguishable through context), there is a straightforward mapping between the character to be read and a pronunciation. However the exact pronunciation must be stored somewhere for whatever characters are supported – which would be at least 3000 for a basic newspaper, and rising to 13 000 for scholarly works. Moving back to English, it is common in TTS systems, including most commercial speech synthesisers, for there to be procedures which guess at a phonetic spelling, with a dictionary to override this guess in irregular cases. In fact early-years school children tend to learn their English in a similar way: if in doubt pronounce phonetically, but learn any exceptions to the rule. The primary difference being that TTS systems do not learn new pronunciations when listeners laugh at their incorrect attempts. 7.6.3 Linguistic transcription systems Text-to-speech systems are all very well, but humans do not simply read words in isolation – and that is one reason why the word playback systems of Section 7.6.1 tend to sound unnatural. Humans modulate their speech over a sentence, based upon the syntax of what is being said. Most speakers also add stress and intonation differences to particular words, either to change the meaning of what is being said (see the sentences reproduced with different stressed words in Section 7.5.4), or at least to convey emotional information |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling