Applied Speech and Audio Processing: With matlab examples

bet	151/170
Sana	18.10.2023
Hajmi	2.66 Mb.
	#1708320

1 ... 147 148 149 150 151 152 153 154 ... 170

Bog'liq
Applied Speech and Audio Processing With MATLAB Examples ( PDFDrive )

7.6. Speech synthesis
181
can be stored along with a pitch parameter to enable recreation of speech. Furthermore,
advances in the processing of speech can allow for post-processing of a stitched-together
sentence to improve naturalness – for example by imposing an overall pitch contour (with
an end-of-sentence downward tail – or upward for a fake Australian accent).
At a basic technology level, stored voice playback is an example of a concatenative
system which concatenates, or strings together sequences of sounds to synthesise an
output [20]. Word-level or sentence-level concatenation is a far simpler, but less general
speech synthesis solution than phoneme concatenation, which we will discuss next.
7.6.2
Text-to-speech systems
Text-to-speech describes the process of turning written words into audible speech. A
readable overview of these systems is provided by Dutoit [21]. At their simplest, these
systems can operate with only single words, and although they can use voice storage and
playback, most normally use either stored or generated phonemes. More complicated
systems handle entire sentences at a time.
What is needed at a word level then is a dictionary or heuristic which relates each
written word to a sequence of phonemes, in effect a rule to pronounce each word.
In English, this is a non-trivial task, because spelling is often not phonetic: there are
many words which must be pronounced in a way contrary to a phonetic reading of
their spelling. In Chinese, the task is far easier on the one hand because, apart from a
few characters having dual pronunciation (normally distinguishable through context),
there is a straightforward mapping between the character to be read and a pronunciation.
However the exact pronunciation must be stored somewhere for whatever characters are
supported – which would be at least 3000 for a basic newspaper, and rising to 13 000 for
scholarly works.
Moving back to English, it is common in TTS systems, including most commercial
speech synthesisers, for there to be procedures which guess at a phonetic spelling, with
a dictionary to override this guess in irregular cases. In fact early-years school children
tend to learn their English in a similar way: if in doubt pronounce phonetically, but learn
any exceptions to the rule. The primary difference being that TTS systems do not learn
new pronunciations when listeners laugh at their incorrect attempts.
7.6.3
Linguistic transcription systems
Text-to-speech systems are all very well, but humans do not simply read words in isolation
– and that is one reason why the word playback systems of Section 7.6.1 tend to sound
unnatural. Humans modulate their speech over a sentence, based upon the syntax of what
is being said. Most speakers also add stress and intonation differences to particular words,
either to change the meaning of what is being said (see the sentences reproduced with
different stressed words in Section 7.5.4), or at least to convey emotional information

182

Download 2.66 Mb.

Do'stlaringiz bilan baham:

1 ... 147 148 149 150 151 152 153 154 ... 170