Applied Speech and Audio Processing: With matlab examples

bet	147/170
Sana	18.10.2023
Hajmi	2.66 Mb.
	#1708320

1 ... 143 144 145 146 147 148 149 150 ... 170

Bog'liq
Applied Speech and Audio Processing With MATLAB Examples ( PDFDrive )

7.5. Speech recognition
177
7.5.3
Practical speech recognition
Practical ASR systems have some generic structure, although the details will vary quite
widely. A block diagram of a generic system, shown in Figure 7.5, shows input speech
is ﬁrst cleaned up by a pre-processing system before a feature vector is extracted. The
pre-processing may take the form of ﬁltering, probably windowing and normalisation,
and some method of segmentation.
Following pre-processing, features are extracted from the speech. There are many
possible features which can be used, including LPCs, LSPs, cepstral coefﬁcients, spec-
tral coefﬁcients, and so on, although Mel-Frequency Cepstral Coefﬁcients (MFCC) are
probably the most popular at present, and there is of course no reason why the vector
needs to contain just one feature. Each feature may include several tens of coefﬁcients,
and be updated every 20 ms.
In the simplest of systems, these features can then be compared, in turn, to a large
set of stored features (an acoustic model). A distance measure (perhaps the Euclidean
distance, but more often a weighted distance measure and very commonly these days,
the role is taken over by a hidden Markov model) is computed for each of the stored
features, and a probability assigned to each one. This probability identiﬁes how well the
current speech segment matches the stored features, and naturally the highest probability
match is the best one.
However there is another level of reﬁnement possible beyond this one, and that is to
apply a language model (also shown in Figure 7.5), to weigh the probabilities of the top
few matches from the acoustic comparison based upon their adherence to language rules.
For example, if the highest matching feature vector is found to be something disallowed
in the language being spoken, then it probably should be rejected in favour of the second
highest matching feature vector.
A dictionary can be used to reﬁne the matching further: only phonetic combinations
found in the dictionary are allowed.
Evidently, with the possibility of several hundred stored feature vector templates in
the acoustic model, a similar number in the language model, and perhaps more in the
dictionary, this whole matching process can be very slow. This is one reason why the
vocabulary should be restricted, but also why the size of the feature vector should be
minimised where possible. Much research has been done on restricting the amount of
searching that may be necessary during the matching process.
The language model, as described, considers the probability that the current speech is
correctly matched given knowledge of the previous unit of matched speech. In general
this history can extend back further than to just the previous sound. An n-gram language
model looks back at the past n speech units, and uses these to compute the probability
of the next unit out of a pre-selected set of a few best matches from the acoustic model.
Of course, this again increases computational complexity, but signiﬁcantly improves
performance (especially in more regular languages such as Mandarin Chinese). The units
under consideration in the n-gram language model could be phonemes, words, or similar,
depending upon the application, vocabulary size, and so on. In a non-regular language

178

Download 2.66 Mb.

Do'stlaringiz bilan baham:

1 ... 143 144 145 146 147 148 149 150 ... 170