Applied Speech and Audio Processing: With matlab examples
Download 2.66 Mb. Pdf ko'rish
|
Applied Speech and Audio Processing With MATLAB Examples ( PDFDrive )
7.5. Speech recognition
177 7.5.3 Practical speech recognition Practical ASR systems have some generic structure, although the details will vary quite widely. A block diagram of a generic system, shown in Figure 7.5, shows input speech is first cleaned up by a pre-processing system before a feature vector is extracted. The pre-processing may take the form of filtering, probably windowing and normalisation, and some method of segmentation. Following pre-processing, features are extracted from the speech. There are many possible features which can be used, including LPCs, LSPs, cepstral coefficients, spec- tral coefficients, and so on, although Mel-Frequency Cepstral Coefficients (MFCC) are probably the most popular at present, and there is of course no reason why the vector needs to contain just one feature. Each feature may include several tens of coefficients, and be updated every 20 ms. In the simplest of systems, these features can then be compared, in turn, to a large set of stored features (an acoustic model). A distance measure (perhaps the Euclidean distance, but more often a weighted distance measure and very commonly these days, the role is taken over by a hidden Markov model) is computed for each of the stored features, and a probability assigned to each one. This probability identifies how well the current speech segment matches the stored features, and naturally the highest probability match is the best one. However there is another level of refinement possible beyond this one, and that is to apply a language model (also shown in Figure 7.5), to weigh the probabilities of the top few matches from the acoustic comparison based upon their adherence to language rules. For example, if the highest matching feature vector is found to be something disallowed in the language being spoken, then it probably should be rejected in favour of the second highest matching feature vector. A dictionary can be used to refine the matching further: only phonetic combinations found in the dictionary are allowed. Evidently, with the possibility of several hundred stored feature vector templates in the acoustic model, a similar number in the language model, and perhaps more in the dictionary, this whole matching process can be very slow. This is one reason why the vocabulary should be restricted, but also why the size of the feature vector should be minimised where possible. Much research has been done on restricting the amount of searching that may be necessary during the matching process. The language model, as described, considers the probability that the current speech is correctly matched given knowledge of the previous unit of matched speech. In general this history can extend back further than to just the previous sound. An n-gram language model looks back at the past n speech units, and uses these to compute the probability of the next unit out of a pre-selected set of a few best matches from the acoustic model. Of course, this again increases computational complexity, but significantly improves performance (especially in more regular languages such as Mandarin Chinese). The units under consideration in the n-gram language model could be phonemes, words, or similar, depending upon the application, vocabulary size, and so on. In a non-regular language |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling