Applied Speech and Audio Processing: With matlab examples
Download 2.66 Mb. Pdf ko'rish
|
Applied Speech and Audio Processing With MATLAB Examples ( PDFDrive )
- Bu sahifa navigatsiya:
- Voice activity detection (VAD)
- Segmentation
- Word stress
7.5. Speech recognition
179 Sphinx-4, in common with most current ASR implementations, relies upon hidden Markov models to match speech features to stored patterns. It is highly configurable, and incredibly flexible – the actual feature used can be selected as required. However the one that is most commonly extracted and used for pattern matching purposes is the Mel-Frequency Cepstral Coefficient (MFCC) [19]. This flexibility extends to the pre-processing sections, the ‘FrontEnd’where a selection of several different filters and operations can be performed singly or chained together, and also to the so-called ‘Linguist’ which is a configurable module containing a language model, acoustic model and dictionary. The linguist is responsible for consulting these based upon a particular feature vector, and determining which subset of stored patterns are compared with a particular feature vector under analysis. Sphinx-4 has been tested extensively using industry-standard databases of recorded speech, which are commonly used by ASR researchers to compare the performance of systems. Accuracy rates of over 98% are possible for very small vocabularies (with a response time of 20 ms), over 97% for a 1000 word vocabulary (in 400 ms), and approximately 81% for a 64 000-word vocabulary (below 4 s) [19]. These figures are assumed to be for high SNR cases. 7.5.4 Some basic difficulties Although we have looked at the main parameters related to speech recognition, there are several issues that speech recognition systems in general need to cope with. These may include: Voice activity detection (VAD), also known as a voice operated switch (VOS) is a device able to detect the presence of speech. It would serve no purpose for an ASR system to attempt the computationally intensive task of trying to recognise what is being said when no speech is present, and thus the ability to accurately detect speech is required. However this is not a trivial task, and is in fact a research area in its own right. Segmentation of speech into smaller units is often required in processing systems. Whilst this is generally based on fixed size analysis frames when performing general audio processing (see Section 2.4), in ASR systems, segmentation into words, or even into phonemes, may be required. Again, this is non-trivial, and is not simply a matter of searching for gaps within continuous speech, since the gaps between words or sentences may on occasion be shorter than the gaps within words. Word stress can be very important in determining the meaning of a sentence, and although it is not captured in the written word, is widely used during vocal com- munications. As an example, note the written sentence ‘He said he did not eat this’ and consider the variations in meaning represented by stressing different words: Download 2.66 Mb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling