Applied Speech and Audio Processing: With matlab examples
Higher order statistics
Download 2.66 Mb. Pdf ko'rish
|
Applied Speech and Audio Processing With MATLAB Examples ( PDFDrive )
- Bu sahifa navigatsiya:
- Audio analysis
6.4
Higher order statistics When applied to speech, the measures described in this section do not provide a defini- tive indication of the presence or absence of speech. Generally a definitive classification result is only practical under certain tightly controlled conditions (such as same user, same microphone, small vocabulary, etc.). In the real world, things become more difficult than in the lab. The author can recall undertaking a very lucrative contract to develop a commercial speech classification engine over several months. All went well initially: algorithms were implemented, background noises and speech samples collected, and a working system rapidly developed. However, problems arose during testing: one person, for no apparent reason, had the ability to ‘break’ the system every time with his voice. Despite little discernible difference to other speakers, this voice confused the combined measures used for classification. A large amount of adjustment and rework was required to modify the system to correct for this. The moral of this story is that, even devel- oping with a large number of representative samples of speech and background noise, 156 Audio analysis the sheer variability of voices and sounds in the real world spells trouble for the speech processing engineer. Speech classification is a matter of statistics. A system which classifies unconstrained speech 100% correctly is impossible. However scores approaching this are feasible. Most important is where and when any system goes wrong. The developer needs to bear in mind four conditions regarding the accuracy of a binary classification (positive- negative match to some criteria): • True-Positive classification accuracy the proportion of positive matches classified correctly; • True-Negative classification accuracy the proportion of negative matches classified correctly; • False-Positive classification accuracy the proportion of negative matches incorrectly classified as positive; • False-Negative classification accuracy the proportion of positive matches incorrectly classified as negative. In general, accuracy can be improved by ensuring a larger sample size. However the way that this is analysed is important. For example, although we can say that speech occupies certain frequency ranges, at certain volumes, there are many types of sound that are similar. Dog growls, chairs scraping across floors and doors slamming could appear like speech if spectral measures alone are used. By and large, we would therefore need to use more than one basic measure – perhaps spectral distribution and amplitude distribution (AMDF measure). Unfortunately, those measures would be confused by music – similar frequency range to speech, and similar amplitude changes. There are thus other aspects that we need to look for. Specifically for speech, there are several higher order statistics that we can turn to, and which are seldom present in generalised audio. These relate to the usage, generation, and content of the speech signal itself. First is the occupancy rate of the channel. Most speakers do not utter long continuous monologues. For telephone systems, there are generally pauses in one person’s speech, which become occupied by the other party. These to-and-fro flows can be detected, and used to indicate the presence of speech. For this, however we may require analysis of several minutes’ worth of speech before a picture emerges of occupancy rate. The syllabic rate is the speed at which syllables are formed and uttered. To some extent, this is a function of language and speaker – for example, native Indian speakers have a far higher syllabic rate than native Maori speakers, and irrespective of origin, most people do not exhibit high syllabic rate when woken in the early hours of the morning. However, the vocal production mechanisms are muscle-controlled, and are only capable of a certain range of syllabic rate. This can be detected, and the fact used to classify speech. Most languages have a certain ratio of voiced and unvoiced speech. Degree of voicing can be affected by several conditions such as sore throat, and speaking environment (think of speaking on a cellphone in a library), but there is still a pattern of voiced and unvoiced speech in many languages. In Chinese, all words are either totally voiced (V), |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling