Applied Speech and Audio Processing: With matlab examples
Download 2.66 Mb. Pdf ko'rish
|
Applied Speech and Audio Processing With MATLAB Examples ( PDFDrive )
3.3. Speech understanding
51 In terms of implementation, and performing such analysis in Matlab, we note firstly that performing an analysis from −∞ to +∞ is an unrealistic expectation, so we would normally choose a segment of N samples of audio to analyse over, then window it and perform a fast Fourier transform to obtain both power spectra, P and S. In discrete sampled versions, we follow the same method that we used in Chapter 2 to visualise signals in the frequency domain: S=fft(s.*hamming(N)); S=20*log10(abs(S(1:N/2))); P=fft(p.*hamming(N)); P=20*log10(abs(P(1:N/2))); and then proceed with the SD measure: SD=mean((S-P).ˆ2); Indeed, SD is a perceptually relevant difference measure for speech and audio, however it can be enhanced further, and that is by the additional step of A-weighting the spectra – so that differences in frequency regions that are more audible are weighted more than those in frequency regions that are inaudible. This yields a perceptually-weighted spectral distortion, and is used in practical systems that perform high-quality speech and audio signal analysis. 3.3.3 Measurement of speech intelligibility Intelligibility is also best measured by a panel of listeners, and relates to the ability of listeners to correctly identify words, phrases or sentences. An articulation test is similar, but applies to the understanding of individual phonemes (vowels or consonants) in monosyllabic or polysyllabic real or artificial words. Several common methods of evaluation exist but those standardised by ANSI (in standard S2.3-1989) dominate. Some example evaluative procedures are listed here along with references that provide more information (unless noted, see [16] for further details): • diagnostic rhyme test (DRT) [17] – asking listeners to distinguish between two words rhyming by initial, such as {freak, leak}; • modified rhyme test (MRT) – asking listeners to select one of six words, half differing by initial and half by final, such as {cap, tap, rap, cat, tan, rat}; • phonetically balanced word lists – presenting listeners with 50 sentences of 20 words each, and asking them to write down the words they hear; • diagnostic medial consonant test; • diagnostic alliteration test; • ICAO spelling alphabet test; • two-alternative forced choice [18] – a general test category that includes the DRT; • six-alternative rhyme test [18] – a general test category that includes the MRT; 52 Speech • four-alternative auditory feature test [17] – asking listeners to select one of four words, chosen to highlight the intelligibility of the given auditory feature; • consonant-vowel-consonant test [19,20,11] – test of vowel syllable sandwiched be- tween two identical consonants, with the recognition of the vowel being the listeners’ task. For example {tAt}, {bOb}; • general sentence test [11] – similar to the phonetically balanced word list test, but using self-selected sentences that may be more realistic in content (and in context of what the test is trying to determine); • general word test [5] – asking listeners to write down each of a set of (usually 100) spoken words, possibly containing realistic words. Clearly intelligibility may be tested in terms of phonemes, syllables, words, phrases, sentences, paragraph meaning, and any other arbitrarily grouped, measured recognition rate. In general we can say that the smaller the unit tested, the more able we are to relate the results to individual parts of speech. Unfortunately no reliable method has so far been developed of extrapolating from, for example the results of a phoneme test, to determine the effectiveness on sentence recognition (although if you know what the cause of intelligibility loss is in a particular system, you could have a good guess). 3.3.4 Contextual information, redundancy and vocabulary size Everyday experience indicates that contextual information plays an important role in the understanding of speech, often compensating for an extreme lack of original information. For example the sentence: ‘He likes to xxxxx brandy’ can easily be understood through guessing even though a complete word is missing (‘drink’). The construction of sentences is such that the importance of missing words is very difficult to predict. It is hard to know in advance whether the start, middle or end of a sentence will be more critical to its understanding. For example the missing word ‘stop’ differs in both importance and predictability in the two sentences: ‘She waited in the long queue at the bus xxxx’ and ‘As the car sped towards him he shouted xxxx!’ Contextual information may be regarded as being provided by surrounding words which constrain the choice of the enclosed word, or on a smaller scale, by the surrounding syllables which constrain the choice of a missing or obscured syllable (as certain com- binations do not appear at all, or very infrequently in the English language). Vocabulary size reduction also causes a similar constraint, and it is noticeable that most people will restrict their vocabulary to simple words when communications are impaired: eloquence is uncommon in highly noisy environments. |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling