Advanced topics
Figure 7.4
Plot of articulation index (a measure of recognition accuracy) versus the logarithm of
vocabulary size for speech recognised at various levels of additive white noise, with SNR
ranging from
−18 dB to +12 dB.
The process of actually capturing speech input for a recogniser is extremely important,
affecting not only the signal-to-noise ratio, but also the characteristics of the sound.
Telephone-based systems must cater for the low bandwidth and low quality telephone
signal, whereas microphone based systems depend for their performance in part upon
the distance between the microphone and the mouth of the speaker.
It is likely that any user of commercial ASR systems would have faced the problems
of background noise, microphone placement, gain, and so on. Having trained a system
to a particular voice so that accuracy levels of over 90% can be achieved, an issue as
simple as changing microphone placement, or the addition of even quite low levels of
background noise, could reduce accuracy by as much as 20%. The presence of music,
speech or loud background noises may well degrade performance far more.
The presence or absence of background noise is a critical operational factor, and
recognition systems designed to work with headset-mounted microphones or similar
will naturally perform better than those capturing speech from a transducer located far
away from the mouth of a speaker. In the latter case, directional microphones can ‘steer’
themselves to some extent in order to avoid background sounds. Other recognition sys-
tems have made use of non-speech cues to improve recognition performance, including:
video images of a speaking mouth, ultrasonic echoes of the mouth, body gestures, facial
expressions, and even nerve impulses in the neck. Clearly the speech recognition research
field is both very active and very diverse.
Do'stlaringiz bilan baham: |