Applied Speech and Audio Processing: With matlab examples

Higher order statistics

bet	129/170
Sana	18.10.2023
Hajmi	2.66 Mb.
	#1708320

1 ... 125 126 127 128 129 130 131 132 ... 170

Bog'liq
Applied Speech and Audio Processing With MATLAB Examples ( PDFDrive )

Audio analysis

6.4
Higher order statistics
When applied to speech, the measures described in this section do not provide a deﬁni-
tive indication of the presence or absence of speech. Generally a deﬁnitive classiﬁcation
result is only practical under certain tightly controlled conditions (such as same user,
same microphone, small vocabulary, etc.). In the real world, things become more difﬁcult
than in the lab. The author can recall undertaking a very lucrative contract to develop
a commercial speech classiﬁcation engine over several months. All went well initially:
algorithms were implemented, background noises and speech samples collected, and a
working system rapidly developed. However, problems arose during testing: one person,
for no apparent reason, had the ability to ‘break’ the system every time with his voice.
Despite little discernible difference to other speakers, this voice confused the combined
measures used for classiﬁcation. A large amount of adjustment and rework was required
to modify the system to correct for this. The moral of this story is that, even devel-
oping with a large number of representative samples of speech and background noise,

156
Audio analysis
the sheer variability of voices and sounds in the real world spells trouble for the speech
processing engineer.
Speech classiﬁcation is a matter of statistics. A system which classiﬁes unconstrained
speech 100% correctly is impossible. However scores approaching this are feasible.
Most important is where and when any system goes wrong. The developer needs to
bear in mind four conditions regarding the accuracy of a binary classiﬁcation (positive-
negative match to some criteria):
• True-Positive classiﬁcation accuracy
the proportion of positive matches classiﬁed correctly;
• True-Negative classiﬁcation accuracy
the proportion of negative matches classiﬁed correctly;
• False-Positive classiﬁcation accuracy
the proportion of negative matches incorrectly classiﬁed as positive;
• False-Negative classiﬁcation accuracy
the proportion of positive matches incorrectly classiﬁed as negative.
In general, accuracy can be improved by ensuring a larger sample size. However the way
that this is analysed is important. For example, although we can say that speech occupies
certain frequency ranges, at certain volumes, there are many types of sound that are
similar. Dog growls, chairs scraping across ﬂoors and doors slamming could appear like
speech if spectral measures alone are used. By and large, we would therefore need to use
more than one basic measure – perhaps spectral distribution and amplitude distribution
(AMDF measure). Unfortunately, those measures would be confused by music – similar
frequency range to speech, and similar amplitude changes. There are thus other aspects
that we need to look for.
Speciﬁcally for speech, there are several higher order statistics that we can turn to,
and which are seldom present in generalised audio. These relate to the usage, generation,
and content of the speech signal itself.
First is the occupancy rate of the channel. Most speakers do not utter long continuous
monologues. For telephone systems, there are generally pauses in one person’s speech,
which become occupied by the other party. These to-and-fro ﬂows can be detected, and
used to indicate the presence of speech. For this, however we may require analysis of
several minutes’ worth of speech before a picture emerges of occupancy rate.
The syllabic rate is the speed at which syllables are formed and uttered. To some
extent, this is a function of language and speaker – for example, native Indian speakers
have a far higher syllabic rate than native Maori speakers, and irrespective of origin, most
people do not exhibit high syllabic rate when woken in the early hours of the morning.
However, the vocal production mechanisms are muscle-controlled, and are only capable
of a certain range of syllabic rate. This can be detected, and the fact used to classify
speech.
Most languages have a certain ratio of voiced and unvoiced speech. Degree of voicing
can be affected by several conditions such as sore throat, and speaking environment
(think of speaking on a cellphone in a library), but there is still a pattern of voiced and
unvoiced speech in many languages. In Chinese, all words are either totally voiced (V),

Download 2.66 Mb.

Do'stlaringiz bilan baham:

1 ... 125 126 127 128 129 130 131 132 ... 170