Applied Speech and Audio Processing: With matlab examples

bet	45/170
Sana	18.10.2023
Hajmi	2,66 Mb.
	#1708320

1 ... 41 42 43 44 45 46 47 48 ... 170

Bog'liq
Applied Speech and Audio Processing With MATLAB Examples ( PDFDrive )

3.3. Speech understanding
49
In recent years, several objective quality algorithms have been developed. These are
computer programs that, when fed an original audio recording, plus a degraded one,
will estimate the MOS score of the degraded audio (or will produce a ﬁgure which can
be converted to a MOS score). These programs can be tricked, and at the extremes will
track actual MOS scores very poorly, but for normal speech processed in standard ways,
have been shown to produce respectable results. They are used primarily because they
are cheaper and quicker than forming a panel of listeners. Most importantly they allow
automated testing to take place. Some of the more prominent algorithms are:
• PESQ (perceptual evaluation of speech quality);
• PSQM (perceptual speech quality measure);
• MNB (measuring normalised blocks).
Although these are commercially supported algorithms, it has been possible in the past
to download working versions from the Internet for non-commercial research use.
A far more crude measure of quality between a processed audio vector p and an
original audio vector s is the mean-squared error (MSE) E. This is simply calculated on
a sample-by-sample basis as the average squared difference between those vectors:
E
=
1
N
N
−1

i
=0
{s[i] − p[i]}
2
.
(3.1)
In Matlab, without using library functions, that would be:
mse=mean((s-p).ˆ2)
For long recordings, this measure would smear together all features of speech which
change over time into one average. It is often more useful to know the mean-squared
error on a segment-by-segment basis to see how this varies (think of a typical speech
system you are developing: it would be more useful to know that it works very well for
voiced speech but not well for unvoiced speech rather than know the overall average).
The reader may remember the same argument used previously in Section 2.4 for a
similar analysis example.
The segmental mean-squared error is a measure of the time-varying MSE. Usually
segments are 20–30 ms in size, and are sequential analysis frames, sometimes with
overlap. For a frame size of N samples and no overlap, for the jth segment this would
be:
E
(j) =
1
N
(j+1)N−1

i
=jN
{s[i] − p[i]}
2
.
(3.2)
mse(j)=mean((s(j*N+1:(j+1)*N)-p(j*N+1:(j+1)*N)).ˆ2);

50
Speech
Remember that Matlab indexes arrays from element 1 and not element 0, hence the
slight difference to indexing terms between the given equation and the Matlab expres-
sion.
For cases when signals of interest are not really being compared for likeness, but
rather one signal is corrupted by another one, the ratios of the signals themselves can
be useful, as the signal-to-noise ratio (SNR). Note that SNR is not used to measure the
degree of difference between signals, but is simply the base-10 logarithm of the ratio
between a wanted signal s and interfering noise vector n, measured in decibels, dB.
SNR
= 20 log
10

1
N
N
−1

i
=0

s
n

.
(3.3)
snr=10*log10 (s./n)
Segmental signal-to-noise ratio (SEGSNR) is a measure of the signal-to-noise ratio of
segments of speech, in the same way that we segment other measures to see how things
change over time. Again, segments are typically 20–30 ms in size, perhaps with some
overlap. For a frame size of N samples and no overlap, for the jth segment this would
be as shown in Equation (3.4):
SEGSNR
(j) = 20 log
10



1
N
(j+1)N−1

i
=jN

s
n




.
(3.4)
segsnr(j)=10*log10(s(j*N+1:(j+1)*N)./n(j*N+1:(j+1)*N))
However on the basis that hearing is not conducted in the time domain, but in the
frequency domain, such measures are likely to be only minimally perceptually relevant.
Better measures – or at least those more related to a real-world subjective analysis –
can be obtained in the frequency domain. The primary measure of frequency-domain
difference is called spectral distortion, shown in Equation (3.7), measured in dB
2
, as a
comparison between signals p
(t) and s(t) for a frame of size N.
First, however, since the equation is in the frequency domain, we convert the time-
domain signals to be compared, into the frequency domain using the Fourier transform:
S
(ω) =
1
√
2
π
+∞
−∞
s
(t)e
−jωt
dt
(3.5)
P
(ω) =
1
√
2
π
+∞
−∞
p
(t)e
−jωt
dt
(3.6)
SD
=
1
π
π
0
(log(S(ω)) − log(P(ω)))
2
d
ω.
(3.7)

Download 2,66 Mb.

Do'stlaringiz bilan baham:

1 ... 41 42 43 44 45 46 47 48 ... 170