Applied Speech and Audio Processing: With matlab examples
Download 2.66 Mb. Pdf ko'rish
|
Applied Speech and Audio Processing With MATLAB Examples ( PDFDrive )
3.3. Speech understanding
49 In recent years, several objective quality algorithms have been developed. These are computer programs that, when fed an original audio recording, plus a degraded one, will estimate the MOS score of the degraded audio (or will produce a figure which can be converted to a MOS score). These programs can be tricked, and at the extremes will track actual MOS scores very poorly, but for normal speech processed in standard ways, have been shown to produce respectable results. They are used primarily because they are cheaper and quicker than forming a panel of listeners. Most importantly they allow automated testing to take place. Some of the more prominent algorithms are: • PESQ (perceptual evaluation of speech quality); • PSQM (perceptual speech quality measure); • MNB (measuring normalised blocks). Although these are commercially supported algorithms, it has been possible in the past to download working versions from the Internet for non-commercial research use. A far more crude measure of quality between a processed audio vector p and an original audio vector s is the mean-squared error (MSE) E. This is simply calculated on a sample-by-sample basis as the average squared difference between those vectors: E = 1 N N −1 i =0 {s[i] − p[i]} 2 . (3.1) In Matlab, without using library functions, that would be: mse=mean((s-p).ˆ2) For long recordings, this measure would smear together all features of speech which change over time into one average. It is often more useful to know the mean-squared error on a segment-by-segment basis to see how this varies (think of a typical speech system you are developing: it would be more useful to know that it works very well for voiced speech but not well for unvoiced speech rather than know the overall average). The reader may remember the same argument used previously in Section 2.4 for a similar analysis example. The segmental mean-squared error is a measure of the time-varying MSE. Usually segments are 20–30 ms in size, and are sequential analysis frames, sometimes with overlap. For a frame size of N samples and no overlap, for the jth segment this would be: E (j) = 1 N (j+1)N−1 i =jN {s[i] − p[i]} 2 . (3.2) mse(j)=mean((s(j*N+1:(j+1)*N)-p(j*N+1:(j+1)*N)).ˆ2); 50 Speech Remember that Matlab indexes arrays from element 1 and not element 0, hence the slight difference to indexing terms between the given equation and the Matlab expres- sion. For cases when signals of interest are not really being compared for likeness, but rather one signal is corrupted by another one, the ratios of the signals themselves can be useful, as the signal-to-noise ratio (SNR). Note that SNR is not used to measure the degree of difference between signals, but is simply the base-10 logarithm of the ratio between a wanted signal s and interfering noise vector n, measured in decibels, dB. SNR = 20 log 10 1 N N −1 i =0 s n . (3.3) snr=10*log10 (s./n) Segmental signal-to-noise ratio (SEGSNR) is a measure of the signal-to-noise ratio of segments of speech, in the same way that we segment other measures to see how things change over time. Again, segments are typically 20–30 ms in size, perhaps with some overlap. For a frame size of N samples and no overlap, for the jth segment this would be as shown in Equation (3.4): SEGSNR (j) = 20 log 10 1 N (j+1)N−1 i =jN s n . (3.4) segsnr(j)=10*log10(s(j*N+1:(j+1)*N)./n(j*N+1:(j+1)*N)) However on the basis that hearing is not conducted in the time domain, but in the frequency domain, such measures are likely to be only minimally perceptually relevant. Better measures – or at least those more related to a real-world subjective analysis – can be obtained in the frequency domain. The primary measure of frequency-domain difference is called spectral distortion, shown in Equation (3.7), measured in dB 2 , as a comparison between signals p (t) and s(t) for a frame of size N. First, however, since the equation is in the frequency domain, we convert the time- domain signals to be compared, into the frequency domain using the Fourier transform: S (ω) = 1 √ 2 π +∞ −∞ s (t)e −jωt dt (3.5) P (ω) = 1 √ 2 π +∞ −∞ p (t)e −jωt dt (3.6) SD = 1 π π 0 (log(S(ω)) − log(P(ω))) 2 d ω. (3.7) |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling