Chapter · July 012 citation reads 9,926 author


Starting from the capturing of audio signal, feature extraction consists of the  following steps as shown in the block diagram below


Download 0.91 Mb.
Pdf ko'rish
bet7/20
Sana31.03.2023
Hajmi0.91 Mb.
#1312783
1   2   3   4   5   6   7   8   9   10   ...   20
Bog'liq
6.Chapter-02 (1)

14
Starting from the capturing of audio signal, feature extraction consists of the 
following steps as shown in the block diagram below: 
 
 
 
 
 
 
 
 
Fig. (2.5): Pre-Processing and Feature Extraction 
 
2.3.1.1 | Capture 
 
The first step in processing speech is to convert the analog representation 
(first air pressure, and then analog electric signals in a microphone) into a digital 
signal x[n], where n is an index over time. Analysis of the audio spectrum shows 
that nearly all energy resides in the band between DC and 4 kHz, and beyond 10 
kHz there is virtually no energy what so ever. 
Used sound format: 
 
22050 Hz 
 
16-bits, Signed 
 
Little Endian 
 
Mono Channel
 
Uncompressed PCM 
2.3.1.2 | End point detection and Silence removal 
The captured audio signal may contain silence at different positions such as 
beginning of signal, in between the words of a sentence, end of signal…. etc. If 
silent frames are included, modeling resources are spent on parts of the signal 
which do not contribute to the identification. The silence present must be removed 
before further processing. There are several ways for doing this: most popular are 
Short Time Energy and Zeros Crossing Rate. But they have their own limitation 
regarding setting thresholds as an ad hocbasis. The algorithm we used uses 
Silence 
removal
Pre-
emphasis
Framing
Windowing
DFT
Mel Filter 
Bank 
Log 
IDF

CMS 
Delta 
Energy
Speech 
Signal 
12
MFCC 
12 ΔMFCC 
12 ΔΔ MFCC 
1 energy
1 Δ energy 
1 ΔΔ energy 


Chapter 2 | Speech Recognition
15
statistical properties of background noise as well as physiological aspect of speech 
production and does not assume any ad hoc threshold.
It assumes that background noise present in the utterances is Gaussian in 
nature. Usually first 200msec or more (we used 4410 samples for the sampling rate 
22050samples/sec) of a speech recording corresponds to silence (or background 
noise) because the speaker takes some time to read when recording starts. 
Endpoint Detection Algorithm: 
Step 1: 
Calculate the mean (μ) and standard deviation (σ) of the first 200ms samples 
of the given utterance. The background noise is characterized by this μ and σ. 

Download 0.91 Mb.

Do'stlaringiz bilan baham:
1   2   3   4   5   6   7   8   9   10   ...   20




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling