Chapter · July 012 citation reads 9,926 author

Starting from the capturing of audio signal, feature extraction consists of the following steps as shown in the block diagram below

bet	7/20
Sana	31.03.2023
Hajmi	0,91 Mb.
	#1312783

1 2 3 4 5 6 7 8 9 10 ... 20

Bog'liq
6.Chapter-02 (1)

2.3.1.2 | End point detection and Silence removal
Step 1: Calculate the mean (μ) and standard deviation (σ) of the first 200ms samples of the given utterance. The background noise is characterized by this μ and σ.

14
Starting from the capturing of audio signal, feature extraction consists of the
following steps as shown in the block diagram below:

Fig. (2.5): Pre-Processing and Feature Extraction

2.3.1.1 | Capture

The first step in processing speech is to convert the analog representation
(first air pressure, and then analog electric signals in a microphone) into a digital
signal x[n], where n is an index over time. Analysis of the audio spectrum shows
that nearly all energy resides in the band between DC and 4 kHz, and beyond 10
kHz there is virtually no energy what so ever.
Used sound format:

22050 Hz

16-bits, Signed

Little Endian

Mono Channel

Uncompressed PCM
2.3.1.2 | End point detection and Silence removal
The captured audio signal may contain silence at different positions such as
beginning of signal, in between the words of a sentence, end of signal…. etc. If
silent frames are included, modeling resources are spent on parts of the signal
which do not contribute to the identification. The silence present must be removed
before further processing. There are several ways for doing this: most popular are
Short Time Energy and Zeros Crossing Rate. But they have their own limitation
regarding setting thresholds as an ad hocbasis. The algorithm we used uses
Silence
removal
Pre-
emphasis
Framing
Windowing
DFT
Mel Filter
Bank
Log
IDF
T
CMS
Delta
Energy
Speech
Signal
12
MFCC
12 ΔMFCC
12 ΔΔ MFCC
1 energy
1 Δ energy
1 ΔΔ energy

Chapter 2 | Speech Recognition
15
statistical properties of background noise as well as physiological aspect of speech
production and does not assume any ad hoc threshold.
It assumes that background noise present in the utterances is Gaussian in
nature. Usually first 200msec or more (we used 4410 samples for the sampling rate
22050samples/sec) of a speech recording corresponds to silence (or background
noise) because the speaker takes some time to read when recording starts.
Endpoint Detection Algorithm:
Step 1:
Calculate the mean (μ) and standard deviation (σ) of the first 200ms samples
of the given utterance. The background noise is characterized by this μ and σ.

Download 0,91 Mb.

Do'stlaringiz bilan baham:

1 2 3 4 5 6 7 8 9 10 ... 20