Voice dsp processing IV

Download 463 b.

Sana	03.11.2017
Hajmi	463 b.
	#19332

Voice DSP Processing IV

Yaakov J. Stein
Chief Scientist RAD Data Communications

Voice DSP

Part 1 Speech biology and what we can learn from it
Part 2 Speech DSP (AGC, VAD, features, echo cancellation)
Part 3 Speech compression techiques
Part 4 Speech Recognition

Voice DSP - Part 4

Speech Recognition tasks
ASR Engine
Pattern Recognition and Phonetic labeling
DTW
HMM
State-of-the-Art

Speech Recognition Tasks

ASR - Automatic Speech Recognition (speech-to-text)

Language/Accent Identification
Keyword-spotting/Gisting/Task-related
Isolated-word/Connected-word/Continuous-speech
Speaker-dependent/Speaker-adaptive/Speaker-independent
Small-vocabulary/Large-vocabulary/Arbitrary-input
Constrained-grammar/Free-speech

Speaker Recognition

Gender-only
Text-dependent/Text-independent
Speaker-spotting/Speaker-verification/Forensic
Closed-set/open-set

Misc

Emotional-state/Voice-polygraph
Translating-telephone
Singing

Basic ASR Engine

Acoustic Processing

… All the processing we have learned before
AGC
VAD
Filtering
Echo/noise cancellation
Pre-emphasis
Mel/Bark scale spectral analysis
Filter banks
LPC
Cepstrum
...

Phonetic labeling

Some ASR systems attempt to label phonemes
Others don’t label at all, or label other pseudo-acoustic entities
Labeling simplifies overall engine architecture

Changing speaker/language/etc. has less system impact
Later stages are DSP-free

Phonetic Labeling - cont.

Peterson - Barney data - an attempt at labeling in formant space

Phonetic Labeling - cont.

Phonetic labeling is a classical Pattern Recognition task
independence of
Need channel, speaker, speed, etc
adaptation to
Pattern recognition can be computationally complex

so feature extraction is often performed for
data dimensionality reduction (but always loses information)

Commonly used features:

LPC
LPC cepstrum (shown to be optimal in some sense)
(mel/Bark scale) filter-bank representation
RASTA (good for cross-channel applications)
Cochlea model features (high dimensionality)

Pattern Recognition Quick Review

What is Pattern Recognition ?

classification of real-world objects

Not a unified field (more ART than SCIENCE)
Not trivial (or even well-defined)
1, 2, 3, 4, 5, 6, 7,
and the answer is ...

Pattern Recognition - approaches

Approaches to Pattern Recognition

Statistical (Decision Theoretic)
Syntactical (Linguistic)

Syntactical Method

Describe classes by rules (grammar) in a formal language
Identify pattern's class by grammar it obeys
Reduces classification to string parsing
Applications: Fingerprinting, Scene analysis, Chinese OCR

Statistical Method (here concentrate on this)

Describe patterns as points in n dimensional vector space
Describe classes as hypervolumes or statistical clouds
Reduces classification to geometry or function evaluation
Applications: Signal classification, Speech, Latin OCR

PR Approaches - examples

Class A Class B Class C
Statistical ( 1, 0, 0 ) ( 0, 1, 0 ) ( 0, 0, 1 )
(0.9, 0.1, 0) (0.1, 1, -0.1) (-0.1, 0.1, 1)
(1.1, 0, 0.1) (0, 0.9, -0.1) (0.1, 0, 1.1)
Syntactic ( 1, 1, 1 ) ( 1, 2, 3 ) ( 1, 2, 4 )
( 2, 2, 2 ) ( 2, 3, 4 ) ( 2, 4, 8 )
( 3, 3, 3 ) ( 3, 4, 5 ) ( 3, 6, 12)

Classifier types

Decision theoretic pattern recognizers come in three types:
Direct probability estimation

1NN, kNN,
Parzen, LBG, LVQ

Hypersphere

potentials, Mahalanobis (MAP for Gauss),
RBF, RCE

Hyperplane

Karhunen-Loeve, Fisher discriminant,
Gaussian mixture classifiers,
CART, MLP

Learning Theory

Decision theoretic pattern recognizer is usually trained
Training (learning) algorithms come in three types

Supervised (learning by examples, query learning)
Reinforcement (good-dog/bad-dog, TD)
Unsupervised (clustering, VQ)

Cross Validation:
In order not to fall into the generalization trap we need

training set
validation set
test set (untainted, fair estimate of generalization)

Probably Approximately Correct Learning

teacher and student
VC dimension - strength of classifier
Limit on generalization error

Egen - Etr <  dVC  Ntr

Why ASR is not pattern recognition

pneumonoultramicroscopicsilicovolcanoconiosis

Time Warping

The Real Problem in ASR - we have to correct for the time warping
Note that since the distortion is time-variant it is not a filter!
One way to deal with such warping is to use Dynamic Programming
The main DP algorithm has many names

Viterbi algorithm
Levenshtein distance
DTW

but they are all basically the same thing!
The algorithm(s) are computationally efficient

since they find a global minimum based on local decisions

Levenshtein Distance

Easy case to demonstrate - distance between two strings

Useful in spelling checkers, OCR/ASR post-processors, etc

There are three possible errors

Deletion digital - digtal
Insertion signal - signall
Substitution processing - prosessing

The Levenshtein distance is the minimal number of errors

distance between dijitl and digital is 2

How do we find the minimal number of errors?
Algorithm is best understood graphically

Levenshtein distance - cont.

What is the Levenshtein distance between prossesing and processing ?

Levenshtein distance - cont.

Start with 0 in the bottom left corner
9
8
7
6
5
4
3
2
1
0 1 2 3 4 5 6 7 8 9

Levenshtein distance - cont.

Continue filling in table
9
8
7
6
5
4 3 2 2 2
3 2 1 1 2
2 1 0 1 2
1 0 1 2 3
0 1 2 3 4 5 6 7 8 9

Levenshtein distance - cont.

Finish filling in table
9 8 7 7 6 5 5 5 4 3
8 7 6 6 5 4 4 4 3 4
7 6 5 5 4 3 3 3 5 5
6 5 4 4 3 2 3 4 4 5
5 4 3 3 2 3 3 3 4 5
4 3 2 2 2 2 2 3 4 5
3 2 1 1 2 2 3 4 5 6
2 1 0 1 2 3 4 5 6 7
1 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 9

Levenshtein distance - end

Backtrack to find path actually taken
9 8 7 7 6 5 5 5 4 3
8 7 6 6 5 4 4 4 3 4
7 6 5 5 4 3 3 3 4 5
6 5 4 4 3 2 3 4 4 5
5 4 3 3 2 3 3 3 4 5
4 3 2 2 2 2 2 3 4 5
3 2 1 1 2 2 3 4 5 6
2 1 0 1 2 3 4 5 6 7
1 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 9

Generalization to DP

What if not all substitutions are equally probable?

Then we add a cost function instead of 1

We can also have costs for insertions and deletions
Di j = min ( Di-1 j + Ci-1 j; I j ;
Di-1 j-1 + Ci-1 j-1; I j ;
Di j-1 + Ci-1 j; I j )
Even more general rules are often used

DTW

DTW uses the same technique for matching spoken words
The input is separately matched to each dictionary word

and the word with the least distortion is the winner!

When waveforms are used the comparison measure is:

correlation, segmental SNR, Itakura-Saito, etc

When (N) features are used the comparison is

(N-dimensional) Euclidean distance

With DTW there is no labeling,
alignment and dictionary are performed together

DTW - cont.

Some more details:
In isolated word recognition systems

energy contours are used to cut the words
linear time warping is then used to normalize the utterance
special mechanisms are used for endpoint location flexibility

so there are endpoint and path constraints

In connected word recognition systems

the endpoint of each recognized utterance is used
as a starting point for searching for the next word

In speaker-independent recognition systems

we need multiple templates for each reference word
the number of templates can be reduced by VQ

Markov Models

An alternative to DTW is based on Markov Models
A discrete-time left-to-right first order Markov model
A DT LR second order Markov model

Markov Models - cont.

General DT Markov Model
Model jumps from state to state with given probabilities
e.g. 1 1 1 1 2 2 3 3 3 3 3 3 3 3 4 4 4
or 1 1 2 2 2 2 2 2 2 2 2 4 4 4 (LR models)

Markov Models - cont.

Why use Markov models for speech recognition?

States can represent phonemes (or whatever)
Different phoneme durations (but exponentially decaying)
Phoneme deletions using 2nd or higher order

So time warping is automatic !
We build a Markov model for each word

given an utterance, select the most probable word

HMM

But the same phoneme can be said in different ways

so we need a Hidden Markov Model

HMM - cont.

For a given state sequence S1 S2 S3 … ST
the probability of an observation sequence O1 O2 O3 … OT
is P(O|S) = bS1O1 bS2O2 bS3O3 … bSTOT
For a given hidden Markov model M = { a, b }
the probability of the state sequence S1 S2 S3 … ST
is (the initial probability of S1 is taken to be S1)
P(S|M) = S1 aS1S2 aS2S3 aS3S4 … aST-1ST
So, for a given hidden Markov model M
the probability of an observation sequence O1 O2 O3 … OT
is obtained by summing over all possible state sequences

HMM - cont.

P(O| M) = P(O|S) P(S|M)
=  S1 bS1O1 aS1S2 bS2O2 aS2S3 bS2O2 …
So for an observation sequence we can find

the probability of any word model

(but this can be made more efficient using the forward-backward algorithm)
How do we train HMM models?
The full algorithm (Baum-Welch) is an EM algorithm
An approximate algorithm is based on the Viterbi (DP) algorithm
(Sorry, but beyond our scope here)

State-of-the-Art

Isolated digits Isolated words Constrained Free speech
Spkr dep 100% 98% 95% 90%
Spkr ind  98% 95% 85%  70%
NB 95% on words is about 50% on sentences

only > 97% is considered useful for transcription
(otherwise more time consuming to correct than to type)

Download 463 b.

Do'stlaringiz bilan baham:

Voice dsp processing IV

Voice DSP Processing IV

Yaakov J. Stein

Chief Scientist RAD Data Communications

Voice DSP

Part 1 Speech biology and what we can learn from it

Part 2 Speech DSP (AGC, VAD, features, echo cancellation)

Part 3 Speech compression techiques

Part 4 Speech Recognition

Voice DSP - Part 4

Speech Recognition tasks

ASR Engine

Pattern Recognition and Phonetic labeling

DTW

HMM

State-of-the-Art

Speech Recognition Tasks

ASR - Automatic Speech Recognition (speech-to-text)

Speaker Recognition

Misc

Basic ASR Engine

Acoustic Processing

… All the processing we have learned before

AGC

VAD

Filtering

Echo/noise cancellation

Pre-emphasis

Mel/Bark scale spectral analysis

Filter banks

LPC

Cepstrum

...

Phonetic labeling

Some ASR systems attempt to label phonemes

Others don’t label at all, or label other pseudo-acoustic entities

Labeling simplifies overall engine architecture

Phonetic Labeling - cont.

Peterson - Barney data - an attempt at labeling in formant space

Phonetic Labeling - cont.

Phonetic labeling is a classical Pattern Recognition task

independence of

Need channel, speaker, speed, etc

adaptation to

Pattern recognition can be computationally complex

Commonly used features:

Pattern Recognition Quick Review

What is Pattern Recognition ?

Not a unified field (more ART than SCIENCE)

Not trivial (or even well-defined)

1, 2, 3, 4, 5, 6, 7,

and the answer is ...

Pattern Recognition - approaches

Approaches to Pattern Recognition

Syntactical Method

Statistical Method (here concentrate on this)

PR Approaches - examples

Class A Class B Class C

Statistical ( 1, 0, 0 ) ( 0, 1, 0 ) ( 0, 0, 1 )

(0.9, 0.1, 0) (0.1, 1, -0.1) (-0.1, 0.1, 1)

(1.1, 0, 0.1) (0, 0.9, -0.1) (0.1, 0, 1.1)

Syntactic ( 1, 1, 1 ) ( 1, 2, 3 ) ( 1, 2, 4 )

( 2, 2, 2 ) ( 2, 3, 4 ) ( 2, 4, 8 )

( 3, 3, 3 ) ( 3, 4, 5 ) ( 3, 6, 12)

Classifier types

Decision theoretic pattern recognizers come in three types:

Direct probability estimation

Hypersphere

Hyperplane

Learning Theory

Decision theoretic pattern recognizer is usually trained

Training (learning) algorithms come in three types

Cross Validation:

In order not to fall into the generalization trap we need

Probably Approximately Correct Learning

Why ASR is not pattern recognition

pneumonoultramicroscopicsilicovolcanoconiosis

Time Warping

The Real Problem in ASR - we have to correct for the time warping

Note that since the distortion is time-variant it is not a filter!