Yaakov J. Stein Chief Scientist RAD Data Communications
Voice DSP Part 1 Speech biology and what we can learn from it Part 2 Speech DSP (AGC, VAD, features, echo cancellation) Part 3 Speech compression techiques Part 4 Speech Recognition
Voice DSP - Part 4 Speech Recognition tasks ASR Engine Pattern Recognition and Phonetic labeling DTW HMM State-of-the-Art
Speech Recognition Tasks ASR - Automatic Speech Recognition (speech-to-text) - Language/Accent Identification
- Keyword-spotting/Gisting/Task-related
- Isolated-word/Connected-word/Continuous-speech
- Speaker-dependent/Speaker-adaptive/Speaker-independent
- Small-vocabulary/Large-vocabulary/Arbitrary-input
- Constrained-grammar/Free-speech
Speaker Recognition - Gender-only
- Text-dependent/Text-independent
- Speaker-spotting/Speaker-verification/Forensic
- Closed-set/open-set
Misc - Emotional-state/Voice-polygraph
- Translating-telephone
- Singing
Basic ASR Engine
Acoustic Processing AGC VAD Filtering Echo/noise cancellation Pre-emphasis Mel/Bark scale spectral analysis Filter banks LPC Cepstrum ...
Phonetic labeling Some ASR systems attempt to label phonemes Others don’t label at all, or label other pseudo-acoustic entities Labeling simplifies overall engine architecture - Changing speaker/language/etc. has less system impact
- Later stages are DSP-free
Phonetic Labeling - cont. Peterson - Barney data - an attempt at labeling in formant space
Phonetic Labeling - cont. Phonetic labeling is a classical Pattern Recognition task independence of Need channel, speaker, speed, etc Pattern recognition can be computationally complex - so feature extraction is often performed for
- data dimensionality reduction (but always loses information)
Commonly used features: - LPC
- LPC cepstrum (shown to be optimal in some sense)
- (mel/Bark scale) filter-bank representation
- RASTA (good for cross-channel applications)
- Cochlea model features (high dimensionality)
Pattern Recognition Quick Review What is Pattern Recognition ? - classification of real-world objects
Not a unified field (more ART than SCIENCE) Not trivial (or even well-defined) 1, 2, 3, 4, 5, 6, 7, and the answer is ...
Pattern Recognition - approaches Approaches to Pattern Recognition - Statistical (Decision Theoretic)
- Syntactical (Linguistic)
Syntactical Method - Describe classes by rules (grammar) in a formal language
- Identify pattern's class by grammar it obeys
- Reduces classification to string parsing
- Applications: Fingerprinting, Scene analysis, Chinese OCR
Statistical Method (here concentrate on this) - Describe patterns as points in n dimensional vector space
- Describe classes as hypervolumes or statistical clouds
- Reduces classification to geometry or function evaluation
- Applications: Signal classification, Speech, Latin OCR
PR Approaches - examples Class A Class B Class C Statistical ( 1, 0, 0 ) ( 0, 1, 0 ) ( 0, 0, 1 ) (0.9, 0.1, 0) (0.1, 1, -0.1) (-0.1, 0.1, 1) (1.1, 0, 0.1) (0, 0.9, -0.1) (0.1, 0, 1.1) Syntactic ( 1, 1, 1 ) ( 1, 2, 3 ) ( 1, 2, 4 ) ( 2, 2, 2 ) ( 2, 3, 4 ) ( 2, 4, 8 ) ( 3, 3, 3 ) ( 3, 4, 5 ) ( 3, 6, 12)
Classifier types Decision theoretic pattern recognizers come in three types: Direct probability estimation - 1NN, kNN,
- Parzen, LBG, LVQ
Hypersphere - potentials, Mahalanobis (MAP for Gauss),
- RBF, RCE
Hyperplane - Karhunen-Loeve, Fisher discriminant,
- Gaussian mixture classifiers,
- CART, MLP
Learning Theory Decision theoretic pattern recognizer is usually trained Training (learning) algorithms come in three types - Supervised (learning by examples, query learning)
- Reinforcement (good-dog/bad-dog, TD)
- Unsupervised (clustering, VQ)
Cross Validation: In order not to fall into the generalization trap we need - training set
- validation set
- test set (untainted, fair estimate of generalization)
Probably Approximately Correct Learning - teacher and student
- VC dimension - strength of classifier
- Limit on generalization error
Why ASR is not pattern recognition pneumonoultramicroscopicsilicovolcanoconiosis
Time Warping The Real Problem in ASR - we have to correct for the time warping Note that since the distortion is time-variant it is not a filter! One way to deal with such warping is to use Dynamic Programming The main DP algorithm has many names - Viterbi algorithm
- Levenshtein distance
- DTW
but they are all basically the same thing! The algorithm(s) are computationally efficient
Levenshtein Distance Easy case to demonstrate - distance between two strings - Useful in spelling checkers, OCR/ASR post-processors, etc
There are three possible errors - Deletion digital - digtal
- Insertion signal - signall
- Substitution processing - prosessing
The Levenshtein distance is the minimal number of errors - distance between dijitl and digital is 2
How do we find the minimal number of errors? Algorithm is best understood graphically
Levenshtein distance - cont. What is the Levenshtein distance between prossesing and processing ?
Levenshtein distance - cont. Start with 0 in the bottom left corner 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9
Levenshtein distance - cont. 9 8 7 6 5 4 3 2 2 2 3 2 1 1 2 2 1 0 1 2 1 0 1 2 3 0 1 2 3 4 5 6 7 8 9
Levenshtein distance - cont. Finish filling in table 9 8 7 7 6 5 5 5 4 3 8 7 6 6 5 4 4 4 3 4 7 6 5 5 4 3 3 3 5 5 6 5 4 4 3 2 3 4 4 5 5 4 3 3 2 3 3 3 4 5 4 3 2 2 2 2 2 3 4 5 3 2 1 1 2 2 3 4 5 6 2 1 0 1 2 3 4 5 6 7 1 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9
Levenshtein distance - end Backtrack to find path actually taken 9 8 7 7 6 5 5 5 4 3 8 7 6 6 5 4 4 4 3 4 7 6 5 5 4 3 3 3 4 5 6 5 4 4 3 2 3 4 4 5 5 4 3 3 2 3 3 3 4 5 4 3 2 2 2 2 2 3 4 5 3 2 1 1 2 2 3 4 5 6 2 1 0 1 2 3 4 5 6 7 1 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9
Generalization to DP What if not all substitutions are equally probable? - Then we add a cost function instead of 1
We can also have costs for insertions and deletions Di j = min ( Di-1 j + Ci-1 j; I j ; Di-1 j-1 + Ci-1 j-1; I j ; Di j-1 + Ci-1 j; I j ) Even more general rules are often used
DTW DTW uses the same technique for matching spoken words The input is separately matched to each dictionary word - and the word with the least distortion is the winner!
When waveforms are used the comparison measure is: - correlation, segmental SNR, Itakura-Saito, etc
When (N) features are used the comparison is - (N-dimensional) Euclidean distance
With DTW there is no labeling,
DTW - cont. Some more details: In isolated word recognition systems - energy contours are used to cut the words
- linear time warping is then used to normalize the utterance
- special mechanisms are used for endpoint location flexibility
- so there are endpoint and path constraints
In connected word recognition systems - the endpoint of each recognized utterance is used
- as a starting point for searching for the next word
In speaker-independent recognition systems - we need multiple templates for each reference word
- the number of templates can be reduced by VQ
Markov Models An alternative to DTW is based on Markov Models A DT LR second order Markov model
Markov Models - cont. General DT Markov Model Model jumps from state to state with given probabilities e.g. 1 1 1 1 2 2 3 3 3 3 3 3 3 3 4 4 4 or 1 1 2 2 2 2 2 2 2 2 2 4 4 4 (LR models)
Markov Models - cont. Why use Markov models for speech recognition? - States can represent phonemes (or whatever)
- Different phoneme durations (but exponentially decaying)
- Phoneme deletions using 2nd or higher order
So time warping is automatic ! We build a Markov model for each word - given an utterance, select the most probable word
HMM But the same phoneme can be said in different ways - so we need a Hidden Markov Model
HMM - cont. For a given state sequence S1 S2 S3 … ST the probability of an observation sequence O1 O2 O3 … OT is P(O|S) = bS1O1 bS2O2 bS3O3 … bSTOT For a given hidden Markov model M = { a, b } the probability of the state sequence S1 S2 S3 … ST P(S|M) = S1 aS1S2 aS2S3 aS3S4 … aST-1ST So, for a given hidden Markov model M the probability of an observation sequence O1 O2 O3 … OT is obtained by summing over all possible state sequences
HMM - cont. P(O| M) = P(O|S) P(S|M) = S1 bS1O1 aS1S2 bS2O2 aS2S3 bS2O2 … So for an observation sequence we can find - the probability of any word model
(but this can be made more efficient using the forward-backward algorithm) How do we train HMM models? The full algorithm (Baum-Welch) is an EM algorithm An approximate algorithm is based on the Viterbi (DP) algorithm (Sorry, but beyond our scope here)
State-of-the-Art Isolated digits Isolated words Constrained Free speech Spkr dep 100% 98% 95% 90% Spkr ind 98% 95% 85% 70% NB 95% on words is about 50% on sentences - only > 97% is considered useful for transcription
- (otherwise more time consuming to correct than to type)
Do'stlaringiz bilan baham: |