*Yaakov J. Stein*
## Chief Scientist RAD Data Communications
**Voice DSP** ## Part 1 Speech biology and what we can learn from it ## Part 2 Speech DSP (AGC, VAD, features, echo cancellation) ## Part 3 Speech compression techiques ## Part 4 Speech Recognition
**Voice DSP - Part 4** ## Speech Recognition tasks ## ASR Engine ## DTW ## HMM ## State-of-the-Art
**Speech Recognition Tasks** ## ASR - Automatic Speech Recognition (speech-to-text) - Language/Accent Identification
- Keyword-spotting/Gisting/Task-related
- Isolated-word/Connected-word/Continuous-speech
- Speaker-dependent/Speaker-adaptive/Speaker-independent
- Small-vocabulary/Large-vocabulary/Arbitrary-input
- Constrained-grammar/Free-speech
## Speaker Recognition - Gender-only
- Text-dependent/Text-independent
- Speaker-spotting/Speaker-verification/Forensic
- Closed-set/open-set
## Misc - Emotional-state/Voice-polygraph
- Translating-telephone
- Singing
**Basic ASR Engine**
**Acoustic Processing** ## … All the processing we have learned before ## AGC ## VAD ## Filtering ## Pre-emphasis ## Mel/Bark scale spectral analysis ## Filter banks ## LPC ## Cepstrum ## ...
**Phonetic labeling** ## Some ASR systems attempt to label phonemes ## Others don’t label at all, or label other pseudo-acoustic entities ## Labeling simplifies overall engine architecture
**Phonetic Labeling - cont.** ## Peterson - Barney data - an attempt at labeling in formant space
**Phonetic Labeling - cont.** ## Phonetic labeling is a classical Pattern Recognition task ## independence of ## Need channel, speaker, speed, etc ## adaptation to ## Pattern recognition can be computationally complex ## Commonly used features: - LPC
- LPC cepstrum (shown to be optimal in some sense)
- (mel/Bark scale) filter-bank representation
- RASTA (good for cross-channel applications)
- Cochlea model features (high dimensionality)
**Pattern Recognition Quick Review ** ## What is Pattern Recognition ? - classification of real-world objects
## Not a unified field (more ART than SCIENCE) ## Not trivial (or even well-defined) ## 1, 2, 3, 4, 5, 6, 7, ## and the answer is ...
**Pattern Recognition - approaches** ## Approaches to Pattern Recognition - Statistical (Decision Theoretic)
- Syntactical (Linguistic)
## Syntactical Method - Describe classes by rules (grammar) in a formal language
- Identify pattern's class by grammar it obeys
- Reduces classification to string parsing
- Applications: Fingerprinting, Scene analysis, Chinese OCR
## Statistical Method (here concentrate on this) - Describe patterns as points in n dimensional vector space
- Describe classes as hypervolumes or statistical clouds
- Reduces classification to geometry or function evaluation
- Applications: Signal classification, Speech, Latin OCR
**PR Approaches - examples** ## Class A Class B Class C ## Statistical ( 1, 0, 0 ) ( 0, 1, 0 ) ( 0, 0, 1 ) ## (0.9, 0.1, 0) (0.1, 1, -0.1) (-0.1, 0.1, 1) ## (1.1, 0, 0.1) (0, 0.9, -0.1) (0.1, 0, 1.1) ## Syntactic ( 1, 1, 1 ) ( 1, 2, 3 ) ( 1, 2, 4 ) ## ( 2, 2, 2 ) ( 2, 3, 4 ) ( 2, 4, 8 ) ## ( 3, 3, 3 ) ( 3, 4, 5 ) ( 3, 6, 12)
**Classifier types** ## Decision theoretic pattern recognizers come in three types: ## Direct probability estimation - 1NN, kNN,
- Parzen, LBG, LVQ
## Hypersphere - potentials, Mahalanobis (MAP for Gauss),
- RBF, RCE
## Hyperplane
**Learning Theory** ## Decision theoretic pattern recognizer is usually trained ## Training (learning) algorithms come in three types - Supervised (learning by examples, query learning)
- Reinforcement (good-dog/bad-dog, TD)
- Unsupervised (clustering, VQ)
## Cross Validation: ## In order not to fall into the generalization trap we need - training set
- validation set
- test set (untainted, fair estimate of generalization)
## Probably Approximately Correct Learning
**Why ASR is not pattern recognition** ## pneumonoultramicroscopicsilicovolcanoconiosis
**Time Warping** ## The Real Problem in ASR - we have to correct for the time warping ## Note that since the distortion is time-variant it is not a filter! ## One way to deal with such warping is to use Dynamic Programming ## The main DP algorithm has many names - Viterbi algorithm
- Levenshtein distance
- DTW
## but they are all basically the same thing! ## The algorithm(s) are computationally efficient - since they find a global minimum based on local decisions
**Levenshtein Distance** ## Easy case to demonstrate - distance between two strings - Useful in spelling checkers, OCR/ASR post-processors, etc
- Deletion digital - digtal
- Insertion signal - signall
- Substitution processing - prosessing
## The Levenshtein distance is the minimal number of errors - distance between dijitl and digital is 2
## How do we find the minimal number of errors? ## Algorithm is best understood graphically
**Levenshtein distance - cont.** ## What is the Levenshtein distance between **prossesing** and **processing** ?
**Levenshtein distance - cont.** ## 9 ## 8 ## 7 ## 6 ## 5 ## 4 ## 3 ## 2 ## 1 ## 0 1 2 3 4 5 6 7 8 9
**Levenshtein distance - cont.** ## Continue filling in table ## 9 ## 8 ## 7 ## 6 ## 5 ## 4 3 2 2 2 ## 3 2 1 1 2 ## 2 1 0 1 2 ## 1 0 1 2 3 ## 0 1 2 3 4 5 6 7 8 9
**Levenshtein distance - cont.** ## Finish filling in table ## 9 8 7 7 6 5 5 5 4 3 ## 8 7 6 6 5 4 4 4 3 4 ## 7 6 5 5 4 3 3 3 5 5 ## 6 5 4 4 3 2 3 4 4 5 ## 5 4 3 3 2 3 3 3 4 5 ## 4 3 2 2 2 2 2 3 4 5 ## 3 2 1 1 2 2 3 4 5 6 ## 2 1 0 1 2 3 4 5 6 7 ## 1 0 1 2 3 4 5 6 7 8 ## 0 1 2 3 4 5 6 7 8 9
**Levenshtein distance - end** ## Backtrack to find path actually taken ## 9 8 7 7 6 5 5 5 4 ** 3** ## 8 7 6 6 5 4 4 4 **3** 4 ## 7 6 5 5 4 3 3 **3** 4 5 ## 6 5 4 4 3 **2** **3 ** 4 4 5 ## 5 4 3 3 **2** ** 3** 3 3 4 5 ## 4 3 2 **2** **2** **2** 2 3 4 5 ## 3 2 **1** ** 1** **2** 2 3 4 5 6 ## 2 1 **0 ** **1** 2 3 4 5 6 7 ## 1 **0** 1 2 3 4 5 6 7 8 ## **0** 1 2 3 4 5 6 7 8 9
**Generalization to DP** ## What if not all substitutions are equally probable? - Then we add a cost function instead of 1
## We can also have costs for insertions and deletions ## D**i j **= min ( D**i-1 j **+ C**i-1 j; I j **; ## D**i-1 j-1 **+ C**i-1 j-1; I j **; ## D**i j-1 **+ C**i-1 j; I j **) ## Even more general rules are often used
**DTW** ## The input is separately matched to each dictionary word - and the word with the least distortion is the winner!
## When waveforms are used the comparison measure is: - correlation, segmental SNR, Itakura-Saito, etc
## When (N) features are used the comparison is - (N-dimensional) Euclidean distance
## With DTW there is no labeling,
**DTW - cont.** ## Some more details: ## In isolated word recognition systems - energy contours are used to cut the words
- linear time warping is then used to normalize the utterance
- special mechanisms are used for endpoint location flexibility
- so there are endpoint and path constraints
## In connected word recognition systems ## In speaker-independent recognition systems - we need multiple templates for each reference word
- the number of templates can be reduced by VQ
**Markov Models** ## An alternative to DTW is based on Markov Models ## A discrete-time left-to-right first order Markov model
**Markov Models - cont.** ## General DT Markov Model ## Model jumps from state to state with given probabilities ## e.g. 1 1 1 1 2 2 3 3 3 3 3 3 3 3 4 4 4 ## or 1 1 2 2 2 2 2 2 2 2 2 4 4 4 (LR models)
**Markov Models - cont.** ## Why use Markov models for speech recognition? - States can represent phonemes (or whatever)
- Different phoneme durations (but exponentially decaying)
- Phoneme deletions using 2
**nd **or higher order
## So time warping is automatic ! ## We build a Markov model for each word - given an utterance, select the most probable word
**HMM** - so we need a
**H**idden **M**arkov **M**odel
**HMM - cont.** ## For a given state sequence S**1** S**2** S**3** … S**T** ## is P(O|S) = bS**1**O**1 **bS**2**O**2 **bS**3**O**3 … **bS**T**O**T** ## For a given hidden Markov model M = { a, b } ## the probability of the state sequence S**1** S**2** S**3** … S**T** ## is (the initial probability of S**1 **is taken to be S**1**) ## P(S|M) = S**1** aS**1**S**2 **aS**2**S**3 **aS**3**S**4 … **aS**T-1**S**T** ## So, for a given hidden Markov model M ## the probability of an observation sequence O**1** O**2** O**3** … O**T** ## is obtained by summing over all possible state sequences
**HMM - cont.** ## P(O| M) = P(O|S) P(S|M) ## = S**1** bS**1**O**1 **aS**1**S**2 **bS**2**O**2 **aS**2**S**3 **bS**2**O**2 …** ## So for an observation sequence we can find - the probability of any word model
## (but this can be made more efficient using the forward-backward algorithm) ## How do we train HMM models? ## The full algorithm (Baum-Welch) is an EM algorithm ## An approximate algorithm is based on the Viterbi (DP) algorithm ## (Sorry, but beyond our scope here)
**State-of-the-Art** ## Isolated digits Isolated words Constrained Free speech ## Spkr dep ****100% 98% 95% 90% ## Spkr ind 98% 95% 85% 70% ## NB 95% on words is about 50% on sentences - only > 97% is considered useful for transcription
- (otherwise more time consuming to correct than to type)
**Do'stlaringiz bilan baham:** |