Automatic speech recognition (ASR) - a pattern recognition task Review: relevant aspects of human speech production and perception Acoustic-phonetic principles Digital analysis methods Parameterization and feature extraction Training and adaptation of models Overview of ASR approaches Practical techniques: Hidden Markov Models, Deep Neural Networks Acoustic and Language Models Cognitive and statistical ASR
Need to map a data point in N-dimensional space to a label Need to map a data point in N-dimensional space to a label Input data: samples in time Output: Text for a word/sentence Assumption: signals for similar words cluster in the space Problem: how to process speech signal into suitable features
Then, just need a table look-up Moore's Law will solve ASR problem? storage doubling every year computation power doubling every 1.5 years
Short utterance of 1 s and coding rate 1 kbps (kilobit/second): Short utterance of 1 s and coding rate 1 kbps (kilobit/second): 25 frames/s, 10 coefficients a frame, 4 bits/coefficient -> 21000 signals Suppose each person spoke 1 word every second for 1000 hours: about 1017 short utterances Well beyond near-term capability
ASR assigns a label (text) to each input utterance ASR assigns a label (text) to each input utterance Similar speech is assumed to cluster in the feature space Often use simple distance to measure similarity between input and centroids of trained models This assumes that perceptual and/or production similarity correlates with distance in the feature space – often, not the case unless features are well chosen
not only to reduce costs mostly to focus analysis on important aspects of the signal (thus raising accuracy) use the same analysis to create model and to test it not done in some recent end-to-end ASR; however, most ASR uses either MFCC or log spectral filter-bank energies otherwise, feature space is far too complex
emulate how humans interpret speech treat simply as a pattern recognition problem exploit power of computers expert-system methods stochastic methods
inadequate training data (in speaker-dependent systems: user fatigue) inadequate training data (in speaker-dependent systems: user fatigue) memory limitations computation (searching among many possible texts) inadequate models (poor assumptions made to reduce computation and memory, at the cost of reduced accuracy) hard to train model parameters
Speech: not an arbitrary signal source of input to ASR: human vocal tract data compression should take account of the human source precision of representation: not exceed ability to control speech
Aspects of speech that speakers do not directly control are free variation Aspects of speech that speakers do not directly control are free variation Can be treated as distortion (noise, other sounds, reverberation) Puts limits on accuracy needed Creates mismatch between trained models and any new input Intra-speaker: people never say the same exact utterance twice Inter-speaker: everyone is different (size, gender, dialect,…) Compare to vision PR: changes in lighting, shadows, obscuring objects, viewing angle, focus Vocal-tract length normalization (VTLN); noise suppression
Speaker controls: amplitude, pitch, formants, voicing, speaking rate Speaker controls: amplitude, pitch, formants, voicing, speaking rate Mapping from word (and phoneme) concepts in the brain to the acoustic output is complex Trying to decipher speech is more complex than identifying objects in a visual scene Vision: edges, texture, coloring, orientation, motion Speech: indirect; not observing the vocal tract
Auditory system sensitive to: dynamic positions of spectral peaks, durations (relative to speaking rate), fundamental frequency (F0) patterns Auditory system sensitive to: dynamic positions of spectral peaks, durations (relative to speaking rate), fundamental frequency (F0) patterns Important: where and when energy occurs Less relevant: overall spectral slope, bandwidths, absence of energy Formant tracking: algorithms err in transitions; not directly used in ASR for many years
distribution of speech energy in frequency (spectral amplitude) distribution of speech energy in frequency (spectral amplitude) pitch period estimation sampling rate typically: 8 000/sec for telephone speech 10 000 - 16 000/sec otherwise usually 16 bits/sample 8-bit mu-law logPCM (in the telephone network)
Feature determination (e.g., formant frequencies, F0) requires error-prone methods Feature determination (e.g., formant frequencies, F0) requires error-prone methods So, automatic methods (parameters) preferred: FFT (fast Fourier transform) LPC (linear predictive coding) MFCC (mel-frequency cepstral coefficients) RASTA-PLP Log spectral (filter-bank) energies
Often, errors in weak speech and in transitions between voiced and unvoiced speech (e.g., doubling or halving F0) Often, errors in weak speech and in transitions between voiced and unvoiced speech (e.g., doubling or halving F0) peak-pick the time signal (look for energy increase at each closure of vocal cords) usually first filter out energy above 1000 Hz (retain strong harmonics in F1 region) often use autocorrelation to eliminate phase effects often not done in ASR, due to the difficulty of exploiting F0 in its complex role of signaling different aspects of speech communication
Objective: model speech spectral envelope with few (8-16) coefficients Objective: model speech spectral envelope with few (8-16) coefficients 1) Linear predictive coding (LPC) analysis: standard spectral method for low-rate speech coding 2) Cepstral processing: common in ASR; also can exploit some auditory effects 3) Vector Quantization (VQ): reduces transmission rate (but also ASR accuracy)
Cepstrum: inverse FFT of the log-amplitude FFT of the speech Cepstrum: inverse FFT of the log-amplitude FFT of the speech small set of parameters (often 10-13) as LPC, but allows warping of frequency to match hearing inverse DFT orthogonalizes gross spectral detail in low-order values, finer detail in higher coefficients C0: total speech energy (often discarded)
C1: balance of energy (low vs. high frequency) C1: balance of energy (low vs. high frequency) C2,...C13 encode increasingly fine details about the spectrum (e.g., resolution to 100 Hz) Mel cepstral coefficients (MFCCs) model low frequencies linearly; above 1000 Hz logarithmically
Linear discriminant analysis (LDA) Linear discriminant analysis (LDA) As in analysis of variance (ANOVA), regression analysis and principal component analysis (PCA), LDA finds a linear combination of features to separate pattern classes Maximum likelihood linear transforms Speaker Adaptive Transforms Map sets of features (e.g., MFCC, Spectral energies) to a smaller, more efficient set
Segmenting speech spoken without pauses (continuous speech): Segmenting speech spoken without pauses (continuous speech): speech unit boundaries are not easily found automatically (vs.,e.g., Text-To-Speech) Variability in speech: different speakers, contexts, styles, channels Factors: real-time; telephone; hesitations; restarts; filled pauses; other sounds (noise, etc)
speaker dependence speaker dependence small (< 100 words) medium (100-1000 words) large (1-20 K words) very large (> 20 K words) complexity of vocabulary words alphabet (difficult) digits (easy)
allowed sequences of words allowed sequences of words - perplexity: mean # of words to consider - language models style of speech: isolated words or continuous speech; how many words/utterance? recording environment - quiet (> 30 dB SNR) - noisy (< 15 dB) - noise-cancelling microphone - telephone real-time? feedback (rejection)? type of error criterion costs of errors
Very popular in the 1970s Very popular in the 1970s Compares exemplar patterns, with timing flexibility to handle speaking rate variations No assumption of similarity across patterns for the same utterance Thus, no way to generalize Very poor to formulate efficient models
for observation probabilities, usually Gaussian pdf's, due to the simplicity of model, using only a mean and a variance (in M dimensions, need a mean for each parameter, and a MxM covariance matrix, noting the correlations between parameters)
major difficulty: first-order frame-independence assumption major difficulty: first-order frame-independence assumption use of delta coefficients over several frames (e.g., 50 ms) helps to include timing information, but is inefficient stochastic trajectory models and trended HMMs are examples of ways to improve timing modeling higher-order Markov models are too computationally complex incorporate more information about speech production and perception into the HMM architecture?
Can handle unaligned speech input Can handle unaligned speech input Adds extra layer to RNN to allow “end-to-end” ASR
fast matrix and vector multiplications fast matrix and vector multiplications
Recurrent artificial NN’s are now dominating ASR Basic perceptron: each node in a layer outputs 0 or 1 as input to the next layer, based on a weighted linear combination from the layer beneath 3 layers: adequate to handle all volumes in N-dimensional space Steepest-gradient back propagation training Advances due to big data and massive power End-to-end DNN or DNN/GMM ASR
Still using stochastic gradient as the main way to set the net weights Still using stochastic gradient as the main way to set the net weights Set learning step size for efficient updates Momentum: to avoid traps in local minima Same criterion as in codebook design for speech coding
Conditional random field acoustic models Conditional random field acoustic models Boosting probabilities Support vector machines (SVM): good for binary classifications Machine learning
Do'stlaringiz bilan baham: |