Automatic speech recognition (asr) a pattern recognition task


Download 493 b.
Sana04.11.2017
Hajmi493 b.



Automatic speech recognition (ASR) - a pattern recognition task

  • Automatic speech recognition (ASR) - a pattern recognition task

  • Review: relevant aspects of human speech production and perception

  • Acoustic-phonetic principles

  • Digital analysis methods

  • Parameterization and feature extraction

  • Training and adaptation of models

  • Overview of ASR approaches

  • Practical techniques: Hidden Markov Models, Deep Neural Networks

  • Acoustic and Language Models

  • Cognitive and statistical ASR



Need to map a data point in N-dimensional space to a label

  • Need to map a data point in N-dimensional space to a label

  • Input data: samples in time

  • Output: Text for a word/sentence

  • Assumption: signals for similar words cluster in the space

  • Problem: how to process speech signal into suitable features



  • Store all possible speech signals with their corresponding texts

  • Then, just need a table look-up

  • Moore's Law will solve ASR problem?

  • storage doubling every year

  • computation power doubling every 1.5 years



Short utterance of 1 s and coding rate 1 kbps (kilobit/second):

  • Short utterance of 1 s and coding rate 1 kbps (kilobit/second):

  • 25 frames/s, 10 coefficients a frame, 4 bits/coefficient -> 21000 signals

  • Suppose each person spoke 1 word every second for 1000 hours:

  • about 1017 short utterances

  • Well beyond near-term capability



ASR assigns a label (text) to each input utterance

  • ASR assigns a label (text) to each input utterance

  • Similar speech is assumed to cluster in the feature space

  • Often use simple distance to measure similarity between input and centroids of trained models

  • This assumes that perceptual and/or production similarity correlates with distance in the feature space – often, not the case unless features are well chosen















  • not only to reduce costs

  • mostly to focus analysis on important aspects of the signal (thus raising accuracy)

  • use the same analysis to create model and to test it

  • not done in some recent end-to-end ASR; however, most ASR uses either MFCC or log spectral filter-bank energies

  • otherwise, feature space is far too complex



emulate how humans interpret speech

  • emulate how humans interpret speech

  • treat simply as a pattern recognition problem

  • exploit power of computers

  • expert-system methods

  • stochastic methods





inadequate training data (in speaker-dependent systems: user fatigue)

  • inadequate training data (in speaker-dependent systems: user fatigue)

  • memory limitations

  • computation (searching among many possible texts)

  • inadequate models (poor assumptions made to reduce computation and memory, at the cost of reduced accuracy)

  • hard to train model parameters



  • Speech: not an arbitrary signal

  • source of input to ASR: human vocal tract

  • data compression should take account of the human source

  • precision of representation: not exceed ability to control speech



Aspects of speech that speakers do not directly control are free variation

  • Aspects of speech that speakers do not directly control are free variation

  • Can be treated as distortion (noise, other sounds, reverberation)

  • Puts limits on accuracy needed

  • Creates mismatch between trained models and any new input

  • Intra-speaker: people never say the same exact utterance twice

  • Inter-speaker: everyone is different (size, gender, dialect,…)

  • Environment: SNR, microphone placement, …

  • Compare to vision PR: changes in lighting, shadows, obscuring objects, viewing angle, focus

  • Vocal-tract length normalization (VTLN); noise suppression





Speaker controls: amplitude, pitch, formants, voicing, speaking rate

  • Speaker controls: amplitude, pitch, formants, voicing, speaking rate

  • Mapping from word (and phoneme) concepts in the brain to the acoustic output is complex

  • Trying to decipher speech is more complex than identifying objects in a visual scene

  • Vision: edges, texture, coloring, orientation, motion

  • Speech: indirect; not observing the vocal tract



Auditory system sensitive to: dynamic positions of spectral peaks, durations (relative to speaking rate), fundamental frequency (F0) patterns

  • Auditory system sensitive to: dynamic positions of spectral peaks, durations (relative to speaking rate), fundamental frequency (F0) patterns

  • Important: where and when energy occurs

  • Less relevant: overall spectral slope, bandwidths, absence of energy

  • Formant tracking: algorithms err in transitions; not directly used in ASR for many years



distribution of speech energy in frequency (spectral amplitude)

  • distribution of speech energy in frequency (spectral amplitude)

  • pitch period estimation

  • sampling rate typically:

  • 8 000/sec for telephone speech

  • 10 000 - 16 000/sec otherwise

  • usually 16 bits/sample

  • 8-bit mu-law logPCM (in the telephone network)



Feature determination (e.g., formant frequencies, F0) requires error-prone methods

  • Feature determination (e.g., formant frequencies, F0) requires error-prone methods

  • So, automatic methods (parameters) preferred:

  • FFT (fast Fourier transform)

  • LPC (linear predictive coding)

  • MFCC (mel-frequency cepstral coefficients)

  • RASTA-PLP

  • Log spectral (filter-bank) energies







Often, errors in weak speech and in transitions between voiced and unvoiced speech (e.g., doubling or halving F0)

  • Often, errors in weak speech and in transitions between voiced and unvoiced speech (e.g., doubling or halving F0)

  • peak-pick the time signal (look for energy increase at each closure of vocal cords)

  • usually first filter out energy above 1000 Hz (retain strong harmonics in F1 region)

  • often use autocorrelation to eliminate phase effects

  • often not done in ASR, due to the difficulty of exploiting F0 in its complex role of signaling different aspects of speech communication



Objective: model speech spectral envelope with few (8-16) coefficients

  • Objective: model speech spectral envelope with few (8-16) coefficients

  • 1) Linear predictive coding (LPC) analysis: standard spectral method for low-rate speech coding

  • 2) Cepstral processing: common in ASR; also can exploit some auditory effects

  • 3) Vector Quantization (VQ): reduces transmission rate (but also ASR accuracy)





Cepstrum: inverse FFT of the log-amplitude FFT of the speech

  • Cepstrum: inverse FFT of the log-amplitude FFT of the speech

  • small set of parameters (often 10-13) as LPC, but allows warping of frequency to match hearing

  • inverse DFT orthogonalizes

  • gross spectral detail in low-order values, finer detail in higher coefficients

  • C0: total speech energy (often discarded)



C1: balance of energy (low vs. high frequency)

  • C1: balance of energy (low vs. high frequency)

  • C2,...C13 encode increasingly fine details about the spectrum (e.g., resolution to 100 Hz)

  • Mel cepstral coefficients (MFCCs)

  • model low frequencies linearly; above 1000 Hz logarithmically



Linear discriminant analysis (LDA)

  • Linear discriminant analysis (LDA)

  • As in analysis of variance (ANOVA), regression analysis and principal component analysis (PCA), LDA finds a linear combination of features to separate pattern classes

  • Maximum likelihood linear transforms

  • Speaker Adaptive Transforms

  • Map sets of features (e.g., MFCC, Spectral energies) to a smaller, more efficient set



Segmenting speech spoken without pauses (continuous speech):

  • Segmenting speech spoken without pauses (continuous speech):

  • speech unit boundaries are not

  • easily found automatically

  • (vs.,e.g., Text-To-Speech)

  • Variability in speech: different speakers, contexts, styles, channels

  • Factors: real-time; telephone;

  • hesitations; restarts; filled pauses; other sounds (noise, etc)



speaker dependence

  • speaker dependence

  • size of vocabulary

  • small (< 100 words)

  • medium (100-1000 words)

  • large (1-20 K words)

  • very large (> 20 K words)

  • complexity of vocabulary words

  • alphabet (difficult)

  • digits (easy)



allowed sequences of words

  • allowed sequences of words

  • - perplexity: mean # of words to consider

  • - language models

  • style of speech: isolated words or continuous speech; how many words/utterance?

  • recording environment

  • - quiet (> 30 dB SNR)

  • - noisy (< 15 dB)

  • - noise-cancelling microphone

  • - telephone

  • real-time? feedback (rejection)?

  • type of error criterion

  • costs of errors









Very popular in the 1970s

  • Very popular in the 1970s

  • Compares exemplar patterns, with timing flexibility to handle speaking rate variations

  • No assumption of similarity across patterns for the same utterance

  • Thus, no way to generalize

  • Very poor to formulate efficient models











for observation probabilities, usually Gaussian pdf's,

  • for observation probabilities, usually Gaussian pdf's,

  • due to the simplicity of model, using only a mean and a variance

  • (in M dimensions, need a mean for each parameter, and

  • a MxM covariance matrix, noting the correlations between parameters)



major difficulty: first-order frame-independence assumption

  • major difficulty: first-order frame-independence assumption

  • use of delta coefficients over several frames (e.g., 50 ms) helps to include timing information, but is inefficient

  • stochastic trajectory models and trended HMMs are examples of ways to improve timing modeling

  • higher-order Markov models are too computationally complex

  • incorporate more information about speech production and perception into the HMM architecture?















Can handle unaligned speech input

  • Can handle unaligned speech input

  • Adds extra layer to RNN to allow “end-to-end” ASR



fast matrix and vector multiplications

  • fast matrix and vector multiplications



Recurrent artificial NN’s are now dominating ASR

  • Recurrent artificial NN’s are now dominating ASR

  • Basic perceptron: each node in a layer outputs 0 or 1 as input to the next layer, based on a weighted linear combination from the layer beneath

  • 3 layers: adequate to handle all volumes in N-dimensional space

  • Steepest-gradient back propagation training

  • Advances due to big data and massive power

  • End-to-end DNN or DNN/GMM ASR



Still using stochastic gradient as the main way to set the net weights

  • Still using stochastic gradient as the main way to set the net weights

  • Set learning step size for efficient updates

  • Momentum: to avoid traps in local minima

  • Same criterion as in codebook design for speech coding





Conditional random field acoustic models

  • Conditional random field acoustic models

  • Boosting probabilities

  • Support vector machines (SVM): good for binary classifications

  • Machine learning




Do'stlaringiz bilan baham:


Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2017
ma'muriyatiga murojaat qiling