Define the problem


Download 461 b.
Sana03.11.2017
Hajmi461 b.
#19303



Define the problem

  • Define the problem

  • What is speech?

  • Feature Selection

  • Models

    • Early methods
    • Modern statistical models
  • Current State of ASR

  • Future Work



There is no single ASR problem

  • There is no single ASR problem

  • The problem depends on many factors

    • Microphone: Close-mic, throat-mic, microphone array, audio-visual
    • Sources: band-limited, background noise, reverberation
    • Speaker: speaker dependent, speaker independent
    • Language: open/closed vocabulary, vocabulary size, read/spontaneous speech
    • Output: Transcription, speaker id, keywords


Accuracy

  • Accuracy

    • Percentage of tokens correctly recognized
  • Error Rate

    • Inverse of accuracy
  • Token Type

    • Phones
    • Words*
    • Sentences
    • Semantics?


Analog signal produced by humans

  • Analog signal produced by humans

  • You can think about the speech signal being decomposed into the source and filter

  • The source is the vocal folds in voiced speech

  • The filter is the vocal tract and articulators















As in any data-driven task, the data must be represented in some format

  • As in any data-driven task, the data must be represented in some format

  • Cepstral features have been found to perform well

  • They represent the frequency of the frequencies

  • Mel-frequency cepstral coefficients (MFCC) are the most common variety



Defined the multiple problems associated with ASR

  • Defined the multiple problems associated with ASR

  • Described how speech is produced

  • Illustrated how speech can be represented in an ASR system

  • Now that we have the data, how do we recognize the speech?



First known attempt at speech recognition

  • First known attempt at speech recognition

  • A toy from 1922

  • Worked by analyzing the signal strength at 500Hz



Originally thought to be a relatively simple task requiring a few years of concerted effort

  • Originally thought to be a relatively simple task requiring a few years of concerted effort

  • 1969, “Wither speech recognition” is published

  • A DARPA project ran from 1971-1976 in response to the statements in the Pierce article

  • We can examine a few general systems



Originally only worked for isolated words

  • Originally only worked for isolated words

  • Performs best when training and testing conditions are best

  • For each word we want to recognize, we store a template or example based on actual data

  • Each test utterance is checked against the templates to find the best match

  • Uses the Dynamic Time Warping (DTW) algorithm



Create a similarity matrix for the two utterances

  • Create a similarity matrix for the two utterances

  • Use dynamic programming to find the lowest cost path



One of the systems developed during the DARPA program

  • One of the systems developed during the DARPA program

  • A blackboard-based system utilizing symbolic problem solvers

  • Each problem solver was called a knowledge group

  • A complex scheduler was used to decide when each KG should be called





The Hearsay-II system performed much better than the two other similar competing systems

  • The Hearsay-II system performed much better than the two other similar competing systems

  • However, only one system met the performance goals of the project

    • The Harpy system was also a CMU built system
    • In many ways it was a predecessor to the modern statistical systems






For each frame of data, we need some way of describing the likelihood of it belonging to any of our classes

  • For each frame of data, we need some way of describing the likelihood of it belonging to any of our classes

  • Two methods are commonly used

    • Multilayer perceptron (MLP) gives the likelihood of a class given the data
    • Gaussian Mixture Model (GMM) gives the likelihood of the data given a class




While the pronunciation model can be very complex, it is typically just a dictionary

  • While the pronunciation model can be very complex, it is typically just a dictionary

  • The dictionary contains the valid pronunciations for each word

  • Examples:

    • Cat: k ae t
    • Dog: d ao g
    • Fox: f aa x s


Now we need some way of representing the likelihood of any given word sequence

  • Now we need some way of representing the likelihood of any given word sequence

  • Many methods exist, but ngrams are the most common

  • Ngrams models are trained by simply counting the occurrences of words in a training set



A unigram is the probability of any word in isolation

  • A unigram is the probability of any word in isolation

  • A bigram is the probability of a given word given the previous word

  • Higher order ngrams continue in a similar fashion

  • A backoff probability is used for any unseen data



We now have models to represent the three parts of our equation

  • We now have models to represent the three parts of our equation

  • We need a framework to join these models together

  • The standard framework used is the Hidden Markov Model (HMM)‏



A state model using the markov property

  • A state model using the markov property

    • The markov property states that the future depends only on the present state
  • Models the likelihood of transitions between states in a model

  • Given the model, we can determine the likelihood of any sequence of states



Similar to a markov model except the states are hidden

  • Similar to a markov model except the states are hidden

  • We now have observations tied to the individual states

  • We no longer know the exact state sequence given the data

  • Allows for the modeling of an underlying unobservable process



First we build an HMM for each phone

  • First we build an HMM for each phone

  • Next we combine the phone models based on the pronunciation model to create word level models

  • Finally, the word level models are combined based on the language model

  • We now have a giant network with potentially thousands or even millions of states



Decoding happens in the same way as the previous example

  • Decoding happens in the same way as the previous example

  • For each time frame we need to maintain two pieces of information

    • The likelihood of being at any state
    • The previous state for every state


What works well

  • What works well

    • Constrained vocabulary systems
    • Systems adapted to a given speaker
    • Systems in anechoic environments without background noise
    • Systems expecting read speech
  • What doesn't work

    • Large unconstrained vocabulary
    • Noisy environments
    • Conversational speech


Better representations of audio based on humans

  • Better representations of audio based on humans

  • Better representation of acoustic elements based on articulatory phonology

  • Segmental models that do not rely on the simple frame-based approach



Hidden Markov Model Toolkit (HTK)‏

  • Hidden Markov Model Toolkit (HTK)‏

    • http://htk.eng.cam.ac.uk/
  • CHIME ( a freely available dataset)‏

    • http://spandh.dcs.shef.ac.uk/projects/chime/PCC/datasets.html
  • Machine Learning Lectures

    • http://www.stanford.edu/class/cs229/
    • http://www.youtube.com/watch?v=UzxYlbK2c7E


Download 461 b.

Do'stlaringiz bilan baham:




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling