Define the problem

Define the problem

There is no single ASR problem

Accuracy

Analog signal produced by humans

As in any data-driven task, the data must be represented in some format

Defined the multiple problems associated with ASR

First known attempt at speech recognition

Originally thought to be a relatively simple task requiring a few years of concerted effort

Originally only worked for isolated words

Create a similarity matrix for the two utterances

One of the systems developed during the DARPA program

The Hearsay-II system performed much better than the two other similar competing systems

For each frame of data, we need some way of describing the likelihood of it belonging to any of our classes

While the pronunciation model can be very complex, it is typically just a dictionary

Now we need some way of representing the likelihood of any given word sequence

A unigram is the probability of any word in isolation

We now have models to represent the three parts of our equation

A state model using the markov property

Similar to a markov model except the states are hidden

First we build an HMM for each phone

Decoding happens in the same way as the previous example

What works well

Better representations of audio based on humans

Hidden Markov Model Toolkit (HTK)‏

Do'stlaringiz bilan baham:

Define the problem

Define the problem

Define the problem

What is speech?

Feature Selection

Models

Current State of ASR

Future Work

There is no single ASR problem

There is no single ASR problem

The problem depends on many factors

Accuracy

Accuracy

Error Rate

Token Type

Analog signal produced by humans

Analog signal produced by humans

You can think about the speech signal being decomposed into the source and filter

The source is the vocal folds in voiced speech

The filter is the vocal tract and articulators

As in any data-driven task, the data must be represented in some format

As in any data-driven task, the data must be represented in some format

Cepstral features have been found to perform well

They represent the frequency of the frequencies

Mel-frequency cepstral coefficients (MFCC) are the most common variety

Defined the multiple problems associated with ASR

Defined the multiple problems associated with ASR

Described how speech is produced

Illustrated how speech can be represented in an ASR system

Now that we have the data, how do we recognize the speech?

First known attempt at speech recognition

First known attempt at speech recognition

A toy from 1922

Worked by analyzing the signal strength at 500Hz

Originally thought to be a relatively simple task requiring a few years of concerted effort

Originally thought to be a relatively simple task requiring a few years of concerted effort

1969, “Wither speech recognition” is published

A DARPA project ran from 1971-1976 in response to the statements in the Pierce article

We can examine a few general systems

Originally only worked for isolated words

Originally only worked for isolated words

Performs best when training and testing conditions are best

For each word we want to recognize, we store a template or example based on actual data

Each test utterance is checked against the templates to find the best match

Uses the Dynamic Time Warping (DTW) algorithm

Create a similarity matrix for the two utterances

Create a similarity matrix for the two utterances

Use dynamic programming to find the lowest cost path

One of the systems developed during the DARPA program

One of the systems developed during the DARPA program

A blackboard-based system utilizing symbolic problem solvers

Each problem solver was called a knowledge group

A complex scheduler was used to decide when each KG should be called

The Hearsay-II system performed much better than the two other similar competing systems

The Hearsay-II system performed much better than the two other similar competing systems

However, only one system met the performance goals of the project

For each frame of data, we need some way of describing the likelihood of it belonging to any of our classes

For each frame of data, we need some way of describing the likelihood of it belonging to any of our classes

Two methods are commonly used

While the pronunciation model can be very complex, it is typically just a dictionary

While the pronunciation model can be very complex, it is typically just a dictionary

The dictionary contains the valid pronunciations for each word

Examples:

Now we need some way of representing the likelihood of any given word sequence

Now we need some way of representing the likelihood of any given word sequence

Many methods exist, but ngrams are the most common

Ngrams models are trained by simply counting the occurrences of words in a training set

A unigram is the probability of any word in isolation

A unigram is the probability of any word in isolation

A bigram is the probability of a given word given the previous word

Higher order ngrams continue in a similar fashion

A backoff probability is used for any unseen data

We now have models to represent the three parts of our equation

We now have models to represent the three parts of our equation

We need a framework to join these models together

The standard framework used is the Hidden Markov Model (HMM)‏

A state model using the markov property

A state model using the markov property

Models the likelihood of transitions between states in a model

Given the model, we can determine the likelihood of any sequence of states