Speech Recognition

Speech Recognition

Speech signal to text

Contextual effects

Speech Recognition

Wave form?

Extract features from short frames (frame period 10ms, 25ms frame size) – a sequence of features

Mel Scale: Approximate the unequal sensitivity of human hearing at different frequencies

MFCC (Mel frequency cepstral coefficient)

Extract a feature vector from each frame

Speech Recognition

Weighted finite state acceptor: Future is independent of the past given the present

HMM is a Markov chain + emission probability function for each state

Evaluation

Given an observation sequence O and a model M, how can we efficiently compute:

Efficient algorithm for decoding O(TN^2)

How do we estimate the model parameters M=(A, B, Pi) to maximize P(O|M)?

Speech Recognition

HTK is a research toolkit for building and manipulating HMMs

Examples:

HTK scripting language is used to generate phonetic transcription for all training data

For each wave file, extract MFCC features.

5 states: 3 emitting states

For each training pair of files (mfc+lab):

So far, we have all monophone models trained

The dictionary contains multiple pronunciations for some words.

The dictionary contains multiple pronunciations for some words.

After getting the best pronunciation

Phones may be realized differently in some contexts

Clustering by growing decision trees

Train the acoustic models again using Baum-Welch algorithm (HERest)

Using the compiled grammar network (WNET)

MFCC Features

Do'stlaringiz bilan baham:

Speech Recognition

Speech Recognition

Speech Recognition

Feature Extraction

Modeling Speech

HMM Toolkit (HTK)

Speech signal to text

Speech signal to text

Contextual effects

Contextual effects

Within-speaker variability

Between-speaker variability

Environment variability

Speech Recognition

Speech Recognition

Feature Extraction

Modeling Speech

HMM Toolkit (HTK)

Wave form?

Wave form?

Spectrogram?

Need representation of speech signal that is robust to acoustic variation but sensitive to linguistic content

Extract features from short frames (frame period 10ms, 25ms frame size) – a sequence of features

Extract features from short frames (frame period 10ms, 25ms frame size) – a sequence of features

Mel Scale: Approximate the unequal sensitivity of human hearing at different frequencies

Mel Scale: Approximate the unequal sensitivity of human hearing at different frequencies

Based on pitch perception

MFCC (Mel frequency cepstral coefficient)

MFCC (Mel frequency cepstral coefficient)

Extract a feature vector from each frame

Extract a feature vector from each frame

Inverted MFCCs:

Speech Recognition

Speech Recognition

Feature Extraction

Modeling Speech

HMM Toolkit (HTK)

Weighted finite state acceptor: Future is independent of the past given the present

Weighted finite state acceptor: Future is independent of the past given the present

HMM is a Markov chain + emission probability function for each state

HMM is a Markov chain + emission probability function for each state

Markov Chain

HMM M=(A, B, Pi)

A = Transition Matrix

B = Observation Distributions

Pi = Initial state probabilities

Evaluation

Evaluation

Decoding

Training

Given an observation sequence O and a model M, how can we efficiently compute:

Given an observation sequence O and a model M, how can we efficiently compute:

P(O | M) = the likelihood of O given the model?

Efficient algorithm for decoding O(TN^2)

Efficient algorithm for decoding O(TN^2)

How do we estimate the model parameters M=(A, B, Pi) to maximize P(O|M)?

How do we estimate the model parameters M=(A, B, Pi) to maximize P(O|M)?

Speech Recognition

Speech Recognition

Feature Extraction

Modeling Speech

HMM Toolkit (HTK)

HTK is a research toolkit for building and manipulating HMMs

HTK is a research toolkit for building and manipulating HMMs

Primarily designed for building HMM-based ASR systems

Tools, for examples:

Examples:

Examples:

Grammar:

HTK scripting language is used to generate phonetic transcription for all training data

HTK scripting language is used to generate phonetic transcription for all training data

For each wave file, extract MFCC features.

For each wave file, extract MFCC features.

.wav  .mfc files

5 states: 3 emitting states

5 states: 3 emitting states

Flat Start: Mean and Variance are initialized as the global mean and variance of all the data

For each training pair of files (mfc+lab):

For each training pair of files (mfc+lab):

1. Concatenate the corresponding monophone HMMs