Historical introduction

Download 445 b.

Sana	04.11.2017
Hajmi	445 b.
	#19367

Historical introduction

Historical introduction
Mathematical background (e.g., pattern classification, acoustics)
Feature extraction for speech recognition (and some neural processing)
What sound units are typically defined
Audio signal processing topics (pitch extraction, perceptual audio coding, source separation, music analysis)
Now – back to pattern recognition, but include time

ASR = static pattern classification + sequence recognition

ASR = static pattern classification + sequence recognition
Deterministic sequence recognition: template matching
Templates are typically word-based; don’t need phonetic sound units per se
Still need to put together local distances into something global (per word or utterance)

Basic approach the same for deterministic, statistical:

Basic approach the same for deterministic, statistical:

25 ms windows (e.g., Hamming), 10 ms steps (a frame)
Some kind of cepstral analysis (e.g., MFCC or PLP)
Cepstral vector at time n called xn

Words, phones most common

Words, phones most common
For template-based ASR, mostly words
For template-based ASR, local distances based on examples (reference frames) versus input frames

Easy if local matches are all correct (never happens!)

Easy if local matches are all correct (never happens!)
Local matches are unreliable
Need measure of goodness of fit
Need to integrate into global measure
Need to consider all possible sequences

Matrix for comparison between frames

Matrix for comparison between frames
Word template = multiple feature vectors
Reference template =
Input template =
Need to find D( , )

Time Normalization

Time Normalization
Which references to use
Defining distances/costs
Endpoints for input templates

Linear Time Normalization

Linear Time Normalization
Nonlinear Time Normalization – Dynamic Time Warp (DTW)

Speech sounds stretch/compress differently

Speech sounds stretch/compress differently
Stop consonants versus vowels
Need to normalize differently

Permit many more variations

Permit many more variations
Ideally, compare all possible time warpings
Vintsyuk (1968): use dynamic programming

Bellman optimality principle (1962): optimal policy given optimal policies from sub problems

Bellman optimality principle (1962): optimal policy given optimal policies from sub problems
Best path through grid: if best path goes through grid point, best path includes best partial path to grid point
Classic example: knapsack problem

Stuffing a sack with items, different value

Stuffing a sack with items, different value
Goal: maximize value in sack
Key point 1: If max size is 10, and we know values of solutions for max size of 9, we can compute the final answer knowing the value of adding items.
Key point 2: Point 1 sounds recursive, but can be made efficiently nonrecursive by building a table

Apply DP to ASR: Vintsyuk, Bridle, Sakoe

Apply DP to ASR: Vintsyuk, Bridle, Sakoe
Let D(i,j) = total distortion up to frame i in input and frame j in reference
Let d(i,j) = local distance between frame i in input and frame j in reference
Let p(i,j) = set of possible predecessors to frame i in input and frame j in reference
D(i,j) = d(i, j) + minp(i,j) D(p(i,j))

(1) Compute local distance d in 1st column(1st frame of input) for each reference template. Let D(0,j) = d(0,j) for each cell in each template
(2) For i=1 (2nd column), j=0, compute d(i,j) add to min of all possible predecessor values of D to get local value of D; repeat for each frame in each template.
(3) Repeat (2) for each column to the end of input
(4) For each template, find best D in last column of input
(5) Choose the word for the template with smallest D

O(Nframesref . Nframesin . Ntemplates)

O(Nframesref . Nframesin . Ntemplates)
Storage, though can just be O(Nframesref . Ntemplates)
(store current column and previous column)
Constant reduction: global constraints
Constant reduction: local constraints

All examples?

All examples?
Prototypes?
DTW-based global distances permit clustering

(1) Initialize (how many, where)

(1) Initialize (how many, where)
(2) Assign examples to closest center (DTW distance)
(3) For each cluster, find template with minimum value for maximum distance, call it the center
(4) Repeat (2) and (3) until some stopping criterion is reached
(5) Use center templates as references for ASR

Normalizing for scale

Normalizing for scale
Cepstral weighting
Perceptual weighting, e.g., JND
Learning distances, e.g., with ANN, statistics

Sounds easy

Sounds easy
Hard in practice (noise, reverb, gain issues)
Simple systems use energy, time thresholds
More complex ones also use spectrum
Can be tuned
Not robust

Time normalization

Time normalization
Recognition
Segmentation
Can’t have templates for all utterances
DP to the rescue

Vintsyuk, Bridle, Sakoe

Vintsyuk, Bridle, Sakoe
Sakoe: 2-level algorithm
Vintsyuk, Bridle: one stage
Ney explanation Ney, H., “The use of a one-stage dynamic programming algorithm for connected word recognition,” IEEE Trans. Acoust. Speech Signal Process. 32: 263-271, 1984

In principle: one big distortion matrix (for 20,000 words, 50 frames/word, 1000 frame input [10 seconds] would be 109 cells!)

In principle: one big distortion matrix (for 20,000 words, 50 frames/word, 1000 frame input [10 seconds] would be 109 cells!)
Also required, backtracking matrix (since word segmentation not known)
Get best distortion
Backtrack to get words
Fundamental principle: find best segmentation and classification as part of the same process, not as sequential steps

In principle, backtracking matrix points back to best previous cell

In principle, backtracking matrix points back to best previous cell
Mostly just need backtrack to end of previous word
Simplifications possible

Distortion matrix -> 2 columns

Distortion matrix -> 2 columns
Backtracking matrix -> 2 rows
“From template” points to template with lowest cost ending here
“From frame” points to end frame of previous word

“Within word” local constraints

“Within word” local constraints
“Between word” local constraints
Grammars
Transition costs

DTW combines segmentation, time norm, recognition; all segmentations considered

DTW combines segmentation, time norm, recognition; all segmentations considered
Same feature vectors used everywhere
Could segment separately, using acoustic-phonetic features cleverly
Example: FEATURE, Ron Cole (1983)

No structure from subword units

No structure from subword units
Average or exemplar values only
Cross-word pronunciation effects not handled
Limited flexibility for distance/distortion
Limited mathematical basis
-> Statistics!

Having examples can get interesting again when there are many of them

Having examples can get interesting again when there are many of them
Potentially an augmentation of stat methods
Recent experiments show decent results
Somewhat different properties -> combination

Statistical ASR

Statistical ASR
Speech synthesis
Speaker recognition
Speaker diarization
Oral presentations on your projects
Written report on your project

Week of April 30: no class Monday, double class Wednesday May 2 (is that what people want?)

Week of April 30: no class Monday, double class Wednesday May 2 (is that what people want?)
8 oral presentations by individuals, 12 minutes each + 3 minutes for questions
2 oral presentations by pairs – 17 minutes each + 3 minutes for questions
3:10 PM to 6 PM with a 10 minute mid-session break
Written report due Wednesday May 9, no late submissions (email attachment is fine)

Download 445 b.

Do'stlaringiz bilan baham:

Historical introduction

Historical introduction

Historical introduction

Mathematical background (e.g., pattern classification, acoustics)

Feature extraction for speech recognition (and some neural processing)

What sound units are typically defined

Audio signal processing topics (pitch extraction, perceptual audio coding, source separation, music analysis)

Now – back to pattern recognition, but include time

ASR = static pattern classification + sequence recognition

ASR = static pattern classification + sequence recognition

Deterministic sequence recognition: template matching

Templates are typically word-based; don’t need phonetic sound units per se

Still need to put together local distances into something global (per word or utterance)

Basic approach the same for deterministic, statistical:

Basic approach the same for deterministic, statistical:

Words, phones most common

Words, phones most common

For template-based ASR, mostly words

For template-based ASR, local distances based on examples (reference frames) versus input frames

Easy if local matches are all correct (never happens!)

Easy if local matches are all correct (never happens!)

Local matches are unreliable

Need measure of goodness of fit

Need to integrate into global measure

Need to consider all possible sequences

Matrix for comparison between frames

Matrix for comparison between frames

Word template = multiple feature vectors

Reference template =

Input template =

Need to find D( , )

Time Normalization

Time Normalization

Which references to use

Defining distances/costs

Endpoints for input templates

Linear Time Normalization

Linear Time Normalization

Nonlinear Time Normalization – Dynamic Time Warp (DTW)

Speech sounds stretch/compress differently

Speech sounds stretch/compress differently

Stop consonants versus vowels

Need to normalize differently

Permit many more variations

Permit many more variations

Ideally, compare all possible time warpings

Vintsyuk (1968): use dynamic programming

Bellman optimality principle (1962): optimal policy given optimal policies from sub problems

Bellman optimality principle (1962): optimal policy given optimal policies from sub problems

Best path through grid: if best path goes through grid point, best path includes best partial path to grid point

Classic example: knapsack problem

Stuffing a sack with items, different value

Stuffing a sack with items, different value

Goal: maximize value in sack

Key point 1: If max size is 10, and we know values of solutions for max size of 9, we can compute the final answer knowing the value of adding items.

Key point 2: Point 1 sounds recursive, but can be made efficiently nonrecursive by building a table

Apply DP to ASR: Vintsyuk, Bridle, Sakoe

Apply DP to ASR: Vintsyuk, Bridle, Sakoe

Let D(i,j) = total distortion up to frame i in input and frame j in reference

Let d(i,j) = local distance between frame i in input and frame j in reference

Let p(i,j) = set of possible predecessors to frame i in input and frame j in reference

D(i,j) = d(i, j) + minp(i,j) D(p(i,j))

(1) Compute local distance d in 1st column(1st frame of input) for each reference template. Let D(0,j) = d(0,j) for each cell in each template

(2) For i=1 (2nd column), j=0, compute d(i,j) add to min of all possible predecessor values of D to get local value of D; repeat for each frame in each template.

(3) Repeat (2) for each column to the end of input

(4) For each template, find best D in last column of input

(5) Choose the word for the template with smallest D

O(Nframesref . Nframesin . Ntemplates)

O(Nframesref . Nframesin . Ntemplates)

Storage, though can just be O(Nframesref . Ntemplates)

(store current column and previous column)

Constant reduction: global constraints

Constant reduction: local constraints

All examples?

All examples?

Prototypes?

DTW-based global distances permit clustering

(1) Initialize (how many, where)

(1) Initialize (how many, where)

(2) Assign examples to closest center (DTW distance)