Historical introduction


Download 445 b.
Sana04.11.2017
Hajmi445 b.
#19367


Historical introduction

  • Historical introduction

  • Mathematical background (e.g., pattern classification, acoustics)

  • Feature extraction for speech recognition (and some neural processing)

  • What sound units are typically defined

  • Audio signal processing topics (pitch extraction, perceptual audio coding, source separation, music analysis)

  • Now – back to pattern recognition, but include time




ASR = static pattern classification + sequence recognition

  • ASR = static pattern classification + sequence recognition

  • Deterministic sequence recognition: template matching

  • Templates are typically word-based; don’t need phonetic sound units per se

  • Still need to put together local distances into something global (per word or utterance)





Basic approach the same for deterministic, statistical:

  • Basic approach the same for deterministic, statistical:

    • 25 ms windows (e.g., Hamming), 10 ms steps (a frame)
    • Some kind of cepstral analysis (e.g., MFCC or PLP)
    • Cepstral vector at time n called xn


Words, phones most common

  • Words, phones most common

  • For template-based ASR, mostly words

  • For template-based ASR, local distances based on examples (reference frames) versus input frames



Easy if local matches are all correct (never happens!)

  • Easy if local matches are all correct (never happens!)

  • Local matches are unreliable

  • Need measure of goodness of fit

  • Need to integrate into global measure

  • Need to consider all possible sequences





Matrix for comparison between frames

  • Matrix for comparison between frames

  • Word template = multiple feature vectors

  • Reference template =

  • Input template =

  • Need to find D( , )



Time Normalization

  • Time Normalization

  • Which references to use

  • Defining distances/costs

  • Endpoints for input templates



Linear Time Normalization

  • Linear Time Normalization

  • Nonlinear Time Normalization – Dynamic Time Warp (DTW)





Speech sounds stretch/compress differently

  • Speech sounds stretch/compress differently

  • Stop consonants versus vowels

  • Need to normalize differently





Permit many more variations

  • Permit many more variations

  • Ideally, compare all possible time warpings

  • Vintsyuk (1968): use dynamic programming



Bellman optimality principle (1962): optimal policy given optimal policies from sub problems

  • Bellman optimality principle (1962): optimal policy given optimal policies from sub problems

  • Best path through grid: if best path goes through grid point, best path includes best partial path to grid point

  • Classic example: knapsack problem



Stuffing a sack with items, different value

  • Stuffing a sack with items, different value

  • Goal: maximize value in sack

  • Key point 1: If max size is 10, and we know values of solutions for max size of 9, we can compute the final answer knowing the value of adding items.

  • Key point 2: Point 1 sounds recursive, but can be made efficiently nonrecursive by building a table





Apply DP to ASR: Vintsyuk, Bridle, Sakoe

  • Apply DP to ASR: Vintsyuk, Bridle, Sakoe

  • Let D(i,j) = total distortion up to frame i in input and frame j in reference

  • Let d(i,j) = local distance between frame i in input and frame j in reference

  • Let p(i,j) = set of possible predecessors to frame i in input and frame j in reference

  • D(i,j) = d(i, j) + minp(i,j) D(p(i,j))



  • (1) Compute local distance d in 1st column(1st frame of input) for each reference template. Let D(0,j) = d(0,j) for each cell in each template

  • (2) For i=1 (2nd column), j=0, compute d(i,j) add to min of all possible predecessor values of D to get local value of D; repeat for each frame in each template.

  • (3) Repeat (2) for each column to the end of input

  • (4) For each template, find best D in last column of input

  • (5) Choose the word for the template with smallest D



O(Nframesref . Nframesin . Ntemplates)

  • O(Nframesref . Nframesin . Ntemplates)

  • Storage, though can just be O(Nframesref . Ntemplates)

  • (store current column and previous column)

  • Constant reduction: global constraints

  • Constant reduction: local constraints









All examples?

  • All examples?

  • Prototypes?

  • DTW-based global distances permit clustering



(1) Initialize (how many, where)

  • (1) Initialize (how many, where)

  • (2) Assign examples to closest center (DTW distance)

  • (3) For each cluster, find template with minimum value for maximum distance, call it the center

  • (4) Repeat (2) and (3) until some stopping criterion is reached

  • (5) Use center templates as references for ASR



Normalizing for scale

  • Normalizing for scale

  • Cepstral weighting

  • Perceptual weighting, e.g., JND

  • Learning distances, e.g., with ANN, statistics



Sounds easy

  • Sounds easy

  • Hard in practice (noise, reverb, gain issues)

  • Simple systems use energy, time thresholds

  • More complex ones also use spectrum

  • Can be tuned

  • Not robust





Time normalization

  • Time normalization

  • Recognition

  • Segmentation

  • Can’t have templates for all utterances

  • DP to the rescue



Vintsyuk, Bridle, Sakoe

  • Vintsyuk, Bridle, Sakoe

  • Sakoe: 2-level algorithm

  • Vintsyuk, Bridle: one stage

  • Ney explanation Ney, H., “The use of a one-stage dynamic programming algorithm for connected word recognition,” IEEE Trans. Acoust. Speech Signal Process. 32: 263-271, 1984



In principle: one big distortion matrix (for 20,000 words, 50 frames/word, 1000 frame input [10 seconds] would be 109 cells!)

  • In principle: one big distortion matrix (for 20,000 words, 50 frames/word, 1000 frame input [10 seconds] would be 109 cells!)

  • Also required, backtracking matrix (since word segmentation not known)

  • Get best distortion

  • Backtrack to get words

  • Fundamental principle: find best segmentation and classification as part of the same process, not as sequential steps





In principle, backtracking matrix points back to best previous cell

  • In principle, backtracking matrix points back to best previous cell

  • Mostly just need backtrack to end of previous word

  • Simplifications possible



Distortion matrix -> 2 columns

  • Distortion matrix -> 2 columns

  • Backtracking matrix -> 2 rows

  • “From template” points to template with lowest cost ending here

  • “From frame” points to end frame of previous word





“Within word” local constraints

  • “Within word” local constraints

  • “Between word” local constraints

  • Grammars

  • Transition costs



DTW combines segmentation, time norm, recognition; all segmentations considered

  • DTW combines segmentation, time norm, recognition; all segmentations considered

  • Same feature vectors used everywhere

  • Could segment separately, using acoustic-phonetic features cleverly

  • Example: FEATURE, Ron Cole (1983)



No structure from subword units

  • No structure from subword units

  • Average or exemplar values only

  • Cross-word pronunciation effects not handled

  • Limited flexibility for distance/distortion

  • Limited mathematical basis

  • -> Statistics!



Having examples can get interesting again when there are many of them

  • Having examples can get interesting again when there are many of them

  • Potentially an augmentation of stat methods

  • Recent experiments show decent results

  • Somewhat different properties -> combination



Statistical ASR

  • Statistical ASR

  • Speech synthesis

  • Speaker recognition

  • Speaker diarization

  • Oral presentations on your projects

  • Written report on your project



Week of April 30: no class Monday, double class Wednesday May 2 (is that what people want?)

  • Week of April 30: no class Monday, double class Wednesday May 2 (is that what people want?)

  • 8 oral presentations by individuals, 12 minutes each + 3 minutes for questions

  • 2 oral presentations by pairs – 17 minutes each + 3 minutes for questions

  • 3:10 PM to 6 PM with a 10 minute mid-session break

  • Written report due Wednesday May 9, no late submissions (email attachment is fine)



Download 445 b.

Do'stlaringiz bilan baham:




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling