Eric Fosler-Lussier The Ohio State University Speech & Language Technologies Lab ITRW - Speech Recognition & Intrinsic Variation 20 May 2006
Fill in the blanks 3, 6, __, 12, 15, __, 21, 24 A B C __ E F __ H You’re going to Toulouse? Drink a bottle of _____ for me! What’s the red object?
Filling in the blanks: missing data Missing data approaches have been used to integrate over noisy acoustics
Decode this! (brackets indicate options) s iy n y {ah,ax,axr,er} {l,r} {eh,ih,iy} s er ch {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d}
Decode this! (brackets indicate options) s iy n y {ah,ax,axr,er} senior {l,r} {eh,ih,iy} s er ch research {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d} associate
Decode this! (brackets indicate options) s iy n y {ah,ax,axr,er} senior {l,r} {eh,ih,iy} s er ch research {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d} associate dictionary pronunciation
Decode this! (brackets indicate options) s iy n y {ah,ax,axr,er} senior {l,r} {eh,ih,iy} s er ch research {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d} associate dictionary pronunciation as marked by transcribers (Buckeye Corpus of Speech)
What do these tasks have in common? Recovering from erroneous information? - Context plays a big role in helping “clean up”
What do these tasks have in common? Recovering from erroneous information? - Context plays a big role in helping “clean up”
Recovering from incomplete information! - We should be treating pronunciation variation as a missing data problem
- Integrate over “missing” phonological features
- How much information do you need to decode words?
Outline Problems with phonetic representations of variation - Potential advantages of phonological features
Re-examining the role of phonetic transcription Phonological feature approaches to ASR - Feature attribute detection
- Feature combination methods
- Learning to (dis-)trust features
A challenge for the future
“The Case Against The Phoneme” Homage to Ostendorf (ASRU 99) Four major indications that phonetic modeling of variation is not appropriate:
“The Case Against The Phoneme” Homage to Ostendorf (ASRU 99) Four major indications that phonetic modeling of variation is not appropriate: - Lack of progress on spontaneous speech WER
- McAllaster et al (98): 50% improvement possible
- Finke & Waibel (97): 6% WER reduction
“The Case Against The Phoneme” Homage to Ostendorf (ASRU 99) Four major indications that phonetic modeling of variation is not appropriate: - Lack of progress on spontaneous speech WER
- Independence of decisions in phone-based models
- When pronunciation variation is modeled on phone-by-phone level, unusual baseforms are often created
- Word-based learning fails to generalize across words
“The Case Against The Phoneme” Homage to Ostendorf (ASRU 99) Four major indications that phonetic modeling of variation is not appropriate: - Lack of progress on spontaneous speech WER
- Independence of decisions in phone-based models
- Lack of granularity
- Triphone contexts mean a symbolic change in phone can affect 9 HMM states (min 90 msec)
- Much variation is already handled by triphone context
“The Case Against The Phoneme” Homage to Ostendorf (ASRU 99) Four major indications that phonetic modeling of variation is not appropriate: - Lack of progress on spontaneous speech WER
- Independence of decisions in phone-based models
- Lack of granularity
- Difficulty in transcription
- Phonetic transcription is expensive and time consuming
- Many decisions difficult to make for transcribers
Using phonological features Finer granularity - Some phonological changes don’t result in canonical phones for a language
- English: uw can sometimes be fronted (toot)
- Common enough: TIMIT introduced a special phone (ux)
- Symbol change loses all commonality between phones (uw->ux)
- Handling odd phonological effects
- Phone deletions: many “deletions” really leave small traces of coarticulation on neighboring segments
- E.g. vowel nasalization with nasal deletion
Features may provide basis for cross-lingual recognition - International Phonetic Alphabet
Issues with phonological features Interlingua: “high vowels in English are not the same as high vowels in Japanese” - Richard Wright, lunch Wednesday, ICASSP 2006
Concept of “independent directions” false - Correlation of feature values
- Distances no longer euclidean among feature dimensions
Dealing with feature spreading Even more difficulty in transcription - (but: Karen Livescu’s group, JHU workshop 2006)
Articulatory vs. acoustic features - No two definitions are exactly the same (see Richard’s talk)
Phonetic transcription There have been a number of efforts to transcribe speech phonetically - American English
- TIMIT (4 hr read speech)
- Switchboard (4 hr spontaneous speech)
- Buckeye Corpus (40 hr spontaneous speech) http://buckeyecorpus.osu.edu
ASR researchers have found it difficult to utilize phonetic transcriptions directly
ASR & Phonetic Transcription Saraclar & Khudanpur (04) examined the means of acoustic models where canonical phone /x/ was transcribed as [y] over all pairs x:y - Compared means of x:y to x:x, y:y
- Data showed that x:y means often fell between x:x and y:y, sometimes closer to x:x
Another view: data from Buckeye Corpus
Can you trust transcription? Perceptual marking ≠ acoustic measurement - Can’t take transcription at face value
What are the transcribers are trying to tell us? - This phone doesn’t sound like a canonical phone
- Perhaps we can look at commonalities across canonical/transcribed phone
- ae:eh -> front vowel (& not high?)
Phonological features may help us represent transcription differences.
Variation in single-phone changes Compared canonical vs. transcribed consonants with single-phone substitutions in Switchboard, Buckeye - Differences in manner, place, voicing counted
Recent approaches to feature modeling in ASR Since 90’s there has been increased interest in phonological feature modeling - Deng et al (92 ff), Kirchhoff (96 ff)
Current directions of research - Approaches for detecting phonological features from data
- Methods of combining phonological features
- Knowing when to ignore information
Frame-level decisions - Most common: artificial neural network methods
- Input: various flavors of spectral/cepstral representations
- Output: estimating posterior P(feature|acoustics) on a per-frame level
- Recent competitor: support vector machines
- Typically used for binary decision problems
Segmental-level decisions: integrate over time - HMM detectors
- Hybrid ANN/Dynamic Bayesian Network
Binary vs. n-ary features Features can either be described as binary or n-ary if they can contrast - Binary: /t/ : +stop -fricative …
- N-ary: /t/ : manner=stop
No real conclusion on whether which is better - Binary more matched to SVM learning
- N-ary allows for discrimination among classes
- Should a segment be allowed to be +stop +fricative?
- Anecdotally (our lab) we find n-ary features slightly better
Hierarchical representations Phonological features are not truly independent - Chang et al (01): Place prediction improves if manner is known
- ANN predicts P(place=x|manner=y,X) vs P(place=x|X)
- Suggests need for hierarchical detectors
- Rajamanohar & Fosler-Lussier (05): Cascading errors make chained decisions worse
- Better to jointly model P(place=x,manner=y|X), or even derive P(place=x|X) from phone probabilities
- Frankel et al (04): Hierarchy can be integrated as additional dependencies in DBN
Combining features into higher-level structures Once you have (frame-level) estimates of phonological features, need to combine - Temporal integration: Markov structures
- Phonetic spatial integration: combining into higher-level units (phones, syllables, words)
Differences in methodologies: - spatial first, then temporal
- joint/factored spatio-temporal integration
- phone-level temporal integration with spatial rescoring
Combining features into higher-level structures Tandem ANN/HMM Systems - ANN feature posterior estimates are used as replacements for MFCCs for Mixture of Gaussians HMM system
- We find decorrelation of features (via PCA) necessary to keep models well conditioned
Lattice rescoring with Landmarks - Maximum entropy models for local word discrimination
- SVMs used as local features for MaxEnt model.
Dynamic Bayesian Models - Model asynchrony as a hidden variable
- SVM outputs used as observations of features
Combining features into higher-level structures Conditional random fields - CRFs jointly model spatio-temporal integration
- Probability expressed in terms of indicator functions s (state), t (transition)
- Usually binary in NLP applications
- Frame-level ANN posteriors are bounded
- Probabilities can serve as observation feature functions
- sstop(/t/,x,i)=P(manner=stop|xi)
Conditional Random Fields CRFs make no independence assumptions about input Entire label sequence is modeled jointly - Monophone feature CRF phone recog. similar to triphone HMM
Learning parameters (,) determines importance of feature/phone relationships - Implicit model of partial phonological underspecification
Slow to train
Underspecification All of these models learn what phonological information is important in higher-level processing - Ignoring “canonical” feature definitions for phone is a form of underspecification
- Traditional underspecification: some features are undefined for a particular phone
- Weighted models: partial underspecification
When can you ignore phonetic information? - Crucially, when it doesn’t help you disambiguate between word hypotheses
Underspecification Example: unstressed syllables tend to show more phonetic variation than stressed syllables - Experiment: reduce phonetic representation for unstressed syllables to manner class
- Allowing recognizer to choose best representation (phone/manner) during training (WSJ0):
- Minor degradation for clean speech (9.9 vs. 9.1 WER)
- Larger improvement in 10dB car noise (15.8 vs 13.0 WER)
Moral: we don’t need to have exact phonetic representation to decode words
Vision for the Future Acoustic-phonetic variation is difficult - Still significant cause of errors in ASR
Underspecified models give a new way of looking at the problem - Rather than the “change x to y” model
Challenge for the field: - Current techniques for accent modeling, intrinsic pronunciation variation separate
- Can we build a model that handles both?
Conclusions We have come quite a distance since 1999 - New methods for phonological feature detection
- New methods for feature integration
- New ways of thinking about variation: underspecification
Still have a long way to go - Integrating more knowledge sources
- Stress, prosody, word confusability
- Solving the pronunciation adaptation problem in a general way
Fin
An example feature grid
Do'stlaringiz bilan baham: |