Underspecified feature models for pronunciation variation in asr eric Fosler-Lussier


Download 502 b.
Sana04.11.2017
Hajmi502 b.
#19350


Underspecified feature models for pronunciation variation in ASR

  • Eric Fosler-Lussier

  • The Ohio State University

  • Speech & Language Technologies Lab

  • ITRW - Speech Recognition & Intrinsic Variation

  • 20 May 2006


Fill in the blanks

  • 3, 6, __, 12, 15, __, 21, 24

  • A B C __ E F __ H

  • You’re going to Toulouse? Drink a bottle of _____ for me!

  • What’s the red object?



Filling in the blanks: missing data

  • Missing data approaches have been used to integrate over noisy acoustics



Decode this!

  • (brackets indicate options)

  • s iy n y {ah,ax,axr,er}

  • {l,r} {eh,ih,iy} s er ch

  • {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d}



Decode this!

  • (brackets indicate options)

  • s iy n y {ah,ax,axr,er} senior

  • {l,r} {eh,ih,iy} s er ch research

  • {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d} associate



Decode this!

  • (brackets indicate options)

  • s iy n y {ah,ax,axr,er} senior

  • {l,r} {eh,ih,iy} s er ch research

  • {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d} associate

  • dictionary pronunciation



Decode this!

  • (brackets indicate options)

  • s iy n y {ah,ax,axr,er} senior

  • {l,r} {eh,ih,iy} s er ch research

  • {ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d} associate

  • dictionary pronunciation

  • as marked by transcribers (Buckeye Corpus of Speech)



What do these tasks have in common?

  • Recovering from erroneous information?

    • Context plays a big role in helping “clean up”


What do these tasks have in common?

  • Recovering from erroneous information?

    • Context plays a big role in helping “clean up”
  • Recovering from incomplete information!

    • We should be treating pronunciation variation as a missing data problem
      • Integrate over “missing” phonological features
    • How much information do you need to decode words?


Outline

  • Problems with phonetic representations of variation

    • Potential advantages of phonological features
  • Re-examining the role of phonetic transcription

  • Phonological feature approaches to ASR

    • Feature attribute detection
    • Feature combination methods
    • Learning to (dis-)trust features
  • A challenge for the future



“The Case Against The Phoneme” Homage to Ostendorf (ASRU 99)

  • Four major indications that phonetic modeling of variation is not appropriate:



“The Case Against The Phoneme” Homage to Ostendorf (ASRU 99)

  • Four major indications that phonetic modeling of variation is not appropriate:

    • Lack of progress on spontaneous speech WER
      • McAllaster et al (98): 50% improvement possible
      • Finke & Waibel (97): 6% WER reduction


“The Case Against The Phoneme” Homage to Ostendorf (ASRU 99)

  • Four major indications that phonetic modeling of variation is not appropriate:

    • Lack of progress on spontaneous speech WER
    • Independence of decisions in phone-based models
      • When pronunciation variation is modeled on phone-by-phone level, unusual baseforms are often created
      • Word-based learning fails to generalize across words


“The Case Against The Phoneme” Homage to Ostendorf (ASRU 99)

  • Four major indications that phonetic modeling of variation is not appropriate:

    • Lack of progress on spontaneous speech WER
    • Independence of decisions in phone-based models
    • Lack of granularity
      • Triphone contexts mean a symbolic change in phone can affect 9 HMM states (min 90 msec)
      • Much variation is already handled by triphone context


“The Case Against The Phoneme” Homage to Ostendorf (ASRU 99)

  • Four major indications that phonetic modeling of variation is not appropriate:

    • Lack of progress on spontaneous speech WER
    • Independence of decisions in phone-based models
    • Lack of granularity
    • Difficulty in transcription
      • Phonetic transcription is expensive and time consuming
      • Many decisions difficult to make for transcribers


Using phonological features

  • Finer granularity

    • Some phonological changes don’t result in canonical phones for a language
      • English: uw can sometimes be fronted (toot)
      • Common enough: TIMIT introduced a special phone (ux)
      • Symbol change loses all commonality between phones (uw->ux)
    • Handling odd phonological effects
      • Phone deletions: many “deletions” really leave small traces of coarticulation on neighboring segments
      • E.g. vowel nasalization with nasal deletion
  • Features may provide basis for cross-lingual recognition

      • International Phonetic Alphabet


Issues with phonological features

  • Interlingua: “high vowels in English are not the same as high vowels in Japanese”

    • Richard Wright, lunch Wednesday, ICASSP 2006
  • Concept of “independent directions” false

    • Correlation of feature values
    • Distances no longer euclidean among feature dimensions
  • Dealing with feature spreading

  • Even more difficulty in transcription

    • (but: Karen Livescu’s group, JHU workshop 2006)
  • Articulatory vs. acoustic features

    • No two definitions are exactly the same (see Richard’s talk)


Phonetic transcription

  • There have been a number of efforts to transcribe speech phonetically

    • American English
      • TIMIT (4 hr read speech)
      • Switchboard (4 hr spontaneous speech)
      • Buckeye Corpus (40 hr spontaneous speech) http://buckeyecorpus.osu.edu
  • ASR researchers have found it difficult to utilize phonetic transcriptions directly



ASR & Phonetic Transcription

  • Saraclar & Khudanpur (04) examined the means of acoustic models where canonical phone /x/ was transcribed as [y] over all pairs x:y

    • Compared means of x:y to x:x, y:y
    • Data showed that x:y means often fell between x:x and y:y, sometimes closer to x:x
  • Another view: data from Buckeye Corpus

    • /ae/ is sometimes transcribed as [eh]
    • Examined 80 vowels from one speaker
      • Formant frequencies from center of vowel






Can you trust transcription?

  • Perceptual marking ≠ acoustic measurement

    • Can’t take transcription at face value
  • What are the transcribers are trying to tell us?

    • This phone doesn’t sound like a canonical phone
    • Perhaps we can look at commonalities across canonical/transcribed phone
      • ae:eh -> front vowel (& not high?)
  • Phonological features may help us represent transcription differences.



Variation in single-phone changes

  • Compared canonical vs. transcribed consonants with single-phone substitutions in Switchboard, Buckeye

    • Differences in manner, place, voicing counted


Recent approaches to feature modeling in ASR

  • Since 90’s there has been increased interest in phonological feature modeling

    • Deng et al (92 ff), Kirchhoff (96 ff)
  • Current directions of research

    • Approaches for detecting phonological features from data
    • Methods of combining phonological features
    • Knowing when to ignore information


Feature detection methods

  • Frame-level decisions

    • Most common: artificial neural network methods
      • Input: various flavors of spectral/cepstral representations
      • Output: estimating posterior P(feature|acoustics) on a per-frame level
    • Recent competitor: support vector machines
      • Typically used for binary decision problems
  • Segmental-level decisions: integrate over time

    • HMM detectors
    • Hybrid ANN/Dynamic Bayesian Network


Binary vs. n-ary features

  • Features can either be described as binary or n-ary if they can contrast

    • Binary: /t/ : +stop -fricative …
    • N-ary: /t/ : manner=stop
  • No real conclusion on whether which is better

    • Binary more matched to SVM learning
    • N-ary allows for discrimination among classes
      • Should a segment be allowed to be +stop +fricative?
    • Anecdotally (our lab) we find n-ary features slightly better


Hierarchical representations

  • Phonological features are not truly independent

    • Chang et al (01): Place prediction improves if manner is known
      • ANN predicts P(place=x|manner=y,X) vs P(place=x|X)
      • Suggests need for hierarchical detectors
    • Rajamanohar & Fosler-Lussier (05): Cascading errors make chained decisions worse
      • Better to jointly model P(place=x,manner=y|X), or even derive P(place=x|X) from phone probabilities
    • Frankel et al (04): Hierarchy can be integrated as additional dependencies in DBN


Combining features into higher-level structures

  • Once you have (frame-level) estimates of phonological features, need to combine

    • Temporal integration: Markov structures
    • Phonetic spatial integration: combining into higher-level units (phones, syllables, words)
  • Differences in methodologies:

    • spatial first, then temporal
    • joint/factored spatio-temporal integration
    • phone-level temporal integration with spatial rescoring


Combining features into higher-level structures

  • Tandem ANN/HMM Systems

    • ANN feature posterior estimates are used as replacements for MFCCs for Mixture of Gaussians HMM system
    • We find decorrelation of features (via PCA) necessary to keep models well conditioned
  • Lattice rescoring with Landmarks

    • Maximum entropy models for local word discrimination
    • SVMs used as local features for MaxEnt model.
  • Dynamic Bayesian Models

    • Model asynchrony as a hidden variable
    • SVM outputs used as observations of features


Combining features into higher-level structures

  • Conditional random fields

    • CRFs jointly model spatio-temporal integration
    • Probability expressed in terms of indicator functions s (state), t (transition)
      • Usually binary in NLP applications
    • Frame-level ANN posteriors are bounded
      • Probabilities can serve as observation feature functions
        • sstop(/t/,x,i)=P(manner=stop|xi)


Conditional Random Fields

  • CRFs make no independence assumptions about input

  • Entire label sequence is modeled jointly

    • Monophone feature CRF phone recog. similar to triphone HMM
  • Learning parameters (,) determines importance of feature/phone relationships

    • Implicit model of partial phonological underspecification
  • Slow to train



Underspecification

  • All of these models learn what phonological information is important in higher-level processing

    • Ignoring “canonical” feature definitions for phone is a form of underspecification
    • Traditional underspecification: some features are undefined for a particular phone
    • Weighted models: partial underspecification
  • When can you ignore phonetic information?

    • Crucially, when it doesn’t help you disambiguate between word hypotheses


Underspecification

  • Example: unstressed syllables tend to show more phonetic variation than stressed syllables

    • Experiment: reduce phonetic representation for unstressed syllables to manner class
    • Allowing recognizer to choose best representation (phone/manner) during training (WSJ0):
      • Minor degradation for clean speech (9.9 vs. 9.1 WER)
      • Larger improvement in 10dB car noise (15.8 vs 13.0 WER)
  • Moral: we don’t need to have exact phonetic representation to decode words



Vision for the Future

  • Acoustic-phonetic variation is difficult

    • Still significant cause of errors in ASR
  • Underspecified models give a new way of looking at the problem

    • Rather than the “change x to y” model
  • Challenge for the field:

    • Current techniques for accent modeling, intrinsic pronunciation variation separate
    • Can we build a model that handles both?


Conclusions

  • We have come quite a distance since 1999

    • New methods for phonological feature detection
    • New methods for feature integration
    • New ways of thinking about variation: underspecification
  • Still have a long way to go

    • Integrating more knowledge sources
      • Stress, prosody, word confusability
    • Solving the pronunciation adaptation problem in a general way


Fin



An example feature grid



Download 502 b.

Do'stlaringiz bilan baham:




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling