Intensive course Dipartimento di Elettrotecnica ed Elettronica Politecnica di Bari 22 – 26 March 1999
Jacques Koreman Institute of Phonetics University of the Saarland P.O. Box 15 11 50 D - 66041 Saarbrücken E-mail: Germany jkoreman@coli.uni-sb.de
Organisation of the course Tuesday – Friday: - First half of each session: theory - Second half of each session: practice Interruptions invited!!!
Overview of the course 1. Variability in the signal 2. Phonetic features in ASR 3. Deriving phonetic features from the acoustic signal by a Kohonen network 4. ICSLP’98: “Exploiting transitions and focussing on linguistic properties for ASR” 5. ICSLP’98: “Do phonetic features help to improve consonant identification in ASR?”
The goal of ASR systems Input: spectral description of microphone signal, typically - energy in band-pass filters - LPC coefficients - cepstral coefficients Output: linguistic units, usually phones or phonemes (on the basis of which words can be recognised)
Variability in the signal (1) Main problem in ASR: variability in the input signal Example: /k/ has very different realisations in different contexts. Its place of articulation varies from velar before back vowels to pre-velar before front vowels (own articulation of “keep”,“cool”)
Variability in the signal (2) Main problem in ASR: variability in the input signal Example: /g/ in canonical form is sometimes realised as a fricative or approximant , e.g. intervocalically (OE. regen > E. rain). In Danish, this happens to all intervocalic voiced plosives; also, voiceless plosives become voiced.
Variability in the signal (3) Main problem in ASR: variability in the input signal Example: /h/ has very different realisations in different contexts. It can be considered as a voiceless realisation of the surrounding vowels. (spectrograms “ihi”, “aha”, “uhu”)
Variability in the signal (3a)
Variability in the signal (4) Main problem in ASR: variability in the input signal Example: deletion of segments due to articulat- ory overlap. Friction is superimposed on the vowel signal. (spectrogram G.“System”)
Variability in the signal (4a)
Variability in the signal (5) Main problem in ASR: variability in the input signal Example: the same vowel /a:/ is realised differ- ently dependent on its context. (spectrogram “aba”, “ada”, “aga”)
Variability in the signal (5a)
Modelling variability Hidden Markov models can represent the variable signal characteristics of phones
Lexicon and language model (1) Linguistic knowledge about phone sequences (lexicon, language model) improves word recognition Without linguistic knowledge, low phone accuracy
Lexicon and language model (2) Using a lexicon and/or language model is not a top-down solution to all problems: sometimes pragmatic knowledge needed. Example:
Lexicon and language model (3) Using a lexicon and/or language model is not a top-down solution to all problems: sometimes pragmatic knowledge needed. Example: []
CONCLUSIONS The acoustic parameters (e.g. MFCC) are very variable. We must try to improve phone accuracy by extracting linguistic information. BUT: not all our problems can be solved
Phonetic features in ASR Assumption: phone accuracy can be improved by deriving phonetic features from the spectral representation of the speech signal What are phonetic features?
A phonetic description of sounds The articulation of consonants
A phonetic description of sounds The articulation of vowels
Phonetic features: IPA IPA (International Phonetic Alphabet) chart - consonants and vowels - only phonemic distinctions (http://www.arts.gla.ac.uk/IPA/ipa.html)
The IPA chart (consonants)
The IPA chart (other consonants)
The IPA chart (non-pulm. cons.)
The IPA chart (vowels)
The IPA chart (diacritics)
IPA features (obstruents)
IPA features (sonorants)
IPA features (vowels)
Phonetic features Phonetic features - different systems (JFH, SPE, art. feat.) - distinction between “natural classes” which undergo the same phonological processes
SPE features (obstruents) c s n s l h c b r a c c v l s t n y a o o i e a o n o n o a t e s l s n w g n c u t r t i t r n p0 1 -1 -1 -1 -1 0 0 0 -1 0 0 -1 -1 -1 -1 1 b0 1 -1 -1 -1 -1 0 0 0 -1 0 0 -1 1 -1 -1 -1 p 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 -1 -1 -1 -1 1 b 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 -1 1 -1 -1 -1 tden 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 -1 -1 -1 -1 1 t 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 -1 -1 -1 -1 1 d 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 -1 1 -1 -1 -1 k 1 -1 -1 -1 -1 1 0 1 -1 -1 -1 -1 -1 -1 -1 1 g 1 -1 -1 -1 -1 1 0 1 -1 -1 -1 -1 1 -1 -1 -1 f 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 1 -1 -1 1 1 vfri 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 1 1 -1 1 -1 T 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 -1 -1 -1 1 Dfri 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 1 -1 -1 -1 s 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 -1 -1 1 1 z 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 1 -1 1 -1 S 1 -1 -1 -1 -1 1 0 -1 -1 -1 1 1 -1 -1 1 1 Z 1 -1 -1 -1 -1 1 0 -1 -1 -1 1 1 1 -1 1 -1 C 1 -1 -1 -1 -1 1 0 -1 -1 -1 -1 1 -1 -1 1 1 x 1 -1 -1 -1 -1 1 0 1 -1 -1 -1 1 -1 -1 1 1
SPE features (sonorants) c s n s l h c b r a c c v l s t n y a o o i e a o n o n o a t e s l s n w g n c u t r t i t r n m 1 -1 1 1 -1 -1 0 -1 -1 1 -1 -1 1 -1 -1 0 n 1 -1 1 1 -1 -1 0 -1 -1 1 1 -1 1 -1 -1 0 J 1 -1 1 1 -1 1 0 -1 -1 -1 -1 -1 1 -1 -1 0 N 1 -1 1 1 -1 1 0 1 -1 -1 -1 -1 1 -1 -1 0 l 1 -1 -1 1 -1 -1 0 -1 -1 1 1 1 1 1 -1 0 L 1 -1 -1 1 -1 1 0 -1 -1 -1 -1 1 1 1 -1 0 ralv 1 -1 -1 1 -1 -1 0 -1 -1 1 1 1 1 -1 -1 0 Ruvu 1 -1 -1 1 -1 -1 0 1 -1 -1 -1 1 1 -1 -1 0 rret 1 -1 -1 1 -1 -1 0 -1 -1 -1 1 1 1 -1 -1 0 j -1 -1 -1 1 -1 1 0 -1 -1 -1 -1 1 1 -1 -1 0 vapr -1 -1 -1 1 -1 -1 0 -1 -1 1 -1 1 1 -1 -1 0 w -1 -1 -1 1 -1 1 0 1 1 1 -1 1 1 -1 -1 0 h -1 -1 -1 1 1 -1 0 -1 -1 -1 -1 1 -1 -1 -1 0 XXX 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SPE features (vowels) c s n s l h c b r a c c v l s t n y a o o i e a o n o n o a t e s l s n w g n c u t r t i t r n i -1 1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 -1 1 I -1 1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 -1 -1 e -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 1 -1 -1 1 E -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 1 -1 -1 -1 { -1 1 -1 1 1 -1 -1 -1 -1 -1 -1 1 1 -1 -1 -1 a -1 1 -1 1 1 -1 -1 -1 -1 -1 -1 1 1 -1 -1 1 y -1 1 -1 1 -1 1 -1 -1 1 -1 -1 1 1 -1 -1 1 Y -1 1 -1 1 -1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 2 -1 1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1 -1 -1 1 9 -1 1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1 -1 -1 -1 A -1 1 -1 1 1 -1 -1 1 -1 -1 -1 1 1 -1 -1 -1 Q -1 1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 -1 V -1 1 -1 1 -1 -1 -1 1 -1 -1 -1 1 1 -1 -1 -1 O -1 1 -1 1 -1 -1 -1 1 1 -1 -1 1 1 -1 -1 -1 o -1 1 -1 1 -1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 U -1 1 -1 1 -1 1 -1 1 1 -1 -1 1 1 -1 -1 -1 u -1 1 -1 1 -1 1 -1 1 1 -1 -1 1 1 -1 -1 1 Uschwa -1 1 -1 1 -1 -1 1 -1 1 -1 -1 1 1 -1 -1 -1 3 -1 1 -1 1 -1 -1 1 -1 -1 -1 -1 1 1 -1 -1 1 @ -1 1 -1 1 -1 -1 1 -1 -1 -1 -1 1 1 -1 -1 -1 6 -1 1 -1 1 1 -1 1 -1 -1 -1 -1 1 1 -1 -1 -1
CONCLUSION
Kohonen networks Kohonen networks are unsupervised neural networks Our Kohonen networks take vectors of acoustic parameters (MFCC_E_D) as input and output phonetic feature vectors Network size: 50 x 50 neurons
Training the Kohonen network 1. Self-organisation results in a phonotopic map 2. Phone calibration attaches array of phones to each winning neuron 3. Feature calibration replaces array of phones by array of phonetic feature vectors
Mapping with the Kohonen network Acoustic parameter vector belonging to one frame activates neuron Weighted average of phonetic feature vector attached to winning neuron and K-nearest neurons is output
Reduction of features dimensions possible Mapping onto linguistically meaningful dimensions (phonetically less severe confusions) Many-to-one mapping allows mapping of different allophones (acoustic variability) onto the same phonetic feature values automatic and fast mapping
Disadvantages of Kohonen networks They need to be trained on manually segmented and labelled material BUT: cross-language training has been shown to be succesful
Hybrid ASR system
CONCLUSION Acoustic-phonetic mapping extracts linguistically relevant information from the variable input signal.
ICSLP’98
INTRODUCTION
INTRODUCTION
DATA
DATA
DATA
EXPERIMENT 1: SYSTEM
EXPERIMENT 1: RESULTS
EXPERIMENT 1: CONCLUSIONS
EXPERIMENT 2: SYSTEM
EXPERIMENT 2: RESULTS
EXPERIMENT 2: CONCLUSIONS
EXPERIMENT 3: SYSTEM
EXPERIMENT 3: RESULTS
EXPERIMENT 3: CONCLUSIONS
INTERPRETATION (1)
INTERPRETATION (2)
REFERENCES (1)
REFERENCES (2)
SUMMARY
ICSLP’98
INTRODUCTION
DATA
DATA
DATA (1)
DATA (2)
SYSTEM ARCHITECTURE
CONFUSIONS MAPPING
ACIS =
BASELINE SYSTEM
MAPPING SYSTEM
AFFRICATES (1)
AFFRICATES (2)
APMS =
APMS =
CONSONANT CONFUSIONS
CONCLUSIONS
CONCLUSIONS
REFERENCES (1)
REFERENCES (2)
SUMMARY
THE END
Do'stlaringiz bilan baham: |