Lecture 11: asr: Training & Systems Training hmms Language modeling Discrimination & adaptation
Download 237.05 Kb. Pdf ko'rish
|
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 1
EE E6820: Speech & Audio Processing & Recognition Lecture 11: ASR: Training & Systems Training HMMs Language modeling Discrimination & adaptation
Dan Ellis http://www.ee.columbia.edu/~dpwe/e6820/ Columbia University Dept. of Electrical Engineering Spring 2003
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 2 HMM review • HMM M j is specified by:
+ (initial state probabilities ) • See
e6820/papers/Rabiner89-hmm.pdf 1 k a t k a t • k a t • • • • • • k a t k a t • 0.9 0.1 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.9 0.1 0.0 0.0 0.0 0.9 0.1 p(x|q) x - states q i - transition probabilities
- emission distributions
(x) p q n j q n 1 – i ( ) a ij ≡
i ( ) b i x ( )
≡ p q 1
( )
i ≡
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 3 HMM summary (1) • HMMs are a generative model: recognition is inference of • During generation, behavior of model depends only on current state q n :
- transition probabilities p
( q n
+1 |
q n
) = a ij
- observation distributions p
( x n
| q n
) = b i
( x
) • Given states + observations Markov assumption makes • Given observed emissions X , can calculate: p M j X ( ) Q q 1
2 … q N , , , { } = X X 1
x 1
1 … x N , , , { } = =
, (
p x n q n ( ) p q n q n 1 – ( )
∏ =
j ( ) p X Q M , ( ) p Q M ( ) all Q ∑ = E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 4 HMM summary (2) • Calculate via forward recursion : • Viterbi (best path) approximation
- then backtrace... • Pictorially: p X M ( ) p X 1
q n j , ( ) α
j ( )
α n 1 – i ( )a ij i 1 = S ∑
j x n ( ) ⋅ = = α n *
( ) α
1 – * i ( )a ij { } i max
b j x n ( ) ⋅ =
*
, ( ) Q argmax
= Q = {
q 1 , q 2 ,... qn }
=
assumed, hidden observed inferred
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 5
- Viterbi training - EM for HMM parameters - Forward-backward (Baum-Welch) Language modeling Discrimination & adaptation 1 2 3 4 E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 6
- i.e. estimate a ij ,
i (x) given data - better than DTW... • Algorithms to improve p(M | X) are key to success of HMMs - maximum-likelihood of models... • State alignments Q of training examples are generally unknown - else estimating parameters would be easy →
- choose ‘best’ labels (heuristic) →
- ‘fuzzy labels’ (guaranteed local convergence) 2 E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 7
Word models Labelled training data two one
four three five
Data Models
one two
three w ah n w ah n th r iy th r iy th r iy t uw f ao t uw Fit models to data Repeat
until convergence Re-estimate model parameters E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 8
th r
iy Data
Viterbi labels
Q * µ
x n n q i ∈ ∑ # q n i ( ) ----------------------- =
ij # q n 1 – i q n j → ( ) # q n i ( ) ----------------------------------- =
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 9
- finds locally-optimal parameters Θ
- makes sense for decision rules like • Principle: Adjust Θ
of known x & unknown u : - for GMMs, unknowns = mix assignments k - for HMMs, unknowns = hidden state q n (take
Θ to include M j )
Interpretation: “ fuzzy ” values for unknowns p x train Θ ( ) p x M j ( ) p M j ( ) ⋅ E p x u , Θ ( ) log [ ]
Θ
, ( ) p x u Θ , ( ) p u Θ ( )
] log
u ∑ = E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 10
Data log likelihood log
(X | Θ )
estimates Θ Estimate unknowns p(q n | X, Θ) Re-estimate unknowns etc...
local optimum Adjust model params Θ to maximize expected log likelihood E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 11
- closed-form maximization by differentiation etc. p Q k X Θ old , ( ) p X Q k Θ , ( ) p Q k Θ ( ) [ ] log all Q k ∑
k X Θ old , ( ) p x n q n ( ) n ∏
n q n 1 – ( ) ⋅ log all Q k ∑ = p q n i X Θ old , ( ) p x n q n i Θ , ( ) log i 1 = S ∑
1 =
∑ =
1
Θ old , ( ) p q 1
Θ (
log i 1 = S ∑ + p q n 1 – i q n j ,
Θ old
, ( ) p q n j q n 1 – i Θ , ( ) log j 1 = S ∑
1 =
∑ n 2 = N ∑ + E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 12
- reduce to Viterbi if • Require ‘ state occupancy probabilities ’, µ
p q n i X Θ old , ( ) x n ⋅
∑
Θ old , ( ) n ∑ ----------------------------------------------------- = p q n j q n 1 – i ( ) a ij new
p q n 1 – i q n j ,
Θ old
, ( ) n ∑
n 1 – i X Θ old , ( ) n ∑ ---------------------------------------------------------- = =
( )
⁄ =
n i X 1
Θ old
, ( ) E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 13
Θ
• Forward algorithm gives - excludes influence of remaining data • Hence, define so that then • Recursive definition for β
- recurses backwards from final state
1
( )
n i ( )
p q n i X 1
, (
= X n 1 + N β
i ( )
p X n 1 + N q n i X 1
, (
= α
i ( ) β
n i ( )
⋅ p q n i X 1
, (
= p q n i X 1
( )
n i ( ) β
n i ( )
⋅ α
j ( ) β
n j ( )
⋅ j ∑ ---------------------------------------- = β
i ( )
β n 1 + j ( )a ij b j x n 1 + ( )
∑ =
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 14
α
β
- prob. of transition normalized by prob. in first • Obtain from p q n j q n 1 – i ( ) a ij new
p q n 1 – i q n j ,
Θ old
, ( ) n ∑
n 1 – i X Θ old , ( ) n ∑ ---------------------------------------------------------- = =
n 1 – i q n j X Θ old , , ( ) p X n 1 + N q n j ( ) p x n q n j ( ) p q n j q n 1 – i ( ) p q n 1 – i X 1
1 –
( ) = β n j ( ) b j x n ( ) a ij α
1 –
( ) ⋅ ⋅ ⋅ = α n- 1 (i) β n (j) a ij b j (x n )
n j q n- 1
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 15
- just more complicated equations... • Practical GMMs: - 9 to 39 feature dimensions - 2 to 64 Gaussians per mixture depending on number of training examples • Lots of data →
- e.g context-independent (CI):
= ae aa ax ... → context-dependent (CD): q i = b-ae-b b-ae-k ... µ
Θ old , , ( ) p q n i X Θ old , ( ) x n ⋅
∑
Θ old , , ( ) p q n i X Θ old , ( ) n ∑ ------------------------------------------------------------------------------------------------- = E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 16
→
initialization - approximate parameters / rough alignment • Applicable for more than just words... ae 1
2
3
1
2 Model inventory Uniform initialization alignments Initialization parameters Repeat until convergence
probabilities of unknowns
maximize via parameters Labelled training data dh ax k ae t s ae t aa n dh dh
ae t aa n ax ax k k ae ae t Θ init p(q n |X 1, Θ
) i N Θ : max E[log p(X,Q | Θ)] E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 17
→
models - alignment all handled implicitly • What do the states end up meaning ? - not necessarily what you intended; whatever locally maximizes data likelihood
- slow convergence, poor discrimination in models • Other kinds of data, transcriptions - less constrained initial models... TWO ONE FIVE
ONE = w ah n TWO = t uw sil w
n th r iy t uw E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 18
- Pronunciation models - Grammars - Decoding Discrimination & adaptation 1 2 3 4 E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 19
-
j is a particular word sequence - Θ L are parameters related to the language
- link state sequences to words - priors on word sequences
*
j X Θ , ( )
j argmax
= p X M j Θ
, (
j Θ
( )
j argmax
= p X M j Θ
, (
p M j Θ
( )
i ( ) p w i M j ( ) E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 20
- can handle time dilation, pronunciation, grammar all within the same framework ae 1 ae 2 ae 3 k ae aa t THE CAT DOG
SAT ATE
p q M ( ) p q Φ w M , ,
) =
φ (
= p φ w ( ) ⋅
p w n w 1
1 –
, ( ) E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 21
- more training examples for each unit - generalizes to unseen words - (or can do it automatically...) • Start e.g. from pronouncing dictionary : ZERO(0.5) z iy r ow ZERO(0.5) z ih r ow ONE(1.0)
w ah n TWO(1.0)
tcl t uw ...
p Q w i ( ) E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 22
- align to ‘canonical’ pronunciations - infer modification rules - predict other pronunciation variants • e.g. ‘ d deletion ’: d
→
Ø / l _ [stop] p = 0.9
• Generate pronunciation variants; use forced alignment to find weights Surface Phone String f ay v y iy r ow l d f ah ay v y uh r ow l
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 23
- need parses, but speech often agrammatic →
- e.g. n-gram models of Shakespeare: n=1 To him swallowed confess hear both. Which. Of save on ... n=2 What means, sir. I confess she? then all sorts, he is trim, ... n=3 Sweet prince, Falstaff shall die. Harry of Monmouth's grave... n=4 King Henry. What! I will go seek the traitor Gloucester. ... • Big win in recognizer WER - raw recognition results often highly ambiguous - grammar guides to ‘reasonable’ solutions
( ) p w n w 1
( )
n w n K – … w n 1 – , , ( ) =
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 24
- 100M+ words - but: not like spoken language
→
10 15 trigrams ! - never see enough examples - unobserved trigrams should NOT have Pr=0!
-
n ) as an approx to p(w n | w n-1 ) etc. - interpolate 1-gram, 2-gram, 3-gram with learned weights?
• Lots of ideas e.g. category grammars - e.g.
p( PLACE | “went”, “to”) · p(w n | PLACE) - how to define categories? - how to tag words in training corpus? E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 25
- with 100,000+ individual states for LVCSR! →
- phone states independent of word - next word (semi) independent of word history
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 26
- need to restrict search to most promising ones: beam search - sort by estimates of total probability = Pr(so far) + lower bound estimate of remains - trade
search errors for speed • Start-synchronous algorithm: - extract top hypothesis from queue: - find plausible words {
} starting at time n → new hypotheses: - discard if too unlikely, or queue is too long - else re-insert into queue and repeat P n w 1 … w k , , { } n , ,
] pr. so far words next time frame P n p X n n N 1 – + w i ( ) p w i w k … ( ) ⋅ ⋅ w 1 … w k w i , , , { } n N + , , [ ] E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 27
- Discriminant models - Neural net acoustic models - Model adaptation 1 2 3 4 E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 28
- i.e. choose single Θ to
max
trn |
Θ) - Bayesian approach: actually p( Θ | X trn ) • Decision rule is max p(X | M)·p(M) - training will increase
) - may also increase p(X | M wrong ) ...as much? • Discriminant training tries directly to increase discrimination between right & wrong models - e.g. Maximum Mutual Information (MMI) 4 I M j X Θ , ( )
j X Θ , ( )
j Θ ( ) p X Θ ( ) ------------------------------------------ log
= p X M j Θ , ( )
k Θ , ( ) p M k Θ ( ) ∑ ------------------------------------------------------------ log =
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 29
- set
• Nets are less sensitive to input representation - skewed feature distributions - correlated features
( ) p x n q i ( ) = p q i x n ( ) p q i ( )
⁄ ∝ C 0 C 1 C 2 C k t n t n+w
h# pcl
bcl tcl
dcl Feature
calculation posteriors p (
i |
X )
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 30
- input layer: 9 frames x 9-40 features ~ 300 units - hidden layer: 100-8000 units, dep. train set size - output layer: 30-60 context-independent phones • Hard to make context dependent - problems training many classes that are similar? • Representation is partially opaque : Hidden -> Output weights Input -> Hidden #187 hidden layer time frame feature index output layer (phones)
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 31
- test conditions are not like training data: accent, microphone, background noise ...
- but: no ‘ground truth’ labels or transcription • Assume that recognizer output is correct; Estimate a few parameters from those labels - e.g. Maximum Likelihood Linear Regression (MLLR) 2
4 5 6 7 -1.5
-1 -0.5
0 0.5
2 3 4 5 6 7 -1.5 -1 -0.5 0 0.5
Male data Female data E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 32
Feature
calculation sound
Acoustic classifier feature vectors Network
weights HMM
decoder phone probabilities phone & word labeling Word models Language model E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 33
- state transitions and emission likelihoods in one - best path (Viterbi) performs recognition
- Viterbi training makes intuitive sense - EM training is guaranteed to converge - acoustic models (e.g. GMMs) train at same time • Language modeling captures higher structure - pronunciation, word sequences - fits directly into HMM state structure - need to ‘prune’ search space in decoding • Further improvements... - discriminant training moves models ‘apart’ - adaptation adjusts models in new situations Document Outline
Download 237.05 Kb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling