Speech Recognition HUNG-YI LEE 李宏毅.
LAS: 就是 seq2seq CTC: decoder 是 linear classifier 的 seq2seq RNA: 輸入一個東西就要 輸出一個東西的 seq2seq RNN-T: 輸入一個東西可以 輸出多個東西的 seq2seq Neural Transducer: 每次輸入 一個 window 的 RNN-T MoCha: window 移動伸縮 自如的 Neural Transducer Last Time.
Two Points of Views Source of image: 李琳山老師 《數位語音處理概論》 Seq-to-seq HMM.
Hidden Markov Model (HMM) Speech Recognition speech text X Y Y∗ = ??? max Y ? Y|? = ??? max Y ? ?|Y ? Y ? ? = ??? max Y ? ?|Y ? Y ? ?|Y : Acoustic Model ? Y : Language Model HMM Decode.
HMM hh w aa t d uw y uw th ih ng k what do you think t-d+uw1 t-d+uw2 t-d+uw3 …… t-d+uw d-uw+y uw-y+uw y-uw+th …… d-uw+y1 d-uw+y2 d-uw+y3 Phoneme: Tri-phone: State: ? ?|Y ? ?|? A token sequence Y corresponds to a sequence of states S.
HMM ? ?|Y ? ?|? A sentence Y corresponds to a sequence of states S ? ? ? Start End ? ?.
HMM ? ?|Y ? ?|? A sentence Y corresponds to a sequence of states S Transition Probability ? ?|? ? ? Emission Probability t-d+uw1 d-uw+y3 P(x|”t-d+uw1”) P(x|”d-uw+y3”) Gaussian Mixture Model (GMM) Probability from one state to another.
HMM – Emission Probability • Too many states …… P(x|”t-d+uw3”) P(x|”d-uw+y3”) Tied-state Same Address pointer pointer 終極型態: Subspace GMM [Povey, et al., ICASSP’10] (Geoffrey Hinton also published deep learning for ASR in the same conference) [Mohamed , et al., ICASSP’10].
P? ?|? =? transition ? ?|? ? ? emission (GMM) ℎ∈?????(?) ? ?|ℎ alignment which state generates which vector ℎ1 = ?????? ? ? ? ℎ2 = ?????? ℎ = ?????? ? ?|ℎ1 ? ?|ℎ2 ℎ = ????? ? ? ? ? ? ? ? ?|? ? ?|? ? ?|? ? ?|? ? ?|? ?1 ?2 ?3 ?4 ?5 ?6 ? ?1|? ? ?2|? ? ?3|? ? ?4|? ? ?5|? ? ?6|?.
How to use Deep Learning?.
Method 1: Tandem …… …… ?? Size of output layer = No. of states DNN …… …… Last hidden layer or bottleneck layer are also possible. New acoustic feature for HMM ?(?|??) ?(?|??) ?(?|??) State classifier.
Method 2: DNN-HMM Hybrid DNN ? ?|? …… ? ? ?|? ? ?|? = ? ?, ? ? ? = ? ?|? ? ? ? ? Count from training data CNN, LSTM … DNN output.
How to train a state classifier? Train HMM-GMM model state sequence: Acoustic features: Utterance + Label (without alignment) Utterance + Label (aligned) state sequence: Acoustic features: a a a b b c c.
How to train a state classifier? Train HMM-GMM model Utterance + Label (without alignment) Utterance + Label (aligned) DNN1.
How to train a state classifier? Utterance + Label (without alignment) Utterance + Label (aligned) DNN2 DNN1 realignment.
Human Parity! • 微軟語音辨識技術突破重大里程碑:對話辨識能力達人 類水準!(2016.10) • https://www.bnext.com.tw/article/41414/bn-2016-10-19-020437-216 • IBM vs Microsoft: 'Human parity' speech recognition record changes hands again (2017.03) • http://www.zdnet.com/article/ibm-vs-microsoft-human-parity- speech-recognition-record-changes-hands-again/ Machine 5.9% v.s. Human 5.9% Machine 5.5% v.s. Human 5.1% [Yu, et al., INTERSPEECH’16] [Saon, et al., INTERSPEECH’17].
Very Deep [Yu, et al., INTERSPEECH’16].
Back to End-to-end.
LAS ? ?|X =? Y∗ = ??? max Y ???? Y|? ?2 ?3 ?4 ?1 • LAS directly computes ? ?|X ? ? ? ?|X = ? ?|? ? ?|?, ? … ?0 ?0 ?1 Size V ?1 ?1 ?2 a ? ? ? ? ?2 ?3 b ? ??? Beam Search ?∗ = ??? max ? ???P? ?|? Decoding: Training:.
CTC, RNN-T ? ?|X =? ?2 ?3 ?4 ?1 • LAS directly computes ? ?|X ? ? ? ?|X = ? ?|? ? ?|?, ? … • CTC and RNN-T need alignment Encoder ℎ4 ℎ3 ℎ2 ℎ1 ℎ = ? ? ? ? ? ℎ|? ? ? ? ? ? ? ? ? P Y|? = ℎ∈????? ? ? ℎ|? → ? ? Y∗ = ??? max Y ???? Y|? Beam Search ?∗ = ??? max ? ???P? ?|? Decoding: Training:.
HMM, CTC, RNN-T P? ?|? = ℎ∈????? ? ? ?|ℎ ?∗ = ??? max ? ???P? ?|? Y∗ = ??? max Y ???? Y|? P? Y|? = ℎ∈????? ? ? ℎ|? HMM CTC, RNN-T ?P? ?|? ?? =? 1. Enumerate all the possible alignments 2. How to sum over all the alignments 3. Training: 4. Testing (Inference, decoding):.
HMM, CTC, RNN-T ? ?|? = ℎ∈????? ? ? ?|ℎ ?∗ = ??? max ? ???P? ?|? Y∗ = ??? max Y ???? Y|? ? Y|? = ℎ∈????? ? ? ℎ|? HMM CTC, RNN-T ?P? ?|? ?? =? 1. Enumerate all the possible alignments 2. How to sum over all the alignments 3. Training: 4. Testing (Inference, decoding):.
All the alignments HMM CTC RNN-T LAS 你們在忙什麼 ☺ c a t c c a a a t c a a a a t c ? a a t t add ? duplicate to length ? ? c a ? t ? add ? x ? c ? ? ? a ? ? t ? c ? ? a ? ? t ? ? ? = 6 Speech Recognition speech text c a t ? = 3 … duplicate c a t to length ? … c a t …… ? ℎ ∈ ?????(?).
HMM c a t c c a a a t c a a a a t duplicate to length ? … ?1 ?2 ?3 ?4 ?5 ?6 c a t next token duplicate For n = 1 to ? output the n-th token ?? times constraint: ?1 + ?2 + ⋯ ?? = ?, ?? > 0 Trellis Graph c c c a a t a a t t t.
HMM c a t c c a a a t c a a a a t duplicate to length ? … ?1 ?2 ?3 ?4 ?5 ?6 c a t next token duplicate For n = 1 to ? output the n-th token ?? times constraint: ?1 + ?2 + ⋯ ?? = ?, ?? > 0 Trellis Graph c c c c c c.
For n = 1 to ? output the n-th token ?? times constraint: output “?” ?? times ?? > 0 ?? ≥ 0 output “?” ?0 times ?1 + ?2 + ⋯ ?? + ?0 + ?1 + ⋯ ?? = ? CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? ….
?1 ?2 ?3 ?4 ?5 ?6 ? c ? a ? t ? next token duplicate insert ? CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? … (? can be skipped) c.
CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? … ?1 ?2 ?3 ?4 ?5 ?6 ? c ? a ? t ? duplicate ? next token cannot skip any token ?.
?1 ?2 ?3 ?4 ?5 ?6 ? c ? a ? t ? next token duplicate insert ? CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? … duplicate next token.
CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? … ?1 ?2 ?3 ?4 ?5 ?6 ? c ? a ? t ? c c a t ? a ? t ? ? c ?.
CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? … ?1 ?2 ?3 ?4 ?5 ?6 ? c ? a ? t ? c a t ? ? c c c a t c ?.
CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? … ?1 ?2 ?3 ?4 ?5 ?6 ? s ? e ? e ? next token duplicate insert ? … ee … → e Exception: when the next token is the same token c ?.
RNN-T add ? x ? c ? ? ? a ? ? t ? c ? ? a ? ? t ? ? c a t …… For n = 1 to ? output the n-th token 1 times constraint: output “?” ?? times output “?” ?0 times ?0 + ?1 + ⋯ ?? = ? c a t Put some ? (option) Put some ? (option) Put some ? (option) Put some ? at least once ?? > 0 ?? ≥ 0 for n = 1 to ? − 1.
?1 ?2 ?3 ?4 ?5 ?6 c a t RNN-T add ? x ? c ? ? ? a ? ? t ? c ? ? a ? ? t ? ? c a t …… output token Insert ? ? ? c.
?1 ?2 ?3 ?4 ?5 ?6 c a t output token Insert ? RNN-T add ? x ? c ? ? ? a ? ? t ? c ? ? a ? ? t ? ? c a t …… ? c ? ? ? ? ? ? ? ? c a ? ? ? ? ? ?.
?1 ?2 ?3 ?4 ?5 ?6 c a t output token Insert ? RNN-T add ? x ? c ? ? ? a ? ? t ? c ? ? a ? ? t ? ? c a t …… ? ? ? ? ? ? ? c a ?.
c a t Start End c a t Start End ? ? ? ? c a t Start End ? ? ? ? HMM CTC RNN-T not gen not gen not gen.
1. Enumerate all the possible alignments HMM, CTC, RNN-T ? ?|Y = ℎ∈????? ? ? ?|ℎ ?∗ = ??? max ? ???P? ?|? Y∗ = ??? max Y ???? Y|? ? Y|? = ℎ∈????? ? ? ℎ|? HMM CTC, RNN-T ?P? ?|? ?? =? 2. How to sum over all the alignments 3. Training: 4. Testing (Inference, decoding):.
This part is challenging..
Score Computation ?1 ?2 ?3 ?4 ?5 ?6 c a t output token Insert ? ? c ? ? ? ? ? ? ? ℎ = ? c ? ? a ? t ? ? ? ℎ|? = ? ?|? × ? ?|?, ? × ? ?|?, ?? … ….
? ℎ1 ℎ2 ℎ2 ℎ3 ℎ4 ℎ4 ℎ5 ℎ5 ℎ6 ?1,0 ? ?2,0 ? ?2,1 ? ?3,1 ? ?4,1 ? ?4,2 ? ?5,2 ? ?5,3 ? ?6,3 c a t <BOS> ?0 ?1 ?2 ?3 ?0 ?0 ?1 ?1 ?1 ?2 ?2 ?3 ?3 ℎ = ? c ? ? a ? t ? ?.
Score Computation ?1 ?2 ?3 ?4 ?5 ?6 c a t ?1,0 ? ?2,0 ? ?2,1 ? ?3,1 ? ?4,1 ? ?4,2 ? ?5,2 ? ?5,3 ? ?6,3 ? ℎ = ? c ? ? a ? t ? ? ? ℎ|?.
Score Computation ?1 ?2 ?3 ?4 ?5 ?6 c a t c <BOS> a ?0 ?1 ?2 Because ? is not considered! ?4,2 ? ?4,2 ? ?4,2.
? ℎ1 ℎ2 ℎ2 ℎ3 ℎ4 ℎ4 ℎ5 ℎ5 ℎ6 ?1,0 ? ?2,0 ? ?2,1 ? ?3,1 ? ?4,1 ? ?4,2 ? ?5,2 ? ?5,3 ? ?6,3 c a t <BOS> ?0 ?1 ?2 ?3 ?0 ?0 ?1 ?1 ?1 ?2 ?2 ?3 ?3.
? ℎ4 ℎ5 ℎ5 ℎ6 ? ? ? ? ? ?4,2 ? ?5,2 ? ?5,3 ? ?6,3 c a t <BOS> ?0 ?1 ?2 ?3 ?2 ?2 ?3 ?3 ? ? ? ? ? ? ? ? ? ?.
?1 ?2 ?3 ?4 ?5 ?6 c a t ?4,2 ?4,2 = ?4,1?4,1 ? + ?3,2?3,2 ? ?4,1 ?3,2 generate “a” read ?4 generate “?”, ??,?: the summation of the scores of all the alignments that read i-th acoustic features and output j-th tokens ??,? ℎ∈????? ? ? ℎ|?.
?1 ?2 ?3 ?4 ?5 ?6 c a t ?4,2 = ?4,1?4,1 ? + ?3,2?3,2 ? ??,?: the summation of the scores of all the alignments that read i-th acoustic features and output j-th tokens You can compute summation of the scores of all the alignments..
1. Enumerate all the possible alignments HMM, CTC, RNN-T P? ?|Y = ℎ∈????? ? ? ?|ℎ ?∗ = ??? max ? ???P? ?|? Y∗ = ??? max Y ???? Y|? P? Y|? = ℎ∈????? ? ? ℎ|? HMM CTC, RNN-T ?P? ?|? ?? =? 2. How to sum over all the alignments 3. Training: 4. Testing (Inference, decoding):.
Training ? c ? ? a ? t ? ? ?1,0 ? ?2,0 ? ?2,1 ? ?3,1 ? ?4,1 ? ?4,2 ? ?5,2 ? ?5,3 ? ?6,3 ? ?∗ = ??? max ? ???? ?|? ? ?|? = ℎ ? ℎ|? ?? ?|? ?? =?.
?1 ?2 ?3 ?4 ?5 ?6 c a t ?4,1 ? ?3,2 ? ?1,0 ? ?2,0 ? ?2,1 ? ?3,1 ? ?4,1 ? ?4,2 ? ?5,2 ? ?5,3 ? ?6,3 ? ? ?|? = ℎ ? ℎ|? Each arrow is a component in ? ?|? = ℎ ? ℎ|?.