PowerPoint 簡報

Published on Apr 25, 2024

Scene 1 (0s)

Speech Recognition HUNG-YI LEE 李宏毅.

Scene 2 (6s)

LAS: 就是 seq2seq CTC: decoder 是 linear classifier 的 seq2seq RNA: 輸入一個東西就要輸出一個東西的 seq2seq RNN-T: 輸入一個東西可以輸出多個東西的 seq2seq Neural Transducer: 每次輸入一個 window 的 RNN-T MoCha: window 移動伸縮自如的 Neural Transducer Last Time.

Scene 3 (26s)

Two Points of Views Source of image: 李琳山老師《數位語音處理概論》 Seq-to-seq HMM.

Scene 4 (36s)

Hidden Markov Model (HMM) Speech Recognition speech text X Y Y∗ = ??? max Y ? Y|? = ??? max Y ? ?|Y ? Y ? ? = ??? max Y ? ?|Y ? Y ? ?|Y : Acoustic Model ? Y : Language Model HMM Decode.

Scene 5 (48s)

HMM hh w aa t d uw y uw th ih ng k what do you think t-d+uw1 t-d+uw2 t-d+uw3 …… t-d+uw d-uw+y uw-y+uw y-uw+th …… d-uw+y1 d-uw+y2 d-uw+y3 Phoneme: Tri-phone: State: ? ?|Y ? ?|? A token sequence Y corresponds to a sequence of states S.

Scene 6 (1m 2s)

HMM ? ?|Y ? ?|? A sentence Y corresponds to a sequence of states S ? ? ? Start End ? ?.

Scene 7 (1m 10s)

HMM ? ?|Y ? ?|? A sentence Y corresponds to a sequence of states S Transition Probability ? ?|? ? ? Emission Probability t-d+uw1 d-uw+y3 P(x|”t-d+uw1”) P(x|”d-uw+y3”) Gaussian Mixture Model (GMM) Probability from one state to another.

Scene 8 (1m 20s)

HMM – Emission Probability • Too many states …… P(x|”t-d+uw3”) P(x|”d-uw+y3”) Tied-state Same Address pointer pointer 終極型態: Subspace GMM [Povey, et al., ICASSP’10] (Geoffrey Hinton also published deep learning for ASR in the same conference) [Mohamed , et al., ICASSP’10].

Scene 9 (1m 34s)

P? ?|? =? transition ? ?|? ? ? emission (GMM) ෍ ℎ∈?????(?) ? ?|ℎ alignment which state generates which vector ℎ1 = ?????? ? ? ? ℎ2 = ?????? ℎ = ?????? ? ?|ℎ1 ? ?|ℎ2 ℎ = ????? ? ? ? ? ? ? ? ?|? ? ?|? ? ?|? ? ?|? ? ?|? ?1 ?2 ?3 ?4 ?5 ?6 ? ?1|? ? ?2|? ? ?3|? ? ?4|? ? ?5|? ? ?6|?.

Scene 10 (1m 45s)

How to use Deep Learning?.

Scene 11 (1m 51s)

Method 1: Tandem …… …… ?? Size of output layer = No. of states DNN …… …… Last hidden layer or bottleneck layer are also possible. New acoustic feature for HMM ?(?|??) ?(?|??) ?(?|??) State classifier.

Scene 12 (2m 2s)

Method 2: DNN-HMM Hybrid DNN ? ?|? …… ? ? ?|? ? ?|? = ? ?, ? ? ? = ? ?|? ? ? ? ? Count from training data CNN, LSTM … DNN output.

Scene 13 (2m 13s)

How to train a state classifier? Train HMM-GMM model state sequence: Acoustic features: Utterance + Label (without alignment) Utterance + Label (aligned) state sequence: Acoustic features: a a a b b c c.

Scene 14 (2m 24s)

How to train a state classifier? Train HMM-GMM model Utterance + Label (without alignment) Utterance + Label (aligned) DNN1.

Scene 15 (2m 32s)

How to train a state classifier? Utterance + Label (without alignment) Utterance + Label (aligned) DNN2 DNN1 realignment.

Scene 16 (2m 39s)

Human Parity! • 微軟語音辨識技術突破重大里程碑：對話辨識能力達人類水準！(2016.10) • https://www.bnext.com.tw/article/41414/bn-2016-10-19-020437-216 • IBM vs Microsoft: 'Human parity' speech recognition record changes hands again (2017.03) • http://www.zdnet.com/article/ibm-vs-microsoft-human-parity- speech-recognition-record-changes-hands-again/ Machine 5.9% v.s. Human 5.9% Machine 5.5% v.s. Human 5.1% [Yu, et al., INTERSPEECH’16] [Saon, et al., INTERSPEECH’17].

Scene 17 (3m 10s)

Very Deep [Yu, et al., INTERSPEECH’16].

Scene 18 (3m 17s)

Back to End-to-end.

Scene 19 (3m 23s)

LAS ? ?|X =? Y∗ = ??? max Y ???? Y|? ?2 ?3 ?4 ?1 • LAS directly computes ? ?|X ? ? ? ?|X = ? ?|? ? ?|?, ? … ?0 ?0 ?1 Size V ?1 ?1 ?2 a ? ? ? ? ?2 ?3 b ? ??? Beam Search ?∗ = ??? max ? ???P? ෠?|? Decoding: Training:.

Scene 20 (3m 35s)

CTC, RNN-T ? ?|X =? ?2 ?3 ?4 ?1 • LAS directly computes ? ?|X ? ? ? ?|X = ? ?|? ? ?|?, ? … • CTC and RNN-T need alignment Encoder ℎ4 ℎ3 ℎ2 ℎ1 ℎ = ? ? ? ? ? ℎ|? ? ? ? ? ? ? ? ? P Y|? = ෍ ℎ∈????? ? ? ℎ|? → ? ? Y∗ = ??? max Y ???? Y|? Beam Search ?∗ = ??? max ? ???P? ෠?|? Decoding: Training:.

Scene 21 (3m 51s)

HMM, CTC, RNN-T P? ?|? = ෍ ℎ∈????? ? ? ?|ℎ ?∗ = ??? max ? ???P? ෠?|? Y∗ = ??? max Y ???? Y|? P? Y|? = ෍ ℎ∈????? ? ? ℎ|? HMM CTC, RNN-T ?P? ෠?|? ?? =? 1. Enumerate all the possible alignments 2. How to sum over all the alignments 3. Training: 4. Testing (Inference, decoding):.

Scene 22 (4m 6s)

HMM, CTC, RNN-T ? ?|? = ෍ ℎ∈????? ? ? ?|ℎ ?∗ = ??? max ? ???P? ෠?|? Y∗ = ??? max Y ???? Y|? ? Y|? = ෍ ℎ∈????? ? ? ℎ|? HMM CTC, RNN-T ?P? ෠?|? ?? =? 1. Enumerate all the possible alignments 2. How to sum over all the alignments 3. Training: 4. Testing (Inference, decoding):.

Scene 23 (4m 21s)

All the alignments HMM CTC RNN-T LAS 你們在忙什麼 ☺ c a t c c a a a t c a a a a t c ? a a t t add ? duplicate to length ? ? c a ? t ? add ? x ? c ? ? ? a ? ? t ? c ? ? a ? ? t ? ? ? = 6 Speech Recognition speech text c a t ? = 3 … duplicate c a t to length ? … c a t …… ? ℎ ∈ ?????(?).

Scene 24 (4m 36s)

HMM c a t c c a a a t c a a a a t duplicate to length ? … ?1 ?2 ?3 ?4 ?5 ?6 c a t next token duplicate For n = 1 to ? output the n-th token ?? times constraint: ?1 + ?2 + ⋯ ?? = ?, ?? > 0 Trellis Graph c c c a a t a a t t t.

Scene 25 (4m 48s)

HMM c a t c c a a a t c a a a a t duplicate to length ? … ?1 ?2 ?3 ?4 ?5 ?6 c a t next token duplicate  For n = 1 to ? output the n-th token ?? times constraint: ?1 + ?2 + ⋯ ?? = ?, ?? > 0 Trellis Graph c c c c c c.

Scene 26 (5m 0s)

For n = 1 to ? output the n-th token ?? times constraint: output “?” ?? times ?? > 0 ?? ≥ 0 output “?” ?0 times ?1 + ?2 + ⋯ ?? + ?0 + ?1 + ⋯ ?? = ? CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? ….

Scene 27 (5m 12s)

?1 ?2 ?3 ?4 ?5 ?6 ? c ? a ? t ? next token duplicate insert ? CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? … (? can be skipped) c.

Scene 28 (5m 21s)

CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? … ?1 ?2 ?3 ?4 ?5 ?6 ? c ? a ? t ? duplicate ? next token cannot skip any token ?.

Scene 29 (5m 30s)

?1 ?2 ?3 ?4 ?5 ?6 ? c ? a ? t ? next token duplicate insert ? CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? … duplicate next token.

Scene 30 (5m 39s)

CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? … ?1 ?2 ?3 ?4 ?5 ?6 ? c ? a ? t ? c c a t ? a ? t ? ? c ?.

Scene 31 (5m 48s)

CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? … ?1 ?2 ?3 ?4 ?5 ?6 ? c ? a ? t ? c a t ? ? c c c a t c ?.

Scene 32 (5m 57s)

CTC c ? a a t t add ? ? c a ? t ? duplicate c a t to length ? … ?1 ?2 ?3 ?4 ?5 ?6 ? s ? e ? e ? next token duplicate insert ? … ee … → e Exception: when the next token is the same token c ?.

Scene 33 (6m 8s)

RNN-T add ? x ? c ? ? ? a ? ? t ? c ? ? a ? ? t ? ? c a t …… For n = 1 to ? output the n-th token 1 times constraint: output “?” ?? times output “?” ?0 times ?0 + ?1 + ⋯ ?? = ? c a t Put some ? (option) Put some ? (option) Put some ? (option) Put some ? at least once ?? > 0 ?? ≥ 0 for n = 1 to ? − 1.

Scene 34 (6m 25s)

?1 ?2 ?3 ?4 ?5 ?6 c a t RNN-T add ? x ? c ? ? ? a ? ? t ? c ? ? a ? ? t ? ? c a t …… output token Insert ? ? ? c.

Scene 35 (6m 35s)

?1 ?2 ?3 ?4 ?5 ?6 c a t output token Insert ? RNN-T add ? x ? c ? ? ? a ? ? t ? c ? ? a ? ? t ? ? c a t …… ? c ? ? ? ? ? ? ? ? c a ? ? ? ? ? ?.

Scene 36 (6m 45s)

?1 ?2 ?3 ?4 ?5 ?6 c a t output token Insert ?  RNN-T add ? x ? c ? ? ? a ? ? t ? c ? ? a ? ? t ? ? c a t …… ? ? ? ? ? ? ? c a ?.

Scene 37 (6m 54s)

c a t Start End c a t Start End ? ? ? ? c a t Start End ? ? ? ? HMM CTC RNN-T not gen not gen not gen.

Scene 38 (7m 1s)

1. Enumerate all the possible alignments HMM, CTC, RNN-T ? ?|Y = ෍ ℎ∈????? ? ? ?|ℎ ?∗ = ??? max ? ???P? ෠?|? Y∗ = ??? max Y ???? Y|? ? Y|? = ෍ ℎ∈????? ? ? ℎ|? HMM CTC, RNN-T ?P? ෠?|? ?? =? 2. How to sum over all the alignments 3. Training: 4. Testing (Inference, decoding):.

Scene 39 (7m 16s)

This part is challenging..

Scene 40 (7m 23s)

Score Computation ?1 ?2 ?3 ?4 ?5 ?6 c a t output token Insert ? ? c ? ? ? ? ? ? ? ℎ = ? c ? ? a ? t ? ? ? ℎ|? = ? ?|? × ? ?|?, ? × ? ?|?, ?? … ….

Scene 41 (7m 33s)

? ℎ1 ℎ2 ℎ2 ℎ3 ℎ4 ℎ4 ℎ5 ℎ5 ℎ6 ?1,0 ? ?2,0 ? ?2,1 ? ?3,1 ? ?4,1 ? ?4,2 ? ?5,2 ? ?5,3 ? ?6,3 c a t <BOS> ?0 ?1 ?2 ?3 ?0 ?0 ?1 ?1 ?1 ?2 ?2 ?3 ?3 ℎ = ? c ? ? a ? t ? ?.

Scene 42 (7m 44s)

Score Computation ?1 ?2 ?3 ?4 ?5 ?6 c a t ?1,0 ? ?2,0 ? ?2,1 ? ?3,1 ? ?4,1 ? ?4,2 ? ?5,2 ? ?5,3 ? ?6,3 ? ℎ = ? c ? ? a ? t ? ? ? ℎ|?.

Scene 43 (7m 57s)

Score Computation ?1 ?2 ?3 ?4 ?5 ?6 c a t c <BOS> a ?0 ?1 ?2 Because ? is not considered! ?4,2 ? ?4,2 ? ?4,2.

Scene 44 (8m 6s)

? ℎ1 ℎ2 ℎ2 ℎ3 ℎ4 ℎ4 ℎ5 ℎ5 ℎ6 ?1,0 ? ?2,0 ? ?2,1 ? ?3,1 ? ?4,1 ? ?4,2 ? ?5,2 ? ?5,3 ? ?6,3 c a t <BOS> ?0 ?1 ?2 ?3 ?0 ?0 ?1 ?1 ?1 ?2 ?2 ?3 ?3.

Scene 45 (8m 15s)

? ℎ4 ℎ5 ℎ5 ℎ6 ? ? ? ? ? ?4,2 ? ?5,2 ? ?5,3 ? ?6,3 c a t <BOS> ?0 ?1 ?2 ?3 ?2 ?2 ?3 ?3 ? ? ? ? ? ? ? ? ? ?.

Scene 46 (8m 22s)

?1 ?2 ?3 ?4 ?5 ?6 c a t ?4,2 ?4,2 = ?4,1?4,1 ? + ?3,2?3,2 ? ?4,1 ?3,2 generate “a” read ?4 generate “?”, ??,?: the summation of the scores of all the alignments that read i-th acoustic features and output j-th tokens ??,? ෍ ℎ∈????? ? ? ℎ|?.

Scene 47 (8m 38s)

?1 ?2 ?3 ?4 ?5 ?6 c a t ?4,2 = ?4,1?4,1 ? + ?3,2?3,2 ? ??,?: the summation of the scores of all the alignments that read i-th acoustic features and output j-th tokens You can compute summation of the scores of all the alignments..

Scene 48 (8m 54s)

1. Enumerate all the possible alignments HMM, CTC, RNN-T P? ?|Y = ෍ ℎ∈????? ? ? ?|ℎ ?∗ = ??? max ? ???P? ෠?|? Y∗ = ??? max Y ???? Y|? P? Y|? = ෍ ℎ∈????? ? ? ℎ|? HMM CTC, RNN-T ?P? ෠?|? ?? =? 2. How to sum over all the alignments 3. Training: 4. Testing (Inference, decoding):.

Scene 49 (9m 9s)

Training ? c ? ? a ? t ? ? ?1,0 ? ?2,0 ? ?2,1 ? ?3,1 ? ?4,1 ? ?4,2 ? ?5,2 ? ?5,3 ? ?6,3 ? ?∗ = ??? max ? ???? ෠?|? ? ෠?|? = ෍ ℎ ? ℎ|? ?? ෠?|? ?? =?.

Scene 50 (9m 22s)

?1 ?2 ?3 ?4 ?5 ?6 c a t ?4,1 ? ?3,2 ? ?1,0 ? ?2,0 ? ?2,1 ? ?3,1 ? ?4,1 ? ?4,2 ? ?5,2 ? ?5,3 ? ?6,3 ? ? ෠?|? = ෍ ℎ ? ℎ|? Each arrow is a component in ? ෠?|? = ෍ ℎ ? ℎ|?.