Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/50681
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李琳山
dc.contributor.authorHsiang-Hung Luen
dc.contributor.author呂相弘zh_TW
dc.date.accessioned2021-06-15T12:52:22Z-
dc.date.available2016-07-26
dc.date.copyright2016-07-26
dc.date.issued2016
dc.date.submitted2016-07-19
dc.identifier.citation[1] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed,Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” 2012, vol. 29, pp. 82–97, IEEE.
[2] George E Dahl, Tara N Sainath, and Geoffrey E Hinton, “Improving deep neural networks for lvcsr using rectified linear units and dropout,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8609–8613.
[3] Tasha Nagamine, Michael L Seltzer, and Nima Mesgarani, “Exploring how deep neural networks form phonemic categories,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[4] Alex Graves, Santiago Fern ́andez, Faustino Gomez, and J ̈urgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
[5] Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 6645–6649.
[6] Yann LeCun and Yoshua Bengio, “Convolutional networks for images, speech, and time series,” 1995, vol. 3361, p. 1995.
[7] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell, “Caffe: Convolutional architecture for fast feature embedding,” 2014.
[8] Tomas Mikolov, Martin Karafi ́at, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur, “Recurrent neural network based language model.,” in INTERSPEECH, 2010, vol. 2, p. 3.
[9] Sepp Hochreiter and J ̈urgen Schmidhuber, “Long short-term memory,” 1997, vol. 9, pp. 1735–1780, MIT Press.
[10] David A Van Leeuwen and Rosemary Orr, “Speech recognition of non-native speech using native and non-native acoustic models,” 2000.
[11] LIU Kat and Pascale Fung, “Fast accent identification and accented speech recognition,” in Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on. IEEE, 1999, vol. 1, pp. 221–224.
[12] Ching-Feng Yeh, Aaron Heidel, Hong-Yi Lee, and Lin-Shan Lee, “Recognition of highly imbalanced code-mixed bilingual speech with frame-level language detection based on blurred posteriorgram,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4873–4876.
[13] Laurent Besacier, Etienne Barnard, Alexey Karpov, and Tanja Schultz, “Automatic speech recognition for under-resourced languages: A survey,” 2014, vol. 56, pp. 85–100, Elsevier.
[14] Ngoc Thang Vu, Dau-Cheng Lyu, Jochen Weiner, Dominic Telaar, Tim Schlippe, Fabian Blaicher, Eng-Siong Chng, Tanja Schultz, and Haizhou Li, “A first speech recognition system for mandarin-english code-switch conversational speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4889–4892.
[15] Qingqing Zhang, Jielin Pan, and Yonghong Yan, “Mandarin-english bilingual speech recognition for real world music retrieval,” in Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on. IEEE, 2008, pp. 4253–4256.
[16] Ngoc Thang Vu, David Imseng, Daniel Povey, Petr Motlicek, Tanja Schultz, and Herv ́e Bourlard, “Multilingual deep neural network based acoustic modeling for rapid language adaptation,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 7639–7643.
[17] Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong, “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 7304–7308.
[18] Horia Cucu, Laurent Besacier, Corneliu Burileanu, and Andi Buzo, “Investigating the role of machine translated text in asr domain adaptation: Unsupervised and semisupervised methods,” in Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011, pp. 260–265.
[19] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” 2014, vol. 15, pp. 1929–1958, JMLR. org.
[20] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distilling the knowledge in a neural network,” 2015.
[21] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, olodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell, “Policy distillation,” 2015.
[22] Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe, “Globalphone: A multilingual text & speech database in 20 languages,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8126–8130.
[23] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Rue Marconi 19, Martigny, Dec. 2011, number Idiap-RR-04-2012, IEEE Signal Processing Society, IEEE Catalog No.: CFP11SRW-USB.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/50681-
dc.description.abstract隨著巨量資料的發展,語音辨識相關的處理技術越來越成熟,人們渴望著聲音世代能帶來的魅力。此時,這些技術是可流動的,不再只是先進國家獨有的資源,而是世界上不同地區、各種語言使用者都可以享用的科技。這些不同語言的人類語音,雖各自成體系,卻都擁有一個共同點--都是人類能夠藉以互相理解的訊號媒介,承載著感情、觀念、資訊以及聲音的意義。
本論文探討的,是如何讓世界上不同語言的語料互相輔助學習,使得傳統的單語言語音辨識系統擴增成多語音辨識系統,找出其中潛藏的跨語言知識,希望藉以強化各個語言的語音辨識系統。本論文使用GlobalPhone全球音素語料庫,從純語言知識開始,加入資料導向的方法,最後合併了深層類神經網路中間層,由粗糙到細緻,一步一步探討如何可以將聲學模型中跨語言的共通知識合併起來。
一旦有了多語言辨識系統,深層學習的模型將會變得更為龐大,訓練過程也會更為複雜。為了容納龐大資訊並方便即時使用,本論文亦探討了知識蒸餾的方法,將原本多語音辨識系統的龐大模型,濃縮在較小的模型裡,成功提煉出更豐富的跨語言概括化資訊,幫助多語言語音辨識系統變得更加準確。
zh_TW
dc.description.abstractSpeech Signal Processing technologies have gone mature as well as the Big Data Era. The beauty of sound draws high attention from the modern people. These resources are not occupied by only few strong companies, but shared by speakers in different regions, using different languages all over the world. The various types of human speech have their own unique properties, but they all share the same one: people rely on it to comprehend each others.
This thesis focuses on the cooperation of speech data from different languages to help enhance the conventional monolingual speech recognition system. The latent crosslingual information could be found and utilized. We use GlobalPhone Corpus to discuss about linguistic knowledge, data-driven methods and model sharing techniques. The research procedure starts from coarse phonetic level mergig to delicate model level sharing in a step-by-step way, achieving better results using crosslingual information.
Once multilingual speech recognition systems are built, the model should become deep and cumbersome. The training procedure should contain more complex and time-consuming techniques. To incorporate generalization ability lying inside the huge models with tiny, in-hand and real-time model size, one can use Knowledge Distillation to extract information, thus acheiving model compression.
en
dc.description.provenanceMade available in DSpace on 2021-06-15T12:52:22Z (GMT). No. of bitstreams: 1
ntu-105-R03942039-1.pdf: 3551605 bytes, checksum: ac364461065469fef8834392d0198d94 (MD5)
Previous issue date: 2016
en
dc.description.tableofcontents誌謝 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .i
中文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iii
英文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iv
一、導論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1.1 研究動機與背景 . . . . . . . . . . . . . . . . . . . . . .1
1.2 相關研究 . . . . . . . . . . . . . . . . . . . . . . . . . .2
1.3 研究方向 . . . . . . . . . . . . . . . . . . . . . . . . . .3
1.4 章節安排 . . . . . . . . . . . . . . . . . . . . . . . . . .3
二、背景知識 . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
2.1 深層類神經網路(Deep Neural Network, DNN) . . . .5
2.1.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . .5
2.1.2 訓練方法 . . . . . . . . . . . . . . . . . . . . . .7
2.1.3 丟棄演算法 . . . . . . . . . . . . . . . . . . . .9
2.2 全球音素語料庫(GlobalPhone Corpus) . . . . . . . . . .11
2.2.1 語料介紹 . . . . . . . . . . . . . . . . . . . . . .11
2.2.2 國際音標(International Phonetic Alphabet, IPA) .12
2.3 多語言語音辨識系統 . . . . . . . . . . . . . . . . . . .15
2.3.1 語言混合(Code-Mixing) . . . . . . . . . . . . . .16
2.3.2 跨語言轉移學習(Crosslingual Transfer Learning) . . . . .17
2.3.3 多語言共享學習(Multilingual Sharing) . . . . . .18
三、單語言基礎實驗 . . . . . . . . . . . . . . . . . . . . . . . .20
3.1 單語言語音辨識系統架構 . . . . . . . . . . . . . . . . .20
3.1.1 特徵萃取(Feature Extraction) . . . . . . . . .21
3.1.2 聲學模型(Acoustic Modeling) . . . . . . . . .22
3.1.3 語言模型(Language Modeling) . . . . . . . .24
3.1.4 解碼器(Decoder) . . . . . . . . . . . . . . . .25
3.2 深層類神經網路基礎實驗設計 . . . . . . . . . . . . . .27
3.2.1 語言模型 . . . . . . . . . . . . . . . . . . . . . .27
3.2.2 丟棄演算法 . . . . . . . . . . . . . . . . . . . .30
3.2.3 模型深度 . . . . . . . . . . . . . . . . . . . . . .31
3.2.4 基準實驗總結 . . . . . . . . . . . . . . . . . . .32
四、多語言聲學模型合併 . . . . . . . . . . . . . . . . . . . . . .33
4.1 評估多語言語音辨識系統 . . . . . . . . . . . . . . . . .33
4.2 基於國際音標進行音素合併 . . . . . . . . . . . . . . .34
4.2.1 音素合併表 . . . . . . . . . . . . . . . . . . . .35
4.2.2 實驗結果與分析 . . . . . . . . . . . . . . . . . .37
4.3 基於資料混淆矩陣的音素狀態合併 . . . . . . . . . . .38
4.3.1 混淆矩陣 . . . . . . . . . . . . . . . . . . .38
4.3.2 階層式累積分群合併 . . . . . . . . . . . .42
4.3.3 合併三連音素狀態的多語言語音辨識系統 . . . . 44
4.3.4 加上國際音標限制 . . . . . . . . . . . . .45
4.3.5 實驗結果與分析 . . . . . . . . . . . . . . .48
4.4 基於模型共享的中間層合併 . . . . . . . . . . . .48
4.4.1 深層神經網路中間層合併 . . . . . . . . .48
4.4.2 實驗結果與比較 . . . . . . . . . . . . . . .50
4.5 本章實驗比較與綜合分析 . . . . . . . . . . . . . .53
五、知識蒸餾(Knowledge Distillation) . . . . . . . . . . . .54
5.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . .54
5.2 溫度軟性最大化(Temperature Softmax) . . . . . . .56
5.3 蒸餾設定與步驟 . . . . . . . . . . . . . . . . . . .57
5.4 多語言知識蒸餾 . . . . . . . . . . . . . . . . . . .58
5.5 實驗結果與分析 . . . . . . . . . . . . . . . . . . .61
5.5.1 單語言知識蒸餾 . . . . . . . . . . . . . . .61
5.5.2 多語言知識蒸餾 . . . . . . . . . . . . . . .62
5.5.3 蒸餾參數 . . . . . . . . . . . . . . . . . . .64
5.5.4 本章實驗總結 . . . . . . . . . . . . . . . .65
六、結論與展望 . . . . . . . . . . . . . . . . . . . . . . . .66
6.1 本論文主要的研究貢獻 . . . . . . . . . . . . . . .66
6.2 本論文未來研究方向 . . . . . . . . . . . . . . . .67
參考文獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . .68
dc.language.isozh-TW
dc.subject跨語音資訊zh_TW
dc.subject多語言zh_TW
dc.subject語音辨識zh_TW
dc.subject深層學習zh_TW
dc.subject知識蒸餾zh_TW
dc.subject多語言zh_TW
dc.subject語音辨識zh_TW
dc.subject跨語音資訊zh_TW
dc.subject深層學習zh_TW
dc.subject知識蒸餾zh_TW
dc.subjectCrosslingual Informationen
dc.subjectSpeech Recognitionen
dc.subjectDeep Learningen
dc.subjectMultilingualen
dc.subjectKnowledge Distillationen
dc.subjectDeep Learningen
dc.subjectCrosslingual Informationen
dc.subjectSpeech Recognitionen
dc.subjectKnowledge Distillationen
dc.subjectMultilingualen
dc.title使用深層學習的語音辨識中的跨語言聲學模型zh_TW
dc.titleCrosslingual Acoustic Modeling in Speech Recognition Using Deep Learningen
dc.typeThesis
dc.date.schoolyear104-2
dc.description.degree碩士
dc.contributor.oralexamcommittee李宏毅,鄭秋豫,陳信宏,王小川,簡仁宗
dc.subject.keyword多語言,語音辨識,跨語音資訊,深層學習,知識蒸餾,zh_TW
dc.subject.keywordMultilingual,Speech Recognition,Crosslingual Information,Deep Learning,Knowledge Distillation,en
dc.relation.page71
dc.identifier.doi10.6342/NTU201601009
dc.rights.note有償授權
dc.date.accepted2016-07-20
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電信工程學研究所zh_TW
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-105-1.pdf
  未授權公開取用
3.47 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved