Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54516
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李宏毅(Hung-Yi Lee)
dc.contributor.authorJui-Yang Hsuen
dc.contributor.author徐瑞陽zh_TW
dc.date.accessioned2021-06-16T03:01:34Z-
dc.date.available2021-02-20
dc.date.copyright2021-02-20
dc.date.issued2020
dc.date.submitted2021-02-05
dc.identifier.citation[1] F. Jelinek, “Continuous speech recognition by statistical methods,” Proceedings of the IEEE, vol. 64, no. 4, pp. 532–556, 1976.
[2] L. Bahl, P. Brown, P. de Souza, and R. Mercer, “Maximum mutual information estimation of hidden markov model parameters for speech recognition,” in ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1986, vol. 11, pp. 49–52.
[3] Lawrence R Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[4] G. D. Forney, “The viterbi algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp.268–278, 1973.
[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[6] Li Deng and Dong Yu, “Deep learning: methods and applications,” Foundations and trends in signal processing, vol. 7, no. 3–4, pp. 197–387, 2014.
[7] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina,et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778.
[8] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964.
[9] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International conference on machine learning, 2016, pp. 173–182.
[10] Linhao Dong, Shuang Xu, and Bo Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5884–5888.
[11] Joaquin Vanschoren, “Meta-learning: A survey,” arXiv preprint arXiv:1810.03548, 2018.
[12] Lilian Weng, “Meta-learning: Learning to learn fast,” lilianweng.github.io/lil-log, 2018.
[13] Sebastian Ruder, “An overview of multi-task learning in deep neural networks,” arXiv preprint arXiv:1706.05098, 2017.
[14] Jaejin Cho, Murali Karthick Baskar, Ruizhi Li, MatthewWiesner, Sri Harish Mallidi, Nelson Yalta, Martin Karafiat, Shinji Watanabe, and Takaaki Hori, “Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 521–527.
[15] Jiangyan Yi, Jianhua Tao, ZhengqiWen, and Ye Bai, “Language-adversarial transfer learning for low-resource speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 3, pp. 621–630, 2018.
[16] OliverAdams, MatthewWiesner, ShinjiWatanabe, and DavidYarowsky, “Massively multilingual adversarial speech recognition,” arXiv preprint arXiv:1904.02210, 2019.
[17] Jake Snell, Kevin Swersky, and Richard Zemel, “Prototypical networks for fewshot learning,” in Advances in neural information processing systems, 2017, pp.4077–4087.
[18] Chelsea Finn, Pieter Abbeel, and Sergey Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400, 2017.
[19] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap, “Meta-learning with memory-augmented neural networks,” in International conference on machine learning, 2016, pp. 1842–1850.
[20] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1199–1208.
[21] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, andNando De Freitas, “Learning to learn by gradient descent by gradient descent,” in Advances in neural information processing systems, 2016, pp. 3981–3989.
[22] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, DaanWierstra, et al., “Matching networks for one shot learning,” in Advances in neural information processing systems, 2016, pp. 3630–3638.
[23] Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li, “Meta-learning for low-resource neural machine translation,” arXiv preprint arXiv:1808.08437, 2018.
[24] Ondřej Klejch, Joachim Fainberg, Peter Bell, and Steve Renals, “Speaker adaptive training using model agnostic meta-learning,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 881–888.
[25] Jui-Yang Hsu, Yuan-Jui Chen, and Hung-yi Lee, “Meta learning for end-to-end lowresource speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7844–7848.
[26] Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhaojiang Lin, Andrea Madotto, Peng Xu, and Pascale Fung, “Learning fast adaptation on cross-accented speech recognition,” arXiv preprint arXiv:2003.01901, 2020.
[27] Yann LeCun, Yoshua Bengio, et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, pp. 1995, 1995.
[28] Michael Collins and Nigel Duffy, “Convolution kernels for natural language,” in Advances in neural information processing systems, 2002, pp. 625–632.
[29] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, and Gerald Penn, “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in 2012 IEEE international conference on Acoustics, speech and signal
processing (ICASSP). IEEE, 2012, pp. 4277–4280.
[30] Yanmin Qian, Mengxiao Bi, Tian Tan, and Kai Yu, “Very deep convolutional neural networks for noise robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2263–2276, 2016.
[31] Takaaki Hori, ShinjiWatanabe, Yu Zhang, andWilliam Chan, “Advances in joint ctcattention based end-to-end speech recognition with a deep cnn encoder and rnn-lm,”
arXiv preprint arXiv:1706.02737, 2017.
[32] Jerome T Connor, R Douglas Martin, and Les E Atlas, “Recurrent neural networks and robust time series prediction,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 240–254, 1994.
[33] Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Černock`y, and Sanjeev Khudanpur, “Extensions of recurrent neural network language model,” in 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2011, pp. 5528–5531.
[34] Sepp Hochreiter and Jrgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, pp. 1735–80, 12 1997.
[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan NGomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
[36] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
[37] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[38] N. T. Vu, D. Imseng, D. Povey, P. Motlicek, T. Schultz, and H. Bourlard, “Multilingual deep neural network based acoustic modeling for rapid language adaptation,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 7639–7643.
[39] Sibo Tong, Philip Neil Garner, and Hervé Bourlard, “An investigation of deep neural networks for multilingual speech recognition training and adaptation,” in INTERSPEECH, 2017.
[40] Adriana Stan, Oliver Watts, Yoshitaka Mamiya, Mircea Giurgiu, Robert AJ Clark, Junichi Yamagishi, and Simon King, “Tundra: a multilingual corpus of found data for tts research created with light supervision.,” in INTERSPEECH, 2013, pp. 2331–2335.
[41] Thibault Viglino, Petr Motlicek, and Milos Cernak, “End-to-end accented speech recognition.,” in INTERSPEECH, 2019, pp. 2140–2144.
[42] Abhinav Jain, Vishwanath P Singh, and Shakti P Rath, “A multi-accent acoustic model using mixture of experts for speech recognition.,” in INTERSPEECH, 2019, pp. 779–783.
[43] Sining Sun, Ching-Feng Yeh, Mei-Yuh Hwang, Mari Ostendorf, and Lei Xie, “Domain adversarial training for accented speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4854–4858.
[44] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, MatthewWiesner, Nanxin Chen, et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
[45] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals, “Rapid learning or feature reuse? towards understanding the effectiveness of maml,” arXiv preprint arXiv:1909.09157, 2019.
[46] Mark JF Gales, Kate M Knill, Anton Ragni, and Shakti P Rath, “Speech recognition and keyword spotting for low-resource languages: Babel project research at cued,” in Fourth International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU-2014). International Speech Communication Association (ISCA), 2014, pp. 16–23.
[47] Da Li, YongxinYang, Yi-Zhe Song, and TimothyMHospedales, “Learning to generalize: Meta-learning for domain generalization,” arXiv preprint arXiv:1710.03463, 2017.
[48] Yi-Chen Chen, Jui-Yang Hsu, Cheng-Kuang Lee, and Hung yi Lee, “DARTS-ASR: Differentiable architecture search for multilingual speech recognition and adaptation,” 2020.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54516-
dc.description.abstract本論文探討在標註資料有限的前提下,不同轉移學習方法在不同情境下的自動語音辨識之成效。實現上以 2017 年以降,在少樣本影像識別與強化學習中取得初步成功的元學習方法 - 模型無關元學習與在語音領域已行之有年的多工學習作為探討主軸。本論文所檢視的情境為跨語言音素辨識、跨腔調端對端語音辨識、跨語言端對端語音辨識三種,從聲學模型的任務出發到端對端語音辨識、從較為單純的深層類神經網路到複雜的轉換器模型、從資料相似程度較高的跨腔調情境到跨語言情境,一步一步地拓展元學習在語音相關任務上的應用界限。並以不同的預訓練資料集合、驗證資料集合、微調迭代數、微調資料量多寡及在預訓練時的採樣策略,嘗試找出在什麼樣的情境下,元學習能帶來更多表現的進步。實驗結果顯示,在極少資源的跨語言音素辨識、少資源但資料相似程度較高的跨腔調端對端語音辨識,元學習的方法都展露了較多工學習更為優秀的轉移成效;但在資料差異較大且不能以過少語料訓練的跨語言端對端語音辨識,其表現便與多工學習旗鼓相當。以此展現了適合應用元學習的語音任務情境需具備何種特性,作為學界後續研究的參考。zh_TW
dc.description.abstractThis thesis surveys various kinds of transfer learning methods under low-resource setting. In addition to the popular implementation of transfer learning, multitask learning methods, we firstly introduce meta learning methods into speech processing. This thesis uses cross-language phoneme recognition, cross-accent end-to-end speech recognition, cross-language end-to-end speech recognition as testing scenarios. To explore the limitation of applying meta learning in speech processing, we start from simple acoustic modeling to more complicated end-to-end speech recognition, from simple multi-layer neural network to more complicated transformer architecture, and from similar cross-accent settings to the more challenging and dissimilar cross-language setting. To find the suitable transfer learning methods under a specific scenario, we control the variables like pretraining datasets, validation sets, number of fine-tuning steps, number of data used in fine-tuning, and the sampling strategies during pretraining. The initial experiments show that under low-resource setting, in cross-language phoneme recognition and cross-accent end-to-end speech recognition tasks, meta learning methods outperform multitask learning methods. However, under more challenging tasks like cross-language end-to-end speech recognition, there is no performance gap between these two methods. We believe such findings can help the researchers explore more possibilities of applying meta learning methods in speech processing.en
dc.description.provenanceMade available in DSpace on 2021-06-16T03:01:34Z (GMT). No. of bitstreams: 1
U0001-0502202100244400.pdf: 2561588 bytes, checksum: 12f692f17e3215f8a1a928edfb223dfe (MD5)
Previous issue date: 2020
en
dc.description.tableofcontents誌謝 i
中文摘要 ii
英文摘要 iii
一、導論 1
1.1 研究動機 1
1.2 相關研究 3
1.3 研究方向 3
1.4 章節安排 4
二、背景知識 5
2.1 深層類神經網路(Deep Neural Network, DNN) 5
2.1.1 基本原理 5
2.1.2 訓練方法 7
2.1.3 卷積類神經網路 (Convolutional Neural Network, CNN ) 7
2.1.4 遞歸類神經網路 (Recurrent Neural Network, RNN) 8
2.2 序列到序列學習 (Sequence-to-Sequence Learning) 10
2.2.1 編碼器解碼器架構 (Encoder-Decoder) 10
2.2.2 專注機制 (Attention Mechanism) 11
2.2.3 自專注機制 (Self-attention Mechanism) 12
2.2.4 轉換器模型 (Transformer) 13
2.3 轉移學習 (Transfer Learning) 15
2.3.1 基本概念 15
2.3.2 名詞定義 16
2.3.3 多工學習 (Multitask Learning) 17
2.4 元學習 (Meta Learning) 18
2.4.1 基本概念 18
2.4.2 元學習方法分類 19
2.4.3 基於優化的元學習方法與轉移學習之類比 20
2.4.4 模型無關元學習 (Model-Agnostic Meta Learning, MAML) 20
2.5 本章總結 22
三、基於元學習的跨語言音素辨識實驗 24
3.1 簡介 24
3.2 資料集介紹 24
3.3 實作細節 26
3.3.1 資料集預處理 26
3.3.2 模型架構和訓練方式 26
3.3.3 實驗設計 28
3.4 實驗結果與分析 32
3.5 本章總結 34
四、基於元學習的跨腔調端對端語音辨識實驗 35
4.1 簡介 35
4.2 資料集介紹 37
4.3 實作細節 37
4.3.1 資料集預處理 37
4.3.2 模型架構和訓練方式 37
4.3.3 實驗設計 40
4.4 實驗結果與分析 41
4.5 本章總結 46
五、基於元學習的跨語言端對端語音辨識實驗 47
5.1 簡介 47
5.2 資料集介紹 47
5.3 實作細節 49
5.3.1 資料集預處理 49
5.3.2 模型架構和訓練方式 49
5.3.3 實驗設計 51
5.4 實驗結果與分析 51
5.5 本章總結 53
六、結論與展望 54
6.1 研究貢獻與討論 54
6.2 未來展望 55
6.2.1 考量不同資料分佈之轉移 55
6.2.2 採用不同的元學習策略 55
6.2.3 探討於語音領域其他應用 55
參考文獻 56
dc.language.isozh-TW
dc.subject元學習zh_TW
dc.subject多工學習zh_TW
dc.subject少資源zh_TW
dc.subject模型無關元學習zh_TW
dc.subject轉移學習zh_TW
dc.subject聲學模型zh_TW
dc.subject語音辨識zh_TW
dc.subject元學習zh_TW
dc.subject多工學習zh_TW
dc.subject少資源zh_TW
dc.subject模型無關元學習zh_TW
dc.subject轉移學習zh_TW
dc.subject聲學模型zh_TW
dc.subject語音辨識zh_TW
dc.subjectModel-Agnostic Meta Learningen
dc.subjectMeta Learningen
dc.subjectSpeech Recognitionen
dc.subjectAcoustic Modelingen
dc.subjectTransfer Learningen
dc.subjectLow-Resourceen
dc.subjectMultitask Learningen
dc.subjectMeta Learningen
dc.subjectSpeech Recognitionen
dc.subjectAcoustic Modelingen
dc.subjectTransfer Learningen
dc.subjectModel-Agnostic Meta Learningen
dc.subjectLow-Resourceen
dc.subjectMultitask Learningen
dc.title元學習於端對端語音辨識之探討zh_TW
dc.titleMeta Learning in End-to-End Speech Recognitionen
dc.typeThesis
dc.date.schoolyear109-1
dc.description.degree碩士
dc.contributor.oralexamcommittee李琳山(Lin-Shan Lee),李彥寰(Yen-Huan Li),陳尚澤(Shang-Tse Chen)
dc.subject.keyword元學習,語音辨識,聲學模型,轉移學習,模型無關元學習,少資源,多工學習,zh_TW
dc.subject.keywordMeta Learning,Speech Recognition,Acoustic Modeling,Transfer Learning,Model-Agnostic Meta Learning,Low-Resource,Multitask Learning,en
dc.relation.page63
dc.identifier.doi10.6342/NTU202100553
dc.rights.note有償授權
dc.date.accepted2021-02-06
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電機工程學研究所zh_TW
顯示於系所單位:電機工程學系

文件中的檔案:
檔案 大小格式 
U0001-0502202100244400.pdf
  未授權公開取用
2.5 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved