請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7291完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 張智星 | |
| dc.contributor.author | Yueh-Ting Lee | en |
| dc.contributor.author | 李岳庭 | zh_TW |
| dc.date.accessioned | 2021-05-19T17:41:04Z | - |
| dc.date.available | 2024-07-25 | |
| dc.date.available | 2021-05-19T17:41:04Z | - |
| dc.date.copyright | 2019-07-25 | |
| dc.date.issued | 2019 | |
| dc.date.submitted | 2019-07-24 | |
| dc.identifier.citation | [1] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme Recognition Using Time-Delay Neural Networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989.
[2] V. Peddinti, D. Povey, and S. Khudanpur, “A Time Delay Neural Network Architecture for Efficient Modeling of Long Temporal Contexts,” in Proc. Interspeech, 2015. [3] H. Zheng, Z. Yang, L. Qiao, J. Li, and W. Liu, “Attribute Knowledge Integration for Speech Recognition Based on Multi-task Learning Neural Networks,” in Proc. Interspeech, 2015. [4] G. Hinton, L. Deng, D. Yu, G. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012. [5] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006. [6] H. Andrew Senior, Sak, F. de Chaumont Quitry, T. Sainath, K. Rao et al., “Acoustic modelling with cd-ctc-smbr lstm rnns,” in Proc. IEEE ASRU, 2015. [7] J. Li, G. Ye, A. Das, R. Zhao, and Y. Gong, “Advancing acoustic-to-word ctc model,” in Proc. IEEE ICASSP, 2018. [8] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech Recognition Toolkit,” in Proc. IEEE ASRU, 2011. [9] 袁家樺等, 《漢語方言概要》. 語文出版社, 1960. [10] S. King and P. Taylor, “Detection of Phonological Features in Continuous Speech Using Neural Networks,” Computer Speech and Language, vol. 14, no. 4, pp. 333–353, 2000. [11] C.-H. Lee, M. A. Clements, S. Dusan, E. Fosler-Lussier, K. Johnson, B.-H. Juang, and L. R. Rabiner, “An Overview on Automatic Speech Attribute Transcription (ASAT),” in Proc. Interspeech, 2007. [12] C. Zhang, Y. Liu, and C. H. Lee, “Detection-based Accented Speech Recognition Using Articulatory Features,” in Proc. IEEE ASRU, 2011. [13] I. Bromberg, Q. Fu, J. Hou, J. Li, C. Ma, B. Matthews, A. Moreno-daniel, J. Morris, S. M. Siniscalchi, Y. Tsao, and Y. Wang, “Detection-Based ASR in the Automatic Speech Attribute Transcription Project,” in Proc. Interspeech, 2007. [14] D. Yu, S. M. Siniscalchi, L. Deng, and C.-h. Lee, “Boosting Attribute and Phone Estimation Accuracies with Deep Neural Networks for Detection-based Speech Recognition,” in Proc. ICASSP, 2012. [15] S. M. Siniscalchi, J. Li, and C.-H. Lee, “A Study on Lattice Rescoring with Knowledge Scores for Automatic Speech Recognition,” in Proc. Interspeech, 2006. [16] W. Li, S. M. Siniscalchi, N. F. Chen, and C.-H. Lee, “Improving Non-native Mispronunciation Detection and Enriching Diagnostic Feedback with DNN-based Speech Attribute Modeling,” in Proc. ICASSP, 2016. [17] R. Duan, T. Kawahara, M. Dantsuji, and J. Zhang, “Pronunciation Error Detection using DNN Articulatory Model Based on Multi-lingual and Multi-task Learning,” in Proc. ISCSLP, 2016. [18] R. Duan, T. Kawahara, M. Dantsuji, and H. Nanjo, “Efficient Learning of Articulatory Models Based on Multi-Label Training and Label Correction for Pronunciation Learning,” in Proc. ICASSP, 2018. [19] R. A. Caruana, “Multitask Learning: A Knowledge-Based Source of Inductive Bias,” in Proc. ICML, 1993. [20] T. Evgeniou and M. Pontil, “Regularized Multi-task Learning,” in Proc. ACM SIGKDD, 2004. [21] P. Kenny, “Joint factor analysis of speaker and session variability: Theory and algorithms,” CRIM, Montreal,(Report) CRIM-06/08-13, vol. 14, pp. 28–29, 2005. [22] I. P. Association and Others, Handbook of the International Phonetic Association: A Guide to theUuse of the International Phonetic Alphabet. Cambridge University Press, 1999. [23] H.-M. Wang, B. Chen, J.-W. Kuo, and S.-S. Cheng, “MATBN: A Mandarin Chinese Broadcast News Corpus,” International Journal of Computational Linguistics & Chinese Language Processing, vol. 10, no. 2, pp. 219–236, 2005. [24] S. Broman and M. Kurimo, “Methods for Combining Language Models in Speech Recognition,” in Proc. Interspeech, 2005. [25] D. Povey and K. Vesel, “Sequence-discriminative Training of Deep Neural Networks,” in Proc. Interspeech, 2013. [26] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio Augmentation for Speech Recognition,” in Proc. Interspeech, 2015. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/7291 | - |
| dc.description.abstract | 在大詞彙語音辨識的領域中,以DNN-HMM取代GMM-HMM作為聲學模型效果已經有顯著提升。本篇論文使用多任務學習的神經網路模型(multi-task learning,MTL-DNN),除了主要的senone分類之外,我們以發音方式與位置的發音特徵,作為子任務來同時訓練DNN模型,使辨識結果效果提升。相較於前人的研究,我們提出三個改進方法,第一是將發音特徵的標籤分為四個區塊,每個區塊內的特徵彼此互斥,以取代傳統多重標籤(multi-label)的方式,作為子任務的輸出層來訓練MTL-TDNN模型。第二是以時延神經網路(time-delay neural networks,TDNN)來取代傳統神經網路。TDNN的特性可以將較多的前後文資訊加入訓練,第三是將子任務的輸出層接到較底層的隱藏層。實驗的語料為中文廣播新聞語料庫(MATBN),分為小資料集MATBN-20與大資料集MATBN-200,評估方式為字符錯誤率(character error rate,CER),與傳統單任務的TDNN模型做比較,最好的模型在MATBN-20與MATBN-200的相對進步幅度為3.33%與1%。 | zh_TW |
| dc.description.abstract | In large vocabulary continuous speech recognition (LVCSR), it is well known that the recognition performance has been improved by using DNN-HMM instead of GMM-HMM. In this thesis, we use multi-task learning model (MTL-DNN), aiming at simultaneously minimizing the cross-entropy losses with respect to the output scores of senones and articulatory attributes, such as place and manner. The proposed framework has three novelties when compared with previous studies. First, the subtasks designed for articulation classification assure that all attributes are mutually exclusive. Second, instead of fully-connected multilayer perceptrons, the well-known structure of time-delay neural networks is adopted to efficiently model long temporal contexts. Finally, in the proposed MTL-TDNN architecture, layer-wise neuron sharing of subtasks only occurs in the first few layers. We performed experiments on the Mandarin Chinese broadcast news corpus (MATBN), including a small dataset (MATBN-20) and a large dataset (MATBN-200). Compared with the conventional single-task learning TDNN model, the experiments show that the proposed framework achieves relative character error rate (CER) reductions of 3.3\% and 1\% on the small and big datasets, respectively. | en |
| dc.description.provenance | Made available in DSpace on 2021-05-19T17:41:04Z (GMT). No. of bitstreams: 1 ntu-108-R06922117-1.pdf: 4607579 bytes, checksum: 5c581191999de08f2144706d4373e492 (MD5) Previous issue date: 2019 | en |
| dc.description.tableofcontents | 誌謝iii
摘要v Abstract vii 1 緒論 1 1.1 主題簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 工具簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 文獻回顧. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 章節概述. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 研究內容介紹 5 2.1 問題定義. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 聲學特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 梅爾頻率倒譜係數. . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 因子分析與i-向量. . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 聲學模型訓練. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 發音特徵:Place and Manner . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 時延神經網路. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 多任務學習. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3 實驗 31 3.1 語料介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.1 中文廣播新聞語料庫MATBN . . . . . . . . . . . . . . . . . . 31 3.1.2 TCC-300 麥克風語音資料庫. . . . . . . . . . . . . . . . . . . 33 3.1.3 發音詞典與語言模型. . . . . . . . . . . . . . . . . . . . . . . 34 3.2 實驗流程. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.1 聲學特徵抽取. . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 神經網路架構. . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.3 訓練流程與參數設定. . . . . . . . . . . . . . . . . . . . . . . 38 3.2.4 效果評估方式. . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 錯誤分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 結論與未來展望47 4.1 結論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Bibliography 49 | |
| dc.language.iso | zh-TW | |
| dc.subject | 大詞彙語音辨識 | zh_TW |
| dc.subject | 多任務學習 | zh_TW |
| dc.subject | 發音特徵 | zh_TW |
| dc.subject | 時延神經網路 | zh_TW |
| dc.subject | multi-task learning | en |
| dc.subject | LVCSR | en |
| dc.subject | time-delay neural networks | en |
| dc.subject | articulatory attributes | en |
| dc.title | 使用基於發音方式與位置的多任務學習來改進華語大詞彙語音辨識 | zh_TW |
| dc.title | Improving Mandarin LVCSR Using Place and Manner Based Multi-task Learning | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 107-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 廖元甫,王新民 | |
| dc.subject.keyword | 多任務學習,發音特徵,時延神經網路,大詞彙語音辨識, | zh_TW |
| dc.subject.keyword | multi-task learning,articulatory attributes,time-delay neural networks,LVCSR, | en |
| dc.relation.page | 52 | |
| dc.identifier.doi | 10.6342/NTU201901599 | |
| dc.rights.note | 同意授權(全球公開) | |
| dc.date.accepted | 2019-07-24 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 資訊工程學研究所 | zh_TW |
| dc.date.embargo-lift | 2024-07-25 | - |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-108-1.pdf | 4.5 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
