以深層與卷積類神經網路建構聲學模型之大字彙連續語音辨識

Po-Wei Chou; 周伯威

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4793

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李琳山(Lin-Shan Lee)
dc.contributor.author	Po-Wei Chou	en
dc.contributor.author	周伯威	zh_TW
dc.date.accessioned	2021-05-14T17:47:21Z	-
dc.date.available	2015-03-13
dc.date.available	2021-05-14T17:47:21Z	-
dc.date.copyright	2015-03-13
dc.date.issued	2015
dc.date.submitted	2015-02-13
dc.identifier.citation	[1] Ra ́ul Rojas, Neural Networks: A Systematic Introduction, Springer-Verlag New York, Inc., New York, NY, USA, 1996. [2] Yann LeCun, L ́eon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [3] James Baker, “The dragon system–an overview,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 23, no. 1, pp. 24–29, 1975. [4] Lawrence Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989. [5] Sadaoki Furui, Digital speech processing: synthesis, and recognition, CRC Press, 2000. [6] Janet Baker, Li Deng, James Glass, Sanjeev Khudanpur, Chin-Hui Lee, Nelson Morgan, and Douglas O’Shaughnessy, “Developments and directions in speech recognition and understanding, part 1 [dsp education],” Signal Processing Magazine, IEEE, vol. 26, no. 3, pp. 75–80, 2009. [7] B-H Juang, “Maximum-likelihood estimation for mixture multivariate stochastic observations of markov chains,” AT&T technical journal, vol. 64, no. 6, pp. 1235– 1249, 1985. [8] Richard P Lippmann, “An introduction to computing with neural nets,” ASSP Magazine, IEEE, vol. 4, no. 2, pp. 4–22, 1987. [9] Herve A Bourlard and Nelson Morgan, Connectionist speech recognition: a hybrid approach, vol. 247, Springer, 1994. [10] Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006. [11] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron, “Scalable parallel programming with cuda,” Queue, vol. 6, no. 2, pp. 40–53, 2008. [12] Abdel-rahman Mohamed, Tara N Sainath, George Dahl, Bhuvana Ramabhadran, Geoffrey E Hinton, and Michael A Picheny, “Deep belief networks using discrimi- native features for phone recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 5060–5063. [13] Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton, “Acoustic modeling using deep belief networks,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 14–22, 2012. [14] George E Dahl, Dong Yu, Li Deng, and Alex Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 30–42, 2012. [15] George Dahl, Abdel-rahman Mohamed, Geoffrey E Hinton, et al., “Phone recogni- tion with the mean-covariance restricted boltzmann machine,” in Advances in neural information processing systems, 2010, pp. 469–477. [16] Yann LeCun, Fu Jie Huang, and Leon Bottou, “Learning methods for generic object recognition with invariance to pose and lighting,” in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. IEEE, 2004, vol. 2, pp. II–97. [17] Yann LeCun and Yoshua Bengio, “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, 1995. [18] Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Advances in neural information processing systems, 2009, pp. 1096–1104. [19] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, and Gerald Penn, “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4277–4280. [20] Frank Seide, Gang Li, Xie Chen, and Dong Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011, pp. 24–29. [21] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams, “Learning representations by back-propagating errors,” Cognitive modeling, 1988. [22] Jeff A Bilmes et al., “A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models,” International Computer Science Institute, vol. 4, no. 510, pp. 126, 1998. [23] Lalit R Bahl, PV deSouza, PS Gopalakrishnan, D Nahamoo, and MA Picheny, “Decision trees for phonological rules in continuous speech,” in Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference on. IEEE, 1991, pp. 185–188. [24] Steve J Young, JJ Odell, and Philip C Woodland, “Tree-based state tying for high accuracy acoustic modelling,” in Proceedings of the workshop on Human Language Technology. Association for Computational Linguistics, 1994, pp. 307–312. [25] “教育部重編國語辭典修訂本,” http://dict.revised.moe.edu.tw/, Accessed: 2014-02-09. [26] Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky, “Rnnlm-recurrent neural network language modeling toolkit,” . [27] Tomas Mikolov, Anoop Deoras, Stefan Kombrink, Lukas Burget, and Jan Cernocky, “Empirical evaluation and combination of advanced language modeling techniques.,” in INTERSPEECH, 2011, number s 1, pp. 605–608. [28] John Duchi, Elad Hazan, and Yoram Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” The Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011. [29] David L Donoho, “For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution,” Communications on pure and applied mathematics, vol. 59, no. 6, pp. 797–829, 2006. [30] Trevor Hastie, Robert Tibshirani, Jerome Friedman, T Hastie, J Friedman, and R Tibshirani, The elements of statistical learning, vol. 2, Springer, 2009. [31] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012. [32] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfit- ting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014. [33] Ludmila I Kuncheva and Christopher J Whitaker, “Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy,” Machine learning, vol. 51, no. 2, pp. 181–207, 2003. [34] Peter Sollich Anders Krogh, “Learning with ensembles: How over-fitting can be useful,” in Proceedings of the 1995 Conference, 1996, vol. 8, p. 190. [35] Najim Dehak, Patrick Kenny, R ́eda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 4, pp. 788–798, 2011. [36] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, Design patterns: elements of reusable object-oriented software, Pearson Education, 1994. [37] Anthony J Robinson, “An application of recurrent nets to phone probability estimation,” Neural Networks, IEEE Transactions on, vol. 5, no. 2, pp. 298–305, 1994. [38] Sepp Hochreiter and J ̈urgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [39] Sadaoki Furui, “Cepstral analysis technique for automatic speaker verification,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 29, no. 2, pp. 254–272, 1981. [40] Kunihiko Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological cy- bernetics, vol. 36, no. 4, pp. 193–202, 1980. [41] Ossama Abdel-Hamid, Li Deng, and Dong Yu, “Exploring convolutional neural network structures and optimization techniques for speech recognition.,” in INTER- SPEECH, 2013, pp. 3366–3370. [42] CJ Leggetter and Philip C Woodland, “Speaker adaptation of continuous density hmms using multivariate linear regression.,” in ICSLP, 1994, vol. 94, pp. 451–454. [43] Christopher J Leggetter and Philip C Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models,” Com- puter Speech & Language, vol. 9, no. 2, pp. 171–185, 1995. [44] Mark JF Gales and Philip C Woodland, “Mean and variance adaptation within the mllr framework,” Computer Speech & Language, vol. 10, no. 4, pp. 249–264, 1996. [45] Mark JF Gales, “Maximum likelihood linear transformations for hmm-based speech recognition,” Computer speech & language, vol. 12, no. 2, pp. 75–98, 1998. [46] Jean-Luc Gauvain and Chin-Hui Lee, “Bayesian learning for hidden markov model with gaussian mixture state observation densities,” Speech Communication, vol. 11, no. 2, pp. 205–213, 1992. [47] Jean-Luc Gauvain and Chin-Hui Lee, “Maximum a posteriori estimation for mul- tivariate gaussian mixture observations of markov chains,” Speech and audio pro- cessing, ieee transactions on, vol. 2, no. 2, pp. 291–298, 1994. [48] G Zavaliagkos, R Schwartz, and John McDonough, “Maximum a posteriori adaptation for large scale hmm recognizers,” in Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on. IEEE, 1996, vol. 2, pp. 725–728. [49] Michael Finke, Petra Geutner, Hermann Hild, Thomas Kemp, Klaus Ries, and Martin Westphal, “The karlsruhe-verbmobil speech recognition engine,” in Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on. IEEE, 1997, vol. 1, pp. 83–86. [50] Roland Kuhn, Jean-Claude Junqua, Patrick Nguyen, and Nancy Niedzielski, “Rapid speaker adaptation in eigenvoice space,” Speech and Audio Processing, IEEE Trans- actions on, vol. 8, no. 6, pp. 695–707, 2000. [51] Joao Neto, Lu ́ıs Almeida, Mike Hochberg, Ciro Martins, Lu ́ıs Nunes, Steve Renals, and Tony Robinson, “Speaker-adaptation for hybrid hmm-ann continuous speech recognition system,” 1995. [52] Bo Li and Khe Chai Sim, “Comparison of discriminative input and output transformations for speaker adaptation in the hybrid nn/hmm systems.,” in INTERSPEECH, 2010, pp. 526–529. [53] Roberto Gemello, Franco Mana, Stefano Scanzio, Pietro Laface, and Renato De Mori, “Linear hidden transformations for adaptation of hybrid ann/hmm models,” Speech Communication, vol. 49, no. 10, pp. 827–835, 2007. [54] Jian Xue, Jinyu Li, Dong Yu, Mike Seltzer, and Yifan Gong, “Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network,” . [55] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran, “Low-rank matrix factorization for deep neural network training with high-dimensional output targets,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 6655–6659. [56] Jian Xue, Jinyu Li, and Yifan Gong, “Restructuring of deep neural network acoustic models with singular value decomposition.,” in INTERSPEECH, 2013, pp. 2365–2369. [57] George Saon, Hagen Soltau, David Nahamoo, and Michael Picheny, “Speaker adaptation of neural network acoustic models using i-vectors,” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 55–59. [58] Chih-Chung Chang and Chih-Jen Lin, “Libsvm: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 27, 2011. [59] Katta G Murty and Santosh N Kabadi, “Some np-complete problems in quadratic and nonlinear programming,” Mathematical programming, vol. 39, no. 2, pp. 117– 129, 1987. [60] Xavier Glorot and Yoshua Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256. [61] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 1096–1103.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4793	-
dc.description.abstract	在語音辨識中，以深層類神經網路 (deep neural network, DNN) 取代傳統的高斯混合模型 (Gaussian mixture model, GMM) 來建構聲學模型 (acoustic model, AM) 的作法，因其優異的表現已逐漸成為主流。在本論文中，我們以深層類神經網路及卷積類神經網路 (convolutional neural network, CNN) 來產生隱藏式馬可夫模型 (hidden Markov model, HMM) 所需的狀態 (state) 機率，發展出大字彙連續語音辨識 (large-vocabulary continuous speech recognition, LVCSR) 中的聲學模型，在英文的評效語料 (benchmark corpus) 上進行了一系列的實驗。實驗結果顯示不論是深層類神經網路還是卷積類神經網路，其辨識準確率均能大幅地超越傳統基於高斯混合模型的作法，而其中又以深層類神經網路的表現最為出色。由於不同語者的語音永遠是不一樣的，本文也探討了如何在深層類神經網路的聲學模型架構上，執行語者調適 (speaker adaptation) 以解決受測目標語者 (target speaker) 的語音與訓練語料 (training corpus) 之間不匹配 (mismatch) 的問題。透過對特徵空間上鑑別式線性迴歸 (feature-space discriminative linear regression, fDLR) 的改進，我們提出了一套將隱藏式馬可夫模型的狀態分群 (state-clustered) 的作法，更精細地考慮隱藏式馬可夫模型中各狀態不同的聲學結構，分群進行調適，並透過兩階段的方式進行辨識，提升目標語者的辨識準確度。在一系列的以 Facebook 個人動態 (status) 錄製而成的中英雙語 (bilingual) 語料的實驗中，可以發現不論是少量或是大量的調適語料，運用此方法建立的個人化 (personalized) 聲學模型皆能有相當良好的表現。此外，我們也實作了一套透過圖形處理器 (graphics processing unit, GPU) 加速的深層類神經網路函式庫。文中除了介紹基本的使用說明以外，也詳細地記載了該程式的軟體架構與設計原理，並探討了圖形處理器上幾個重要的實作細節。	zh_TW
dc.description.provenance	Made available in DSpace on 2021-05-14T17:47:21Z (GMT). No. of bitstreams: 1 ntu-104-R01942135-1.pdf: 38147670 bytes, checksum: 81934d6b0eda9b136172a2c11c4602f5 (MD5) Previous issue date: 2015	en
dc.description.tableofcontents	口試委員會審定書 . . . . . . . . . . . . . . . . . . . i 中文摘要 . . . . . . . . . . . . . . . . . . . . . . . ii 一、導論 . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 研究背景與動機 . . . . . . . . . . . . . . . . . . 1 1.2 研究方向與貢獻 . . . . . . . . . . . . . . . . . . 3 1.3 章節安排 . . . . . . . . . . . . . . . . . . . . . 4 二、背景知識 . . . . . . . . . . . . . . . . . . . . . 5 2.1 自動語音辨識 . . . . . . . . . . . . . . . . . . . 5 2.1.1 聲學模型 . . . . . . . . . . . . . . . . . . . . 5 2.1.2 辭典 . . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 語言模型 . . . . . . . . . . . . . . . . . . . . 9 2.2 類神經網路 . . . . . . . . . . . . . . . . . . . . 10 2.2.1 順向傳遞式類神經網路 . . . . . . . . . . . . . . 10 2.2.2 訓練類神經網路 . . . . . . . . . . . . . . . . . 12 2.3 類神經網路之正規化 . . . . . . . . . . . . . . . . 14 2.3.1 一次與二次正規化 . . . . . . . . . . . . . . . . 14 2.3.2 丟棄法 . . . . . . . . . . . . . . . . . . . . . 15 2.4 本章總結 . . . . . . . . . . . . . . . . . . . . . 16 三、深層類神經網路聲學模型 . . . . . . . . . . . . . . 17 3.1 以深層類神經網路取代高斯混合模型作為聲學模型 . . . 17 3.1.1 反向傳播演算法 . . . . . . . . . . . . . . . . . 20 3.2 實驗與分析 . . . . . . . . . . . . . . . . . . . . 21 3.2.1 實驗設定 . . . . . . . . . . . . . . . . . . . . 23 3.2.2 實驗結果與分析 . . . . . . . . . . . . . . . . . 25 3.3 本章總結 . . . . . . . . . . . . . . . . . . . . . 26 四、卷積類神經網路聲學模型 . . . . . . . . . . . . . . 27 4.1 卷積類神經網路 . . . . . . . . . . . . . . . . . . 27 4.1.1 簡介 . . . . . . . . . . . . . . . . . . . . . . 27 4.1.2 卷積層 (Convolutional Layer) . . . . . . . . . . 27 4.1.3 減縮取樣層 (Subsampling Layer) . . . . . . . . . 30 4.1.4 反向傳播演算法 . . . . . . . . . . . . . . . . . 30 4.2 實驗與分析 . . . . . . . . . . . . . . . . . . . . 32 4.2.1 實驗設定 . . . . . . . . . . . . . . . . . . . . 32 4.2.2 實驗結果與分析 . . . . . . . . . . . . . . . . . 35 4.3 本章總結 . . . . . . . . . . . . . . . . . . . . . 36 五、深層類神經網路之語者調適 . . . . . . . . . . . . . 37 5.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 基於奇異值分解的語者調適 . . . . . . . . . . . . . 38 5.3 特徵空間上鑑別式線性迴歸 . . . . . . . . . . . . . 39 5.4 基於狀態分群的特徵空間上鑑別式線性迴歸 . . . . . . 41 5.4.1 隱藏式馬可夫模型之狀態分群 . . . . . . . . . . . 41 5.4.2 兩階段式解碼 . . . . . . . . . . . . . . . . . . 43 5.5 實驗與分析 . . . . . . . . . . . . . . . . . . . . 45 5.5.1 實驗設定 . . . . . . . . . . . . . . . . . . . . 45 5.5.2 基準實驗 . . . . . . . . . . . . . . . . . . . . 46 5.5.3 實驗結果與分析 . . . . . . . . . . . . . . . . . 46 5.6 本章總結 . . . . . . . . . . . . . . . . . . . . . 49 六、深層類神經網路函式庫與工具 . . . . . . . . . . . . 50 6.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . 50 6.2 基礎用法 . . . . . . . . . . . . . . . . . . . . . 50 6.2.1 初始化類神經網路模型 . . . . . . . . . . . . . . 50 6.2.2 利用資料訓練類神經網路模型 . . . . . . . . . . . 53 6.2.3 透過訓練後的類神經網路對資料進行預測 . . . . . . 54 6.3 程式碼架構 . . . . . . . . . . . . . . . . . . . . 54 6.3.1 記憶體佈局 . . . . . . . . . . . . . . . . . . . 55 6.3.2 性能調校與優化 . . . . . . . . . . . . . . . . . 57 6.4 本章總結 . . . . . . . . . . . . . . . . . . . . . 58 七、結論與展望 . . . . . . . . . . . . . . . . . . . . 59 7.1 總結 . . . . . . . . . . . . . . . . . . . . . . . 59 7.2 未來展望 . . . . . . . . . . . . . . . . . . . . . 60 參考文獻 . . . . . . . . . . . . . . . . . . . . . . . 62 附錄 . . . . . . . . . . . . . . . . . . . . . . . . . 71
dc.language.iso	zh-TW
dc.title	以深層與卷積類神經網路建構聲學模型之大字彙連續語音辨識	zh_TW
dc.title	Deep and Convolutional Neural Networks for Acoutic Modeling in Large Vocabulary Continuous Speech Recognition	en
dc.type	Thesis
dc.date.schoolyear	103-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	林軒田(Hsuan-Tien Lin),張智星(Jyh-Shing Jang),李宏毅(Hung-yi Lee)
dc.subject.keyword	語音辨識,大字彙連續語音辨識,類神經網路,深層類神經網路,	zh_TW
dc.subject.keyword	Speech Recognition,Large Vocabulary Continuous Speech Recognition,Artificial Neural Network,Deep Neural Network,	en
dc.relation.page	80
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2015-02-13
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電信工程學研究所	zh_TW
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf	37.25 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。