基於結構化學習之初步音素辨識

Yi-Hsiu Liao; 廖宜修

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/60087

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李琳山(Lin-shan Lee)
dc.contributor.author	Yi-Hsiu Liao	en
dc.contributor.author	廖宜修	zh_TW
dc.date.accessioned	2021-06-16T09:55:13Z	-
dc.date.available	2020-02-08
dc.date.copyright	2017-02-08
dc.date.issued	2016
dc.date.submitted	2016-12-29
dc.identifier.citation	[1] Kurt Hornik, “Approximation capabilities of multilayer feedforward networks,”Neural networks, vol. 4, no. 2, pp. 251–257, 1991. [2] George Cybenko, “Approximation by superpositions of a sigmoidal function,”Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303–314, 1989. [3] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [4] Rupesh Kumar Srivastava, Klaus Greff, and Ju ̈rgen Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015. [5] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun, “Large margin methods for structured and interdependent output variables,” in Jour- nal of Machine Learning Research, 2005, pp. 1453–1484. [6] Hao Tang, Chao-Hong Meng, and Lin-Shan Lee, “An initial attempt for phoneme recognition using structured support vector machine (svm),” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 4926–4929. [7] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, and Raj Foreword By-Reddy, Spoken language processing: A guide to theory, algorithm, and system development, Prentice Hall PTR, 2001. [8] RivarolVergin,DouglasO’shaughnessy,andAzarshidFarhat,“Generalizedmelfre- quency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition,” Speech and Audio Processing, IEEE Transactions on, vol. 7, no. 5, pp. 525–532, 1999. [9] Hynek Hermansky, Daniel W Ellis, and Shantanu Sharma, “Tandem connectionist feature extraction for conventional hmm systems,” in Acoustics, Speech, and Signal Processing, 2000. ICASSP’00. Proceedings. 2000 IEEE International Conference on. IEEE, 2000, vol. 3, pp.1635–1638. [10] LawrenceRRabiner,“Atutorialonhiddenmarkovmodelsandselectedapplications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989. [11] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012. [12] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011, num- ber EPFL-CONF-192584. [13] Thorsten Joachims, “Making large scale svm learning practical,” Tech. Rep., Uni- versita ̈t Dortmund, 1999. [14] Andrej Karpathy, “Convolutional neural network for visual recognition,” 2015. [15] Pavel Golik, Patrick Doetsch, and Hermann Ney, “Cross-entropy vs. squared error training: a theoretical and experimental comparison.,” . [16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller, “Playing atari with deep re- inforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/60087	-
dc.description.abstract	在現今的語音辨識中，用深層類神經網路(deep neural network, DNN)取代高斯混合模型(Gaussian mixture model, GMM)的混合式(hybrid)隱藏式馬可夫模型(hidden Markov model, HMM)在辨識正確率上已經大幅超越傳統語音辨識系統，成為現在的主流。然而即使在主流架構中，仍然將聲音切成很小的音框分別辨識，並使用在不同層次分別優化的模型，而非一次考慮句子的整體結構。另一方面，結構化學習有別於以往將一個個物件分別訓練辨識，而有能力考慮整體輸出輸入的結構。因此當我們將語音特徵向量序列作為結構化輸入，把音素序列當作結構化輸出，那麼結構化學習恰好可以利用語音整體結構上的資訊求出最佳的音素辨識結果。在本論文中，除了實作使用結構化支撐向量機的音素辨識系統外，並提出兩種全新的融合結構化學習與深層學習的模型，分別是：結構化深層類神經網路與梯度結構化深層類神經網路，也分別實作了音素辨識系統。在Timit語料庫上的實驗結果顯示，結構化支撐向量機雖然僅是線性模型，但搭配適當的輸入，可以達到音素錯誤率22.7%；結構化深層類神經網路突破了線性模型的限制，使用非線性深層類神經網路，成功擊敗了目前最好的主流模型，達到音素錯誤率17.8%；而梯度結構化深層類神經網路雖然限於時間，現階段仍未能有很好的音素錯誤率表現，但提供了一個新方向，也可能是一種解決一般最大化問題的新方式。	zh_TW
dc.description.abstract	Nowadays, using Deep Neural Network(DNN) and Gaussian Mixture Model(GMM) hybrid with Hidden Markov Model(HMM) shows great improvement over traditional Automatic Speech Recognition(ASR), and this becomes main stream in ASR.However, in this architecture, we still divide waveform into separated frames, and optimize each models individually without the whole utterance structure. In the other hand, structured learning is capable of taking whole structured input and produce structured output without separating each objects in training. Hence, we can take acoustic feature sequence as structured input, and phoneme sequence as structured output. In this way, ASR problem is transformed into a structured learning problem. In this thesis, we implemented structured Support Vector Machine(SVM) as baseline, and proposed two novel structured learning model: structured Deep Neural Network and gradient structured Deep Neural Network towards phoneme recognition system. In TIMIT corpus, although structured SVM is linear model, with proper input, it can achieve 22.7\% Phoneme Error Rate(PER). Structured DNN is a great non-linear model, and it shows 17.8\% PER which beats state-of-the-art results. And gradient structured deep neural network didn't give good results on PER, but it's a novel and interesting way to solve maximize problem.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T09:55:13Z (GMT). No. of bitstreams: 1 ntu-105-R03921048-1.pdf: 3115923 bytes, checksum: 01801c50a95c3d456b4582dc2c9a805b (MD5) Previous issue date: 2016	en
dc.description.tableofcontents	中文摘要....................................... i 英文摘要....................................... ii 一、導論....................................... 1 1.1 研究動機.................................. 1 1.2 相關前人研究 ............................... 3 1.3 本論文研究貢獻.............................. 3 1.4 章節安排.................................. 4 二、背景知識 .................................... 5 2.1 音素辨識系統 ............................... 5 2.1.1 前端處理.............................. 5 2.1.2 統計式音素辨識.......................... 7 2.1.3 聲學模型.............................. 8 2.1.4 隱藏式馬可夫模型 ........................ 10 2.1.5 搜尋演算法與詞圖 ........................ 11 2.2 深層類神經網路 .............................. 12 2.2.1 順向傳遞式類神經網路...................... 13 2.2.2 訓練類神經網路.......................... 17 2.3 結構化學習................................. 21 2.4 本章總結.................................. 22 三、結構化支撐向量機 ............................... 23 3.1 簡介..................................... 23 3.2 系統架構.................................. 24 3.3 結構化特徵擷取 .............................. 26 3.4 鑑別函數(DiscriminativeFunction).................... 27 3.5 解碼..................................... 29 3.6 最違反限制序列 .............................. 30 3.7 減損函數(LossFunction) ......................... 32 3.8 實驗與分析................................. 33 3.8.1 實驗設定.............................. 33 3.8.2 基準實驗.............................. 36 3.8.3 實驗結果與分析.......................... 37 3.9 本章總結.................................. 39 四、結構化深層類神經網路 ............................ 40 4.1 簡介..................................... 40 4.2 系統架構.................................. 41 4.3 結構化特徵擷取 .............................. 42 4.4 鑑別函數.................................. 44 4.5 解碼..................................... 44 4.6 訓練減損函數 ............................... 45 4.6.1 逼近音素正確率(approximating phoneme accuracy rate) . . . . . 45 4.6.2 最大化邊距 ............................ 45 4.7 全域式結構化深層類神經網路 ...................... 46 4.8 實驗與分析................................. 48 4.8.1 實驗設定.............................. 48 4.8.2 實驗結果與分析.......................... 49 4.9 本章總結.................................. 52 五、梯度結構化深層類神經網路.......................... 53 5.1 簡介..................................... 53 5.2 系統架構.................................. 55 5.3 結構化特徵擷取 .............................. 57 5.4 解碼..................................... 61 5.5 訓練減損函數 ............................... 61 5.6 實驗與分析................................. 63 5.6.1 實驗設定.............................. 63 5.6.2 實驗結果與分析.......................... 64 5.7 本章總結.................................. 73 六、結論與展望 ................................... 74 6.1 結論與展望................................. 74 6.1.1 總結 ................................ 74 6.1.2 未來展望.............................. 74 參考文獻....................................... 76 附錄A 結構化反向特徵擷取運算細節 ...................... 79 A.1自動微分.................................. 79 A.2 結構化反向特徵擷取 ........................... 82
dc.language.iso	zh-TW
dc.subject	機器學習	zh_TW
dc.subject	語音辨識	zh_TW
dc.subject	結構化學習	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	deep learning	en
dc.subject	structured learning	en
dc.subject	ASR	en
dc.subject	machine learning	en
dc.title	基於結構化學習之初步音素辨識	zh_TW
dc.title	Towards Phoneme Recognition with Structured Learning	en
dc.type	Thesis
dc.date.schoolyear	105-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	李宏毅(Hung-yi Lee),陳信宏(Sin-Horng Chen),鄭秋豫(Chiu-Yu Tseng),簡仁宗(Jen-Tzung Chien)
dc.subject.keyword	結構化學習,深度學習,語音辨識,機器學習,	zh_TW
dc.subject.keyword	structured learning,deep learning,ASR,machine learning,	en
dc.relation.page	84
dc.identifier.doi	10.6342/NTU201603864
dc.rights.note	有償授權
dc.date.accepted	2016-12-30
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-105-1.pdf 未授權公開取用	3.04 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。