Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/60087
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李琳山(Lin-shan Lee)
dc.contributor.authorYi-Hsiu Liaoen
dc.contributor.author廖宜修zh_TW
dc.date.accessioned2021-06-16T09:55:13Z-
dc.date.available2020-02-08
dc.date.copyright2017-02-08
dc.date.issued2016
dc.date.submitted2016-12-29
dc.identifier.citation[1] Kurt Hornik, “Approximation capabilities of multilayer feedforward networks,”Neural networks, vol. 4, no. 2, pp. 251–257, 1991.
[2] George Cybenko, “Approximation by superpositions of a sigmoidal function,”Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303–314, 1989.
[3] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[4] Rupesh Kumar Srivastava, Klaus Greff, and Ju ̈rgen Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015.
[5] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun, “Large margin methods for structured and interdependent output variables,” in Jour- nal of Machine Learning Research, 2005, pp. 1453–1484.
[6] Hao Tang, Chao-Hong Meng, and Lin-Shan Lee, “An initial attempt for phoneme recognition using structured support vector machine (svm),” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 4926–4929.
[7] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, and Raj Foreword By-Reddy, Spoken language processing: A guide to theory, algorithm, and system development, Prentice Hall PTR, 2001.
[8] RivarolVergin,DouglasO’shaughnessy,andAzarshidFarhat,“Generalizedmelfre- quency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition,” Speech and Audio Processing, IEEE Transactions on, vol. 7, no. 5, pp. 525–532, 1999.
[9]  Hynek Hermansky, Daniel W Ellis, and Shantanu Sharma, “Tandem connectionist feature extraction for conventional hmm systems,” in Acoustics, Speech, and Signal Processing, 2000. ICASSP’00. Proceedings. 2000 IEEE International Conference on. IEEE, 2000, vol. 3, pp.1635–1638.
[10]  LawrenceRRabiner,“Atutorialonhiddenmarkovmodelsandselectedapplications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[11]  Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012.
[12]  Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011, num- ber EPFL-CONF-192584.
[13]  Thorsten Joachims, “Making large scale svm learning practical,” Tech. Rep., Uni- versita ̈t Dortmund, 1999.
[14]  Andrej Karpathy, “Convolutional neural network for visual recognition,” 2015.
[15]  Pavel Golik, Patrick Doetsch, and Hermann Ney, “Cross-entropy vs. squared error training: a theoretical and experimental comparison.,” .
[16]  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller, “Playing atari with deep re- inforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/60087-
dc.description.abstract在現今的語音辨識中,用深層類神經網路(deep neural network, DNN)取代高斯混合模型(Gaussian mixture model, GMM)的混合式(hybrid)隱藏
式馬可夫模型(hidden Markov model, HMM)在辨識正確率上已經大幅超越傳統語音辨識系統,成為現在的主流。然而即使在主流架構中,仍然將聲音切成很小的音框分別辨識,並使用在不同層次分別優化的模型,而非一次考慮句子的整體結構。
另一方面,結構化學習有別於以往將一個個物件分別訓練辨識,而有能力考慮整體輸出輸入的結構。因此當我們將語音特徵向量序列作為結構化輸入,把音素序列當作結構化輸出,那麼結構化學習恰好可以利用語音整體結構上的資訊求出最佳的音素辨識結果。
在本論文中,除了實作使用結構化支撐向量機的音素辨識系統外,並提出兩種全新的融合結構化學習與深層學習的模型,分別是:結構化深層類神經網路與梯度結構化深層類神經網路,也分別實作了音素辨識系統。
在Timit語料庫上的實驗結果顯示,結構化支撐向量機雖然僅是線性模型,但搭配適當的輸入,可以達到音素錯誤率22.7%;結構化深層類神經網路突破了線性模型的限制,使用非線性深層類神經網路,成功擊敗了目前最好的主流模型,達到音素錯誤率17.8%;
而梯度結構化深層類神經網路雖然限於時間,現階段仍未能有很好的音素錯誤率表現,但提供了一個新方向,也可能是一種解決一般最大化問題的新方式。
zh_TW
dc.description.abstractNowadays, using Deep Neural Network(DNN) and Gaussian Mixture Model(GMM) hybrid with Hidden Markov Model(HMM) shows great improvement over traditional Automatic Speech Recognition(ASR), and this becomes main stream in ASR.However, in this architecture, we still divide waveform into separated frames, and optimize each models individually without the whole utterance structure.
In the other hand, structured learning is capable of taking whole structured input and produce structured output without separating each objects in training. Hence, we can take acoustic feature sequence as structured input, and phoneme sequence as structured output. In this way, ASR problem is transformed into a structured learning problem.
In this thesis, we implemented structured Support Vector Machine(SVM) as baseline, and proposed two novel structured learning model: structured Deep Neural Network and gradient structured Deep Neural Network towards phoneme recognition system.
In TIMIT corpus, although structured SVM is linear model, with proper input, it can achieve 22.7\% Phoneme Error Rate(PER). Structured DNN is a great non-linear model, and it shows 17.8\% PER which beats state-of-the-art results. And gradient structured deep neural network didn't give good results on PER, but it's a novel and interesting way to solve maximize problem.
en
dc.description.provenanceMade available in DSpace on 2021-06-16T09:55:13Z (GMT). No. of bitstreams: 1
ntu-105-R03921048-1.pdf: 3115923 bytes, checksum: 01801c50a95c3d456b4582dc2c9a805b (MD5)
Previous issue date: 2016
en
dc.description.tableofcontents中文摘要....................................... i
英文摘要....................................... ii
一、導論....................................... 1
1.1 研究動機.................................. 1
1.2 相關前人研究 ............................... 3
1.3 本論文研究貢獻.............................. 3
1.4 章節安排.................................. 4
二、背景知識 .................................... 5
2.1 音素辨識系統 ............................... 5
2.1.1 前端處理.............................. 5
2.1.2 統計式音素辨識.......................... 7
2.1.3 聲學模型.............................. 8
2.1.4 隱藏式馬可夫模型 ........................ 10
2.1.5 搜尋演算法與詞圖 ........................ 11
2.2 深層類神經網路 .............................. 12
2.2.1 順向傳遞式類神經網路...................... 13
2.2.2 訓練類神經網路.......................... 17
2.3 結構化學習................................. 21
2.4 本章總結.................................. 22
三、結構化支撐向量機 ............................... 23
3.1 簡介..................................... 23
3.2 系統架構.................................. 24
3.3 結構化特徵擷取 .............................. 26
3.4 鑑別函數(DiscriminativeFunction).................... 27
3.5 解碼..................................... 29
3.6 最違反限制序列 .............................. 30
3.7 減損函數(LossFunction) ......................... 32
3.8 實驗與分析................................. 33
3.8.1 實驗設定.............................. 33
3.8.2 基準實驗.............................. 36
3.8.3 實驗結果與分析.......................... 37
3.9 本章總結.................................. 39
四、結構化深層類神經網路 ............................ 40
4.1 簡介..................................... 40
4.2 系統架構.................................. 41
4.3 結構化特徵擷取 .............................. 42
4.4 鑑別函數.................................. 44
4.5 解碼..................................... 44
4.6 訓練減損函數 ............................... 45
4.6.1 逼近音素正確率(approximating phoneme accuracy rate) . . . . . 45
4.6.2 最大化邊距 ............................ 45
4.7 全域式結構化深層類神經網路 ...................... 46
4.8 實驗與分析................................. 48
4.8.1 實驗設定.............................. 48
4.8.2 實驗結果與分析.......................... 49
4.9 本章總結.................................. 52
五、梯度結構化深層類神經網路.......................... 53
5.1 簡介..................................... 53
5.2 系統架構.................................. 55
5.3 結構化特徵擷取 .............................. 57
5.4 解碼..................................... 61
5.5 訓練減損函數 ............................... 61
5.6 實驗與分析................................. 63
5.6.1 實驗設定.............................. 63
5.6.2 實驗結果與分析.......................... 64
5.7 本章總結.................................. 73
六、結論與展望 ................................... 74
6.1 結論與展望................................. 74
6.1.1 總結 ................................ 74
6.1.2 未來展望.............................. 74
參考文獻....................................... 76
附錄A 結構化反向特徵擷取運算細節 ...................... 79
A.1自動微分.................................. 79
A.2 結構化反向特徵擷取 ........................... 82
dc.language.isozh-TW
dc.subject機器學習zh_TW
dc.subject語音辨識zh_TW
dc.subject結構化學習zh_TW
dc.subject深度學習zh_TW
dc.subjectdeep learningen
dc.subjectstructured learningen
dc.subjectASRen
dc.subjectmachine learningen
dc.title基於結構化學習之初步音素辨識zh_TW
dc.titleTowards Phoneme Recognition with Structured Learningen
dc.typeThesis
dc.date.schoolyear105-1
dc.description.degree碩士
dc.contributor.oralexamcommittee李宏毅(Hung-yi Lee),陳信宏(Sin-Horng Chen),鄭秋豫(Chiu-Yu Tseng),簡仁宗(Jen-Tzung Chien)
dc.subject.keyword結構化學習,深度學習,語音辨識,機器學習,zh_TW
dc.subject.keywordstructured learning,deep learning,ASR,machine learning,en
dc.relation.page84
dc.identifier.doi10.6342/NTU201603864
dc.rights.note有償授權
dc.date.accepted2016-12-30
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電機工程學研究所zh_TW
顯示於系所單位:電機工程學系

文件中的檔案:
檔案 大小格式 
ntu-105-1.pdf
  未授權公開取用
3.04 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved