Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/50690
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李琳山
dc.contributor.authorChih-Hsiang Yangen
dc.contributor.author楊植翔zh_TW
dc.date.accessioned2021-06-15T12:52:51Z-
dc.date.available2016-08-02
dc.date.copyright2016-08-02
dc.date.issued2016
dc.date.submitted2016-07-19
dc.identifier.citation[1] Henning Reetz and Allard Jongman, Phonetics: Transcription, production, acoustics, and perception, vol. 34, John Wiley & Sons, 2011.
[2] Simon King and Paul Taylor, “Detection of phonological features in continuous speech using neural networks,” Computer Speech & Language, vol. 14, no. 4, pp. 333–353, 2000.
[3] Vikramjit Mitra, Hosung Nam, Carol Y Espy-Wilson, Elliot Saltzman, and Louis Goldstein, “Retrieving tract variables from acoustics: a comparison of different machine learning strategies,” Selected Topics in Signal Processing, IEEE Journal of, vol. 4, no. 6, pp. 1027–1045, 2010.
[4] Benigno Uria, Steve Renals, and Korin Richmond, “A deep neural network for acoustic-articulatory speech inversion,” in NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
[5] Vikramjit Mitra, Wen Wang, Andreas Stolcke, Hosung Nam, Colleen Richey, Jiahong Yuan, and Mark Liberman, “Articulatory trajectories for large-vocabulary speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 7145–7149.
[6] Vikramjit Mitra, Ganesh Sivaraman, Hosung Nam, Carol Espy-Wilson, and Elliot Saltzman, “Articulatory features from deep neural networks and their role in speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 3017–3021.
[7] Hayes Bruce, Introductory phonology, vol. 32, John Wiley & Sons, 2011.
[8] Roman Jakobson and Moris Halle, Fundamentals of language, vol. 1, Walter de Gruyter, 2002.
[9] Noam Chomsky and Morris Halle, “The sound pattern of english.,” 1968.
[10] EC Polome, “Frontiers of phonology: atoms, structures, derivations-durand, j, katamba, f,” 1997.
[11] Xuedong D Huang, Yasuo Ariki, and Mervyn A Jack, Hidden Markov models for speech recognition, vol. 2004, Edinburgh university press Edinburgh, 1990.
[12] Kunihiko Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological cybernetics, vol. 36, no. 4, pp. 193–202, 1980.
[13] Paul Werbos, “Beyond regression: New tools for prediction and analysis in the behavioral sciences,” 1974
[14] John Duchi, Elad Hazan, and Yoram Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” The Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011.
[15] Anthony J Robinson, “An application of recurrent nets to phone probability estimation,” Neural Networks, IEEE Transactions on, vol. 5, no. 2, pp. 298–305, 1994.
[16] Sepp Hochreiter and Jぴurgen Schmidhuber, “Long short term memory,” Neural computation, vol. 9, no. 8, pp. 1735 1780, 1997. 62
[17] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
[18] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[19] Abdel-rahman Mohamed, Tara N Sainath, George Dahl, Bhuvana Ramabhadran, Geoffrey E Hinton, and Michael A Picheny, “Deep belief networks using discriminative features for phone recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 5060–5063.
[20] Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton, “Acoustic modeling using deep belief networks,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 14–22, 2012.
[21] George E Dahl, Dong Yu, Li Deng, and Alex Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 30–42, 2012.
[22] Najim Dehak, Patrick Kenny, R´eda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 4, pp. 788 798, 2011.
[23] Kanishka Rao, Fuchun Peng, Hasim Sak, and Franc¸oise Beaufays, “Grapheme-tophoneme conversion using long short term memory recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4225–4229.
[24] Ved Mitra, Gangadharan Sivaraman, Hosung Nam, Carol Espy-Wilson, and Elliot Saltzman, “Articulatory features from deep neural networks and their role in speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 3017–3021.
[25] Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe, “Globalphone: A multilingual text & speech database in 20 languages,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8126–8130.
[26] Alan Wrench and Korin Richmond, “Continuous speech recognition using articulatory data,” 2000.
[27] Leonardo Badino, Claudia Canevari, Luciano Fadiga, and Giorgio Metta, “Integrating articulatory data in deep neural network-based acoustic modeling,” Computer Speech & Language, vol. 36, pp. 173–195, 2016.
[28] Joe Frankel and Simon King, “Asr-articulatory speech recognition,” 2001.
[29] Matthew Richardson, Jeff Bilmes, and Chris Diorio, “Hidden-articulator markov models for speech recognition,” Speech Communication, vol. 41, no. 2, pp. 511– 529, 2003.
[30] Kevin Erler and George H Freeman, “An hmm-based speech recognizer using overlapping articulatory features,” The Journal of the Acoustical Society of America, vol. 100, no. 4, pp. 2500–2513, 1996.
[31] Supphanat Kanokphara and Julie Carson-Berndsen, “Better hmm-based articulatory feature extraction with context-dependent model.,” in FLAIRS Conference, 2005, pp.
370–374.
[32] R Caruana, “Multitask learning: A knowledge-based source of inductive bias1,” in Proceedings of the Tenth International Conference on Machine Learning. Citeseer,
pp. 41–48.
[33] Gokhan Tur, “Multitask learning for spoken language understanding,” in Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. IEEE, 2006, vol. 1, pp. I–I.
[34] Xiao Li, Ye-Yi Wang, and Gぴokhan Tぴur, “Multi-task learning for spoken language understanding with shared slots.,” in INTERSPEECH, 2011, vol. 20, p. 1.
[35] Ronan Collobert and Jason Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 160–167.
[36] Zhizheng Wu, Cassia Valentini-Botinhao, Oliver Watts, and Simon King, “Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4460–4464.
[37] John S Garofolo et al., “Getting started with the darpa timit cd-rom: An acoustic phonetic continuous speech database,” National Institute of Standards and Technology (NIST), Gaithersburgh, MD, vol. 107, 1988.
[38] Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong, “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 7304–7308.
[39] Dongpeng Chen, Brian Mak, Cheung-Chi Leung, and Sunil Sivadas, “Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 5592–5596.
[40] Zhiyuan Tang, Lantian Li, and Dong Wang, “Multi-task recurrent model for speech and speaker recognition,” arXiv preprint arXiv:1603.09643, 2016.
[41] George E Dahl, Tara N Sainath, and Geoffrey E Hinton, “Improving deep neural networks for lvcsr using rectified linear units and dropout,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8609–8613.
[42] I-Fan Chen and Hsin-Min Wang, “Articulatory feature asynchrony analysis and compensation in detection-based asr.,” in INTERSPEECH, 2009, pp. 3059–3062.
[43] Katrin Kirchhoff, “Robust speech recognition using articulatory information,” 1999.
[44] Katrin Kirchhoff, Gernot A Fink, and Gerhard Sagerer, “Combining acoustic and articulatory feature information for robust speech recognition,” Speech Communication, vol. 37, no. 3, pp. 303–319, 2002.
[45] Kun Li, “The use of multi-distribution deep neural networks for segmental and suprasegmental mispronunciation detection and diagnosis in l2 english speech,” 2015.
[46] Frantisek Gr´ezl, Martin Karafi´at, Stanislav Kont´ar, and Jan Cernocky, “Probabilistic and bottle-neck features for lvcsr of meetings,” in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. IEEE, 2007, vol. 4, pp. IV–757.
[47] Frantiˇsek Gr´ezl and Petr Fousek, “Optimizing bottle neck features for lvcsr,” in Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on. IEEE, 2008, pp. 4729–4732.
[48] Dong Yu and Michael L Seltzer, “Improved bottleneck features using pretrained deep neural networks.,” in INTERSPEECH, 2011, vol. 237, p. 240.
[49] Ngoc Thang Vu, Jochen Weiner, and Tanja Schultz, “Investigating the learning effect of multilingual bottle neck features for asr.,” in INTERSPEECH, 2014, pp. 825–829.
[50] Ngoc Thang Vu and Tanja Schultz, “Multilingual multilayer perceptron for rapid language adaptation between and across language families.,” in INTERSPEECH, 2013, pp. 515–519.
[51] Jie Li, Rong Zheng, Bo Xu, et al., “Investigation of cross-lingual bottleneck features in hybrid asr systems.,” in INTERSPEECH, 2014, pp. 1395–1399.
[52] Tasha Nagamine, Michael L Seltzer, and Nima Mesgarani, “Exploring how deep neural networks form phonemic categories,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[53] Cheng-Tao Chung, Wei-Ning Hsu, Cheng-Yi Lee, and Lin Shan Lee, “Enhancing automatically discovered multi-level acoustic patterns considering context consistency with applications in spoken term detection,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5231–5235.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/50690-
dc.description.abstract三連發聲特徵(tri-articulatory feature, tri-AF)是一種考慮前後文的發聲特徵。人在說話時,口型會連續變化,故前後連接的音素不同時,相同的發聲特徵應有所不同。本論文將發聲特徵分為八大類別,每個類別皆建成考慮前後文的隱藏馬可夫模型(Context-dependent Hidden Markov Model),藉此得到三連發聲特徵標記。
在語音辨識中,深層類神經網路(deep neural network, DNN)已廣泛被用來建構聲學模型。多訓練目標之深層類神經網路亦已被證實能夠改善模型的表現,故本論文以此為基本架構,使用三連音素、字母與三連發聲特徵為多重訓練目標,以增強聲學模型。
此外,兩階段的深層類神經網路模型在近期也被廣泛使用,第一階段的深層類神經網路作為特徵抽取之用,將抽出的特徵和聲學特徵結合,作為第二階段深層類神經網路的輸入。本論文將聲學特徵結合字母、三連發聲特徵、單語言瓶頸特徵與多語言瓶頸特徵等,實現多輸入特徵之深層類神經網路。
最後,本論文結合上述兩者,實現多輸入特徵/多訓練目標之深層類神經網路,兩者相輔相成,得到最佳的實驗結果。
zh_TW
dc.description.abstractTri-articulatory feature(Tri-AF) is a context-dependent articulatory feature. When we speak, the shape of mouth change continuously. Therefore, the same phone with different context should be different in articulatory feature. In this thesis, the articulatory feature is categorized into eight groups; construct context-dependent Hidden Markov Model for each group, and then we can get tri-AF labels.
In speech recognition, deep neural network(DNN) has been widely used for acoustic model, and multi-target training DNN has been demonstrated that it can improve acoustic model. Accoding to this concept, this paper uses triphone, tri-AF, grapheme as multitarget to enhance the acoustic model.
On the other hand, two-stage DNN is also popular in recent year. The first stage acts as feature extraction model; concatenate the extracted feature with acoustic feature to be the input of second stage. This thesis uses grapheme, tri-AF, monolingual bottleneck feature and multilingual bottleneck feature as extra input to realize multi-input DNN.
Finally, combining multi-target and multi-input to fulfill multi-input/multi-target DNN, and we can get the best recognition results.
en
dc.description.provenanceMade available in DSpace on 2021-06-15T12:52:51Z (GMT). No. of bitstreams: 1
ntu-105-R03942066-1.pdf: 7480550 bytes, checksum: 4a90a2af6f2caa70482afc48ea0fbf99 (MD5)
Previous issue date: 2016
en
dc.description.tableofcontents口試委員會審定書. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
誌謝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
英文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
一、導論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 研究方向. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 章節安排. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
二、背景知識. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 音素與音位. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 國際音標. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 發聲特徵分類. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 二元特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 多值特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.3 發音聲韻要素. . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.4 發聲軌跡. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 隱藏式馬可夫模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 模型訓練及參數更新. . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 維特比演算法. . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 深層類神經網路. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 訓練方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.3 丟棄演算法. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
三、實驗語料庫及基準實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 語音辨識架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 特徵抽取. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 聲學模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.3 語言模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.4 辭典. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.5 解碼器. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 實驗語料庫與基準實驗. . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 全球音素語料庫. . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 本論文所做的音素改動. . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 基準實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
四、發聲特徵之隱藏馬可夫模型. . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 相關研究. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 考慮前後文的發聲特徵之隱藏馬可夫模型. . . . . . . . . . . . . . . 33
五、多目標之深層類神經網路. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1 相關研究. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 實驗與分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 三連音素與字母. . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.2 三連音素與三連發聲特徵. . . . . . . . . . . . . . . . . . . . 41
5.2.3 三連音素、字母與三連發聲特徵. . . . . . . . . . . . . . . . 42
5.3 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
六、多輸入特徵之深層類神經網路. . . . . . . . . . . . . . . . . . . . . . . . 44
6.1 簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2 實驗與分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2.1 聲學特徵與字母. . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2.2 聲學特徵與發聲特徵. . . . . . . . . . . . . . . . . . . . . . . 48
6.2.3 聲學特徵與瓶頸特徵. . . . . . . . . . . . . . . . . . . . . . . 50
6.3 綜合比較與分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.4 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
七、結論與展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.1 結論與貢獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2 未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
附錄A 反向傳播演算法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
附錄B 音素-發聲特徵對照表. . . . . . . . . . . . . . . . . . . . . . . . . . . 71
dc.language.isozh-TW
dc.subject發聲特徵zh_TW
dc.subject多輸入特徵之深層類神經網路zh_TW
dc.subject多目標學習之深層類神經網路zh_TW
dc.subject深層類神經網路zh_TW
dc.subject瓶頸特徵zh_TW
dc.subject發聲特徵zh_TW
dc.subject多輸入特徵之深層類神經網路zh_TW
dc.subject多目標學習之深層類神經網路zh_TW
dc.subject深層類神經網路zh_TW
dc.subject瓶頸特徵zh_TW
dc.subjectarticulatory featureen
dc.subjectbottleneck featureen
dc.subjectdeep neural network(DNN)en
dc.subjectmulti-target DNNen
dc.subjectmulti-input DNNen
dc.subjectarticulatory featureen
dc.subjectbottleneck featureen
dc.subjectdeep neural network(DNN)en
dc.subjectmulti-target DNNen
dc.subjectmulti-input DNNen
dc.title三連發聲特徵與多輸入多目標之深層類神經網路zh_TW
dc.titleTri-Articulatory Feature and Multi-input/Multi-target Deep
Neural Network
en
dc.typeThesis
dc.date.schoolyear104-2
dc.description.degree碩士
dc.contributor.oralexamcommittee陳信宏,鄭秋豫,李宏毅,王小川,簡仁宗
dc.subject.keyword發聲特徵,瓶頸特徵,深層類神經網路,多目標學習之深層類神經網路,多輸入特徵之深層類神經網路,zh_TW
dc.subject.keywordarticulatory feature,bottleneck feature,deep neural network(DNN),multi-target DNN,multi-input DNN,en
dc.relation.page73
dc.identifier.doi10.6342/NTU201601024
dc.rights.note有償授權
dc.date.accepted2016-07-19
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電信工程學研究所zh_TW
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-105-1.pdf
  未授權公開取用
7.31 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved