用串接式系統整合加伯與基頻特徵之國語語音辨識

Shang-Wen Li; 李尚文

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/48026

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李琳山
dc.contributor.author	Shang-Wen Li	en
dc.contributor.author	李尚文	zh_TW
dc.date.accessioned	2021-06-15T06:44:40Z	-
dc.date.available	2014-07-07
dc.date.copyright	2011-07-07
dc.date.issued	2011
dc.date.submitted	2011-06-29
dc.identifier.citation	[1] Defense Advance Research Projects Agency, http://www.darpa.mil. [2] H. Hermansky and S Sharma, “Temporal patterns (traps) in asr of noisy speech,” in ICASSP, 1999. [3] H. Hermansky and N.Morgan, “Rasta processing of speech,” in IEEE Trans. Speech and Audio Proc., 1994, pp. 578–589. [4] H. Hermansky and P. Fousek, “Multi-resolution rasta ﬁltering for tandem-based asr,” in Interspeech, 2005. [5] H. Hermansky, D. Ellis, and S. Sharma, “Tandem connectionist feature extraction for conventional hmm systems,” in ICASSP, 2000. [6] D.A. Depireux, J.Z. Simon, D.J. Klein, and S.A. Shamma, “Spectro-temporal re- sponse ﬁeld characterization with dynamic ripples in ferret primary auditory cortex,” in J. Neurophysiology, 2001, vol. 85, pp. 1220–1234. [7] S. Thomas, S. Ganapathy, and H. Hermansky, “Recognition of reverberant speech using frequency domain linear prediction,” in IEEE Sig. Proc. Let., vol. 15, pp. 681–684. [8] S. Ganapathy, S. Thomas, and H. Hermansky, “Robust spectro-temporal features based on autoregressive models of hilbert envelopes,” in ICASSP, 2010. [9] X. Domont, M. Heckmann, F. Joublin, and C. Goerick, “Hierarchical spectro- temporal features for robust speech recognition,” in ICASSP, 2008. [10] M. Kleinschmidt and D. Gelbart, “Improving word accuracy with gabor feature extraction,” in ICSLP, 2002. [11] S. Zhao and N. Morgan, “Multi-stream spectro-temporal features for robust speech recognition,” in Interspeech, 2008. [12] B. Meyer and B. Kollmeier, “Complementarity of mfcc, plp and gabor features in the presence of speech-intrinsic variabilities,” in Interspeech, 2009. [13] S. Zhao, S. Ravuri, and N. Morgan, “Multi-stream to many-stream: Using spectro- temporal features for asr,” in ICASSP, 2009. [14] S. Thomas, N. Mesgarani, and H. Hermansky, “A multistream multiresolution framework for phoneme recognition,” in Interspeech, 2010. [15] S. Ravuri and N. Morgan, “Using spectro-temporal features to improve afe feature extraction for asr,” in Interspeech, 2010. [16] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993. [17] S. Haykin, Neural Networks A comprehensive Foundation, 1999. [18] S.-Y. Chang and L.-S. Lee, “Data-driven clustered hierarchical tandem system for lvcsr,” in Interspeech, 2008. [19] 張碩尹, “串接群聚階層式多層感知器聲學模型之中文大字彙語音辨識,” M.S. thesis,國立台灣大學電信工程學研究所 , 2009. [20] International Computer Science Institute( ICSI), http://www.icsi.berkeley.edu/Speech/qn.html. [21] Cambridge University Engineering Dept. (CUED), Machine Intelligence Labora- tory, ”HTK,” http://htk.eng.cam.ac.uk/. [22] H. Misra, H. Bourlard, and V. Tyagi, “New entropy based combination rules in hmm/ann multi-stream asr,” in ICASSP, 2003. [23] SRI Speech Technology and Research Laboratory, ”SRILM,” http://www.speech.sri.com/projects/srilm/. [24] 潘奕誠, “大字彙中文連續語音辨認之一段式及以詞圖為基礎之搜尋演算法,” M.S. thesis,國立台灣大學資訊工程研究所 , 2002. [25] S. Katz, “Estimation of probabilities from sparse data for other language component of a speech recognizer,” in IEEE Trans. Acoustics, Speech and Signal Proc., 1987, vol. 35, pp. 400–401. [26] M.-Y. Hwang, G. Peng, W.Wang, A. Faria, A. Heidel, and M. Ostendorf, “Building a highly accurate mandarin speech recognizer,” in ASRU, 2007. [27] ESPS Version 5.0 Program Manual, http://www.speech.kth.se/software/. [28] V. Neﬁan, L.-H. Liang, X.-B. Pi, Liu X.-X., C. Mao, and K. Murphy, “A coupled hmm for audio-visual speech recognition,” in ICASSP, 2002. [29] P.-S. Huang, X.-D. Zhuang, and M. Hasegawa-Johnson, “Improving acoustic event detection using generalizable visual features and multi-modality modeling,” in ICASSP, 2011. [30] J. Frankel, D. Wang, and S. King, “Growing bottleneck features for tandem asr,” in Interspeech, 2008. [31] O. Vinyals and S. Ravuri, “Comparing multilayer perceptron to deep belief network tandem features for robust asr,” in ICASSP, 2011.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/48026	-
dc.description.abstract	傳統語音辨識中，使用梅爾倒頻譜係數特徵參數來抽取聲音訊號中的語音資訊，並用這樣的特徵參數訓練統計模型，對聲音加以辨識；然而梅爾倒頻譜係數有一些無法克服的缺點，例如其所抽取的資訊僅限於短時間內等。近年來已有不少研究，藉由抽取聲音中更長時間的訊息，或是時域、頻域及時頻域上的變化，來獲取更豐富的特徵參數，進而提升辨識系統的效能。本論文中，利用加伯濾波器抽取出富含時頻訊息的特徵參數，經多層感知器學習其在不同音素間的變化，得到音素事後機率向量，並藉由串接式系統將加伯事後機率和梅爾倒頻譜係數事後機率做整合，發現可以提升辨識系統的正確率。此外，我們進一步藉由群聚階層式多層感知器，針對易混淆的音素，估計更為精準的事後機率，改善了辨識系統的效能。最後，我們在特徵參數中加入了基頻特徵，並在聲學模型中考慮了聲調的變化，這樣的語音辨識系統在中文大字彙新聞辨識實驗中，辨識正確率有顯著的進步。	zh_TW
dc.description.abstract	In conventional speech recognition, we use MFCC features to extract speech information in waveform. We further train statistic models with these features for decoding. However, MFCC features retain only the information within a short time span. Recently, many researches focus on extracting long-term information from speech signal or the variation in spectral, temporal or spectro-temporal modulation frequency, and these studies achieve significant performance improvement. Here, we utilize Gabor filters to extract Gabor features, which are abundant in spectro-temporal information. An MLP is trained for learning the variation of Gabor features among different phonemes. The outputs of MLP are Gabor posteriors. We use Tandem system to integrate Gabor and MFCC posteriors and achieve better performance in our speech recognition system. Furthermore, we estimate posteriors more accurately by clustered hierarchical MLP, which emphasize on the classification of error-prone phoneme pairs. Thus, we obtain even better recognition performance. Finally, we add pitch features while MLP training and adopt tonal acoustic units. With these modifications, we significantly improve the performance in Mandarin large vocabulary broadcast news recognition.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T06:44:40Z (GMT). No. of bitstreams: 1 ntu-100-R98942035-1.pdf: 17213022 bytes, checksum: 3f732e46a75682dbc308aea40b5fa4af (MD5) Previous issue date: 2011	en
dc.description.tableofcontents	中文摘要. . . . . . . . . . . . . . . . . . . . . . . . i 一、緒論. . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 研究動機 . . . . . . . . . . . . . . . . . . . . . .1 1.2 語音辨識原理. . . . . . . . . . . . . . . . . . . . 1 1.3 特徵抽取. . . . . . . . . . . . . . . . . . . . . . 2 1.3.1 梅爾倒頻譜係數特徵. . . . . . . . . . . . . . . . 3 1.3.2 時域特徵. . . . . . . . . . . . . . . . . . . . . 3 1.3.3 時頻特徵. . . . . . . . . . . . . . . . . . . . . 4 1.4 聲學模型. . . . . . . . . . . . . . . . . . . . . . 5 1.5 語言模型 . . . . .. . . . . . . . . . . . . . . . . 6 1.6 研究方法及成果. . . . . . . . . . . . . . . . . . . 6 1.7 論文章節概要. . . . . . . . . . . . . . . . . . . . 7 二、背景知識 . . . .. . . . . . . . . . . . . . . . . . 8 2.1 加伯特徵. . . . . . . . . . . . . . . . . . . . . . 8 2.2 多層感知器 . . . . .. . . . . . . . . . . . . . . . 10 2.3 群聚階層式多層感知器 . . . . . . . . . .. . . . . . 13 2.3.1 音素距離. . . . . . . . . . . . . . . . . . . . . 13 2.3.2 階層式群聚法. . . . . . . . . . . . . . . . . . . 14 2.3.3 群聚階層式多層感知器. . . . . . . . . . . . . . . 15 2.4 串接式系統. . . . . . . . . . . . . . . . . . . . . 17 三、加伯特徵與梅爾倒頻譜係數之整合. . . . . . . . . . . 18 3.1 實驗語音資料庫與模型設定. . . . . . . . . . . . . . 18 3.1.1 實驗語料. . . . . . . . . . . . . . . . . . . . . 18 3.1.2 訓練與辨識系統工具. . . . . . . . . . . . . . . . 18 3.1.3 聲學模型設定. . . . . . . . . . . . . . . . . . . 19 3.2 事後機率. . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 梅爾倒頻譜係數事後機率. . . . . . . . . . . . . . 19 3.2.2 加伯事後機率. . . . . . . . . . . . . . . . . . . 20 3.3 整合事後機率之串接式系統. . . . . . . . . . . . . . 20 3.4 實驗結果. . . . . . . . . . . . . . . . . . . . . . 21 3.5 互補性分析. . . . . . . . . . . . . . . . . . . . . 23 3.6 本章結論. . . . . . . . . . . . . . . . . . . . . . 26 四、群聚階層式多層感知器與不同特徵之整合. . . . . . . . 27 4.1 實驗語音資料庫與模型設定. . . . . . . . . . . . . . 27 4.1.1 實驗語料. . . . . . . . . . . . . . . . . . . . . 27 4.1.2 訓練與辨識系統工具. . . . . . . . . . . . . . . . 28 4.1.3 聲學模型設定. . . . . . . . . . . . . . . . . . . 28 4.1.4 辭典與語言模型設定. . . . . . . . . . . . . . . . 29 4.2 事後機率. . . . . . . . . . . . . . . . . . . . . . 29 4.2.1 由多層感知器獲得之事後機率. . . . . . . . . . . . 30 4.2.2 由群聚階層式多層感知器獲得之事後機率. . . . . . . 30 4.3 實驗結果. . . . . . . . . . . . . . . . . . . . . . 33 4.4 系統分析. . . . . . . . . . . . . . . . . . . . . . 37 4.4.1 各種事後機率之音框正確率比較. . . . . . . . . . . 37 4.4.2 事後機率之平均數及變異數正規化. . . . . . . . . . 39 4.5 本章結論. . . . . . . . . . . . . . . . . . . . . . 40 五、聲調特徵與聲調聲學單元之整合. . . . . . . . . . . . 42 5.1 實驗語音資料庫與模型設定. . . . . . . . . . . . . . 42 5.1.1 聲調與音素集. . . . . . . . . . . . . . . . . . . 42 5.1.2 實驗語料與基礎實驗設定. . . . . . . . . . . . . . 43 5.1.3 辭典與語言模型設定. . . . . . . . . . . . . . . . 43 5.2 音調特徵. . . . . . . . . . . . . . . . . . . . . . 43 5.3 事後機率. . . . . . . . . . . . . . . . . . . . . . 44 5.3.1 加伯與梅爾倒頻譜係數事後機率. . . . . . . . . . . 44 5.3.2 經基頻特徵增強之事後機率. . . . . . . . . . . . . 45 5.4 整合事後機率之串接式系統. . . . . . . . . . . . . . 45 5.5 實驗結果. . . . . . . . . . . . . . . . . . . . . . 46 5.6 不同特徵與聲學單位之分析. . . . . . . . . . . . . . 49 5.7 本章結論. . . . . . . . . . . . . . . . . . . . . . 52 六、結論與展望. . . . . . . . . . . . . . . . . . . . . 53 6.1 結論. . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 展望. . . . . . . . . . . . . . . . . . . . . . . . 54 參考文獻. . . . . . . . . . . . . . . . . . . . . . . . 55
dc.language.iso	zh-TW
dc.title	用串接式系統整合加伯與基頻特徵之國語語音辨識	zh_TW
dc.title	Integrating Gabor and Pitch Features in Tandem Systems for Mandarin Speech Recognition	en
dc.type	Thesis
dc.date.schoolyear	99-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳信宏,鄭秋豫,王小川,簡仁宗
dc.subject.keyword	語音辨識,特徵抽取,串接式系統,	zh_TW
dc.subject.keyword	speech recognition,feature extraction,Tandem system,	en
dc.relation.page	58
dc.rights.note	有償授權
dc.date.accepted	2011-06-29
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電信工程學研究所	zh_TW
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-100-1.pdf 目前未授權公開取用	16.81 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。