Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4884
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李琳山
dc.contributor.authorChing-Feng Yehen
dc.contributor.author葉青峰zh_TW
dc.date.accessioned2021-05-14T17:49:35Z-
dc.date.available2015-12-01
dc.date.available2021-05-14T17:49:35Z-
dc.date.copyright2015-12-01
dc.date.issued2015
dc.date.submitted2015-09-10
dc.identifier.citation[1] Ching-Feng Yeh and Lin-Shan Lee, “An improved framework for recognizing highly imbalanced bilingual code-switched lectures with cross-language acoustic modeling and frame-level language identification,” IEEE Transactions on Audio, Speech, and Language Processing, 2015.
[2] Tanja Schultz and AlexWaibel, “Multilingual and crosslingual speech recognition,” in DARPA Workshop on Broadcast News Transcription and Understanding, 1998.
[3] Tanja Schultz and Alex Waibel, “Language independent and language adaptive acoustic modeling for speech recognition,” Speech Communication, 2001.
[4] Hui Lin, Li Deng, Jasha Droppo, Dong Yu, and Alex Acero, “Learning methods in multilingual speech recognition,” in NIPS, 2008.
[5] L. Lamel, M. Adda-decker, and J.L. Gauvain, “Issues in large vocabulary, multilingual speech recognition,” in Europ. Conf. on Speech Communication and Technology,
1995, pp. 185–188.
[6] Li Deng, “Integrated-multilingual speech recognition using universal phonological features in a functional speech production model,” in ICASSP, 1997.
[7] Alex Waibel, Hagen Soltau, Tanja Schultz, Thomas Schaaf, and Florian Metze, Speech-to-speech Translation.
[8] S.J. Young, M. Adda-Dekker, X. Aubert, C. Dugast, J.L. Gauvain, D.J. Kershaw, L. Lamel, D.A. Leeuwen, D. Pyea, A.J. Robinson, H.J.M. Steeneken, and P.C. Woodland, “Multilingual large vocabulary speech recognition: the european sqale project,” in Computer Speech & Language, 1997.
[9] Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe, “Globalphone: A multilingual text & speech database in 20 languages,” in ICASSP, 2013.
[10] Zolt´an Tぴuske, David Nolden, Ralf Schlぴuter, and Hermann Ney, “Multilingual mrasta features for low-resource keyword search and speech recognition systems,” in ICASSP, 2014.
[11] Jie Li, Rong Zheng, and Bo Xu, “Investigation of cross-lingual bottleneck features in hybrid asr systems,” in Interspeech, 2014.
[12] B. Mark and E. Barnard, “Phone clustering using bhattacharyya distance,” in ICSLP, 1996.
[13] Yanmin Qian and Jia Liu, “Phone modeling and combining discriminative training for mandarin-english bilingual speech recognition,” in ICASSP, 2010.
[14] Anne-Katrin Kienappel, Dieter Geller, and Rolf Bippus, “Cross-language transfer of multilingual phoneme models,” in Automatic Speech Recognition, 2000.
[15] J. Kohler, “Language adaptation of multilingual phone models for vocabulary independent speech recognition tasks,” in Acoustics, Speech and Signal Processing, 1998.
[16] Houwei Cao, Tan Lee, and P.C. Ching, “Cross-lingual speaker adaptation via gaussian component mapping,” in Interspeech, 2010.
[17] Ching-Feng Yeh, Chao-Yu Huang, and Lin-Shan Lee, “Bilingual acoustic model adaptation by unit merging on different levels and cross-level integration,” in Interspeech, 2011.
[18] R. Bayeh, S. Lin, G. Chollet, and C. Mokbel, “Towards multilingual speech recognition using data driven source/target acoustical units association,” in ICASSP, 2004.
[19] Edward Lebese, Jonas Manamela, and Nalson Gasela, “Towards a multilingual recognition system based on phone-clustering scheme for decoding local languages,”
in SATNAC, 2012.
[20] Lukas Burget, Petr Schwarz, Mohit Agarwal, Pinar Akyazi, Kai Feng, Arnab Ghoshal, Ondrej Glembek, Nagendra Goel, Martin Karafiat, Daniel Povey, Ariya Rastrow, Richard C. Rose, and Samuel Thomas, “Multilingual acoustic modeling for speech recognition based on subspace gaussian mixture models,” in ICASSP, 2010.
[21] Yanmin Qian, Daniel Povey, and Jia Lu, “State-level data borrowing for lowresource speech recognition based on subspace gmms,” in Interspeech, 2011.
[22] Daniel Povey, Discriminative training for large vocabulary speech recognition,Ph.D. thesis, Cambridge University Engineering Dept, 2003.
[23] Ran Xu, Qingqing Zhang, Jielin Pan, and Yonghong Yan, “Investigations to minimum phone error training in bilingual speech recognition,” in FSKD, 2009.
[24] Ching-Feng Yeh, Yiu-Chang Lin, and Lin-Shan Lee, “Minimum phone error model training on merged acoustic units for transcribing bilingual code-switched speech,”
in ISCSLP, 2012.
[25] Li Deng, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Dong Yu, Frank Seide, Michael L. Seltzer, Geoff Zweig, Xiaodong He, Jason Williams, Yifan Gong, , and Alex Acero, “Recent advances in deep learning for speech research at microsoft,” in ICASSP, 2013.
[26] Ngoc Thang Vu, David Imseng, Daniel Povey, Petr Motlicek, Tanja Schultz, and Herve Bourlard, “Multilingual deep neural network based acoustic modeling for rapid language adaptation,” in ICASSP, 2014.
[27] Livescu K., Fosler-Lussier E., and F. Metze, “Subword modeling for automatic speech recognition: past, present, and emerging approaches,” IEEE Signal Processing
Magazine, 2012.
[28] Chung-Hsien Wu, Han-Ping Shen, , and Yan-Ting Yang, “Phone set construction based on context-sensitive articulatory attributes for code-switching speech recognition,” in ICASSP, 2012.
[29] Raul Fernandez, Jia Cui, Andrew Rosenberg, Bhuvana Ramabhadran, and Xiaodong Cui, “Exploiting vocal-source features to improve asr accuracy for low-resource
languages,” in Interspeech, 2014.
[30] Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong, “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers,” in ICASSP, 2013.
[31] Ching-Feng Yeh, Chao-Yu Huang, Liang-Che Sun, , and LinShan Lee, “An integrated framework for transcribing mandarin-english code-mixed lectures with improved
acoustic and language modeling,” in ISCSLP, 2010.
[32] Ngoc Thang Vu, Dau-Cheng Lyu, Jochen Weiner, Dominic Telaar, Tim Schlippe, Fabian Blaicher, Eng-Siong Chng, Tanja Schultz, and Haizhou Li, “A first speech
recognition system for mandarin-english code-switch conversational speech,” in ICASSP, 2012.
[33] Ching-Feng Yeh, Liang-Che Sun, Chao-Yu Huang, and Lin-Shan Lee, “Bilingual acoustic modeling with state mapping and three-stage adaptation for transcribing unbalanced code-mixed lectures,” in ICASSP, 2011.
[34] Ching-Feng Yeh and Lin-Shan Lee, “Transcribing code-switched bilingual lectures using deep neural networks with unit merging in acoustic modeling,” in ICASSP, 2014.
[35] Ching-Feng Yeh, Aaron Heidel, Hong-Yi Lee, and Lin-Shan Lee, “Recognition of highly imbalanced code-mixed bilingual speech with frame-level language detection
based on blurred posteriorgram,” in ICASSP, 2012.
[36] David Imseng, Herve Bourlard, Mathew Magimai.-Doss, and John Dines, “Language dependent universal phoneme posterior estimation for mixed language speech
recognition,” in ICASSP, 2011.
[37] JochenWeiner, Ngoc Thang Vu, Dominic Telaar, Florian Metze, Tanja Schultz, Dau-Cheng Lyu, Eng-Siong Chng, and Haizhou Li, “Integration of language identification into a recognition system for spoken conversations containing code-switches,” in SLTU, 2012.
[38] N. Barroso, Karmele L´opez de Ipi˜na, Aitzol Ezeiza, O. Barroso, and U. Susperregi, “Hybrid approach for language identification oriented to multilingual speech recognition in the basque context,” in Hybrid Artificial Intelligence Systems Lecture Notes in Computer Science, 2010.
[39] Dau-Cheng Lyu and Ren-Yuan Lyu, “Language identification on code-switching utterances using multiple cues,” in Interspeech, 2008.
[40] Yu Zhang, Ekapol Chuangsuwanich, and James R. Glass, “Language id-based training of multilingual stacked bottleneck features,” in Interspeech, 2014.
[41] David A. van Leeuwen and Rosemary Orr, “Speech recognition of non-native speech using native and non-native acoustic models,” in Interspeech, 1999.
[42] Ngoc Thang Vu, Yuanfan Wang, Marten Klose, Zlatka Mihaylova, and Tanja Schultz, “Improving asr performance on non-native speech using multilingual and
crosslingual information,” in Interspeech, 2014.
[43] Li Ying and Pascale Fung, “Code switch language modeling with functional head constraint,” in ICASSP, 2014.
[44] Heike Adel, Dominic Telaar, Ngoc Thang Vu, Katrin Kirchhoff, and Tanja Schultz, “Combining recurrent neural networks and factored language models during decoding
of code-switching speech,” in Interspeech, 2014.
[45] Alan W Black and Tanja Schultz, “Speaker clustering for multilingual synthesis,” in MultiLing, 2006.
[46] George Dahl, Dong Yu, Li Deng, and Alex Acero, “Context-dependent pre-trained deep neural networks for large vocabulary speech recognition,” IEEE Transactions
on Audio, Speech, and Language Processing, Special Issue on Deep Learning for Speech and Langauge Processing, 2012.
[47] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath,
and Brian Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal Processing Magazine, 2012.
[48] Zhi-Jie Yan, Qiang Huo, and Jian Xu, “A scalable approach to using dnn-derivedfeatures in gmm-hmm based acoustic modeling for lvcsr,” in Interspeech, 2013.
[49] The Association for Computational Linguistics and Chinese Language Processing, http://www.aclclp.org.tw/corp.php.
[50] S. J. Young, J. J. Odell, and P. C. Woodland, “Tree-based state tying for high accuracy acoustic modeling,” in HLT, 1994.
[51] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models,” in Computer
Speech and Language, 1995.
[52] C.H. Lee and J.L. Gauvain, “Speaker adaptation based on map estimation of hmm parameters,” in ICASSP, 1993.
[53] LDC94S13A, Wall Street Journal-based Continuous Speech Recognition (CSR) Corpus Phase II (WSJ1), 1994.
[54] Chiu yu Tseng, Taiwan Asian English Speech Corpus Project (TWNAESOP), Academia Sinica, 2009-2012.
[55] Cambridge University Press, Handbook of the international phonetic association, 1999.
[56] UCL Division of Psychology & Language Sciences, SAMPA - computer readable phonetic alphabet, 1999.
[57] Yi-Jian Wu, Simon King, and Keiichi Tokuda, “Cross-lingual speaker adaptation for hmm-based speech synthesis,” in ISCSLP, 2008.
[58] Thomas Niesler, “Language-dependent state clustering for multilingual acoustic modeling,” Speech Communication, 2007.
[59] John R. Hershey and Peder A. Olsen, “Approximating the kullback leibler divergence between gaussian mixture models,” in ICASSP, 2007.
[60] J. Silva and S. Narayanan, “Upper bound kullback-leibler divergence for transient hidden markov models,” IEEE Transactions on Signal Processing, 2008.
[61] Yu Dong and Michael L. Seltzer, “Improved bottleneck features using pretrained deep neural networks,” in Interspeech, 2011.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4884-
dc.description.abstract本論文探討一種常見的雙語混合語音的辨識:語者所使用的語句中大部分的語音訊號是用主語言(通常是語者的母語)>所說,但其中包含小部分的詞或片語是用客語言(通常是語者的第二語言)所說的。在此狀況下,不只因為語言在語句>內頻繁切換而造成語音辨識困難,而且客語言的資料量少得多,造成客語言的辨識正確率明顯甚低。本論文提出了一個>辨識這種高度不平衡的雙語混合語音的整合性辨識系統架構。這其中包含了在聲學模型上進行不同層級(模型、狀態、>高斯)的單位融合做到跨語言語料共享,語音單位的恢復加強以重建融合後的聲學模型,依據單位佔用度排序提供更彈>性的跨語言以及語言內的語料共享,以及使用模糊事後機率特徵估測音框層級的語言事後機率等。此外,本論文也將這>些方法延伸到今日最成功的用深層類神經網路作為瓶頸特徵抽取器以及隱藏式馬可夫模型狀態模擬器的兩種方法上。我>們用一套在真實情境下錄製的語料進行統一條件下的測試,將所有提出方法做了完整的比較。實驗結果顯示本論文所提>出的系統架構能夠大幅改善雙語混合語音辨識的正確率。zh_TW
dc.description.abstractThis thesis considers the recognition of a widely observed type of bilingual code-switched speech: the speaker speaks primarily the host language (usually his native language), but with a few words or phrases in the guest language (usually his second language) inserted in many utterances of the host language. In this case, not only the languages are switched back and forth within an utterance so the language identification is difficult, but much less data are available for the guest language, which results in poor recognition accuracy for the guest language part. In this thesis, we propose an integrated overall framework for recognizing such highly imbalanced code-switched speech. This includes unit merging approaches on three levels of acoustic modeling (triphone models, HMM states and Gaussians) for cross-lingual data sharing, unit recovery for reconstructing the identity for units of the two languages after being merged, unit occupancy ranking to offer much more flexible data sharing between units both across languages and within the language based on the accumulated occupancy of the HMM states, and estimation of frame-level language posteriors using Blurred Posteriorgram Features (BPFs) to be used in decoding. In addition, we also evaluated two approaches extending above approaches based on HMMs to the state-of-the-art deep neural networks (DNNs), including using bottleneck features in HMM/GMM and modeling context-dependent HMM states. We present a complete set of experimental results comparing all approaches involved for a real-world application scenario under unified conditions, and show very good improvement achieved with the proposed approaches.en
dc.description.provenanceMade available in DSpace on 2021-05-14T17:49:35Z (GMT). No. of bitstreams: 1
ntu-104-D00942013-1.pdf: 10778160 bytes, checksum: 0ddaab9feb66bcbf0f80b4e1a173a0ec (MD5)
Previous issue date: 2015
en
dc.description.tableofcontents口試委員會審定書. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
1 Introduction and Overview of the Framework . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Baseline System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Overview of the Proposed Framework for Bilingual Speech Recognition . 7
1.4 Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Target Corpora and Baseline Experimental Results . . . . . . . . . . . . . . . . 10
2.1 Target Code-switched Bilingual Corpora . . . . . . . . . . . . . . . . . . 10
2.2 Experimental Environment Setup . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Baseline Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 HMM-based Cross-lingual Acoustic Modeling . . . . . . . . . . . . . . . . . . 19
3.1 Unit Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Unit Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Distance Calculation Between Acoustic Units on Different Levels . . . . 25
3.3.1 Model Level Distance . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 State Level Distance . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.3 Gaussian Level Distance . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Unit Occupancy Ranking for Unit Classification . . . . . . . . . . . . . . 30
3.5 Unit Occupancy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Experimental Results for HMM-based Cross-lingual Acoustic Modeling . . . . 35
4.1 Unit Merging on Different Levels (without Unit Recovery and Occupancy
Ranking) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Unit Recovery on Different Levels (without Occupancy Ranking) . . . . . 41
4.3 Unit Occupancy Ranking on Gaussian level . . . . . . . . . . . . . . . . 43
5 Frame-level Language Posterior Estimates and Experimental Results . . . . . . 45
5.1 Frame-level Language Identification by Baseline System . . . . . . . . . 46
5.2 Utilizing Frame-level Language Posterior Estimates in Decoding . . . . . 47
5.3 Blurred Posteriorgram Features (BPFs) . . . . . . . . . . . . . . . . . . . 48
5.4 Frame-level Language Identification Analysis . . . . . . . . . . . . . . . 52
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6 DNN-based Cross-lingual Acoustic Modeling and Experimental Results . . . . 57
6.1 DNN-based Acoustic Modeling . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Code-switched CD-DNN-HMM . . . . . . . . . . . . . . . . . . . . . . 58
6.3 Code-switched BF-HMM/GMM . . . . . . . . . . . . . . . . . . . . . . 59
6.4 Unit Merging on State Level for CD-DNN-HMM . . . . . . . . . . . . . 61
6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
dc.language.isoen
dc.subject語言識別zh_TW
dc.subject語音辨識zh_TW
dc.subject雙語混合zh_TW
dc.subject聲學模型zh_TW
dc.subjectSpeech Recognitionen
dc.subjectLanguage Identificationen
dc.subjectAcoustic Modelingen
dc.subjectCode-switchingen
dc.title使用跨語言聲學模型及音框層級語言識別來辨識高度不平衡雙語混合課程之整合性架構zh_TW
dc.titleAn Integrated Framework for Recognizing Highly Imbalanced Bilingual Code-switched Lectures with Cross-language Acoustic Modeling and Frame-level Language Identificationen
dc.typeThesis
dc.date.schoolyear104-1
dc.description.degree博士
dc.contributor.oralexamcommittee李宏毅,貝蘇章,林宗男,張智星
dc.subject.keyword語音辨識,雙語混合,聲學模型,語言識別,zh_TW
dc.subject.keywordSpeech Recognition,Code-switching,Acoustic Modeling,Language Identification,en
dc.relation.page74
dc.rights.note同意授權(全球公開)
dc.date.accepted2015-09-10
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電信工程學研究所zh_TW
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-104-1.pdf10.53 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved