Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/41758
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李琳山(Lin-Shan Lee)
dc.contributor.authorChe-Kuang Linen
dc.contributor.author林哲光zh_TW
dc.date.accessioned2021-06-15T00:30:13Z-
dc.date.available2009-02-03
dc.date.copyright2009-02-03
dc.date.issued2009
dc.date.submitted2009-01-19
dc.identifier.citationBibliography
[1] J. G. Kahn, M. Ostendorf, and C. Chelba, “Parsing conversational speech using enhanced segmentation,” in Proc. of HLT/NAACL, 2004.
[2] S. Strassel, Simple Metadata Annotation Specification V6.2, Linguistic Data Consortium, 2004. [Online]. Available: http://www.ldc.upenn.edu/Projects/MDE/Guidelines/SimpleMDE V6.2.pdf
[3] S.-C. Tseng and Y.-F. Liu, “Annotation of Mandarin Conversational Dialogue Corpus,” Academia Sinica, CKIP Tech. Rep.-01, 2002.
[4] C.-K. Lin and L.-S. Lee, “Improved features and models for detecting edit disfluencies in transcribing spontaneous Mandarin speech”, to appear in IEEE Transactions on Audio, Speech, and Language Processing in 2009.
[5] C.-K. Lin and L.-S. Lee, “Improved spontaneous Mandarin speech recognition by disfluency interruption point (IP) detection using prosodic features”, in Proc. Interspeech, 2005.
[6] C.-K. Lin, S.-C. Tseng, and L.-S. Lee, “Important and New Features with Analysis for Disfluency Interruption Point (IP) Detection in Spontaneous Mandarin Speech”, in Proc. Disfluency in Spontaneous Speech, 2005.
[7] C.-K. Lin and L.-S. Lee, “Latent Prosodic Modeling (LPM) for Speech with Applications in Recognizing Spontaneous Mandarin Speech with Disfluencies”, in Proc. ICSLP 2006.
[8] S. Furui, M. Nakamura, T. Ichiba, and K. Iwano, “Analysis and recognition of spontaneous speech using corpus of spontaneous japanese,” Speech Communication, vol. 47, pp. 208–219, 2005.
[9] H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, and G. Zweig, “The IBM 2004 conversational telephony system for rich transcription,” in Proc. IEEE ICASSP, 2005, pp. 205–208.
[10] T. Hain, P.C. Woodland, G. Evermann, M.J.F.Gales, X. Liu, G. L. Moore, D. Povey, and L. Wang, “Automatic transcription of conversational telephone speech,” IEEE Trans. Speech Audio Process., vol. 13, no. 6, pp. 1173–1185, Nov. 2005.
[11] C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, “Multistage speaker diarization of broadcast news,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1505–1512, Sep. 2006.
[12] S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker diarization systems,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1557–1565, Sep. 2006.
[13] Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper, “Enriching speech recognition with automatic detection of sentence boundaries and disfluencies,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1526–1540, Sep. 2006.
[14] M. Lease, M. Johnson, and E. Charniak, “Recognizing disfluencies in conversational speech,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1566–1573, Sep. 2006.
[15] J.-F. Yeh and C.-H. Wu, “Edit disfluency detection and correction using a cleanup language model and an alignment model,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1574–1583, Sep. 2006.
[16] L. Deng, D. Yu, and A. Acero, “Structured speech modeling,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1492–1504, Sep. 2006.
[17] M. J. F. Gales, D. Y. Kim, P. C. Woodland, H. Y. Chan, D. Mrva, R. Sinha, and S. E. Tranter, “Progress in the CU-HTK broadcast news transcription system,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1513–1525, Sep. 2006.
[18] S. Matsoukas, J.-L. Gauvain, G. Adda, T. Colthurst, C.-L. Kao, O. Kimball, L. Lamel, F. Lefevre, J. Z. Ma, J. Makhoul, L. Nguyen, R. Prasad, R. Schwartz, H. Schwenk, and B. Xiang, “Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1541–1556, Sep. 2006.
[19] S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker diarization systems,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1557–1565, Sep. 2006.
[20] H. Jiang, X. Li, and C. Liu, “Large margin Hidden Markov Models for speech recognition,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1584–1595, Sep. 2006.
[21] S. F. Chen, B. Kingsbury, L. Mangu, D. Povey, G. Saon, H. Soltau, and G. Zweig, “Advances in speech transcription at IBM under the DARPA EARS Program,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1596–1608, Sep. 2006.
[22] P. Heeman and J. Allen, “Speech repairs, intonational phrases and discourse markers: Modeling speakers’ utterances in spoken dialogue,” Computational Linguistics, vol. 25, pp. 527–571, 1999.
[23] E. Charniak and M. Johnson, “Edit detection and parsing for transcribed speech,” in Proc. of NAACL, 2001, pp. 118–126.
[24] M. Johnson and E. Charniak, “A TAG-based noisy channel model of speech repairs,” in Proc. of ACL, 2004.
[25] M. Honal and T. Schultz, “Automatic disfluency removal on recognized spontaneous speech - rapid adaptation to speaker dependent disfluencies,” in Proc. of ICASSP, 2005.
[26] M. Honal and T. Schultz, “Corrections of disfluencies in spontaneous speech using a noisy channel approach,” in Proc. of Eurospeech, 2003.
[27] C. Nakatani and J. Hirschberg, “A corpus-based study of repair cues in spontaneous speech,” Journal of the Acoustical Society of America, pp.1603–1616, 1994.
[28] E. Shriberg, 'Phonetic consequences of speech disfluency,' in Proc. of the International Conference of Phonetics Sciences, 1999, pp. 619–622.
[29] R. Lickley, “Juncture cues to disfluency,” in Proc. of ICSLP, 1996.
[30] G. Savova and J. Bachenko, “Prosodic features of four types of disfluencies,” in Proc. of DiSS, 2003, pp. 91–94.
[31] E. Shriberg and A. Stolcke, “A prosody-only decision-tree model for disfluency detection,” in Proc. of Eurospeech, 1997, pp. 2383–2386.
[32] E. Shriberg, A. Stolcke, D. Hakkani-Tur, and G. Tur, “Prosody-based automatic segmentation of speech into sentences and topics,” Speech Communication, pp. 127–154, 2000.
[33] Y. Liu, A. Stolcke, E. Shriberg, and M. Harper, “Comparing HMM, maximum entropy, and conditional random fields for disfluency detection,” in Proc. of Eurospeech, 2005, pp. 3313–3316.
[34] Y. Liu, E. Shriberg, A. Stolcke, and M. Harper, “Structural metadata research in the ears program,” presented at the ICASSP, invited paper, 2005, pp. 957–960.
[35] Y. Liu, E. Shriberg, and A. Stolcke, “Automatic disfluency identification in conversational speech using multiple knowledge sources,” in Proc. of Eurospeech, 2003, pp. 957–960.
[36] Y. Liu, E. Shriberg, A. Stolcke, M. Harper, 'Using machine learning to cope with imbalanced classes in natural speech: Evidence from sentence boundary and disfluency detection', in Proc. of ICSLP, 2004.
[37] M. Snover, B. Dorr, and R. Schwartz, “A lexically-driven algorithm for disfluency detection,” in Proc. of HLT/NAACL, 2004.
[38] J. Kim, S. Schwarm, and M. Ostendorf, “Detecting structural metadata with decision trees and transformation-based learning,” in Proc. of HLT/NAACL, 2004.
[39] A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra, 'A maximum entropy approach to natural language processing', Computational Linguistics, 22:39–72, 1996.
[40] S.-C. Tseng, “Processing Spoken Mandarin Corpora,” in Traitement Automatique des Langues. Special Issue: Spoken Corpus Processing. 45(2): 89–108.
[41] R. H. Ryrd, P. Lu, and J. Nocedal, “A limited memory algorithm for bound constrained optimization,” SIAM J. Sci. Statist. Comput., vol. 16, no. 5, pp. 1190–1208, 1995.
[42] S. Chen and R. Rosenfeld, “A Gaussian prior for smoothing maximum entropy models,” Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep., 1999.
[43] H. Chipman, E. I. George, and R. E. McCulloch, “Bayesian CART model search,” Journal of the American Statistical Association, vol. 93, no.443, pp. 935-947, 1998.
[44] C.-Y. Tseng, S.-H. Pin, Y.-L. Lee, H.-M. Wang and Y.-C. Chen, 'Fluent speech prosody: framework and modeling', Speech Communication, Vol.46, Issues 3–4 (July 2005), Special Issue on Quantitative Prosody Modeling for Natural Speech Description and Generation, pp. 284–309.
[45] T. Hofmann, 'Probabilistic latent semantic analysis,' in Uncertainty in Artificial Intelligence, 1999.
[46] K. Daniels and C. Giraud-Carrier, “Learning the Threshold in Hierarchical Agglomerative Clustering,” in Proc. ICMLA, pp. 270-278, 2006.
[47] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, 2001. www.csie.ntu.edu.tw/~cjlin/libsvm .
[48] Y.-C. Hsieh, Y.-T. Huang, C.-C. Wang, L.-S. Lee, “Improved spoken document retrieval with dynamic key term lexicon and Probabilistic Latent Semantic Analysis (PLSA),” in Proc. of ICASSP, 2006.
[49] Rich Transcription (RT-04F) Evaluation Plan (2004). [Online]. Available: http://www.nist.gov/speech/tests/rt/rt2004/fall/docs/rt04f-eval-plan-v14.doc
[50] F. Wilcoxon, “ Individual Comparisons by Ranking Methods,” Biometrics, 1945, vol. 1.
[51] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: probabilistic models for segmenting and labeling sequence data,” in Proc. ICML, 2001, pp. 282–289.
[52] Y.-J. Cheng, “Evaluation and analysis of Minimum Phone Error training and its modified versions for large vocabulary Mandarin speech recognition”, Master Thesis, National Taiwan University, June, 2008.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/41758-
dc.description.abstract理想的語音辨識系統(speech recognition system)必須能處理人類自然發生的口語語音或自發性語音(spontaneous speech)。相對於清晰朗讀或是有事先準備而產生的語音,這種自發性語音具有一些特質,會增加系統在處理上的難度。其中的一項重要特質就是隨處可見常常發生的修正性不流暢(edit disfluency)現象。要能正確而不失真地解讀說話者要傳達的意思,系統必須要能偵測這樣的修正性不流暢,並且妥善處理。
在本論文中,我們提出一套處理自發性語音中修正性不流暢的架構,透過找出語音中不流暢的中斷點(interruption points, IPs),並且比對前後所講的字詞之間的關係,來找出語句的結構,並刪去語句中多餘或說話者講錯想更正的應修正字詞(edit words, including reparandum and optional editing terms),以利於語意的理解。在這個架構中,我們提出一套有效的特徵參數(features)和模型來偵測語音修正性不流暢的中斷點,並且根據偵測的結果改進辨識結果的正確性和可理解性。這套特徵參數經過仔細設計,考慮了中文語音所特有的各種語言特性。而用來偵測不流暢中斷點的模型,則是改進自機器學習(machine learning)研究中相當著名的兩個方法:決策樹(decision trees, DTs)以及最大熵值模型(maximum entropy models, MEs)。透過結合兩者的優點,我們得到一個更加適合偵測不流暢中斷點的模型:以決策樹為基礎的最大熵值模型(DT-ME)。此外,我們又進一步提出一套分析語音的韻律或抑揚頓挫(prosody)結構的方法:統計式潛藏韻律模型(latent prosodic modeling, LPM)。透過分析說話者正常流利說話時的抑揚頓挫,並比較其說話中斷語流不順時的情形,我們於是可以將前述的DT-ME模型進一步改進,得到更精確的偵測模型。另一方面,透過使用條件隨機域模型(conditional random field,CRF),我們得以分析不流暢的中斷點前後的詞語間的關係,找出並刪去應修正字詞,以分析語句的結構,正確掌握語意。
在中文口語對話語音上的實驗結果顯示,我們提出的這套架構能有效偵測處理中文口語中的修正性不流暢現象,並且顯著降低偵測的錯誤率。對於語句結構的較佳掌握也帶來了較佳的辨識結果(辨識正確率的提升)。此外,我們更進一步觀察我們提出的潛藏韻律模型所分析出來的抑揚頓挫。我們也透過分析對偵測不同種類修正性不流暢現象有效果的特徵參數如何不同,來進一步了解這些不流暢在特性上的差別。
zh_TW
dc.description.abstractDetection of edit disfluencies is one of the keys to transcribing spontaneous utterances. In this dissertation, we present improved features and models to detect edit disfluencies and enhance transcription of spontaneous Mandarin speech using hypothesized disfluency interruption points (IPs) and edit word detection. A comprehensive set of prosodic features that takes into account the special characteristics of edit disfluencies in Mandarin is developed, and an improved model combining decision trees and maximum entropy is proposed to detect IPs. This model is further adapted to desired prosodic conditions by latent prosodic modeling, a probabilistic framework for analyzing speech prosody in terms of a set of latent prosodic states. These techniques contribute to higher recognition accuracy (by rescoring with the hypothesized IPs) and better edit word detection (using conditional random fields defined on Chinese characters) in the final transcription, as verified by experiments on a spontaneous Mandarin speech corpus. Detailed analysis on the output latent states of the proposed latent prosodic modeling is conducted. Further analysis on the relevance of the proposed prosodic features to each type of edit disfluency is also conducted for further insight into the characteristics of various disfluency categories.en
dc.description.provenanceMade available in DSpace on 2021-06-15T00:30:13Z (GMT). No. of bitstreams: 1
ntu-98-F91942036-1.pdf: 3009649 bytes, checksum: b77da2228301813a4300ca7eec9250a3 (MD5)
Previous issue date: 2009
en
dc.description.tableofcontentsABSTRACT I
中文摘要 II
TABLE OF CONTENTS V
LIST OF FIGURES VII
LIST OF TABLES VIII
1 INTRODUCTION 1
1.1 BACKGROUND 1
1.2 PRIMARY ACHIEVEMENTS OF THIS DISSERTATION 4
1.3 CHAPTER OUTLINE 7
2 BACKGROUND REVIEW AND EXPERIMENTAL ENVIRONMENTS 9
2.1 INTRODUCTION 9
2.2 REVIEW OF EXISTING APPROACHES FOR HANDLING DISFLUENCY IN SPEECH 9
2.2.1 Spontaneous speech processing 9
2.2.2 Decision Trees (DTs) for Classification 11
2.2.3 Maximum Entropy Models (MEs) for Classification 12
2.3 EXPERIMENTAL ENVIRONMENTS 14
2.3.1 Speech Corpora 14
2.3.2 Baseline System 15
2.4 SUMMARY 16
3 OVERVIEW OF THE PROPOSED FRAMEWORK FOR EDIT DISFLUENCY DETECTION 17
4 INTERRUPTION POINT DETECTION 21
4.1 INTRODUCTION 21
4.2 PROSODIC FEATURE EXTRACTION 21
4.2.1 Pitch-related feature 22
4.2.2 Duration-related features 24
4.2.3 Energy-related features 25
4.3 INITIAL INTERRUPTION POINT DETECTION MODELS 27
4.3.1 Integration of DT and ME (DT-ME) 28
4.4 LATENT PROSODIC MODELING (LPM) 29
4.5 USING AN LPM-ADAPTED MODEL FOR INTERRUPTION POINT DETECTION 34
4.6 EXPERIMENTAL RESULTS 37
4.6.1 Analysis of LPM Latent Prosodic States 37
4.6.2 Prosodic Feature Set Comparison in IP Detection 41
4.6.3 Initial IP Detection Model Comparison 43
4.6.4 Refined LPM-adapted DT-ME models for IP Detection 44
4.6.5 Comparison Between Lexical and Prosodic Information 45
4.6.6 Feature Analysis 46
4.7 SUMMARY 50
5 ENHANCED TRANSCRIPTOIN WITH EDIT WORD DETECTION 51
5.1 INTRODUCTION 51
5.2 SECOND-PASS RECOGNITION USING HYPOTHESIZED IPS 51
5.3 EDIT WORD DETECTION 53
5.4 EXPERIMENTAL RESULTS 56
5.4.1 Speech Recognition with IP Detection 56
5.4.2 Edit Word Detection 57
5.5 SUMMARY 59
6 CONCLUSIONS AND FUTURE WORKS 61
6.1 CONCLUSIONS 61
6.2 FUTURE WORKS 61
APPENDIX A. LIST OF THE ACOUSTIC MODELS FOR INITIALS/FINALS 65
BIBLIOGRAPHY 67
dc.language.isoen
dc.subject自發性語音zh_TW
dc.subject修正性不流暢zh_TW
dc.subject不流暢的中斷點zh_TW
dc.subject抑揚頓挫zh_TW
dc.subject語音辨識zh_TW
dc.subjectspontaneous speechen
dc.subjectedit disfluencyen
dc.subjectinterruption point detectionen
dc.subjectprosodyen
dc.subjectspeech recognitionen
dc.title中文自發性語音辨識中偵測修正性不流暢現象之新方法zh_TW
dc.titleNew Approaches for Detecting Edit Disfluencies in Transcribing Spontaneous Mandarin Speechen
dc.typeThesis
dc.date.schoolyear97-1
dc.description.degree博士
dc.contributor.oralexamcommittee貝蘇章(Soo-Chang Pei),陳銘憲(Ming-Syan Chen),傅立成(Li-Chen Fu),鄭士康(Shyh-Kang Jeng),陳志宏(Jyh-Horng Chen),吳家麟(Ja-Ling Wu)
dc.subject.keyword修正性不流暢,不流暢的中斷點,抑揚頓挫,語音辨識,自發性語音,zh_TW
dc.subject.keywordedit disfluency,interruption point detection,prosody,speech recognition,spontaneous speech,en
dc.relation.page74
dc.rights.note有償授權
dc.date.accepted2009-01-19
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電信工程學研究所zh_TW
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-98-1.pdf
  未授權公開取用
2.94 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved