語音文件中關鍵用語之自動擷取及其關係圖之自動生成

Yu Huang; 黃宥

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/47903

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李琳山
dc.contributor.author	Yu Huang	en
dc.contributor.author	黃宥	zh_TW
dc.date.accessioned	2021-06-15T06:42:46Z	-
dc.date.available	2011-08-02
dc.date.copyright	2011-08-02
dc.date.issued	2011
dc.date.submitted	2011-07-07
dc.identifier.citation	[1] F. Liu, D. Pennell, F. Liu, and Y. Lin, “Unsupervised approaches for automatic keyword extraction using meeting transcripts,” in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter, 2009. [2] F. Liu, F. Liu, and Y. Lin, “Automatic keyword extraction for the meeting corpus using supervised approach and bigram expansion,” in Proceedings of IEEE SLT, 2008. [3] Ricardo Baeza-Yates and Berthier Ribeiro-Neto, “Addison wesley,” in Modern Information Retrieval, 1999. [4] George Kingsley Zipf, Human behavior and the principle of least effort, Addison Wesley, 1949. [5] R. Urbizag’astegui-Alvarado, “Las posibilidades de la ley de zipf,” Tech. Rep., Universidad de California Riverside, 1999. [6] Edgar Moyotl-Hern’andez and H’ector Jim’enez-Salazar, “An analysis on frequency of terms for text categorization,” in Procesamiento del lenguaje natural, 2004. [7] Edgar Moyotl-Hern’andez and H’ector Jim’enez-Salazar, “Enhancement of dtp feature selection method for text categorization,” in Computational Linguistics and Intelligent Text Processing, 2005. [8] David Pinto, Paolo Rosso, Alfons Juan, and H’ector Jim’enez-Salazar, “A comparative study of clustering algorithms on narrow-domain abstracts,” in Procesamiento del languaje natural, 2006. [9] Yutaka Matsuo and Mitsuru Ishizuka, “Keyword extraction from a single document using word co-occurrence statistical information,” International Journal on Artificial Intelligence Tools, 2003. [10] Kuo Zhang, Hui Xu, Jie Tang, and Juanzi Li, “Keyword extraction using support vector machine,” Advances in Web-Age Information Management, 2006. [11] Chengzhi Chang, Huilin Wang, Yao Liu, Dan Wu, Yi Liao, and Bo Wang, “Automatic keyword extraction from documents using conditional random fields,” Journal of Computational Information Systems, 2008. [12] Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and Craig G. Nevill- Manning, “Domain-specific keyphrase extraction,” in Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999. [13] Anette Hulth, Jussi Karlgren, Anna Jonsson, Henrik Bostr‥om, and Lars Asker, “Automatic keyword extraction using domain knowledge,” in Computational Linguistics and Intelligent Text Processing, 2001. [14] Decong Li, Sujian Li,Wenjie Li,WeiWang, andWeiguang Qu, “A semi-supervised key phrase extraction approach: learning from title phrases through a document semantic network,” in Proceedings of the ACL, 2010. [15] O. Chapelle, B. Sch‥olkopf, A. Zien, and editors, Semi-supervised learning, MIT Press, 2006. [16] Chong Huang, Yonghong Tian, Zhi Zhou, Charles X. Ling, and Tiejun Huang, “Keyphrase extraction using semantic networks structure analysis,” in Proceedings of the 6th International Conference on Data Mining, 2006. [17] Decong Li, Sujian Li,Wenjie Li,WeiWang, andWeiguang Qu, “A semi-supervised key phrase extraction approach: Learning from title phrase through a document semantic network,” in Proceedings of 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2006. [18] Thomas Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of Uncertainty in Artificial Intelligence, 1999. [19] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society of Information Science, vol. 41, no. 6, pp. 391–407, 1990. [20] A. P. Dempster, N. M. Laird, and D. B. Robin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of Royal Statist, 1997. [21] S. Kong and L. Lee, “Improved summarization of chinese spoken documents by probabilistic latent semantic analysis (PLSA) with further analysis and integrated scoring,” in Proceedings of IEEE SLT, 2006. [22] S. Kong and L. Lee, “Improved spoken documents summarization using probabilistic latent semantic analysis (plsa),” in Proceedings of IEEE ICASSP, 2006. [23] Vladimir N. Vapnik, The nature of statistical learning theory, Springer-Verlag New York, Inc., New York, NY, USA, 1995. [24] Christopher J.C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, pp. 121–167, 1998. [25] List from Wikipedia, “List of english stop words,” http://armandbrahaj. blog.al/2009/04/14/list-of-english-stop-words/. [26] H. Schmid, “Probabilistic part-of-speech tagging using decision trees,” in Proceedings of the International Conference on New Methods in Language Processing, 1994. [27] Y.-N. Chen, “Automatic key term extraction and summarization from spoken course lectures,” M.S. thesis, National Taiwan University, 2011. [28] D.R. Morrison, “PATRICIA-practical algorithm to retrieve information coded in alphanumeric,” Journal of ACM, 1968. [29] D.E. Knuth, The art of computer programming sorting and searching, Mass, 1973. [30] L.-F. Chien, “PAT-tree-based keyword extraction for chinese information retrieval,” in Proceedings of Special Interest Group on Information Retrieval, 1997. [31] L.-F. Chien, “PAT-tree-based adaptive keyphrase extraction for intelligent chinese information retrieval,” in Proceedings of Information Processing and Management, 1999. [32] T. Ong, “Updateable PAT-tree approach to Chinese key phrase extraction using mutual information: a linguistic foundation for knowledge management,” in Proceedings of Second Asian Digital Library Conference, 1999. [33] Oren Kurland and Lillian Lee, “Pagerank without hyperlinks: structural re-ranking using links induced by language models,” in Proceedings of ACM Special Interest Group on Information Retrieval, 2005. [34] Dong Zhou, Seamus Lawless, Jinming Min, and Vincent Wade, “Dual-space reranking model for document retrieval,” in Proceedings of International Conference on Computational Linguistics, 2010. [35] Gabrilovich, Evgenily, and Shaul Markovitch, “Computing semantic relatedness using Wikipedia-based explicit semantic analysis,” in Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007. [36] J. Hirschberg, “Communication and prosody: functional aspects of prosody,” Speech Communication, vol. 36, pp. 31–43, 2002. [37] J. J. Zhang, H. Y. Chan, and P. Fung, “Improving lecture speech summarization using rhetorical information,” in Proceedings of Automatic Speech Recognition and Understanding, 2007. [38] Entropic, Inc., ESPS/waves + with EnSig 5.3 Release Notes. [39] W. Lin, “Tone recognition for fluent mandarin speech and its application on large vocabulary recognition,” M.S. thesis, National Taiwan University, 2004. [40] C.-Y. Chou, “Chinese sentence segmentation using machine learning methods,” M.S. thesis, National Taiwan University, 2009. [41] S. Hsu, “Topic segmentation on lecture corpus and its application,” M.S. thesis, National Taiwan University, 2008. [42] C.-F Yeh, L.-C. Sun, C.-Y. Huang, and L.-S. Lee, “Bilingual acoustic modeling with state mapping and three-stage adaptation for trascribing unbalanced code-mixed lectures,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing, 2011. [43] N. Kumar and K. Srinathan, “Automatic keyphrase extraction from scientific documents using n-gram filtration technique,” in Proceedings of the eighth ACM symposium on Document engineering, 2008. [44] Lee R Dice, “Measures of the amount of ecologic association between species,” Ecology 26, vol. 3, pp. 297–302, 1945. [45] Cronbach L.J., “On the non-rational application of information measures in psychology,” Information Theory in Psychology, pp. 14–30, 1954. [46] Rudi Cilibrasi and Paul Vitanyi, “The Google similarity distance,” IEEE Trans. Knowledge and Data Engineering, vol. 19, pp. 370–383, 2007. [47] H.T. Siegelmann and E.D. Sontag, “Turing computability with neural nets,” Appl. Math. Lett., vol. 4, pp. 77–80, 1991. [48] David M. Blei, Andrew Y. Ng, and Michael I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/47903	-
dc.description.abstract	本論文研究語音文件中關鍵用語之自動擷取及其關係圖之自動生成。本論文將關鍵用語分成關鍵片語(Key Phrase) 和關鍵詞(Keyword)，並用不同方法來擷取。在擷取關鍵片語部分，我們提出了分岐亂度(Branching Entropy)。在擷取關鍵詞的部分，我們提出了二階段擷取(Two-Stage Extraction) 的方法，其中第一階段(First-Stage) 利用相對連貫性計算(Relative Coherence Measure; RCM) 取得關鍵詞的初始排序(Initial Ranking)，並以網路知識為輔助；第二階段(Second-Stage)則利用第一階段得出的初始排序，從語音文件中抽取候選關鍵詞的詞彙特徵(Lexical Feature)、韻律特徵(Prosodic Feature) 以及語意特徵(Semantic Feature)，再透過機器學習方法訓練分類器，得到關鍵詞的重排序(Re-Ranking)。有了關鍵用語，我們進一步利用機器學習方法訓練分類器(Classifier) 來自動判別兩兩關鍵用語之間的關係以生成關係圖，包括抽取詞彙特徵、語意特徵以及網路知識特徵(Feature from Web Knowledge) 以描述關鍵用語之間的關係，發現這些特徵是可加成的，並提出一個評比關係圖的方法。	zh_TW
dc.description.provenance	Made available in DSpace on 2021-06-15T06:42:46Z (GMT). No. of bitstreams: 1 ntu-100-R98922015-1.pdf: 4805965 bytes, checksum: fc316a12c5c186f8caef41d1f7b8e2d4 (MD5) Previous issue date: 2011	en
dc.description.tableofcontents	口試委員會審定書. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i 中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 一、導論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 相關研究. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 自動關鍵用語擷取. . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 自動關鍵用語關係圖生成. . . . . . . . . . . . . . . . . . . . 6 1.3 本論文主要的研究方法及貢獻. . . . . . . . . . . . . . . . . . . . . . 7 1.4 章節安排. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 二、背景知識. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 機率式潛藏語意分析模型. . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 潛藏觀念模型. . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 使用最大期望值演算法求取潛藏觀念模型. . . . . . . . . . . 12 2.1.3 機率式潛藏語意分析模型與傳統潛藏語意分析模型的比較. . 13 2.1.4 基於機率式潛藏語意模型之特徵參數. . . . . . . . . . . . . . 14 2.2 支撐向量機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 演算法理論推導. . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 三、非督導式語音文件之關鍵用語擷取. . . . . . . . . . . . . . . . . . . . . . 22 3.1 簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 前處理. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 無義詞移除. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 詞根原形化. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.3 詞性過濾. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 關鍵片語擷取. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 分歧亂度. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.2 以後綴樹實作分歧亂度. . . . . . . . . . . . . . . . . . . . . . 26 3.4 關鍵詞擷取之初始排序. . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.1 主題連貫性計算. . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.2 網路知識輔助加權主題連貫性計算. . . . . . . . . . . . . . . 30 3.5 關鍵詞擷取之重排序. . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5.1 關鍵詞特徵抽取. . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5.2 藉由機器學習來抽取關鍵詞. . . . . . . . . . . . . . . . . . . 39 3.6 實驗基礎架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6.1 實驗語料. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6.2 訓練與辨識系統. . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.7 實驗結果及分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.7.1 評估方式. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.7.2 參考關鍵用語之生成. . . . . . . . . . . . . . . . . . . . . . . 42 3.7.3 結果分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.8 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 四、語音文件之關鍵用語關係圖. . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1 簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 關鍵用語關係之特徵抽取. . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.1 詞彙特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.2 語意特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.3 網路知識特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 利用支撐向量機抽取關鍵用語關係. . . . . . . . . . . . . . . . . . . 58 4.4 實驗設計. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.1 參考關鍵用語關係圖之生成. . . . . . . . . . . . . . . . . . . 59 4.4.2 評估方式. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.5 實驗結果及分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.1 特徵效力分析. . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.2 關鍵用語關係圖之結果評估及呈現. . . . . . . . . . . . . . . 65 4.6 本章總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 五、結論與展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.1 總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
dc.language.iso	zh-TW
dc.subject	支撐向量機	zh_TW
dc.subject	關鍵用語	zh_TW
dc.subject	關鍵詞	zh_TW
dc.subject	關鍵片語	zh_TW
dc.subject	語音文件	zh_TW
dc.subject	關鍵用語關係圖	zh_TW
dc.subject	機率式潛藏語意分析	zh_TW
dc.subject	機器學習	zh_TW
dc.subject	Spoken Documents	en
dc.subject	Keyword	en
dc.subject	Key Phrase	en
dc.subject	Key Term Graph	en
dc.subject	SVM	en
dc.subject	Support Vector Machine	en
dc.subject	Machine Learning	en
dc.subject	PLSA	en
dc.subject	Probabilistic Latent Semantic Analysis	en
dc.subject	Key Term	en
dc.title	語音文件中關鍵用語之自動擷取及其關係圖之自動生成	zh_TW
dc.title	Automatic Key Term Extraction and Key Term Graph Generation from Spoken Documents	en
dc.type	Thesis
dc.date.schoolyear	99-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	鄭秋豫,陳信宏,王小川,簡仁宗
dc.subject.keyword	關鍵用語,關鍵詞,關鍵片語,語音文件,關鍵用語關係圖,機率式潛藏語意分析,機器學習,支撐向量機,	zh_TW
dc.subject.keyword	Key Term,Keyword,Key Phrase,Spoken Documents,Key Term Graph,Probabilistic Latent Semantic Analysis,PLSA,Machine Learning,Support Vector Machine,SVM,	en
dc.relation.page	77
dc.rights.note	有償授權
dc.date.accepted	2011-07-07
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-100-1.pdf 未授權公開取用	4.69 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。