中文名詞組的辨識：規則式判別、監督式、半監督式與非監督式學習法的實驗

Yen-Hsi Lin; 林晏僖

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/44979

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	高成炎(Cheng-Yan Kao)
dc.contributor.author	Yen-Hsi Lin	en
dc.contributor.author	林晏僖	zh_TW
dc.date.accessioned	2021-06-15T04:00:13Z	-
dc.date.available	2010-03-17
dc.date.copyright	2010-03-17
dc.date.issued	2010
dc.date.submitted	2010-03-08
dc.identifier.citation	【1】張席維，高照明，劉昭麟（ 2005 ）利用向量支撐機辨識中文基底名詞組的初步研究。第十七屆自然語言與語音處理研討會。 pp. 317-332 【2】 Kudo, Taku, and Matsumoto, Yuji. (2000). Use of Support Vector Learning for Chunk Identification. In Proceedings of CoNLL-2000, pp. 142-144. 【3】 Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (pp. 189–196). 【4】 Kudo, Taku, and Matsumoto, Yuji. (2001). Chunking with Support Vector Machine. In Proceedings of NAACL 2001, pp. 192-199. http://chasen.org/~taku/software/YamCha/ 【5】 Chang, Chih-Chung and Lin, Chih-Jen. (2004) LIBSVM -- A Library for Support Vector Machines. [On line]. Available. http://www.csie.ntu.edu.tw/~cjlin/libsvm/ 【6】 Guang-Lu Sun, Chang-Ning Huang, Xiao-Long Wang, and Zhi-Ming Xu .Chinese Chunking Based on Maximum Entropy Markov Models. Computational Linguistics and Chinese Language Processing Vol. 11, No. 2, June 2006, pp. 115-136 【7】 R. K. Ando and T. Zhang. A high-performance semi-supervised learning method for text chunking. In Proceedings of the Annual Meetings of the Association for Computational Linguistics (ACL), pages 1-9. 2005 【8】 Semi-supervised learning book http://www.kyb.tuebingen.mpg.de/ssl-book/ 【9】 Xiaojin Zhu, Semi-supervised literature survey, December 14, 2007 http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf 【10】 Zhao, J., and C. N. Huang, “Analysis of Chinese BaseNP Structure,” Chinese Journal of Computers, 22(2), 1999, pp. 141-146. 【11】 Steven Abney. Partial parsing via ﬁnite-state cascades. Journal of Natural Language Engineering, 2(4):337–344.1996 【12】 Igor Boehm, Rule Based vs. Statistical Chunking of CoNLL Data Sets , 2005 【13】 Bing-Gong Ding , Chang-Ning Huang and De-Gen Huang,“Chinese Main Verb Identification: From Specification to Realization”,Computational Linguistics and Chinese Language Processing ,Vol. 10, No. 1, March 2005, pp. 53-94 【14】 Yuchang CHENG and Masayuki ASAHARA and Yuji MATSUMOTO,“Machine Learning-based Dependency Analyzer for Chinese”, Journal of Chinese Language and Computing 15 (1): (13-24) ,2005 【15】 Kudo, Taku, and Matsumoto, Yuji (2000). Japanese Dependency Analysis Based on Support Vector Machines, EMNLP/VLC 2000 【16】 YamCha: Yet Another Multipurpose CHunk Annotator http://chasen.org/~taku/software/YamCha/ 【17】 Nakov, P.. and Hearst, M. A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies, in RANLP'05, Borovets, Bulgaria, 2005 【18】 Nakov, P., and Hearst, M. Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing, in the Proceedings of CoNLL-2005, Ninth Conference on Computational Natural Language Learning, Ann Arbor, MI, June 2005 【19】 Nakov, P. and Hearst, M., Using Verbs to Characterize Noun-Noun Relations, in the Proceedings of AIMSA 2006, Bulgaria, September 2006 Grace Ngai and Chi-Shing Wang. A Knowledge-Based Approach for Unsupervised Chinese Coreference Resolution, in the Computational Linguistics and Chinese Language Processing Vol. 12, No. 4, December 2007, pp. 459-484 【21】 Claire Grover and Richard Tobin (2006). Rule-Based Chunking and Reusability. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006). Genoa, Italy 【22】 Zhang, Y., and Q. Zhou, “Automatic Identification of Chinese Base Phrases,” Journal of Chinese Information Processing, 16(6), 2002, pp. 1-8. 【23】 Kinyon, A., “A Language-independent Shallow-parser Compiler,” In Proceedings of 39th ACL Conference, Toulouse, France, 2001, pp. 322-329. 【24】 Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto, 'Deterministic dependency structure analyzer for Chinese,' In Proceedings of The 1st International Joint Conference on Natural Language Processing, Sanya City, Hainan Island, China, pp.135-140, March 2004 【25】中央研究院中文自動斷詞程式 http://ckipsvr.iis.sisnica.edu.tw 【26】詞彙速描系統The sketch engine http://www.sketchengine.co.uk/ 【27】 ConLL 2000 Shared Task http://www.cnts.ua.ac.be/conll2000/chunking/ A Perl script for performance measuring: http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/44979	-
dc.description.abstract	名詞組辨識在自然語言處理中可以說是一個非常關鍵的問題，不同組合的結構、其它詞性的變化、或是結構和字本身的歧義，都大大地影響了名詞組辨識的結果。好的辨識結果可以幫助現今許多和自然語言處理相關的應用，尤其是一些名詞組佔了大多數比例的服務，例如：網路探勘、搜尋引擎等等。但由於中文較其它語言複雜，又缺乏大型標記過的語料，使得中文的名詞辨識做起來更加困難。最近這幾年，許多自然語言處理的問題，包括詞組辨識，利用所謂的訓練語料配合監督式學習的分類方法解決的文獻紀錄非常多。但是這些文獻中，常存在一些共同待解決的問題，如：訓練語料不足，並且很難在其它文獻中找到提出的改善方式。本篇論文探討以四種不同方法，辨識中文名詞組。首先參考前人統整好的規則，實做Rule-based 模型，當作一個比較的對象。第二個是監督式學習法（Supervised-learning）的模型：利用Taku Kudo，所提出利用SVM的演算法所作的chunking工具：Yamcha（Yet Another Multipurpose CHunk Annotator），訓練中文名詞組辨識的初始模型，並嘗試以不同於多數文獻中看到的IOB表示法及前二後二位置的語意資訊，找到適用於中文的參數。第三個是基於半監督式學習法（Semi-supervised learning）中自我學習的概念，利用網路上未標記過的資料，強化監督式學習法的半監督式學習法模型。最後一個是個完全利用搜尋引擎得到的未標記過資料（Raw data）等此類網路資源，以及中文本身的語言特徵，所結合而成的未監督式學習法（Unsupervised-learning）的模型。實驗結果證明，最簡單的Rule-based作法在開放測試的f-rate為0.71，比監督式學習法的0.58高出約0.13；而在監督式學習法（supervised learning）的實驗步驟裡，我們所選用的參數比前人選用的參數做出的模型，在第一階段開放測試中高出了約16個百分比；半監督式學習中，加入unlabeled data這個步驟也的確提昇監督式學習法的效果，在第二個開放測試中的f-rate為78.79％，比監督式學習法高出了約8個百分比，不但保存了分類器的優點，同時提昇中文在名物化現象時有歧義的名詞辨識結果；完全不倚賴分類器的非監督式學習法，在開放測試的f-rate為84.57％，比半監督式學習法高出了17個百分比，從開放測試中看出其具有解決長名詞及名物化動詞的效果。	zh_TW
dc.description.provenance	Made available in DSpace on 2021-06-15T04:00:13Z (GMT). No. of bitstreams: 1 ntu-99-R95944002-1.pdf: 632349 bytes, checksum: aef9ddc5819871cbdb822ae4444a3d4c (MD5) Previous issue date: 2010	en
dc.description.tableofcontents	口試委員審定書 ii 中文摘要 iii Abstract v 第一章緒論 10 第二章文獻回顧 12 2.1規則法 12 2.2監督式學習法以及統計方式 12 2.2.1Kudo的支持向量機演算法 13 2.3 SVM以及YAMCHA 14 2.4半監督式學習法 15 2.5非監督式學習法 17 第三章實驗作法說明 20 3.1名詞組表示法 20 3.2規則式作法：Rule-based approach 21 3.2監督式學習法：Supervised-learning 22 3.3半監督式學習法: Semi-supervised learning 22 3.4非監督式的學習法 : Unsupervised learning 27 第四章實驗結果討論 32 4.1.0 實驗語料介紹 32 4.1.1 相關資源介紹 33 4.1.2 潛在問題 34 4.1.3 開放測試集 36 4.2規則式作法：Rule-based approach 37 4.3監督式學習法：Supervised learning 38 4.4半監督式學習法：Semi-supervised learning 45 4.5非監督式學習法：Unsupervised learning 47 4.6綜合表現比較 48 第五章結論與未來展望 51 5.1未來展望 52 參考資料 Bibliographies 53 附錄一開放測試的句子 56 #以下為supervised 和 semi-supervised開放測試的data 56 #在封閉測試語料中的句子 58 #從最近的新聞找出的句子 58 #一個動詞出現兩種作用的對照情形 59 測試句中的未知詞 59 附錄二 Chinese NP Chunking：a Semi-Supervised Approach (part of this thesis, accepted by Second International Symposium on Universal Communication, ISUC 2008, December 15-16, 2008, 2008.) 60
dc.language.iso	zh-TW
dc.subject	web corpus	zh_TW
dc.subject	中文名詞組辨識	zh_TW
dc.subject	YamCha	zh_TW
dc.subject	監督式學習法	zh_TW
dc.subject	半監督式學習法	zh_TW
dc.subject	Chinese NP chunking	en
dc.subject	web corpus	en
dc.subject	semi-supervised learning	en
dc.subject	supervised-learning	en
dc.subject	YamCha	en
dc.title	中文名詞組的辨識：規則式判別、監督式、半監督式與非監督式學習法的實驗	zh_TW
dc.title	Chinese NP Chunking: Experiments with Rule-based Method, Supervised, Semi-supervised and Unsupervised Learning	en
dc.type	Thesis
dc.date.schoolyear	98-1
dc.description.degree	碩士
dc.contributor.coadvisor	高照明(Zhao-Ming Gao)
dc.contributor.oralexamcommittee	楊允言,劉昭麟(Chao-Lin Liu)
dc.subject.keyword	中文名詞組辨識,YamCha,監督式學習法,半監督式學習法,web corpus,	zh_TW
dc.subject.keyword	Chinese NP chunking,YamCha,supervised-learning,semi-supervised learning,web corpus,	en
dc.relation.page	61
dc.rights.note	有償授權
dc.date.accepted	2010-03-10
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-99-1.pdf 未授權公開取用	617.53 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。