Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 管理學院
  3. 資訊管理學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/61336
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor盧信銘(Hsin-Min Lu)
dc.contributor.authorChung-Wei Luoen
dc.contributor.author羅崇瑋zh_TW
dc.date.accessioned2021-06-16T13:01:14Z-
dc.date.available2013-08-14
dc.date.copyright2013-08-14
dc.date.issued2013
dc.date.submitted2013-08-07
dc.identifier.citation1. Chen, K. J., & Liu, S. H. (1992, August). Word identification for Mandarin Chinese sentences. In Proceedings of the 14th conference on Computational linguistics-Volume 1 (pp. 101-107). Association for Computational Linguistics.
2. Chen, W., Zhang, Y., & Isahara, H. (2006, July). Chinese named entity recognition with conditional random fields. In 5th SIGHAN Workshop on Chinese Language Processing, Australia.
3. Ekbal, A., & Bandyopadhyay, S. (2008, January). Bengali named entity recognition using support vector machine. In IJCNLP (pp. 51-58).

4. Fu, G., & Luke, K. K. (2005). Chinese named entity recognition using lexicalized HMMs. ACM SIGKDD Explorations Newsletter, 7(1), 19-25.

5. Gao, J., Li, M., Wu, A., & Huang, C. N. (2005). Chinese word segmentation and named entity recognition: A pragmatic approach. Computational Linguistics, 31(4), 531-574.

6. Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
7. Liu, X., Zhou, M., Wei, F., Fu, Z., & Zhou, X. (2012, July). Joint inference of named entity recognition and normalization for tweets. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 (pp. 526-535). Association for Computational Linguistics.

8. Liu, X., Zhang, S., Wei, F., & Zhou, M. (2011, June). Recognizing named entities in Tweets. In ACL (pp. 359-367).

9. McCallum, A. (2002, August). Efficiently inducing features of conditional random fields. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence (pp. 403-410). Morgan Kaufmann Publishers Inc..
10. Pinto, D., McCallum, A., Wei, X., & Croft, W. B. (2003, July). Table extraction using conditional random fields. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 235-242). ACM.

11. Ratinov, L., & Roth, D. (2009, June). Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (pp. 147-155). Association for Computational Linguistics.

12. Saha, S. K., Chatterji, S., Dandapat, S., Sarkar, S., & Mitra, P. (2008). A hybrid approach for named entity recognition in Indian languages. NER for South and South East Asian Languages, 17.

13. Saha, S. K., Sarkar, S., & Mitra, P. (2008). A hybrid feature set based maximum entropy hindi named entity recognition. In IJCNLP (pp. 343-349).

14. Sha, F., & Pereira, F. (2003, May). Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 (pp. 134-141). Association for Computational Linguistics.
15. She, J., & Zhang, X. Q. (2010). Musical named entity recognition method. Journal of Computer Applications, 11, 022.

16. Sun, X. (2011). A HMM and FSVM based model to Chinese named entity recognition⋆.

17. Tsai, T. H., Wu, C. W., & Hsu, W. L. (2005). Using maximum entropy to extract biomedical named entities without dictionaries. In Second International Joint Conference on Natural Language Processing (pp. 268-273).

18. Wang, Y. (2009, August). Annotating and recognizing named entities in clinical notes. In Proceedings of the ACL-IJCNLP 2009 Student Research Workshop (pp. 18-26). Association for Computational Linguistics.

19. Xu, J. T., Chang (2012). Named entity recognition of Follow-up and time information in 20,000 radiology reports. J Am Med Inform Assoc.

20. Yuejie, Z., Zhiting, X., & Xiangyang, X. (2008). Fusion of multiple features for Chinese named entity recognition based on maximum entropy model. Journal of Computer Research and Development, 6, 013.
21. McCallum, Andrew Kachites. 'MALLET: A Machine Learning for Language Toolkit.' http://mallet.cs.umass.edu. 2002.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/61336-
dc.description.abstract命名實體辨識 (Named Entity Recognition , 簡稱 NER) 是資訊擷取 (Information Extraction) 這領域中一個很重要的課題,也被廣泛的運用於分析許多英文及外文的資料,無論是結構化或非結構化的資料,目前已經很多效能不錯的工具,隨著華人世界的興起,很多學者也開始將NER的技術運用於中文資料的分析上, 故提出一個方法建構在現有的模型上並加入一些新的想法及關鍵特徵,希望能在辨識的準確率上表現的更好。
本篇論文提出,以條件隨機域 (Conditional Random Fields , 簡稱 CRFs) 為基礎,藉由加入一些對於辨識中文人名命名實體 (Chinese Named Entity Recognition , CNER) 有幫助的特徵,來實作中文人名辨識系統 CRF_CNR。有鑑於中文並不如英文有明顯的空格可做斷詞,中文每個字有時所能代表的意義有限,故針對人名辨識的部份,本研究加入了斷詞特徵、百家姓特徵、稱謂特徵以及單字在中文人名中出現的機率分布;斷詞特徵使用中研院提供之 “中文句結構樹資料庫” 內的斷詞資訊;百家姓特徵使用來自內政部之 “戶政資料倉儲系統” 所統計台閩地區姓氏一覽;稱謂特徵則包含了親屬稱謂 (如:爺爺、叔叔…等) 以及職業稱謂 (如:總統、經理…等);人名機率分布特徵使用台灣大學聯考近幾年榜單做分析處理得知。
另外,中央研究院所提供之中文斷詞系統 (CKIP) 具有一定的斷詞準確率,故本研究使用其模組來做斷詞,將結果加上簡單處理,把不必要的標記刪除,留下名詞標記部分,再以人為訂定規則做篩選後,產生的結果作為本研究的 Base Line;資料集使用中央研究院提供之 ”中文句結構樹資料庫” 及 ”現代漢語平衡語料庫”;最後的實驗使用CRFs作為本研究的模型,研究結果顯示使用CRFs並加入以上的特徵能超越 Base line 所作出的結果,確實能有效提昇中文人名命名實體辨識的準確率。
zh_TW
dc.description.abstractNamed Entity Recognition is an important issue in the field of information extraction. It has also been widely used to analyze nature language in English and other languages. This paper proposes an approach to implement Chinese names recognition system, CRF_CNR, based on Conditional Random Fields with character-level features. Our recognition system considers the following features : word segmentation feature, surnames feature, titles feature, and the probability distribution of the Chinese word appearing in Chinese names. Word segmentation feature uses the Chinese word segmentation information from Sinica TreeBank database provides by the Academia Sinica. Surnames feature uses the Taiwan area surname list from the Ministry of the Interior. Titles feature contains kinship (eg, grandfather, uncles ... etc.) and professional titles (eg, the president, manager, ... etc.). The probability distribution of the Chinese word appearing in Chinese names uses the names list from National Taiwan University’s entrance exams in recent years to compute the odds of first name characters. Using the testbed constructed from the Sinica TreeBank and Balance Corpus, the ten-fold cross validation shows that the F1-measure is 0.871, higher than the baseline constructed using Chinese knowledge information processing group (CKIP).en
dc.description.provenanceMade available in DSpace on 2021-06-16T13:01:14Z (GMT). No. of bitstreams: 1
ntu-102-R00725044-1.pdf: 3408844 bytes, checksum: 56114e040f899509349d8783850bed76 (MD5)
Previous issue date: 2013
en
dc.description.tableofcontents目錄
誌謝 ii
Abstract iv
中文摘要 v
表目錄 vii
圖目錄 viii
目錄 ix
第一章 緒論 1
1.1 研究動機 1
1.2 研究目的 2
第二章 文獻探討 3
2.1 命名實體辨識 3
2.2 CRFs用以分析解決命名實體辨識 6
2.3 標記結構 (Labeling Scheme) 10
2.4 中文命名實體辨識Features的探討 11
第三章 研究缺口 12
第四章 研究資料概觀與資料處理 14
4.1 資料來源與背景資訊 14
4.1.1 Sinica Treebank 中文句結構樹資料庫 14
4.1.2 Balance Corpus 現代漢語平衡語料庫 14
4.2 資料處理與處理後相關數據 16
4.2.1 資料處理 16
4.2.2 資料處理後相關數據 18
第五章 系統設計 19
5.1 系統流程圖 19
5.2 系統流程簡介 20
5.3 特徵建構 21
5.3.1 斷詞類-斷詞資訊 21
5.3.2 斷詞類-稱謂 21
5.3.3 斷詞類-百家姓 22
5.3.4 機率類-常出現於中文名字中的字集 22
5.3.5 產生特徵實例 23
5.4 Base Line 24
5.4.1 中央研究院之中文斷詞系統 – CKIP 24
5.4.2 Base Line 實作流程 25
第六章 實驗與效能 27
6.1 Evaluate Matric 27
6.2 系統效能 28
6.2.1 Hyperparameter Tuning 28
6.2.2 CRF_CNR 系統加入各 Features的效能比較 30
6.2.3 mmseg4j vs. 資料集本身斷詞 33
6.3 錯誤分析 35
第七章 結論與建議 40
7.1 實驗結論與建議 40
7.1.1 擴充現有特徵資料集的資料 42
7.1.2 納入更多對於辨識中文人名有用的特徵 42
7.1.3 結合不同機器學習的模型提昇效能 42
7.2 研究貢獻 43
7.3 未來研究方向 43
第八章 參考文獻 44
附錄一 實驗使用之百家姓一覽表 46
附錄二 中文單字出現在中文人名內之 odds 排名前一百名的字集 47
附錄三 實驗使用之職業稱謂及親屬稱謂一覽表 48
dc.language.isozh-TW
dc.subject序列標記zh_TW
dc.subject自然語言處理zh_TW
dc.subject資訊擷取zh_TW
dc.subject條件隨機域zh_TW
dc.subject命名實體辨識zh_TW
dc.subject中文人名zh_TW
dc.subjectChinese Namesen
dc.subjectNatural Language Processingen
dc.subjectSequential Labelingen
dc.subjectConditional Random Fieldsen
dc.subjectNamed Entity Recognitionen
dc.subjectInformation Extractionen
dc.title使用條件隨機域實作中文人名辨識系統zh_TW
dc.titleImplementing Chinese Named Entity Recognition System Using Conditional Random Fieldsen
dc.typeThesis
dc.date.schoolyear101-2
dc.description.degree碩士
dc.contributor.oralexamcommittee魏志平(Chih-Ping Wei),陳建錦(Chien-Chin Chen)
dc.subject.keyword資訊擷取,自然語言處理,序列標記,條件隨機域,命名實體辨識,中文人名,zh_TW
dc.subject.keywordInformation Extraction,Natural Language Processing,Sequential Labeling,Conditional Random Fields,Named Entity Recognition,Chinese Names,en
dc.relation.page49
dc.rights.note有償授權
dc.date.accepted2013-08-07
dc.contributor.author-college管理學院zh_TW
dc.contributor.author-dept資訊管理學研究所zh_TW
顯示於系所單位:資訊管理學系

文件中的檔案:
檔案 大小格式 
ntu-102-1.pdf
  未授權公開取用
3.33 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved