請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/33541完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 陳信希 | |
| dc.contributor.author | Yu-Chuan Wei | en |
| dc.contributor.author | 魏煜娟 | zh_TW |
| dc.date.accessioned | 2021-06-13T04:46:17Z | - |
| dc.date.available | 2006-07-19 | |
| dc.date.copyright | 2006-07-19 | |
| dc.date.issued | 2006 | |
| dc.date.submitted | 2006-07-17 | |
| dc.identifier.citation | Al-Kamaha, R. and Embley, D. W. (2004) “Grouping Search-Engine Returned Citations for Person-Name Queries,” Proceedings of the 6th annual ACM International Workshop on Web Information and Data Management, 2004, pp. 96-103.
Anh, V. N. and Moffat, A. (2002) “Homepage Finding and Topic Distillation Using a Common Retrieval Strategy,” Proceedings of TREC 2002. Bagga, A. and Baldwin, B. (1998) “Entity-based Cross-Document Co-Referencing Using the Vector Space Model,” Proceedings of the 17th International Conference on Computational Linguistics, 1998, pp. 79-85. Bekkerman, R. and McCallum, A. (2005) “Disambiguating Web Appearances of People in a Social Network,” Proceedings of WWW 2005, 2005, pp. 463-470. Borgatti, S. P. (1994) “How to Explain Hierarchical Clustering,” INSNA, 17(2), 1994, pp. 78-80. Online Available: http://www.analytictech.com/networks/hiclus.htm. Chen, H. H. and Bian, G. W. (1998) “White Page Construction from Web Pages for Finding People in Internet,” International Journal of Computational Linguistics and Chinese Language Processing, 3(1), 1998, pp. 75-100. Chen, H. H., Ding, Y. W. and Tsai, S. C. (1998) “Named Entity Extraction for Information Retrieval,” Computer Processing of Oriental Languages, Special Issue on Information Retrieval on Oriental Languages, 12(1), 1998, pp. 75-85. Chen, H. H., Lin, M. S. and Wei, Y. C. (2006) “Novel Association Measures Using Web Search with Double Checking,” COLING-ACL 2006, 2006, to appear. Culotta, A., Bekkerman, R., and McCallum, A. (2004) “Extracting Social Networks and Contact Information from Email and the Web,” Proceedings of CEAS-1, 2004. Fleischman, M. and Hovy, E. (2002) “Fine Grained Classification of Named Entities,” Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan, 2002, pp. 1-7. Fleischman, M. and Hovy, E. (2002) “Multi-Document Person Name Resolution,” Proceedings of ACL Reference Resolution Workshop, 2004, pp. 1-8. Gooi, C. H. and Allan, J. (2004) “Cross-Document Co-Reference on a Large Scale Corpus,” Proceedings of 2004 HLT-NAACL, 2004, pp. 9-16. Han, H., Giles, L. and Zha, H. (2004) “Two Supervised Learning Approaches for Name Disambiguation in Author Citations,” Proceedings of the 2004 ACM/IEEE Joint Conference on Digital Libraries, 2004, pp. 296–305. Kuo, J. J. and Chen, H. H. (2005) “Cross Document Event Clustering Using Knowledge Mining from Co-Reference Chains,” Proceedings of the Second Asia Information Retrieval Symposium, Lecture Notes in Computer Science, 3689, 2005, pp. 121-134. Lloyd, L., Bhagwan, V., Tomkins, A., Gruhl, D. (2005) “Disambiguation of References to Individuals,” IBM Research Report, 2005. Malin, B. (2005) “Unsupervised Name Disambiguation via Social Network Similarity,” Proceedings of the Workshop on Link Analysis, Counterterrorism, and Security, in conjunction with the SIAM International Conference on Data Mining 2005. Newport Beach, CA., 2005, pp. 93-102. Mann, G. S. and Yarowsky, D. (2003) “Unsupervised Personal Name Disambiguation,” Proceedings of CoNLL-7, 2003, pp. 33–40. Lin, M. S. and Chen, H. H. (2006) “Constructing a Named Entity Ontology from Web Corpora,” Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, Italy, 2006, pp.1450-1453. Pedersen, T., Purandare, A. and Kulkarni, A. (2005) “Name Discrimination by Clustering Similar Context,” Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics, 2005, pp. 226-237. Raghavan, H., Allan, J. and McCallum, A. (2004) “An Exploration of Entity Models, Collective Classification and Relation Description,” Proceedings of SIGKDD2004, 2004. Yang, W. and Li, X. (2002) “Chinese Keyword Extraction Based on Max-Duplicated Strings of the Documents” Proceedings of the 25th ACM SIGIR Conference, Tampere, Finland, 2002, pp. 439-440. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/33541 | - |
| dc.description.abstract | 本論文探討人名歧義性的問題。如同一個字具有多個意思,一個人名可能同時為多人所擁有,如何判別不同文章中所出現的相同人名是否屬於同一個人,是本研究的主要目標。近年來,人名歧義性分析受到愈來愈多的重視,相關的應用包括個人資料建立、個人網頁搜尋、專家搜尋、社群關係分析等。我們提出兩種類型的人名解歧義性的方法,目的是希望將提及此名字的文件分群,使得每一群中的文件所談的特定對象均指同一個人。
多分類器方法鏈結五種分類器來分群文件,五種分類器分別代表著從文章中擷取出來的五種特徵,是用於區別不同個體的依據,最前面的兩個分類器分別採用職稱與社群為分群的依據,期望能夠獲得較高的精確率,接著再以詞彙、時間、網址等分類器來判斷,藉由提高召回率使整體效能得以提昇。此外我們也針對其中三種分類器分別提出了不同的演算法,以探討所造成的影響。單分類器是另一種人名解歧的方法,它同時考慮了多個特徵值,並且直接做文件分群,在此,我們探討使用不同分群演算法以及不同特徵時的分群結果。 在我們的實驗資料中,選用了三個真實人名,並且同時考慮了人名的知名度(名人、一般人)、不同類型的資料(新聞、網頁)以及不同資料來源(臺灣地區、中國大陸)對人名解歧的影響。結果顯示:在多分類器的方法中,使用直接職稱分群的效果好於複雜的兩階段判斷法;使用全文分析將引入更多的雜訊,並降低系統的效能;對於單分類器的方法,同時考慮所有特徵的結果比僅利用詞彙來的好;利用網路擴充社群對兩種分類法均有正面的影響。在多分類器的方法中,最好結果可以達到70%的F值,與只有考慮詞彙為特徵的單分類器(最基本的人名解歧的方法)相比,效能大約提升了原本的40%。最後,在結論的部分,我們將提出在此研究議題中未來仍可努力的地方。 | zh_TW |
| dc.description.abstract | In this thesis, we study the problem of personal name disambiguation. As we know, many individuals have the same name. The objective of our work is to identify different individuals from a set of documents and cluster the documents in groups such that each group relates to one person. Two types of approaches are proposed and compared. In the multiple-classifier approach, several classifiers are integrated to disambiguate the denotations of personal names. Each classifier is built based on one feature. Alternatives are proposed and replaced in the three classifiers. In the single-classifier approach, documents are clustered at a time. Different clustering algorithms and different features are considered and compared in this approach.
In the test data sets, we address the issues of awareness degree of an entity (household name vs. general name), the sources of materials (newswire vs. web pages), and web pages in different areas (Mainland of China vs. Taiwan). The experimental results in the multiple-classifier approach show that personal titles and communities are two strong cues for clustering. The first two classifiers achieve very high precisions, and the last three classifiers improve the recalls at only some expense of precisions. The average F-score increases gradually from the first classifier to the last one. The results of several alternatives show that clustering personal titles directly performs better than the two steps strategy, and terms extracted from the full text seems to bring in many noises for name disambiguation. In the single-classifier approach, high performance is achieved when all types of the features are applied. Expanding communities from the Web improves the performance in both approaches. The alternative in the multiple-classifier approach achieves the best F-score 70% and has about 40% increases compared to the general name disambiguation method. We close with discussion of the comparison of two proposed clustering algorithms and make the conclusion. | en |
| dc.description.provenance | Made available in DSpace on 2021-06-13T04:46:17Z (GMT). No. of bitstreams: 1 ntu-95-R92922129-1.pdf: 1213180 bytes, checksum: a022de333e07bd2b812a5e74e24146ea (MD5) Previous issue date: 2006 | en |
| dc.description.tableofcontents | Chapter 1 Introduction 1
1.1 Motivation 1 1.2 Problem Statement 1 1.3 Related Work 2 1.4 Main Issues 5 1.5 The Organization of This Thesis 5 Chapter 2 Evaluation Corpora 7 2.1 Selection Strategies of Testing Names 7 2.2 Description of Resources 8 2.2.1 Newswire 8 2.2.2 Web Pages 11 2.3 Comparison of Three Materials 13 2.3.1 Newswire vs. Web Pages 13 2.3.2 Web Pages in Taiwan vs. Web Pages in China 15 Chapter 3 Multiple-Classifier Approach 16 3.1 Overview 16 3.2 Data Preprocessing 17 3.2.1.1 Data Extraction, Code Translation, Context Extraction, and POS Tagging 17 3.2.2 Feature Extraction 18 3.2.2.1 Personal Title Extraction 18 3.2.2.2 Community Extraction 19 3.2.2.3 Term Extraction 19 3.2.2.4 Temporal Expression Extraction 20 3.2.2.5 URL Extraction 20 3.3 Five Classifiers in the Multiple-Classifier Approach 20 3.3.1 A Classifier Using Personal Titles (C1) 21 3.3.1.1 Dividing by Title Keywords and Organization Names (C11) 22 3.3.1.2 Merging by Organization Names (C12) 23 3.3.2 A Classifier Using Communities (C2) 24 3.3.2.1 Disambiguating by Communities (C21) 24 3.3.2.2 Self-Dividing by Communities (C22) 25 3.3.3 A Classifier Using Term Vectors (C3) 26 3.3.3.1 Disambiguating by Term Vectors (C31) 26 3.3.3.2 Merging by Term Vectors (C32) 27 3.3.4 A Classifier Using Temporal Expressions (C4) 28 3.3.5 A Classifier Using URLs of Documents (C5) 28 3.4 Cluster Labeling 29 Chapter 4 Experiments of Multiple-Classifier Approach 31 4.1 Evaluation Metrics 31 4.2 Baseline Models 32 4.3 Experimental Results 33 4.3.1 Performance of Personal Title Classifier 33 4.3.2 Performance of Community Classifier 34 4.3.3 Performance of Term Vector Classifier 35 4.3.4 Performance of Temporal Expression and URLs of Documents Classifiers 36 4.3.5 Overall Performance and Discussion 37 4.4 Alternative Approaches 41 4.4.1 Personal Title Classifier 41 4.4.1.1 Directly Clustering by Personal Titles 42 4.4.1.2 Merging by Ratio 42 4.4.1.3 Merging by Chi-square 43 4.4.2 Community Classifier 44 4.4.2.1 Community Expansion 44 4.4.2.1.1 Building an NE ontology 44 4.4.2.1.2 Setting up a Community Chain from Two Ontologies 45 4.4.2.1.3 Web Search with Double Checking Model 46 4.4.2.1.4 Community Expansion from the Web 47 4.4.2.2 Expansion in Community Classifier 47 4.4.3 Term Vector Classifier 48 4.5 Results of Alternative Approaches 48 4.5.1 Personal Title Classifier 48 4.5.2 Community Classifier 49 4.5.3 Term Vector Classifier 50 Chapter 5 Single-Classifier Approach 51 5.1 Agglomerative Clustering Algorithms 51 5.2 Two Alternatives 52 5.3 Experimental Results 52 5.3.1 Three Agglomerative Clustering Algorithms 52 5.3.2 Two Alternative Single-Classifiers 54 5.4 Comparison between Multiple-Classifiers and Single-Classifiers 57 5.5 Dynamic Threshold Setting 60 5.5.1 Average-link with Dynamic Threshold 60 5.5.2 Experiments 61 5.6 Visualization of Results 62 Chapter 6 Conclusion and Future Work 65 6.1 Conclusion 65 6.2 Future Work 66 References 68 Appendix 70 І Statistics of “Chien-Ming Wang” in UDN, TW, and CN 70 ? Performances of Multiple-Classifier Approach 75 ? Test Data and Scores in Dynamic Threshold Setting 76 | |
| dc.language.iso | en | |
| dc.subject | 資訊檢索 | zh_TW |
| dc.subject | 人名解歧 | zh_TW |
| dc.subject | Name Disambiguation | en |
| dc.subject | Information Retrieval | en |
| dc.title | 人名歧義性分析之研究 | zh_TW |
| dc.title | A Study of Personal Name Disambiguation | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 94-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 梁婷,蔡益坤,陳光華 | |
| dc.subject.keyword | 人名解歧,資訊檢索, | zh_TW |
| dc.subject.keyword | Name Disambiguation,Information Retrieval, | en |
| dc.relation.page | 76 | |
| dc.rights.note | 有償授權 | |
| dc.date.accepted | 2006-07-18 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 資訊工程學研究所 | zh_TW |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-95-1.pdf 未授權公開取用 | 1.18 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
