書目資料中著者姓名歧異性之解析

Chi-Nan Hsieh; 謝其男

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/9960

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳光華
dc.contributor.author	Chi-Nan Hsieh	en
dc.contributor.author	謝其男	zh_TW
dc.date.accessioned	2021-05-20T20:52:03Z	-
dc.date.available	2011-08-18
dc.date.available	2021-05-20T20:52:03Z	-
dc.date.copyright	2011-08-18
dc.date.issued	2011
dc.date.submitted	2011-08-07
dc.identifier.citation	Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1, 1-36. Can, F., & Patton, J. M. (2004). Change of writing style with time. Computers and the Humanities, 38, 61-82. Chang, C. C. & Lin, C. J. (2010). LIBSVM - A Library for support Vector Machines (Version 3.0). Retrieved Oct. 4, 2010, from http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Churches T., Christen, P., Lim, K., & Zhu, J. (2002). Preparation of name and address data for record linkage using hidden Markov models. BMC Medical Informatics and Decision Making, 2, 9. CiteSeer (n.d.). About CiteSeerX. Retrieved Jan. 31, 2011 from http://citeseer.ist.psu.edu/about/site Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. (2007). Author disambiguation using error-driven machine learning with a ranking loss function. In: Proceedings of the AAAI 6 th International Workshop on Information Integration on the Web, 32-37. Digital Author Identifier (DAI). (2009). DAI-Standard wiki. Retrieved Oct. 4, 2010, from http://www.surffoundation.nl/wiki/display/standards/DAI DiLauro, T., Choudhury, G. S., Patton, M., Warner, J. W. & Brown, E. W. (2001). Automated name authority control and enhanced searching in the levy collection. D-Lib Magazine, 7(4). Elmagarmid, A. K., Ipeirotis, P. G. & Verykios, V. S. (2007). Duplicate record detection: A survey. TKDE, 19(1), p1–16. Ferris, M. & Munson, T. (2002). Interior-point methods for massive support vector machines. SIAM Journal on Optimization 13 (3): 783–804. French, J. C., Powell, A., & Schulman, E. (2000). Using clustering strategies for creating authority files. Journal of the American Society for Information Science, 51, 774-786. Gale, W. A., Church, K. W. & Yarowsky, W. (1992). A method for disambiguation word senses in a large corpus. Computers and the Humanities 26: 415-439. Han, H., Giles, L., Zha, H., (2005a). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of ACM/IEEE Joint Conference on Digital Libraries. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.9354&rep=rep1&type=pdf Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE-CS Joint Conference. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://clgiles.ist.psu.edu/papers/JCDL-2004-author-disambiguation.pdf Han, H., Giles, L., Zha, H., Xu, W. (2005b). A hierarchical Naive Bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM symposium. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://clgiles.ist.psu.edu/papers/SAC-2005-Naive-Bayes-Mixture.pdf Hastie, T., Tibshirani, R., Friedman, J. (2011). Hierarchical clustering. The Elements of Statistical Learning (2nd ed.). New York: Springer, 520–528. Hernandez, M. A., Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1), p9–37. Hill, S., & Provost, F. (2003). The myth of the double-blind review? Author identification using only citations. ACM SIGKDD Explorations, 5, 179-184. Huang, J., Ertekin., S., & Giles, C. L. (2006). Efficient name disambiguation for large scale databases. In J. Furnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, 536-544. International Standard Name Identifier (ISNI). (2009). ISNI Draft ISO 27729. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://www.isni.org/ Jang, J. S. (2011). Data Clustering and Pattern Recognition. Retrieved Jan. 4, 2011, Retrieved Dec. 25, 2010, from http://mirlab.org/jang Jaro, M. A. (1995). Probabilistic linkage of large public health data files. Statistics in Medicine, 14, 491-498. Kanani, P., McCallum, A., & Pal, C. (2007). Improving author coreference by resource bounded information gathering from the web. In M. M. Veloso (Ed.), Proceedings of the 20th International Joint Conference on Artificial Intelligence, 429-434. Koppel, M., Argamon, S., & Shimoni, A. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17, 401-412. Malin, B., Airoldi, E., & Carley, K. M. (2005). A network analysis model for disambiguation of names in lists. Computational and Mathematical Organization Theory, 11, 119-139. Mitchell, T. M. (1997). Machine Learning. New York: McGraw Hill. Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery. Naveman. (2011). Naveman Glossary. Retrieved Jan. 4, 2011, Retrieved Dec. 25, 2010, from http://www.navmanmarine.net/ OCLC. (2009). WorldCat Identity Service. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://orlabs.oclc.org/identities People Australia. (2010). People Australia Overview. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://www.nla.gov.au/initiatives/peopleaustralia/index.html Pereira, D. A., Ribeiro-Neto, B. A., Ziviani, N., Laender, A. H. F., Goncalves, M. A., Ferreira, A. A. (2009). Using web information for author name disambiguation. In Proc. of JCDL, pp 49–58. ProQuest. (2009). Scholar Universe. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://www.scholaruniverse.com Research Name Resolver. (2010). NII Research Name Resolver. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://rns.nii.ac.jp/;jsessionid=372CE9C69AF0745A1597C34DD3ACC420 Safavian, S. R., Landgrebe, D. (1991). A survey of decision tree classifier methodology. IEEE Trans. Systems Man Cybernet. 21, 660-674. Smalheiser, N. R., Torvik, V. I. (2009). Author Name Disambiguation. Chapter in Annual Review of Information Science and Technology, v.43. Song, Y., Huang, J., Councill, I. G., Li, J. & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In E. M. Rasmussen, R. R. Larson, E. Toms, S. Sugimoto (Eds.), Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 342-351. Tan, Y. F., Kan, M. Y. & Lee, D. (2006). Search engine driven author disambiguation. In G. Marchionini, M. L. Nelson, & C. C. Marshall (Eds.), Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 314-315. Thomson Reuter. (2009). Distinct Author Identification System. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://scientific.thomsonreuters.com/support/faq/wok3new/dais/ Thomson Routers. (2011). Journal Citation Reports. Retrieved Oct. 4, 2010, Retrieved Jan. 3, 2011, from http://www.isiwebofknowledge.com/ Torvik V. I, Smalheiser N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11. Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56, 140-158. Wiley-Blackwell. (2010). Author Service. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://authorservices.wiley.com/bauthor/ Winkler, W. E. (1995). Matching and record linkage. In B. G. Cox et al. (Eds.), Business Survey Methods, New York: J. Wiley, 355-384. Yang, D. L., Chang, J. H., Huang, M. C. & Liu, J. S. (1999). An efficient K-means-based clustering algorithm. In Proceedings of the 1st Asia-Pacific Conference on Intelligent Agent Technology, 269-273. Yang, K. H., Jiang, J. Y., Lee, H. M., Ho, J. M. (2007). Extracting citation relationships from web documents for author disambiguation. Technical Report No. TR-IIS-06-017. Retrieved Oct. 4, 2010, from Retrieved Nov. 27, 2009, from http://www.iis.sinica.edu.tw/page/library/TechReport/tr2006/tr06017.pdf Yang, K. H., Peng, H. T., Jiang, J. Y., Lee, H. M., Ho, J. M. (2008). Author Name Disambiguation for Citations Using Topic and Web Correlation. In Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries. Lecture Notes In Computer Science, (5173), p185 – 196. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://www.iis.sinica.edu.tw/papers/hoho/7642-F.pdf
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/9960	-
dc.description.abstract	在檢索大量的學術資訊時，使用者經常會面臨到著者歧異性的問題，使得對同名著者群的解析成為一項重要的研究課題。相較於前人研究，本研究充分應用文獻書目資料的資訊進行辨識工作，且不使用書目資訊以外的資訊。因此，我們使用「共同著者姓名（C）」、「文獻題名（T）」、「期刊題名（J）」、「出版年（Y）」、「頁數（P）」等五項特徵資訊，其中「出版年」與「頁數」從未有其他研究使用過。本研究分別使用監督式學習方法與非監督式分類方法，探討總共28項不同的特徵資訊組合，分別對著者姓名歧義性解析的正確率。研究發現「期刊題名（J）」與「共同作者（C）」是特別有效的特徵資訊，其中「期刊題名（J）」無論在各種方法中都展現重要性，而「共同作者（C）」則主要在使用支持向量機（Support Vector Machine，SVM）方法時十分出色。另外，「出版年（Y）」與「頁數（P）」在與其他特徵資訊的組合明顯地提升歧義性解析的正確率，兩者以「出版年（Y）」的輔助效果較為突出（約平均提升2.5％），此外出版年與頁數對歧異性解析的影響效果在使用K-means分群方法時的特別明顯（約5％）。在前人研究中經常被使用的特徵資訊組合「CTJ」並不一定能取得最佳的正確率，透過不同分類方法發現其他特徵組合亦能達到最佳的正確率，如JYP、JY、CJ等特徵組合。最後根據資料集的規模與複雜度進行辨識結果的比較中發現，當測試的資料集日益龐雜時，僅倚靠引用文獻的書目資料則難以提供充足的辨識效果。顯現在未來研究中，若要有效地解決人名歧異性之問題，必須從書目資料的資訊向外與其他資訊進行連結與對應，以獲取更明確的作者特徵。	zh_TW
dc.description.abstract	In order to solve name ambiguity when retrieving academic information, researches on author identification are indispensable. With comparison to previous works, this study attempts to address this problem using information contained in bibliographic data only. Five features, co-author (C), article title (T), journal title (J), year (Y), and number of pages (P), are extracted from bibliographic data and will be used to disambiguate author names in this work. Note that feature Y and feature P are not ever used before. Both supervised learning methods (Naive Bayes and Support Vector Machine) and unsupervised learning method (K-means) are employed to explore 28 different feature combinations. The findings show that the performance of feature journal title (J) and co-author (C) is very effective. Feature J plays an important role in three different approaches, and feature C is mainly outstanding in SVM. In addition, feature year (Y) and feature number of pages (P) obviously enhance accuracy rate while they accompanied with various feature combination(s), and the average improvement rate of inclusion with feature Y is more significant than feature P. However, it is significant that the effect is more positive in K-means clustering (+4.98% in average) than that in Naive Bayes Model (+0.90% in average) and Support Vector Machine (+0.15% in average). It is also shown that the performance of feature combination CTJ used traditionally is not superior to JYP, and the performance of feature combinations CJY, JY and J are also very effective in three methods. Finally, it is found that the accuracy of disambiguation on larger datasets is 10% inferior to the smaller ones, which indicated the limitation and deficiency of the performance achieved by bibliographic data in this “numerous and jumbled” real world. Consequently, it is a promising trend in the future to build an intellectual mechanism to map other information onto bibliographic information accurately in order to get sufficient information for author disambiguation.	en
dc.description.provenance	Made available in DSpace on 2021-05-20T20:52:03Z (GMT). No. of bitstreams: 1 ntu-100-R97126004-1.pdf: 2948319 bytes, checksum: 9e10632dc05382dddf8cf2550799cbde (MD5) Previous issue date: 2011	en
dc.description.tableofcontents	Table of Contents 摘要 i Abstract ii Table of Contents iii List of Tables v List of Figures vi Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Objectives of Research 2 1.3 Restriction of Research 3 1.4 Definition of Terms 3 1.4.1 Bibliographic data 3 1.4.2 Ambiguity Resolution 3 Chapter 2 Literature Review 5 2.1 Name Disambiguation 5 2.2 Ambiguity Resolution for Author 7 2.3 Machine Learning 9 2.3.1 Supervised Learning Methods 9 2.3.2 Unsupervised Learning Methods 11 Chapter 3 Research Design 13 3.1 Data Collection 13 3.2 Feature Combinations 15 3.3 Data Processing 15 3.4 Machine Learning 16 3.5 Performance Evaluation 16 3.6 Settings for Year and Number of Pages 17 Chapter 4 Experimental Results 19 4.1 Common Feature Combinations 19 4.2 Features Year (Y) and Number of Pages (P) 23 4.3 Complexity of Datasets 25 4.4 Top One Feature Combinations 29 Chapter 5 Conclusions and Suggestions 33 5.1 Conclusions 33 5.2 Suggestions for Future Studies 34 References 37 Appendix 41
dc.language.iso	en
dc.title	書目資料中著者姓名歧異性之解析	zh_TW
dc.title	Ambiguity Resolution of Author Names for Bibliographic Data	en
dc.type	Thesis
dc.date.schoolyear	99-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	唐牧群,黃乾綱
dc.subject.keyword	著者歧義性,書目資料,機器學習,	zh_TW
dc.subject.keyword	Author Disambiguation,Bibliographic Data,Machine Learning,	en
dc.relation.page	47
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2011-08-07
dc.contributor.author-college	文學院	zh_TW
dc.contributor.author-dept	圖書資訊學研究所	zh_TW
顯示於系所單位：	圖書資訊學系

文件中的檔案：

檔案	大小	格式
ntu-100-1.pdf	2.88 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。