Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 文學院
  3. 圖書資訊學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/9960
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳光華
dc.contributor.authorChi-Nan Hsiehen
dc.contributor.author謝其男zh_TW
dc.date.accessioned2021-05-20T20:52:03Z-
dc.date.available2011-08-18
dc.date.available2021-05-20T20:52:03Z-
dc.date.copyright2011-08-18
dc.date.issued2011
dc.date.submitted2011-08-07
dc.identifier.citationBhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1, 1-36.
Can, F., & Patton, J. M. (2004). Change of writing style with time. Computers and the Humanities, 38, 61-82.
Chang, C. C. & Lin, C. J. (2010). LIBSVM - A Library for support Vector Machines (Version 3.0). Retrieved Oct. 4, 2010, from http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Churches T., Christen, P., Lim, K., & Zhu, J. (2002). Preparation of name and address data for record linkage using hidden Markov models. BMC Medical Informatics and Decision Making, 2, 9.
CiteSeer (n.d.). About CiteSeerX. Retrieved Jan. 31, 2011 from http://citeseer.ist.psu.edu/about/site
Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. (2007). Author disambiguation using error-driven machine learning with a ranking loss function. In: Proceedings of the AAAI 6 th International Workshop on Information Integration on the Web, 32-37.
Digital Author Identifier (DAI). (2009). DAI-Standard wiki. Retrieved Oct. 4, 2010, from http://www.surffoundation.nl/wiki/display/standards/DAI
DiLauro, T., Choudhury, G. S., Patton, M., Warner, J. W. & Brown, E. W. (2001). Automated name authority control and enhanced searching in the levy collection. D-Lib Magazine, 7(4).
Elmagarmid, A. K., Ipeirotis, P. G. & Verykios, V. S. (2007). Duplicate record detection: A survey. TKDE, 19(1), p1–16.
Ferris, M. & Munson, T. (2002). Interior-point methods for massive support vector machines. SIAM Journal on Optimization 13 (3): 783–804.
French, J. C., Powell, A., & Schulman, E. (2000). Using clustering strategies for creating authority files. Journal of the American Society for Information Science, 51, 774-786.
Gale, W. A., Church, K. W. & Yarowsky, W. (1992). A method for disambiguation word senses in a large corpus. Computers and the Humanities 26: 415-439.
Han, H., Giles, L., Zha, H., (2005a). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of ACM/IEEE Joint Conference on Digital Libraries. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.9354&rep=rep1&type=pdf
Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE-CS Joint Conference. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://clgiles.ist.psu.edu/papers/JCDL-2004-author-disambiguation.pdf
Han, H., Giles, L., Zha, H., Xu, W. (2005b). A hierarchical Naive Bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM symposium. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://clgiles.ist.psu.edu/papers/SAC-2005-Naive-Bayes-Mixture.pdf
Hastie, T., Tibshirani, R., Friedman, J. (2011). Hierarchical clustering. The Elements of Statistical Learning (2nd ed.). New York: Springer, 520–528.
Hernandez, M. A., Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1), p9–37.
Hill, S., & Provost, F. (2003). The myth of the double-blind review? Author identification using only citations. ACM SIGKDD Explorations, 5, 179-184.
Huang, J., Ertekin., S., & Giles, C. L. (2006). Efficient name disambiguation for large scale databases. In J. Furnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, 536-544.
International Standard Name Identifier (ISNI). (2009). ISNI Draft ISO 27729. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://www.isni.org/
Jang, J. S. (2011). Data Clustering and Pattern Recognition. Retrieved Jan. 4, 2011, Retrieved Dec. 25, 2010, from http://mirlab.org/jang
Jaro, M. A. (1995). Probabilistic linkage of large public health data files. Statistics in Medicine, 14, 491-498.
Kanani, P., McCallum, A., & Pal, C. (2007). Improving author coreference by resource bounded information gathering from the web. In M. M. Veloso (Ed.), Proceedings of the 20th International Joint Conference on Artificial Intelligence, 429-434.
Koppel, M., Argamon, S., & Shimoni, A. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17, 401-412.
Malin, B., Airoldi, E., & Carley, K. M. (2005). A network analysis model for disambiguation of names in lists. Computational and Mathematical Organization Theory, 11, 119-139.
Mitchell, T. M. (1997). Machine Learning. New York: McGraw Hill.
Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery.
Naveman. (2011). Naveman Glossary. Retrieved Jan. 4, 2011, Retrieved Dec. 25, 2010, from http://www.navmanmarine.net/
OCLC. (2009). WorldCat Identity Service. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://orlabs.oclc.org/identities
People Australia. (2010). People Australia Overview. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://www.nla.gov.au/initiatives/peopleaustralia/index.html
Pereira, D. A., Ribeiro-Neto, B. A., Ziviani, N., Laender, A. H. F., Goncalves, M. A., Ferreira, A. A. (2009). Using web information for author name disambiguation. In Proc. of JCDL, pp 49–58.
ProQuest. (2009). Scholar Universe. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://www.scholaruniverse.com
Research Name Resolver. (2010). NII Research Name Resolver. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://rns.nii.ac.jp/;jsessionid=372CE9C69AF0745A1597C34DD3ACC420
Safavian, S. R., Landgrebe, D. (1991). A survey of decision tree classifier methodology. IEEE Trans. Systems Man Cybernet. 21, 660-674.
Smalheiser, N. R., Torvik, V. I. (2009). Author Name Disambiguation. Chapter in Annual Review of Information Science and Technology, v.43.
Song, Y., Huang, J., Councill, I. G., Li, J. & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In E. M. Rasmussen, R. R. Larson, E. Toms, S. Sugimoto (Eds.), Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 342-351.
Tan, Y. F., Kan, M. Y. & Lee, D. (2006). Search engine driven author disambiguation. In G. Marchionini, M. L. Nelson, & C. C. Marshall (Eds.), Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 314-315.
Thomson Reuter. (2009). Distinct Author Identification System. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://scientific.thomsonreuters.com/support/faq/wok3new/dais/
Thomson Routers. (2011). Journal Citation Reports. Retrieved Oct. 4, 2010, Retrieved Jan. 3, 2011, from http://www.isiwebofknowledge.com/
Torvik V. I, Smalheiser N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3):11.
Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56, 140-158.
Wiley-Blackwell. (2010). Author Service. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://authorservices.wiley.com/bauthor/
Winkler, W. E. (1995). Matching and record linkage. In B. G. Cox et al. (Eds.), Business Survey Methods, New York: J. Wiley, 355-384.
Yang, D. L., Chang, J. H., Huang, M. C. & Liu, J. S. (1999). An efficient K-means-based clustering algorithm. In Proceedings of the 1st Asia-Pacific Conference on Intelligent Agent Technology, 269-273.
Yang, K. H., Jiang, J. Y., Lee, H. M., Ho, J. M. (2007). Extracting citation relationships from web documents for author disambiguation. Technical Report No. TR-IIS-06-017. Retrieved Oct. 4, 2010, from Retrieved Nov. 27, 2009, from http://www.iis.sinica.edu.tw/page/library/TechReport/tr2006/tr06017.pdf
Yang, K. H., Peng, H. T., Jiang, J. Y., Lee, H. M., Ho, J. M. (2008). Author Name Disambiguation for Citations Using Topic and Web Correlation. In Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries. Lecture Notes In Computer Science, (5173), p185 – 196. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://www.iis.sinica.edu.tw/papers/hoho/7642-F.pdf
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/9960-
dc.description.abstract在檢索大量的學術資訊時,使用者經常會面臨到著者歧異性的問題,使得對同名著者群的解析成為一項重要的研究課題。相較於前人研究,本研究充分應用文獻書目資料的資訊進行辨識工作,且不使用書目資訊以外的資訊。因此,我們使用「共同著者姓名(C)」、「文獻題名(T)」、「期刊題名(J)」、「出版年(Y)」、「頁數(P)」等五項特徵資訊,其中「出版年」與「頁數」從未有其他研究使用過。本研究分別使用監督式學習方法與非監督式分類方法,探討總共28項不同的特徵資訊組合,分別對著者姓名歧義性解析的正確率。
研究發現「期刊題名(J)」與「共同作者(C)」是特別有效的特徵資訊,其中「期刊題名(J)」無論在各種方法中都展現重要性,而「共同作者(C)」則主要在使用支持向量機(Support Vector Machine,SVM)方法時十分出色。另外,「出版年(Y)」與「頁數(P)」在與其他特徵資訊的組合明顯地提升歧義性解析的正確率,兩者以「出版年(Y)」的輔助效果較為突出(約平均提升2.5%),此外出版年與頁數對歧異性解析的影響效果在使用K-means分群方法時的特別明顯(約5%)。
在前人研究中經常被使用的特徵資訊組合「CTJ」並不一定能取得最佳的正確率,透過不同分類方法發現其他特徵組合亦能達到最佳的正確率,如JYP、JY、CJ等特徵組合。最後根據資料集的規模與複雜度進行辨識結果的比較中發現,當測試的資料集日益龐雜時,僅倚靠引用文獻的書目資料則難以提供充足的辨識效果。顯現在未來研究中,若要有效地解決人名歧異性之問題,必須從書目資料的資訊向外與其他資訊進行連結與對應,以獲取更明確的作者特徵。
zh_TW
dc.description.abstractIn order to solve name ambiguity when retrieving academic information, researches on author identification are indispensable. With comparison to previous works, this study attempts to address this problem using information contained in bibliographic data only. Five features, co-author (C), article title (T), journal title (J), year (Y), and number of pages (P), are extracted from bibliographic data and will be used to disambiguate author names in this work. Note that feature Y and feature P are not ever used before. Both supervised learning methods (Naive Bayes and Support Vector Machine) and unsupervised learning method (K-means) are employed to explore 28 different feature combinations.
The findings show that the performance of feature journal title (J) and co-author (C) is very effective. Feature J plays an important role in three different approaches, and feature C is mainly outstanding in SVM. In addition, feature year (Y) and feature number of pages (P) obviously enhance accuracy rate while they accompanied with various feature combination(s), and the average improvement rate of inclusion with feature Y is more significant than feature P. However, it is significant that the effect is more positive in K-means clustering (+4.98% in average) than that in Naive Bayes Model (+0.90% in average) and Support Vector Machine (+0.15% in average).
It is also shown that the performance of feature combination CTJ used traditionally is not superior to JYP, and the performance of feature combinations CJY, JY and J are also very effective in three methods. Finally, it is found that the accuracy of disambiguation on larger datasets is 10% inferior to the smaller ones, which indicated the limitation and deficiency of the performance achieved by bibliographic data in this “numerous and jumbled” real world. Consequently, it is a promising trend in the future to build an intellectual mechanism to map other information onto bibliographic information accurately in order to get sufficient information for author disambiguation.
en
dc.description.provenanceMade available in DSpace on 2021-05-20T20:52:03Z (GMT). No. of bitstreams: 1
ntu-100-R97126004-1.pdf: 2948319 bytes, checksum: 9e10632dc05382dddf8cf2550799cbde (MD5)
Previous issue date: 2011
en
dc.description.tableofcontentsTable of Contents

摘要 i
Abstract ii
Table of Contents iii
List of Tables v
List of Figures vi
Chapter 1 Introduction 1
1.1 Background and Motivation 1
1.2 Objectives of Research 2
1.3 Restriction of Research 3
1.4 Definition of Terms 3
1.4.1 Bibliographic data 3
1.4.2 Ambiguity Resolution 3
Chapter 2 Literature Review 5
2.1 Name Disambiguation 5
2.2 Ambiguity Resolution for Author 7
2.3 Machine Learning 9
2.3.1 Supervised Learning Methods 9
2.3.2 Unsupervised Learning Methods 11
Chapter 3 Research Design 13
3.1 Data Collection 13
3.2 Feature Combinations 15
3.3 Data Processing 15
3.4 Machine Learning 16
3.5 Performance Evaluation 16
3.6 Settings for Year and Number of Pages 17
Chapter 4 Experimental Results 19
4.1 Common Feature Combinations 19
4.2 Features Year (Y) and Number of Pages (P) 23
4.3 Complexity of Datasets 25
4.4 Top One Feature Combinations 29
Chapter 5 Conclusions and Suggestions 33
5.1 Conclusions 33
5.2 Suggestions for Future Studies 34
References 37
Appendix 41
dc.language.isoen
dc.title書目資料中著者姓名歧異性之解析zh_TW
dc.titleAmbiguity Resolution of Author Names for Bibliographic Dataen
dc.typeThesis
dc.date.schoolyear99-2
dc.description.degree碩士
dc.contributor.oralexamcommittee唐牧群,黃乾綱
dc.subject.keyword著者歧義性,書目資料,機器學習,zh_TW
dc.subject.keywordAuthor Disambiguation,Bibliographic Data,Machine Learning,en
dc.relation.page47
dc.rights.note同意授權(全球公開)
dc.date.accepted2011-08-07
dc.contributor.author-college文學院zh_TW
dc.contributor.author-dept圖書資訊學研究所zh_TW
顯示於系所單位:圖書資訊學系

文件中的檔案:
檔案 大小格式 
ntu-100-1.pdf2.88 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved