請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/9960
標題: | 書目資料中著者姓名歧異性之解析 Ambiguity Resolution of Author Names for Bibliographic Data |
作者: | Chi-Nan Hsieh 謝其男 |
指導教授: | 陳光華 |
關鍵字: | 著者歧義性,書目資料,機器學習, Author Disambiguation,Bibliographic Data,Machine Learning, |
出版年 : | 2011 |
學位: | 碩士 |
摘要: | 在檢索大量的學術資訊時,使用者經常會面臨到著者歧異性的問題,使得對同名著者群的解析成為一項重要的研究課題。相較於前人研究,本研究充分應用文獻書目資料的資訊進行辨識工作,且不使用書目資訊以外的資訊。因此,我們使用「共同著者姓名(C)」、「文獻題名(T)」、「期刊題名(J)」、「出版年(Y)」、「頁數(P)」等五項特徵資訊,其中「出版年」與「頁數」從未有其他研究使用過。本研究分別使用監督式學習方法與非監督式分類方法,探討總共28項不同的特徵資訊組合,分別對著者姓名歧義性解析的正確率。
研究發現「期刊題名(J)」與「共同作者(C)」是特別有效的特徵資訊,其中「期刊題名(J)」無論在各種方法中都展現重要性,而「共同作者(C)」則主要在使用支持向量機(Support Vector Machine,SVM)方法時十分出色。另外,「出版年(Y)」與「頁數(P)」在與其他特徵資訊的組合明顯地提升歧義性解析的正確率,兩者以「出版年(Y)」的輔助效果較為突出(約平均提升2.5%),此外出版年與頁數對歧異性解析的影響效果在使用K-means分群方法時的特別明顯(約5%)。 在前人研究中經常被使用的特徵資訊組合「CTJ」並不一定能取得最佳的正確率,透過不同分類方法發現其他特徵組合亦能達到最佳的正確率,如JYP、JY、CJ等特徵組合。最後根據資料集的規模與複雜度進行辨識結果的比較中發現,當測試的資料集日益龐雜時,僅倚靠引用文獻的書目資料則難以提供充足的辨識效果。顯現在未來研究中,若要有效地解決人名歧異性之問題,必須從書目資料的資訊向外與其他資訊進行連結與對應,以獲取更明確的作者特徵。 In order to solve name ambiguity when retrieving academic information, researches on author identification are indispensable. With comparison to previous works, this study attempts to address this problem using information contained in bibliographic data only. Five features, co-author (C), article title (T), journal title (J), year (Y), and number of pages (P), are extracted from bibliographic data and will be used to disambiguate author names in this work. Note that feature Y and feature P are not ever used before. Both supervised learning methods (Naive Bayes and Support Vector Machine) and unsupervised learning method (K-means) are employed to explore 28 different feature combinations. The findings show that the performance of feature journal title (J) and co-author (C) is very effective. Feature J plays an important role in three different approaches, and feature C is mainly outstanding in SVM. In addition, feature year (Y) and feature number of pages (P) obviously enhance accuracy rate while they accompanied with various feature combination(s), and the average improvement rate of inclusion with feature Y is more significant than feature P. However, it is significant that the effect is more positive in K-means clustering (+4.98% in average) than that in Naive Bayes Model (+0.90% in average) and Support Vector Machine (+0.15% in average). It is also shown that the performance of feature combination CTJ used traditionally is not superior to JYP, and the performance of feature combinations CJY, JY and J are also very effective in three methods. Finally, it is found that the accuracy of disambiguation on larger datasets is 10% inferior to the smaller ones, which indicated the limitation and deficiency of the performance achieved by bibliographic data in this “numerous and jumbled” real world. Consequently, it is a promising trend in the future to build an intellectual mechanism to map other information onto bibliographic information accurately in order to get sufficient information for author disambiguation. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/9960 |
全文授權: | 同意授權(全球公開) |
顯示於系所單位: | 圖書資訊學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-100-1.pdf | 2.88 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。