以機器學習方法處理跨語言檢索合併問題

Yu-Ting Wang; 王昱婷

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/37957

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳信希
dc.contributor.author	Yu-Ting Wang	en
dc.contributor.author	王昱婷	zh_TW
dc.date.accessioned	2021-06-13T15:53:20Z	-
dc.date.available	2008-07-03
dc.date.copyright	2008-07-03
dc.date.issued	2008
dc.date.submitted	2008-06-23
dc.identifier.citation	Chen, K. H., H. H. Chen, et al. (2003). 'Overview of CLIR Task at the Third NTCIR Workshop.' NTCIR-3 Proceedings. Cheng, P. J., J. W. Teng, et al. (2004). 'Translating unknown queries with web corpora for cross-language information retrieval.' Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval: 146-153. Gatford, M., M. M. Hancock-Beaulieu, et al. (1995). 'Okapi at TREC-3.' The Third Text Retrieval Conference (TREC-3): 109–126. Kishida, K., K. Chen, et al. (2005). 'Overview of CLIR Task at the Fifth NTCIR Workshop.' Proc. of the NTCIR-5 Workshop Meeting: 1-38. Kishida, K., K. H. Chen, et al. (2004). 'Overview of CLIR task at the fourth NTCIR workshop.' Proceedings of NTCIR 4. Le Calve, A. and J. Savoy (2000). 'Database merging strategy based on logistic regression.' Information Processing and Management 36(3): 341-359. Lin, W. C. and H. H. Chen (2002). 'Merging Mechanisms in Multilingual Information Retrieval.' Working Notes for the CLEF 2002 Workshop: 97-102. Lu, C., Y. Xu, et al. (2007). 'Improving translation accuracy in web-based translation extraction.' Proceedings of NTCIR-6 Workshop. Martinez-Santiago, F., L. A. Urena-Lopez, et al. (2006). 'A merging strategy proposal: The 2-step retrieval status value method.' Information Retrieval 9(1): 71-93. Ming-Feng, T., L. I. U. Tie-Yan, et al. 'FRank: A Ranking Method with Fidelity Loss.' Powell, A. L., J. C. French, et al. (2000). 'The impact of database selection on distributed searching.' Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval: 232-239. Robertson, S. E. and S. Walker (1994). 'Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval.' Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval: 232-241. Rping, S. (2000). 'mySVM-Manual.' Universitt Dortmund, Lehrstuhl Informatik VIII, Oktober. Savoy, J. (2003). 'Cross-language information retrieval: experiments based on CLEF 2000 corpora.' Information Processing and Management 39(1): 75-115. Savoy, J., A. Le Calve, et al. (1997). 'Report on the TREC-5 experiment: Data fusion and collection fusion.' Proceedings of TREC 5: 500-238. Si, L. and J. Callan (2005). 'CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists.' Proceedings of Cross-Language Evaluation Forum. Towell, G., E. M. Voorhees, et al. (1995). 'Learning collection fusion strategies for information retrieval.' International Conference on Machine Learning: 540–548. Valdivia, M. T. M., F. Martinez-Santiago, et al. (2003). 'Aprendizaje neuronal aplicado a la fusion de colecciones multilingues en CLIR.' Procesamiento del lenguaje natural: 227-234. Virga, P. and S. Khudanpur (2003). 'Transliteration of Proper Names in Cross-Lingual Information Retrieval.' Proceedings of the ACL Workshop on Multi-lingual Named Entity Recognition. Voorhees, E. M., N. K. Gupta, et al. (1995). 'Learning collection fusion strategies.' Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval: 172-179. Zhu, J. and H. Wang (2006). 'The effect of translation quality in MT-based cross-language information retrieval.' Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL: 593-600. Yi-Lin Chu (2001), “Chinese Word Segmentation and Named Entity Extraction” Computer Science department, National Taiwan University, Master Thesis.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/37957	-
dc.description.abstract	多語言檢索主要是允許使用者給予一種語言的查詢，檢索出多種語言的相關文件。一般而言，處理多語言檢索，首先利用查詢，在各個語言的語料庫中找出在該語言中的相關文件；利用合併的方法，將此些不同語言的相關文件合併成最終多語言的相關文件集。在此論文中的主要議題是如何使用最佳的合併方法，來達到不錯的效能。此研究中，我們使用機器學習的方法去建立一個跨語言的合併模型；透過此合併模型去調整每篇文件的合併分數。首先，探討處理跨語言檢索問題過程中，有哪些是可能影響跨語言檢索效能的因素。我們從三個層面做探討；翻譯層面、文件本身的層面以及較為一般性層面的特徵。在翻譯層面，過去有不少研究顯示，跨語言檢索時，翻譯品質的好壞對檢索結果的效能佔有很大程度的影響性；除此之外，我們將查詢中的每一個字給予分類成一個類別，類別則由人為的方式下去做定義。發現有幾個類別在資料檢索過程中，佔有較大程度的影響性，甚至發現不同類別之間亦存在著某些程度的相關連；其中佔有一定影響性的類別，其翻譯品質好壞，對跨語言檢索更為重大。在文件本身層面，利用文件本身以及文件標題的長度來做為此文件所含有的資訊量指標。從此些層次取出特徵，利用機器學習的方法，不只學習出跨語言的合併模型，亦學習出在機器學習過程中哪些特徵是較具影響性的。實驗結果顯示，利用機器學習的方法，所達到的檢索效能較傳統合併的方法效能佳；且發現翻譯品質的好壞，包含組織名稱，事件名稱，抽象名詞以及專業名詞的翻譯品質對跨語言檢索最有影響性。	zh_TW
dc.description.abstract	Multilingual information retrieval aims to able users enter query in one language and access relevant documents in various languages. Usually, implementation of MLIR (multilingual information retrieval) is first retrieving each language to obtain bilingual retrieved documents lists from each language collection. Then, how to merge these bilingual lists is the main issue in this work. In this work, we use machine learning approach, FRank, to build a merge model; merging these multiple bilingual lists using the merge model score and retrieval score. Firstly, we identify some effective factors which may influence MLIR process from three levels general level, translation level and document level. On translation level, previous study showed translation quality is crucial for cross-language information retrieval. Besides, we classify each query term into a category which are pre-defined manually. From our experiment, some categories play more important roles in a query while information retrieval; moreover, there are some relationships between categories. The translation quality of those influential categories is crucial for MLIR. On document level, we extract document and document title length as the quantity of informative. On each level, we totally extract 62 features; utilizing these features, we not only train a merge model but also identify what are the effective features for MLIR merging process. In our experiment, we can achieve the best performance among all traditional merging strategies, including raw-score merging, round-robin merging, normalized by top K merging, logistic regression and 2-step re-indexing merging method. Besides, from the features picked up by FRank as weak learners, we can identify translation quality of some query term categories, translatable query terms and ambiguous degree while translating are effective while MLIR merging.	en
dc.description.provenance	Made available in DSpace on 2021-06-13T15:53:20Z (GMT). No. of bitstreams: 1 ntu-97-R95922066-1.pdf: 531014 bytes, checksum: a9957e285519c36df95a7b2bd0ee1e06 (MD5) Previous issue date: 2008	en
dc.description.tableofcontents	口試委員審定書 i 中文摘要 ii ABSTRACT iii LIST OF FIGURES vi LIST OF TABLES vii Chapter 1 INTRODUCTION 1 1.1. MOTIVATION 1 1.2. MERGING PROBLEM 2 1.3. THESIS STRUCTURE 4 Chapter 2 TRADITIONAL MERGING STRATEGIES 5 2.1. HEURISTIC MERGING STRATEGY 5 2.1.1 RAW SCORE 6 2.1.2. ROUND ROBIN 7 2.1.3. NORMALIZED BY TOP K 8 2.2. LEARNING BASED MERGING STRATEGY 9 2.2.1. LOGISTIC REGRESSION 9 2.3. RETRIEVAL BASED MERGING STRATEGY 11 2.3.1. 2-STEP RETRIEVAL STATUS VALUE METHOD 11 Chapter 3 ANALYZE THE INFLUENTIAL FEATURES FOR MLIR 13 3.1. TRANSLATION LEVEL 13 3.1.1. THE IMPORTANCE OF SOME QUERY TERMS 16 3.1.2. QUERY TERM CATEGORY 17 3.1.3. IMPORTANCE AND RELATIONS OF QUERY TERM CATEGORY 18 3.1.3.1. EXPERIMENT SETTING 19 3.1.3.2. RETRIEVE EXCEPT ONE QUERY TERM CATEGORY 21 3.1.3.3. RETRIEVE EXCEPT TWO QUERY TERM CATEGORY 26 3.1.4. PROMOTE PERFORMANCE IN CLIR 30 3.1.4.1. EXPERIMENT SETTING 33 3.1.4.2. EXPERIMENTAL RESULT 34 3.1.5. CONCLUSION ON TRANSLATION LEVEL 35 3.2. DOCUMENT LEVEL 36 3.3. GENERAL LEVEL 37 Chapter 4 USING FRANK TO BUILD A MERGE MODEL 38 4.1. SYSTEM OVERVIEW 38 4.2. FEATURE SELECTION 40 4.3. USING FRANK APPROACH TO MERGE 46 4.4. EXPERIMENT 47 4.4.1. EXPERIMENT SETTING AND PREPROCESSING 47 4.4.2. LEARNING A MERGE MODEL 49 4.4.3. EXPERIMENT RESULT 50 4.4.4. DISCUSSION 52 Chapter 5 CONCLUSION AND FUTURE WORK 57 5.1. CONCLUSION 57 5.2. FUTURE WORK 58 REFERENCES 59 APPENDIX NTCIR3 ENGLISH QUERY DATA SET WITH LABELED CATEGORY 61 NTCIR4 ENGLISH QUERY DATA SET WITH LABELED CATEGORY 65 NTCIR5 ENGLISG QUERY DATA SET WITH LABELED CATEGORY 70
dc.language.iso	en
dc.subject	結果合併	zh_TW
dc.subject	跨語言檢索	zh_TW
dc.subject	機器學習	zh_TW
dc.subject	Machine Learning	en
dc.subject	Multilingual Information Retrieval	en
dc.subject	Data Fusion	en
dc.title	以機器學習方法處理跨語言檢索合併問題	zh_TW
dc.title	A Machine Learning Approach for Result Fusion in Multilingual Information Retrieval	en
dc.type	Thesis
dc.date.schoolyear	96-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	張俊盛,梁婷,鄭卜壬
dc.subject.keyword	跨語言檢索,結果合併,機器學習,	zh_TW
dc.subject.keyword	Multilingual Information Retrieval,Data Fusion,Machine Learning,	en
dc.relation.page	73
dc.rights.note	有償授權
dc.date.accepted	2008-06-23
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-97-1.pdf 未授權公開取用	518.57 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。