監督式學習之搜尋結果合併問題中訓練資料篩選方法

Ting-Chu Lin; 林庭竹

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/47514

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	鄭卜壬
dc.contributor.author	Ting-Chu Lin	en
dc.contributor.author	林庭竹	zh_TW
dc.date.accessioned	2021-06-15T06:03:44Z	-
dc.date.available	2012-08-20
dc.date.copyright	2010-08-20
dc.date.issued	2010
dc.date.submitted	2010-08-16
dc.identifier.citation	[1] N. Ailon. Aggregation of partial rankings, p-ratings and top-m lists. In SODA ’07: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 415–424, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics. [2] J. A. Aslam and M. Montague. Models for metasearch. In SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 276–284, New York, NY, USA, 2001. ACM. [3] J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In SIGIR ’95: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pages 21– 28, New York, NY, USA, 1995. ACM. [4] N. K. G. E. M. Voorhees and B. Johnson-Laird. The collection fusion problem. In The Third Text REtrieval Conference (TREC-3), pages 95–104, 1994. [5] J. A. S. Edward A. Fox. Combination of multiple searches. In The 2nd Text REtrieval Conference (TREC-2), page 243–252. National Institute of Standards and Technology Special Publication 500-215, 1994. [6] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT ’95: Proceedings of the Second European Conference on Computational Learning Theory, pages 23–37, London, UK, 1995. Springer-Verlag. [7] X. Geng, T.-Y. Liu, T. Qin, and H. Li. Feature selection for ranking. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 407–414, New York, NY, USA, 2007. ACM. [8] J. H. Lee. Analyses of multiple evidence combination. SIGIR Forum, 31(SI):267– 276. [9] Q. Li, S.-H. Myaeng, Y. Jin, and B.-y. Kang. Concept unification of terms in different languages for ir. In ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 641–648, Morristown, NJ, USA, 2006. Association for Computational Linguistics. [10] D. Lillis, F. Toolan, R. Collier, and J. Dunnion. Probfuse: a probabilistic approach to data fusion. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 139–146, New York, NY, USA, 2006. ACM. [11] F. Mart’ınez-Santiago, L. A. Ure na-L’opez, and M. Mart’ın-Valdivia. A merging strategy proposal: The 2-step retrieval status value method. Inf. Retr., 9(1):71–93, 2006. [12] M. Montague and J. A. Aslam. Condorcet fusion for improved retrieval. In CIKM ’02: Proceedings of the eleventh international conference on Information and knowledge management, pages 538–548, New York, NY, USA, 2002. ACM. [13] Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma. Web object retrieval. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, pages 81–90, New York, NY, USA, 2007. ACM. [14] J. Pickens and G. Golovchinsky. Ranked feature fusion models for ad hoc retrieval. In CIKM ’08: Proceeding of the 17th ACM conference on Information and knowledge management, pages 893–900, New York, NY, USA, 2008. ACM. [15] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Mach. Learn., 37(3):297–336, 1999. [16] J. Xu and X. Li. Learning to rank collections. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 765–766, New York, NY, USA, 2007. ACM. [17] R. Yan and A. G. Hauptmann. Probabilistic latent query analysis for combining multiple retrieval sources. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 324–331, New York, NY, USA, 2006. ACM. [18] Z. Zheng, K. Chen, G. Sun, and H. Zha. A regression framework for learning ranking functions using relative relevance judgments. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 287–294, New York, NY, USA, 2007. ACM. [19] K. Zhou, G.-R. Xue, H. Zha, and Y. Yu. Learning to rank with ties. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 275–282, New York, NY, USA, 2008. ACM.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/47514	-
dc.description.abstract	搜尋結果合併問題，是在於合併數個資訊檢索系統或搜尋引擎的結果，以達到更準確的相關排序。提升搜尋引擎的品質有很多方法，合併數種不同的搜尋引擎是其中一個研究方向，如何能截長補短，是各家研究的重點。有效地合併數個資料檢索系統的結果，在眾多研究當中已顯示對於增加排序的準確性有顯著功效。它已被證明，這可以提高檢索效率和精度超過該原來的數個資訊檢索系統。本文的目標是，選擇一個訓練數據的子集合，其中以此子集合為訓練數據，產生的融合模型，以此融合模型融合所有訓練數據，會得到最多的改善。我們已probF use算法[10]為例。提出兩種方法:貪婪演算法和皮匠演算法，貪婪的方法有兩種選擇:獨立選得訓練數據與考慮訓練數據之間的關係。皮匠演算法是一種數據融合問題的框架，他會選出數個訓練數據的子集，以這些子集產生融合模型，每一個融合模型，都針對訓練數據的部分做做最佳化，最後線性將這些融合模型合併起來成最終融合模型。經由訓練數據的篩選，我們提高了合併之後的搜尋結果，大量的實驗包含TREC – 3，4，5及NTCIR – 3，4顯示，經過訓練數據篩選的融合演算法的融合結果，顯著地較相同的融合演算法卻沒有經過數據篩選更好，因為選擇了良好的訓練數據產生適當的融合模型。我們提出兩個有用的訓練數據篩選方法，並且定義成正式的框架，任何融合演算法都可廣泛應用。	zh_TW
dc.description.abstract	Search-result merging is to merge several results from different search en- gines to get better performance. Several early studies have shown combining different information retrieval (IR) models can greatly improve the retrieval effectiveness and accuracy over any individual model can get. In machine learning and statistics applications, researchers often apply data fusion algorithm to ensemble the results from different models to combine their different abilities as well. One of the state of arts data fusion algorithms in recent years is probF use [10]. It considers the past information of each model that is then used to predict the confidence of the model. Although it has shown promising performance in many studies, it doesn’t take the diversity of query into consideration together but the model only. In our analysis for the experiments performed on TREC-3,4,5 and NTCIR-3,4, we found that the performance of one model varies from one query to another. Inspired by the discovery, we assume that not all training examples are effective. We proposed two novel approaches, Greedy approach and Boosting approach, to select training data to optimize the improvement from data fusion algorithm. Greedy approach has two selection policies, dependent and independent, and both of them greedily select training examples. Dependent selection takes the concurrence of training examples into account; independent selection chooses every training example individually one after one. Boosting approach is a framework we design for data fusion problem, which emphasizes different training examples and generates a linear ensemble base on the weights of the different training data. Extensive experiments were performed on several data sets, including TREC-3,4,5 and NTCIR-3,4, the outcome was very promising. With either of our data selection methods, probF use algorithm clearly performs better than before. Our work can not only improved the effectiveness of existing fusion algorithm, but also reduce the training time consuming. One can apply the boosting framework we proposed to any data fusion algorithm on the fly as well.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T06:03:44Z (GMT). No. of bitstreams: 1 ntu-99-R97922134-1.pdf: 3288384 bytes, checksum: c9a17c5b75644438fc2fb782124eb5d1 (MD5) Previous issue date: 2010	en
dc.description.tableofcontents	Contents 致謝iii 中文摘要v Abstract vii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Supervised Learning for Search Result Merging . . . . . . . . . . . . . . 3 1.3 Diversity of Search results . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Problem Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Related Work 11 3 Methodology 15 3.1 Greedy Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.1 Independently Greedy Approach . . . . . . . . . . . . . . . . . . 15 3.1.2 Dependently Greedy Approach . . . . . . . . . . . . . . . . . . . 18 3.2 Boosting Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 Boosting Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.2 A Generalize Version of AdaBoost . . . . . . . . . . . . . . . . . 23 3.2.3 Data fusion Version of Boosting . . . . . . . . . . . . . . . . . . 27 3.2.4 Base Learner: probF use algorithm with data selection . . . . . . 30 3.2.5 Analysis of data fusion version of boosting . . . . . . . . . . . . 31 4 Experiment 35 4.1 Description of Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.1 Exp 1: Greedy Approach . . . . . . . . . . . . . . . . . . . . . . 39 4.3.2 Exp 2: Boosting Approach . . . . . . . . . . . . . . . . . . . . . 39 4.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5 Discussion 45 6 Conclusion and Future Works 49 Bibliography 51
dc.language.iso	en
dc.subject	資訊檢索	zh_TW
dc.subject	information retrieval	en
dc.title	監督式學習之搜尋結果合併問題中訓練資料篩選方法	zh_TW
dc.title	Training Data Selection for Supervised Learning Based Search-result Merging	en
dc.type	Thesis
dc.date.schoolyear	98-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳信希,盧文祥
dc.subject.keyword	資訊檢索,	zh_TW
dc.subject.keyword	information retrieval,	en
dc.relation.page	49
dc.rights.note	有償授權
dc.date.accepted	2010-08-16
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-99-1.pdf 未授權公開取用	3.21 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。