監督式學習之搜尋結果合併問題中訓練資料篩選方法

Ting-Chu Lin; 林庭竹

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/47514

標題:	監督式學習之搜尋結果合併問題中訓練資料篩選方法 Training Data Selection for Supervised Learning Based Search-result Merging
作者:	Ting-Chu Lin 林庭竹
指導教授:	鄭卜壬
關鍵字:	資訊檢索, information retrieval,
出版年 :	2010
學位:	碩士
摘要:	搜尋結果合併問題，是在於合併數個資訊檢索系統或搜尋引擎的結果，以達到更準確的相關排序。提升搜尋引擎的品質有很多方法，合併數種不同的搜尋引擎是其中一個研究方向，如何能截長補短，是各家研究的重點。有效地合併數個資料檢索系統的結果，在眾多研究當中已顯示對於增加排序的準確性有顯著功效。它已被證明，這可以提高檢索效率和精度超過該原來的數個資訊檢索系統。本文的目標是，選擇一個訓練數據的子集合，其中以此子集合為訓練數據，產生的融合模型，以此融合模型融合所有訓練數據，會得到最多的改善。我們已probF use算法[10]為例。提出兩種方法:貪婪演算法和皮匠演算法，貪婪的方法有兩種選擇:獨立選得訓練數據與考慮訓練數據之間的關係。皮匠演算法是一種數據融合問題的框架，他會選出數個訓練數據的子集，以這些子集產生融合模型，每一個融合模型，都針對訓練數據的部分做做最佳化，最後線性將這些融合模型合併起來成最終融合模型。經由訓練數據的篩選，我們提高了合併之後的搜尋結果，大量的實驗包含TREC – 3，4，5及NTCIR – 3，4顯示，經過訓練數據篩選的融合演算法的融合結果，顯著地較相同的融合演算法卻沒有經過數據篩選更好，因為選擇了良好的訓練數據產生適當的融合模型。我們提出兩個有用的訓練數據篩選方法，並且定義成正式的框架，任何融合演算法都可廣泛應用。 Search-result merging is to merge several results from different search en- gines to get better performance. Several early studies have shown combining different information retrieval (IR) models can greatly improve the retrieval effectiveness and accuracy over any individual model can get. In machine learning and statistics applications, researchers often apply data fusion algorithm to ensemble the results from different models to combine their different abilities as well. One of the state of arts data fusion algorithms in recent years is probF use [10]. It considers the past information of each model that is then used to predict the confidence of the model. Although it has shown promising performance in many studies, it doesn’t take the diversity of query into consideration together but the model only. In our analysis for the experiments performed on TREC-3,4,5 and NTCIR-3,4, we found that the performance of one model varies from one query to another. Inspired by the discovery, we assume that not all training examples are effective. We proposed two novel approaches, Greedy approach and Boosting approach, to select training data to optimize the improvement from data fusion algorithm. Greedy approach has two selection policies, dependent and independent, and both of them greedily select training examples. Dependent selection takes the concurrence of training examples into account; independent selection chooses every training example individually one after one. Boosting approach is a framework we design for data fusion problem, which emphasizes different training examples and generates a linear ensemble base on the weights of the different training data. Extensive experiments were performed on several data sets, including TREC-3,4,5 and NTCIR-3,4, the outcome was very promising. With either of our data selection methods, probF use algorithm clearly performs better than before. Our work can not only improved the effectiveness of existing fusion algorithm, but also reduce the training time consuming. One can apply the boosting framework we proposed to any data fusion algorithm on the fly as well.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/47514
全文授權:	有償授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-99-1.pdf 未授權公開取用	3.21 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。