點對點網路下複合檢索系統之設計與實作

Li-Wei Yang; 楊立偉

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/25692

標題:	點對點網路下複合檢索系統之設計與實作 Complex Search in Peer-to-Peer Networks
作者:	Li-Wei Yang 楊立偉
指導教授:	莊裕澤(Yuh-Jzer Joung)
關鍵字:	點對點網路,資訊檢索,關鍵詞搜尋,前置搜尋,通用字元搜尋,多維度搜尋,超立方空間, Peer-to-peer network,Information retrieval,Keyword search,Prefix search,Wildcard search,Complex search,Multi-dimensional search,Hypercube,
出版年 :	2006
學位:	博士
摘要:	本研究提出了一種全新的索引與搜尋方案，稱之為KISS（以關鍵符記為基礎的索引與搜尋方案；keytoken-based index and search scheme）。這是將物件以其關鍵詞之字元與位置資訊來索引在超立方空間的一種方法。透過這個索引與搜尋方案，可以完整實現在點對點網路下各種複合檢索的能力，包含了關鍵詞搜尋、前置搜尋、通用字元搜尋、以及多維度搜尋等。為了避免傳統透過倒置列表法所帶來的問題，這個索引與搜尋方案使用超立方空間做為基礎，並將物件以其所屬的關鍵詞集合來做為索引的依據。因此，這個索引與搜尋方案具有以下特點：　　一、透過單一的查找操作，就能有效率地找到任何物件（就好比在DHT網路下的鍵值搜尋一樣有效率）；若要插入、刪除、或維護物件時，也都可在僅一個查找操作內完成，而不會跟該物件的關鍵詞集大小有關係。　　二、單一關鍵詞的索引資訊是由一群節點所負責：越常出現的關鍵詞，會由越多的節點來負責。因此由索引與搜尋所造成的工作負荷可以平均地分散在各節點上。　　三、每個查詢關鍵詞組均可以得到一個子超立方空間，並進而得到一個二元展開樹；而遍歷該二元展開樹，就可以搜尋到所有符合查詢的物件。其中，查詢結果可以互動、累進的方式來呈現。此外，導引式深度優先搜尋（Guided Depth-Frist Search，GDFS）及快取等技巧，能夠提升搜尋效率；而利用關鍵符記的連續性或對稱性來修剪二元展開樹，則可以縮小搜尋範圍。　　四、基於該二元展開樹的遍歷方式，能夠提供富有彈性的定序機制。例如，查詢結果可以依照物件的明確程度、關鍵詞長度、字典順序、或是相似度來進行排序，而不需經過全域統計的方式來完成。　　五、當把每個屬性視為一個子超立方空間時，則多個屬性可以構成一個較大的超立方空間，也因此多維度搜尋可以在這個較大的超立方空間上被實現，不論該多維度搜尋是針對單一屬性或是多個屬性的查詢。本索引與搜尋方案在進行多維度搜尋時，是一次同時考慮到多個屬性，因此當有越多屬性被指定時，搜尋會變得越有效率；而不像其它傳統方法是一次處理一個屬性，然後再將各屬性處理後的結果做合併。此外，透過將關鍵符記進行合併的技巧，可以調整該超立方空間的維度，端視各種上層應用、或底層網路之需要。　　總括來說，KISS這個索引與搜尋方案提供了一種簡單、有效、統一的方式，來實現點對點網路下各種複合檢索的能力，並且解決了各種既有的問題，如增加的索引成本、頻寬額外負擔、不平衡的負荷、網路熱點與故障點等。而這個方案也支援了富有彈性的定序機制，卻不需要任何的全域資訊。　　本研究以Java程式語言建立了一個KISS的雛形系統，並在該雛形系統上進行一連串的實驗，以評估本方案的負荷平衡程度與搜尋效率。本研究使用了兩組真實資料來進行實驗，第一組為網站分類目錄資料庫，共有131,180筆資料；第二組為網際網路上最大的音樂專輯資料庫，共有2,412,613筆資料。實驗結果證明之前所提到的特點都能夠被具體實現。而本研究也提供了關於本方案複雜性與維度的數學分析。 In this research work, KISS (Keytoken-based index and search scheme) is introduced with a novel technique to extract keytokens - character-position information - from keywords to index objects over a hypercube. With KISS, complex search capabilities are fully provided in P2P networks, including keyword search, prefix search, wildcard search, and multi-dimensional search. To address the problems of other existing approaches mainly based on inverted index, KISS uses hypercube to index objects according to their keyword set. As a result, the following features are provided: 1. An object can be efficiently and deterministically located by one lookup operation such as key search in DHT-based overlay. Object insertion, deletion, and maintenance also take one lookup only, regardless of the keyword set size of the object. 2. The index entries of a single keyword are handled by a set of nodes: the more popular a keyword is, the higher the number of nodes responsible for the keyword. So the index/search load of nodes can be balanced. 3. Searching the matching objects corresponds to traversing the spanning binomial tree of a subhypercube obtained from the queried keyword set. Every matched object can be retrieved interactively and cumulatively by exploring the tree. Techniques such as Guided Depth-Frist Search (GDFS) and caching can boost the search efficiency; while tree pruning by the continuous and/or symmetric property of keytoken can reduce the search space. 4. Flexible ranking is provided based on how the spanning binomial tree is traversed. For example, ranking by object specificness, keyword length, lexicographical order, or similarity, can be facilitated without global statistics. 5. Multi-dimensional search is supported by constructing a larger hypercube, which consists of subhypercubes for every attribute. Thus, queries on any single/multiple attribute(s) can be fulfilled as well. KISS takes multiple attributes into consideration at a time, and becomes more efficient when more attributes are specified, contrary to other existing approaches that can only process one attribute at a time and then merge all the search results for each one. In addition, by keytoken clustering, the dimensionality r of the hypercube is adjustable to fit various upper applications, or the underlying overlay. In summary, KISS provides a simple yet effective unified solution for complex search capabilities in P2P networks, and eliminates the problems such as increased index cost, bandwidth overhead, unbalanced load, hot spots, and points of failures, etc. It also supports a flexible ranking mechanism without any global knowledge required. A prototype of KISS is implemented in Java, and a series of experiments have been conducted to evaluate KISS in load balancing and search efficiency. Two real datasets are used: one consisting of 131,180 records from a website directory and the other consisting of 2,412,613 records from Internet CD database. Experimental results are presented to support the above-mentioned features, while a mathematical analysis of the complexity and dimensionality of KISS is also given.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/25692
全文授權:	未授權
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-95-1.pdf 目前未授權公開取用	4.87 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。