Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4717
Title: | 可有效回覆字串相似度搜尋之演算法 A Query-Efficient Algorithm for String-Similarity Search |
Authors: | You-Sheng Huang 黃宥勝 |
Advisor: | 趙坤茂(Kun-Mao Chao) |
Keyword: | 編輯距離,字串相似度搜尋,範圍查詢,前k個查詢, edit distance,string similarity search,range query,top-k query, |
Publication Year : | 2015 |
Degree: | 碩士 |
Abstract: | 編輯距離(edit distance) 是一個廣泛地被用於測量字串之間相似程度
的度量,而字串相似度搜尋(string similarity search) 則要找出在特定的字串集合中和給予的查詢字串(query string) 相似的字串。以編輯距離為測量依據的字串相似度搜尋,被大量地應用在如數據清理(database cleaning) 、錯誤檢查更正(Error detection and correction) 與資料擷取(data retrieval) 等領域。目前此類問題的解決方法大多先將可排除的字串過濾掉後,再對剩下的字串進行檢驗。然而,這些方法在排除字串的過程中,幾乎全部都採用相同的過濾規則,使得效率隨著字串集合的改變而下降。為了克服這個問題,我們提出一個整合不同過濾方法 的資料結構(data structure) 讓字串的過濾維持穩定的效率。我們並提出對應的演算法解決兩個主要的字串相似度搜尋問題:範圍查詢(range query) 與前k 個查詢(top-k query) 。實驗結果顯示當門檻值小於一定的程度時,我們的方法在範圍查詢上具有優異的性能表現。 Edit distance, a measure determining the similarity between two strings, is a criterion that has been used widely. String similarity search finds strings in a dataset that are similar to a given query string. Edit-distance based string similarity search is exploited in many fields, e.g., database cleaning, error detection and correction and data retrieval. Most approaches toward string similarity search resort to filtering out as many strings in datasets as possible and verifying the remaining strings. However, these approaches use only one filtering method throughout the whole procedure, which makes the power of the method fluctuates during the whole procedure. To overcome this problem, we propose a data structure integrating different filtering methods and adopting a more efficient one on each phase. We also give corresponding algorithms for two important queries of string similarity search, range query and top-k query. Experimental results show that our approach is competitive for range query when thresholds are small enough. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4717 |
Fulltext Rights: | 同意授權(全球公開) |
Appears in Collections: | 資訊工程學系 |
Files in This Item:
File | Size | Format | |
---|---|---|---|
ntu-104-1.pdf | 491.49 kB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.