Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4717
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor趙坤茂(Kun-Mao Chao)
dc.contributor.authorYou-Sheng Huangen
dc.contributor.author黃宥勝zh_TW
dc.date.accessioned2021-05-14T17:45:47Z-
dc.date.available2016-07-20
dc.date.available2021-05-14T17:45:47Z-
dc.date.copyright2015-07-20
dc.date.issued2015
dc.date.submitted2015-07-06
dc.identifier.citation[1] A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi, and D. Srivastava. Benchmarking declarative approximate selection predicates. In SIGMOD, pp. 353-364, 2007.
[2] S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Exampledriven design of efficient record matching queries, In VLDB, pp. 327-338, 2007.
[3] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pp. 491-500, 2001.
[4] C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pp. 303-314, 2007.
[5] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pp. 257-266, 2008.
[6] R. Vernicaand and C. Li. Efficient top-k algorithms for fuzzy search in string collections. In KEYS ’09: Proceedings of the First International Workshop on Keyword Search on Structured Data, pp. 9-14, 2009.
[7] J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD Conference, pp. 1033-1044, 2011.
[8] J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pp. 85-96, 2012.23
[9] A. Behm, S. Ji, C. Li, and J. Lu. Space-constrained gram-based indexing for efficient approximate string search. In ICDE, pp. 604-615, 2009.
[10] A. Behm, C. Li, and M. J. Carey. Answering approximate string queries on large data sets using external memory. In ICDE, pp. 888-899, 2011.
[11] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, pp. 5, 2006.
[12] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB,
pp. 918-929, 2006.
[13] G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. In PVLDB, vol. 5, no. 3, pp. 253-264, 2011.
[14] Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bedtree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD, pp. 915-926, 2010.
[15] W. Lu, X. Du, M. Hadjieleftheriou, and B. C. Ooi. Efficiently supporting edit distance based string similarity search using B+-trees. In IEEE Transactions on Knowledge and Data Engineering, vol.26, no.12, pp.2983-2996, 2014.
[16] W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. In Journal of Computer and System Sciences, 20(1):18-31, 1980.
[17] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. In MIT Press and McGraw-Hill, 2001.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4717-
dc.description.abstract編輯距離(edit distance) 是一個廣泛地被用於測量字串之間相似程度
的度量,而字串相似度搜尋(string similarity search) 則要找出在特定的字串集合中和給予的查詢字串(query string) 相似的字串。以編輯距離為測量依據的字串相似度搜尋,被大量地應用在如數據清理(database
cleaning) 、錯誤檢查更正(Error detection and correction) 與資料擷取(data retrieval) 等領域。目前此類問題的解決方法大多先將可排除的字串過濾掉後,再對剩下的字串進行檢驗。然而,這些方法在排除字串的過程中,幾乎全部都採用相同的過濾規則,使得效率隨著字串集合的改變而下降。為了克服這個問題,我們提出一個整合不同過濾方法
的資料結構(data structure) 讓字串的過濾維持穩定的效率。我們並提出對應的演算法解決兩個主要的字串相似度搜尋問題:範圍查詢(range
query) 與前k 個查詢(top-k query) 。實驗結果顯示當門檻值小於一定的程度時,我們的方法在範圍查詢上具有優異的性能表現。
zh_TW
dc.description.abstractEdit distance, a measure determining the similarity between two strings, is a criterion that has been used widely. String similarity search finds strings in a dataset that are similar to a given query string. Edit-distance based string similarity search is exploited in many fields, e.g., database cleaning, error detection and correction and data retrieval. Most approaches toward string similarity search resort to filtering out as many strings in datasets as possible and verifying the remaining strings. However, these approaches use only one
filtering method throughout the whole procedure, which makes the power of the method fluctuates during the whole procedure. To overcome this problem, we propose a data structure integrating different filtering methods and
adopting a more efficient one on each phase. We also give corresponding algorithms for two important queries of string similarity search, range query and top-k query. Experimental results show that our approach is competitive for range query when thresholds are small enough.
en
dc.description.provenanceMade available in DSpace on 2021-05-14T17:45:47Z (GMT). No. of bitstreams: 1
ntu-104-R02922064-1.pdf: 503284 bytes, checksum: f5df39ebcd322ce63a3a47f39bfcf41d (MD5)
Previous issue date: 2015
en
dc.description.tableofcontents1 Introduction 1
1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Verifying Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Pruning Technique . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Our Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Preliminaries 6
2.1 Preliminary Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Computations of Edit Distance . . . . . . . . . . . . . . . . . . . 7
2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
iv
2.2.1 Range Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Top-k Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Complex-Tree Based Solution 9
3.1 Filtering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 String Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2 Letter Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.3 Reference String . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Index Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 String Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Letter Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Reference String . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Query Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Range Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.2 Top-k Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Experimental Evaluation 18
4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Space Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Comparison with Other Approaches . . . . . . . . . . . . . . . . . . . . 19
4.3.1 Range Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.2 Top-k Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Conclusions and Future Work 22
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
dc.language.isoen
dc.subject編輯距離zh_TW
dc.subject字串相似度搜尋zh_TW
dc.subject範圍查詢zh_TW
dc.subject前k個查詢zh_TW
dc.subjecttop-k queryen
dc.subjectstring similarity searchen
dc.subjectrange queryen
dc.subjectedit distanceen
dc.title可有效回覆字串相似度搜尋之演算法zh_TW
dc.titleA Query-Efficient Algorithm for String-Similarity Searchen
dc.typeThesis
dc.date.schoolyear103-2
dc.description.degree碩士
dc.contributor.oralexamcommittee王弘倫(Hung-Lung Wang),黃耀廷(Yao-Ting Huang)
dc.subject.keyword編輯距離,字串相似度搜尋,範圍查詢,前k個查詢,zh_TW
dc.subject.keywordedit distance,string similarity search,range query,top-k query,en
dc.relation.page24
dc.rights.note同意授權(全球公開)
dc.date.accepted2015-07-06
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊工程學研究所zh_TW
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-104-1.pdf491.49 kBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved