有效率的生物序列查詢的過濾方法

Chao-Wen Lin; 林潮文

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/27503

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李瑞庭
dc.contributor.author	Chao-Wen Lin	en
dc.contributor.author	林潮文	zh_TW
dc.date.accessioned	2021-06-12T18:07:29Z	-
dc.date.available	2008-01-02
dc.date.copyright	2008-01-02
dc.date.issued	2007
dc.date.submitted	2007-12-25
dc.identifier.citation	[1] Aghili, A., Agrawal, D., Abbadi, A. E., 2003. Filtration of string proximity search via transformation. In Proceedings of the Third IEEE International Symposium on Bioinformatics and Bioengineering, 149-157. [2] Agrawal, R., Faloutsos, C., Swami, A., 1993. Efficient similarity search in sequence databases. In Proceedings of the 4th International Conference of Foundations of Data Organization and Algorithms, The Federation of Ophthalmic and Dispensing Opticians, 69-84. [3] Altschul S. F., Gish, W., Miller, W., Myers, E., Lipman, D.J., 1990. Basic local alignment search tool. Journal of Molecular Biology 215, 403-410. [4] Altschul S. F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research 25, 3389-3402. [5] Aluru. S., 2006. Handbook of Computational Molecular Biology. Chapman & Hall., New York, U.S.A. [6] Amarzguioui, M., Holen, T., Babaie, E., Prydz, H., 2003. Tolerance for mutations and chemical modifications in a siRNA. Nucleic Acids Research 31, 589-595. [7] Amir,A., Lewenstein,M., Porat,E., 2000. Faster algorithms for string matching with k mismatches. In Proceedings of 11th ACM- SIAM Symposium on Discrete Algorithms (SODA), 794-803. [8] Antonin, G., 1984. R-Tree: A dynamic index structure for spatial searching. In Proceedings of ACM SIGMOD International Conference on Management of Data, 47-57. [9] Baeza-Yates,R., Perleberg,C., 1992. Fast and practical approximate string matching. Combinational Pattern Matching, 185-192. [10] Baxevanis, A., Francis Ouellette B. F., 1998. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, John Wiley Sons, Inc., New York, U.S.A. [11] Boutla, A., Delidakis, C., Livadaras, I., Tsagris, M., Tabler, M., 2001. Short 50-phosphorylated double-stranded RNAs induce RNA interference in Drosophila. Current Biology 11, 1776-1780. [12] Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H. P., Rivals, E., Vingron, M., 1999. Q-gram based database searching using a suffix array (quasar). In Proceedings of the 3rd International Conference on Computational Molecular Biology, 77-83. [13] Burkhardt, S., Kärkkäinen, J., 2003. Better filtering with gapped q-grams. Fundamenta Informaticae 56, 51-70. [14] Califano, A., Rigoutsos, I., 1993. FLASH: A fast look-up algorithm for string homology. In Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology, 56-64. [15] Chan, K.P., Fu. A.W-C., 1999. Efficient time series matching by wavelets. In Proceedings of 15th International Conference on Data Engineering, 126-133. [16] Chang, W. I., Lawler, E. L., 1994. Sublinear approximate string matching and biological applications. Algorithmica 12, 27-344. [17] Ciaccia, P., Patella, M., Zezula, P., 1997. M-tree: An efficient access method for similarity search in metric spaces. The Very Large Data Bases Journal, 426-435. [18] David A. W., Ramesh, J., 1996. Similarity Indexing with the SS-tree. In Proceedings of the Twelfth International Conference on Data Engineering, 516-523. [19] Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O., and Salzberg, L, 1999. Alignment of whole genomes. Nucleic Acid Research 27, 2369-2376. [20] Elbashir S. M., Harborth, J., Lendeckel, W., Yalcin, A., Weber, K., Tuschl, T., 2001. Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature 411, 494-498. [21] Elbashir, S.M., Lendeckel, W., Tuschl, T., 2001. RNA interference is mediated by 21- and 22-nucleotide RNAs. Genes & Development 15, 188-200. [22] Faloutsos, C., Ranganathan, M., Manolopoulos, Y., 1994. Fast subsequence matching in time-series databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 419-429. [23] Giladi, E., Walker, M. G., Wang, J. Z., Volkmuth, W., 2002. SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size. Bioinformatics 18, 873-879. [24] Gusfield, D., 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press., New York, U.S.A. [25] Han, J., Kamber, M., 2006. Data Mining: Concepts and Techniques, Morgan Kaufmann., New York, U.S.A. [26] Hamilton, A.J., Baulcombe, D.C., 1999. A species of small antisense RNA in post transcriptional gene silencing in plants. Science 286, 950-952. [27] Holen, T., Amarzguioui, M., Wiiger, M.T., Babaie, E., Prydz, H., 2002. Positional effects of short interfering RNAs targeting the human coagulation trigger tissue factor. Nucleic Acids Res 30, 1757-1766. [28] Jackson, A.L., Bartz, S.R., Schelter, J., Kobayashi, S.V., Burchard, J., Mao, M., Li, B.,. Cavet, G.., Linsley, P.S., 2003. Expression profiling reveals off-target gene regulation by RNAi. Nature Biotechnology 21, 635-637. [29] Jackson, A.L., Linsley, P. S., 2004. Noise amidst the silence: off-target effects of siRNAs? Trends Genet 20, 521-524. [30] Jagadish, H. V., Mendelzon, A. O., Milo. T., 1995. Similarity-based queries. In Proceedings of the 14th A CM SIGA CT-SIGMOD-SIGART Symposium on Principles of Database Systems, 36-45. [31] Katayama, N., Satoh, S., 1997. The SR-tree: An index structure for high-dimensional nearest neighbor queries. In Proceedings of ACM SIGMOD International Conference on Management of Data, 369-380. [32] Krishnan, A., Li, K. B., Issac, P., 2004. Rapid detection of conserved regions in protein sequences using wavelets. In Silico Biology 4, 133-148. [33] Kurtz., S., Schleiermacher, C., 1999. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 15, 426-427. [34] Levenkova, N., Gu, Q., Rux, J.J., 2004. Gene specific siRNA selector. Bioinformatics 20, 430-432. [35] Lipman, D.J., Pearson, W.R., 1985. Rapid and sensitive protein similarity searches. Science 227, 1435-1441. [36] Ma, B., Tromp, J., Li, M., 2002. PatternHunter: Faster and more sensitive homology search. Bioinformatics 18, 440-445. [37] Manber, U., Myers, E. W., 1993. Suffix Arrays: A new method for on-line string searches. SIAM Journal on Computing 22, 935-948. [38] Meek, C., Patel, J., Kasetty, S., 2003. OASIS: An online and accurate technique for local-alignment searches on biological sequences. In Proceedings of the 29th VLDB Conference, 214-230. [39] Michael C., Hugh E., Williams., Adam C., 2004. Improved gapped alignment in BLAST. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1, 116-129. [40] Moon, Y. S., Whang, K. Y., Han, W. S., 2002. General match: a subsequence matching method in time-series databases based on generalized windows. In Proceedings of ACM SIGMOD International Conference on Management of Data, 382-393. [41] Moon, Y. S., Whang, K. Y., Loh, W. K., 2001. Duality-based subsequence matching in time-series databases. In Proceedings of the 17th IEEE International Conference on Data Engineering, 263-272. [42] Naito,Y.,Yamada,T., Ui-Tei,K., Morishita,S., Saigo,K., 2004. siDirect: highly effective, target-specific siRNA design software for mammalian RNA interference. Nucleic Acids Research 32, 124-129. [43] National Center for Biotechnology Information (NCBI) website, http://www.ncbi.nlm.nih.gov/, 2004. [44] Navarro, N., 2001. A guided tour to approximate string matching. ACM Computing Surveys 33, 31-88. [45] Norbert, B., Hans-Peter, K., Ralf S., Bernhard, S., 1990. The R*-Tree: An efficient and robust access method for points and rectangles. In Proceedings of ACM SIGMOD International Conference on Management of Data, 322-331. [46] Ooi, B.C., Pang, H., Wang, H., Wong, L., Yu, C., 2002. Fast Filter-and-Refine Algorithms for Subsequence Selection. International Database Engineering and Applications Symposium, 243-255. [47] Overhoff, M., Alken, M., Far, R.K., Lemaitre, M., Lebleu, B., Sczakiel, G., Robbins, I., 2005. Local RNA target structure influences siRNA efficacy: a systematic global analysis. Journal of Molecular Biology 348, 871-881. [48] Pearson, W. R., 1994. Using the fasta program to search protein and DNA sequence databases. Methods in Molecular Biology 25, 365-389. [49] Pearson, W. R., 2001. Protein Sequence Comparison and Protein Evolution. ISMB2000 Tutorial, 1-36. [50] Popivanov, I., Miller, R. J., 2001. Similarity search over time series data using wavelets. In Proceedings of 17th International Conference on Data Engineering, 212-221. [51] Qiu, S., Adema, C.M., Lane, T., 2005. A computational study of off-target effects of RNA interference. Nucleic Acids Research 33, 1834-1847. [52] Rafei, D., Mendelzon, A., 1997. Similarity-based queries for time series data. In Proceedings of ACM SIGMOD International Conference on Management of Data, 13-25. [53] Rahmann,S., 2002. Rapid large-scale oligonucleotide selection for microarrays. In Proceedings of the First IEEE Computer Society Bioinformatics Conference, 54-63. [54] Santoyo, J., Vaquerizas, J.M., Dopazo, J., 2005. Highly specific and accurate selection of siRNAs for high-throughput functional assays. Bioinformatics 21, 1376-1382. [55] Saxena, S., Jonsson, Z.O., Dutta, A., 2003. Small RNAs with imperfect match to endogenous mRNA repress translation: implications for off-target activity of siRNA in mammalian cells. Journal of Biological Chemistry 278, 44312-44319. [56] Smith T. F., Waterman, M. S., 1981. Identification of common molecular subsequences. Journal of Molecular Biology 147, 195-197. [57] Snove,O., Holen,T., 2004. Many commonly used siRNAs risk off-target activity. Biochemical and Biophysical Research Communications 319, 256-263. [58] Trad, C. H. D., Fang, Q., Cosic, I., 2001. An overview of protein sequence comparisons using wavelets. In Proceedings of the IEEE Engineering in Medicine and Biology Society, 115-119. [59] Ui-Tei,K., Naito,Y., Takahashi,F., Haraguchi,T., Ohki-Hamazaki,H., Juni,A., Ueda,R.and Saigo,K., 2004. Guidelines for the selection of highly effective siRNAsequences for mammalian and chick RNA interference. Nucleic Acids Research 32, 936-948. [60] Ukkonen. E., 1992. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92, 191-211. [61] Waldrop, M. M., 1995. On-line archives let biologists interrogate the genome. Science 269, 1356-1358. [62] Wang L., Mu, F.Y., 2004. A web-based design center for vector-based siRNA and siRNA cassette. Bioinformatics 20, 1818-1820. [63] Weber, R., Schek, H. J., Blott. S., 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of 24th International Conference on Very Large Data Bases, 194-205. [64] Xia, C., Shuai, C.L., Beng C,O., Anthony K. H., Tung., 2004. Piers: an efficient model for similarity search in DNA sequence databases, ACM SIGMOD Record 33, 39-44. [65] Yamada, T., Morishita, S., 2005. Accelerated off-target search algorithm for siRNA. Bioinformatics 21, 1316-1324. [66] Yamada,T., Morishita,S., 2004. Computing highly specific and noise-tolerant oligomers efficiently. Journal of Bioinformatic and computational Biology 2, 21-46. [67] Zhang, Z., Schwartz, S., Wagner, L., Miller, W., 2000. A greedy algorithm for aligning DNA sequences. Journal of Computational Biology 7, 203-214. [68] Zhao, W., Lane, T., 2005. siRNA off-target search: a hybrid q-gram based filtering approach. In Proceedings of the 5th International Workshop on Bioinformatics, Chicago, Illinois, 54-60. [69] Zhou, H., Wang, Y., Zeng,X., 2006. Fast and complete search of siRNA off-target sequences. In Proceedings of the International Conference on Bioinformatics & Computational Biology, 168-171.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/27503	-
dc.description.abstract	DNA序列查詢與非siRNA目標序列查詢，都是將查詢序列與生物序列資料庫中所有序列做比對，將與查詢序列差異在一定範圍內的資料庫序列找出，兩者都是生物資訊中序列相似性查詢的重要工具。由於序列資料庫中的序列都可能多達數百萬個鹼基對，若以序列並列分析的方式來直接掃描資料庫一一做比對，則其所產生的時間複雜度將非常大。因此，部分學者提出了查詢過濾演算法，試圖在查詢時的第一個步驟，就過濾掉資料庫中大部分差異較大的序列，只留下一小部分的候選序列。然而，這類演算法皆存在若干問題，包括候選序列個數過多、偽陰性(false negative)比例過高等等。在此兩種查詢中，DNA序列比對的查詢序列比較長，且大都使用編輯距離(edit distance)當作查詢的依據；非siRNA目標序列查詢，其查詢序列較短，通常大約有19~25鹼基對，且大都是使用漢明距離 (Hamming distance)當作查詢的依據。所以在本篇論文中，我們針對此兩種查詢分別提出有效率的過濾演算法: 對DNA序列比對我們提出一個名為「以轉換法為基礎之資料庫查詢過濾的演算法」(我們稱之為TDF)，此方法先以小波轉換將資料庫每條子序列轉成一個特徵向量，並建立索引，然後利用查詢序列的特徵向量與此索引過濾掉不可能符合查詢條件的序列，找出候選子序列。對非siRNA目標序列查詢，我們提出一個名為「以共通前置序列為基礎之資料庫查詢的過濾演算法」(我們稱之為CPF)，此方法先將資料庫每條子序列排序，將有共通前置序列集中在一起，並為此共通前置序列建立索引，然後利用查詢序列的前置序列與此索引過濾掉不可能符合查詢條件的序列，找出候選子序列。最後，對於每條候選子序列，再去計算該子序列與查詢序列之間的真正距離。實驗結果顯示，此兩種查詢過濾演算法均能有效過濾掉大部分的資料序列，同時保證沒有偽陰性。實驗結果顯示TDF優於QUASAR與YM，CPF優於YM與SoS方法。	zh_TW
dc.description.abstract	In this dissertation, we propose two filtration methods for two sequence similarity queries of biological databases: DNA sequence similarity query and siRNA off-target query. Both queries are used to retrieve similar sequences from biological sequence databases. The DNA sequence similarity query is used to retrieve all sequences in a database so that the edit distance between the query and a sequence retrieved is less than a user-specified threshold, where the length of such a query is often long. It is mostly used to retrieve highly similar sequences. A small interfering RNA (siRNA), also called silencing RNA, is used to knock a gene down by an artificial mechanism. Although an siRNA is designed to silence a specific gene, many researchers have shown that the genes highly similar to the siRNA are also silenced or their expressions are depressed. An siRNA off-target query can be used to find those highly similar genes in a database, where the Hamming distance between the query and the sequence retrieved is less than a user-specified threshold. Its query length is usually short (normally 19~25 base pairs). For both queries, a filtration method can be used to screen out most unqualified data sequences from the database and leave only a small number of candidate sequences for sequence comparisons. For the DNA sequence similarity retrieval query, we propose a method called Transformation-based Database Filtration method (TDF). In the TDF method, the data sequences are first divided into several blocks, each of which is transformed into a feature vector by Haar wavelet transform and stored in an index file. Then, we search the index and extract those candidate blocks whose edit distance to the feature vector of the query sequence is less than a user-specified threshold. For the siRNA off-target query, we propose a method called Common Prefix Filtration method (CPF). In the CPF method, the data sequences are first sorted and stored in an index tree according to their common prefixes. Then, we search the index tree and filter out those sequences whose Hamming distance to the query sequence is greater than a user-specified threshold. We extract those possible candidate sequences and pass them to the verification stage. The experiment results show that our both filtration methods can filter out most unqualified sequences and guarantee no false negatives. The experimental results show that the TDF method outperforms the QUASAR and YM methods while the CPF method outperforms the YM and SoS methods.	en
dc.description.provenance	Made available in DSpace on 2021-06-12T18:07:29Z (GMT). No. of bitstreams: 1 ntu-96-D89725003-1.pdf: 854934 bytes, checksum: 92fea393664762d9af4f91738efc1df8 (MD5) Previous issue date: 2007	en
dc.description.tableofcontents	誌謝 i 論文摘要 ii DISSERTATION ABSTRACT III TABLE OF CONTENTS v LIST OF FIGURES VII LIST OF TABLES ix CHAPTER 1 INTRODUCTION 1 1.1 Background 1 1.2 Problem statement 5 1.3 Motivation 7 1.4 Our methods 7 1.4.1 Transformation-based Database Filtration method 7 1.4.2 Common Prefix Filtration method 8 1.5 Contributions 8 1.6 Organization of the dissertation 10 CHAPTER 2 RELATED WORK 11 2.1 Categories of similarity-based sequence queries 12 2.2 Sliding-window schemes 13 2.3 Classification schemes on similarity queries 13 2.3.1 Exhaustive search methods 13 2.3.2 Suffix-based methods 14 2.3.3 Seed-based or index-based methods 15 2.4 Filtration methods 15 2.4.1 Aghili-Agrawal-Abbadi 16 2.4.2 QUASAR 16 2.4.3 SST 17 2.4.4 Two-phase algorithm 18 2.4.5 SoS 20 2.4.6 YM 21 CHAPTER 3 TRANSFORMATION-BASED DATABASE FILTRATION METHOD 24 3.1 Problem definition 24 3.2 The algorithm 25 3.2.1 Transforming a sequence into feature vectors 26 3.2.2 The Manhattan and edit distances 27 3.2.3 Processing data and query sequences 30 3.2.4 Complexities of the TDF method 35 CHAPTER 4 COMMON PREFIX FILTRATION METHOD 38 4.1 Problem definition 38 4.2 The algorithm based on ungapped alignments 39 4.2.1 Preprocess the dataset 40 4.2.2 Construct the index tree 40 4.2.3 The query method 44 4.2.4 Complexities of the ungapped CPF method 48 4.3 The algorithm based on gapped alignments 51 4.3.1 Preprocess the dataset 52 4.3.2 Construct the index tree 52 4.3.3 The gapped query method 52 4.3.4 Complexities of the gapped CPF method 56 CHAPTER 5 PERFORMANCE ANALYSIS 61 5.1 The TDF method 61 5.1.1 Complexities of the comparing methods 61 5.1.2 The datasets 63 5.1.3 Determination of the window size 64 5.1.4 Experimental results 66 5.2 The ungapped CPF method 71 5.2.1 Complexities of the comparing methods 71 5.2.2 The datasets 73 5.2.3 Determination of the length of common prefixes 74 5.2.4 Experimental results 75 5.3 The gapped CPF method 79 5.3.1 Complexities of the comparing methods 80 5.3.2 Experimental results 82 CHAPTER 6 CONCLUSIONS AND FUTURE WORK 85 REFERENCES 88
dc.language.iso	en
dc.subject	序列查詢	zh_TW
dc.subject	生物序列資料庫	zh_TW
dc.subject	過濾方法	zh_TW
dc.subject	siRNA	zh_TW
dc.subject	sequence query	en
dc.subject	siRNA	en
dc.subject	filtration method	en
dc.subject	biological sequence database	en
dc.title	有效率的生物序列查詢的過濾方法	zh_TW
dc.title	Efficient Filtration Methods for Biological Sequence Databases	en
dc.type	Thesis
dc.date.schoolyear	96-1
dc.description.degree	博士
dc.contributor.oralexamcommittee	歐陽彥正,陳彥良,劉敦仁,陳炳宇
dc.subject.keyword	序列查詢,生物序列資料庫,過濾方法,siRNA,	zh_TW
dc.subject.keyword	sequence query,biological sequence database,filtration method,siRNA,	en
dc.relation.page	94
dc.rights.note	有償授權
dc.date.accepted	2007-12-25
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-96-1.pdf 未授權公開取用	834.9 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。