以核苷酸k聚體頻度分類序列

Hung-Yu Chen; 陳泓宇

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/718

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	趙坤茂(Kun-Mao Chao)
dc.contributor.author	Hung-Yu Chen	en
dc.contributor.author	陳泓宇	zh_TW
dc.date.accessioned	2021-05-11T05:00:05Z	-
dc.date.available	2019-08-05
dc.date.available	2021-05-11T05:00:05Z	-
dc.date.copyright	2019-08-05
dc.date.issued	2019
dc.date.submitted	2019-08-01
dc.identifier.citation	J. Alneberg, B. S. Bjarnason, I. de Bruijn, M. Schirmer, J. Quick, U. Z. Ijaz, L. Lahti, N. J. Loman, A. F. Andersson, and C. Quince. Binning metagenomic contigs by coverage and composition. Nature Methods, 11:1144–1146, 2014. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990. P. Audano and F. Vannberg. KAnalyze: a fast versatile pipelined K-mer toolkit. Bioinformatics, 30(14):2070–2072, 2014. S. Batzoglou, D. B. Jaffe, K. Stanley, J. Butler, S. Gnerre, E. Mauceli, B. Berger, J. P. Mesirov, and E. S. Lander. ARACHNE: a whole-genome shotgun assembler. Genome Research, 12:177–189, 2002. S. Behera, S. Gayen, J. S. Deogun, and N. V. Vinodchandran. KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 438–447. ACM, 2018. G. Benoit , P. Peterlongo, M. Mariadassou, E. Drezen, S. Schbath, D. Lavenier, and C. Lemaitre. Multiple comparative metagenomics using multiset k-mer counting. PeerJ Computer Science, 2:e94, 2016. B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, 1970. D. Campagna, C. Romualdi, N. Vitulo, M. D. Favero, M. Lexa, N. Cannata, and G. Valle. RAP: a new computer program for de novo identification of repeated sequences in whole genomes. Bioinformatics, 21(5):582–588, 2004. B. Chor, D. Horn, N. Goldman, Y. Levy, and T. Massingham. Genomic DNA k-mer spectra: models and modalities. Genome Biology, 10:R108, 2009. G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005. S. Deorowicz, A. Debudaj-Grabysz, and S. Grabowski. Disk-based k-mer counting on a PC. BMC Bioinformatics, 14:160, 2013. S. Deorowicz, M. Kokot, S. Grabowski, and A. Debudaj-Grabysz. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569–1576, 2015. V. B. Dubinkina, D. S. Ischenko, V. I. Ulyantsev, A. V. Tyakht, and D. G. Alexeev. Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis. BMC Bioinformatics, 17:38, 2016. R. C. Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5):1792–1797, 2004. R. A. Edwards, R. Olson, T. Disz, G. D. Pusch, V. Vonstein, R. Stevens, and R. Overbeek. Real Time Metagenomics: Using k-mers to annotate metagenomes. Bioinformatics, 28(24):3316–3317, 2012. M. Erbert, S. Rechner, and M. Müller-Hannemann. Gerbil: a fast and memory-efficient k-mer counter with GPU-support. Algorithms for Molecular Biology, 12:9, 2017. Y. Fofanov, Y. Luo, C. Katili, J. Wang, Y. Belosludtsev, T. Powdrill, C. Belapurkar, V. Fofanov, T.B. Li, S. Chumakov, and B. M. Pettitt. How independent are the appearances of n-mers in different genomes? Bioinformatics, 20(15):2421–2428, 2004. J. Ge, N. Guo, J. Meng, B. Wang, P. Balaji, S. Feng, J. Zhou, and Y. Wei. K-mer Counting for Genomic Big Data. In International Conference on Big Data, pages 345–351. Springer, 2018. P. Havlak, R. Chen, K. J. Durbin, A. Egan, Y. Ren, X.-Z. Song, G. M. Weinstock, and R. A. Gibbs. The Atlas genome assembly system. Genome Research, 14(4):721– 732, 2004. J. Healy, E. E. Thomas, J. T. Schwartz, and M. Wigler. Annotating large genomes with exact word matches. Genome Research, 13(10):2306–2315, 2003. S. Heinz, J. Zobel, and H. E. Williams. Burst tries: a fast, efficient data structure for string keys. ACM Transactions on Information Systems (TOIS), 20(2):192–223, 2002. E. Karsenti, S. G. Acinas, P. Bork, C. Bowler, C. D. Vargas, J. Raes, M. Sullivan, D. Arendt, F. Benzoni, J.M. Claverie, M. Follows, G. Gorsky, P. Hingamp, D. Iudicone, O. Jaillon, S. KandelsLewis, U. Krzic, F. Not, H. Ogata, S. Pesant, E. G. Reynaud, C. Sardet, M. E. Sieracki, S. Speich, D. Velayoudon, J. Weissenbach, P. Wincker, and the Tara Oceans Consortium. A Holistic Approach to Marine Eco-Systems Biology. PLoS biology, 9(10):e1001177, 2011. D. R. Kelley, M. C. Schatz, and S. L. Salzberg. Quake: quality-aware detection and correction of sequencing errors. Genome Biology, 11(11):R116, 2010. M. Kokot, M. Długosz, and S. Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759–2761, 2017. S. Koren, B. P. Walenz, K. Berlin, J. R. Miller, N. H. Bergman, and A. M. Phillippy. Canu: scalable and accurate longread assembly via adaptive k-mer weighting and repeat separation. Genome research, 27(5):722–736, 2017. S. Kurtz, A. Narechania, J. C. Stein, and D. Ware. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics, 9:517, 2008. A. Lefebvre, T. Lecroq, H. Dauchel, and J. Alexandre. FORRepeats: detects repeats on entire chromosomes and between genomes. Bioinformatics, 19(3):319–326, 2003. H. Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103–2110, 2016. H. Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, 2018. Y. Li and XifengYan. MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting. arXiv:1505.06550 [qbio.GN], 2015. M. R. Liles, B. F. Manske, S. B. Bintrim, J. Handelsman, and R. M. Goodman. A Census of rRNA Genes and Linked Genomic Sequences within a Soil Metagenomic Library. PLoS biology, 69(5):2684–2691, 2003. H.-N. Lin and W.-L. Hsu. Kart: a divide-and-conquer algorithm for NGS read alignment. Bioinformatics, 33(15):2281–2287, 2017. B. Ma, J. Tromp, and M. Li. PatternHunter: faster and more sensitive homology search. Bioinformatics, 18(3):440–445, 2002. H. Ma, L.-C. Tu, A. Naseri, Y.C. Chung, D. Grunwald, S. Zhang, and T. Pederson. CRISPR-Sirius: RNA scaffolds for signal amplification in genome imaging. Nature Methods, 15(11):928–931, 2018. N. Maillet, G. Collet, T. Vannier, D. Lavenier, and P. Peterlongo. Commet: Comparing and combining multiple metagenomic datasets. In 2014 IEEE International Conference on Bioinformatics and Biomedicine. IEEE, 2014. N. Maillet, C. Lemaitre, R. Chikhi, D. Lavenier, and P. Peterlongo. Compareads: comparing huge metagenomic experiments. BMCBioinformatics, 13(Suppl19):S10, 2012. A.-A. Mamun, S. Pal, and S. Rajasekaran. KCMBT: a k-mer Counter based on Multiple Burst Trees. Bioinformatics, 32(18):2783–2790, 2016. S. C. Manekar and S. R. Sathe. A benchmark study of k-mer counting methods for highthroughput sequencing. GigaScience, 7(12):1–13, 2018. G. Marçais and C. Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764–770, 2011. P. Melsted and J. K. Pritchard. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics, 12:333, 2011. J. R. Miller, A. L. Delcher, S. Koren, E. Venter, B. P. Walenz, A. Brownley, J. Johnson, K. Li, C. Mobarry, and G. Sutton. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24(24):2818–2824, 2008. K. J. V. Nordström, M. C. Albani, G. V. James, C. Gutjahr, B. Hartwig, F. Turck, U. Paszkowski, G. Coupland, and K. Schneeberger. Mutation identification by direct comparison of whole-genome sequencing data from mutant and wildtype individuals using kmers. Nature Biotechnology, 31(4):325–330, 2013. B. D. Ondov, T. J. Treangen, P. Melsted, A. B. Mallonee, N. H. Bergman, S. Koren, and A. M. Phillippy. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17:132, 2016. R. Ounit and S. Lonardi. Higher classification sensitivity of short metagenomic reads with CLARKS. Bioinformatics, 32(24):3823–3825, 2016. R. Ounit, S. Wanamaker, T. J. Close, and S. Lonardi. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics, 16:236, 2015. P. Pandey, M. A. Bender, R. Johnson, and R.Patro. A generalpurpose counting filter: Making every bit count. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 775–787. ACM, 2017. P. Pandey, M. A. Bender, R. Johnson, and R. Patro. Squeakr: an exact and approximate k-mer counting system. Bioinformatics, 34(4):568–575, 2017. J. Pellicer, M. F. Fay, and I. J. Leitch. The largest eukaryotic genome of them all? Botanical Journal of the Linnean Society, 164(1):10–15, 2010. F. Putze, P. Sanders, and J. Singler. Cache-, hash-, and space-efficient bloom filters. Journal of Experimental Algorithmics, 14(4):1950–1957, 2009. J. Ren, N. A. Ahlgren, Y. Y. Lu, J. A. Fuhrman, and F. Sun. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome, 5:69, 2017. G. Rizk, D. Lavenier, and R. Chikhi. DSK: k-mer counting with very low memory usage. Bioinformatics, 29(5):652–653, 2013. M. Roberts, W. Hayes, B. R. Hunt, S. M. Mount, and J. A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363– 3369, 2004. M. Roberts, B. R. Hunt, J. A. Yorke, R. A. Bolanos, and A. L. Delcher. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology, 11(4):734–752, 2004. R. S. Roy, D. Bhattacharya, and A. Schliep. Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics, 30(14):1950–1957, 2014. S. Seth, N. Välimäki, S. Kaski, and A. Honkela. Exploration and retrieval of whole-metagenome sequencing samples. Bioinformatics, 30(17):2471–2479, 2014. R. Sinha and J. Zobel. Cache-conscious sorting of large sets of strings with dynamic tries. ACM Journal of Experimental Algorithmics (JEA), 9(1.5):1–31, 2004. H. Sun, J. Ding, M. Piednoël, and K. Schneeberger. findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics, 34(4):550–557, 2017. H. Teeling, J. Waldmann, T. Lombardot, M. Bauer, and F. O. Glöckner. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 5:163, 2004. V. I. Ulyantsev, S. V. Kazakov, V. B. Dubinkina, A. V. Tyakht, and D. G. Alexeev. MetaFast: fast reference-free graph-based comparison of shotgun metagenomic data. Bioinformatics, 32(18):2760–2767, 2016. Y.-W. Wu and Y. Ye. A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-tuples. Journal of Computational Biology, 18(3):523–534, 2011. S. Yooseph, G. Sutton, D. B. Rusch, A. L. Halpern, S. J. Williamson, K. Remington, J. A. Eisen, K. B. Heidelberg, G. Manning, W. Li, L. Jaroszewski, P. Cieplak, C. S. Miller, H. Li, S. T. Mashiyama, M. P. Joachimiak, C. van Belle, J.M. Chandonia, D. A. Soergel, Y. Zhai, K. Natarajan, S. Lee, B. J. Raphael, V. Bafna, R. Friedman,S. E. Brenner, A. Godzik, D. Eisenberg, J. E. Dixon, S. S. Taylor, R. L. Strausberg, M. Frazier, and J. C. Venter. The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families. PLoS biology, 5(3):e16, 2007. D. R. Zerbino and E. Birney. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome research, 18(5):821–829, 2008. Q. Zhang, J. Pell, R. Canino-Koning, A. C. Howe, and C. T. Brown. These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PloS one, 9(7):e101271, 2014. F. Zhou, V. Olman, and Y. Xu. Barcodes for genomes and applications. BMC Bioinformatics, 9:546, 2008.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/handle/123456789/718	-
dc.description.abstract	序列分類在計算生物學的許多研究中是一個在研究初期就需要解決之問題，有許多方法被研發出來計算此問題，但隨著高通量定序技術的發展，需要計算的資料量也大幅增加，導致許多現有方法已無法在能取得的計算資源及可接受的時間內完成計算。以核苷酸k聚體為基礎的演算法就是其中一種，目前已有不少方法可以快速且準確的完成分類，但卻需要大量的計算空間，因此無法在一般個人電腦中完成計算。在本篇論文中，我們提出一個以核苷酸k聚體為基礎的演算法，在時間上與現有方法相當，在空間上則避免現有方法中儲存上的冗餘性而做出改善。為進一步降低所需記憶體空間，我們提出一個分割架構，此架構除了可以減少所需空間，也適合平行化以縮短計算所需時間。	zh_TW
dc.description.abstract	Sequence classification is a preliminary step in many researches of computational biology. There are a variety of methods proposed to compute this problem. However, with the development of high-throughput sequencing technologies, the datasets of sequencing data are getting much larger. As a result, many existing methods cannot accomplish this task with limited computational resource and acceptable time. The k-mer based algorithms are some of these methods. Most of them could finish the classification fast and accurately, but they need large computational space, which is not available in common personal computers. In this thesis, we propose a k-mer based algorithm. The time complexity of our algorithm is comparable to those of the existing methods, while we make an improvement in space usage by avoiding the redundancy of storing the k-mers. To further reduce the memory usage, we propose a partitioning strategy. In addition to the reduction in memory usage, the algorithm under this partitioning structure can be highly parallelized to improve performance.	en
dc.description.provenance	Made available in DSpace on 2021-05-11T05:00:05Z (GMT). No. of bitstreams: 1 ntu-108-R06945024-1.pdf: 2035380 bytes, checksum: 61f158fcc78409e0849a562c1bd4cf63 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	口試委員會審定書 i 誌謝 ii 摘要 iii Abstract iv Contents v List of Figures vii List of Tables viii List of Algorithms ix 1 Introduction 1 1.1 Motivation 3 1.2 Problem Description 3 1.3 Main Results 4 1.4 Organization of the Thesis 5 2 Related Work 6 2.1 k-mer Counting Tools 6 2.2 k-mer Based Metagenomic Comparison and Classification Methods 10 3 Methods 14 3.1 Algorithm 14 3.2 Analysis of the Algorithms 18 3.3 Partitioning Strategy 20 4 Conclusion 24 Bibliography 25
dc.language.iso	en
dc.subject	演算法	zh_TW
dc.subject	序列分類	zh_TW
dc.subject	環境基因體學	zh_TW
dc.subject	基因體學	zh_TW
dc.subject	k聚體	zh_TW
dc.subject	免序列比對	zh_TW
dc.subject	序列特徵	zh_TW
dc.subject	sequence classification	en
dc.subject	algorithm	en
dc.subject	sequence signature	en
dc.subject	alignment-free	en
dc.subject	k-mer	en
dc.subject	genomics	en
dc.subject	metagenomics	en
dc.title	以核苷酸k聚體頻度分類序列	zh_TW
dc.title	Sequence Classification Based on k-mer Frequencies	en
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	吳彥緯,王弘倫
dc.subject.keyword	序列分類,環境基因體學,基因體學,k聚體,免序列比對,序列特徵,演算法,	zh_TW
dc.subject.keyword	sequence classification,metagenomics,genomics,k-mer,alignment-free,sequence signature,algorithm,	en
dc.relation.page	31
dc.identifier.doi	10.6342/NTU201902038
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2019-08-01
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	生醫電子與資訊學研究所	zh_TW
顯示於系所單位：	生醫電子與資訊學研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf	1.99 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。