使用快速取樣與連結策略之新式長序列基因映射器

Nae-Chyun Chen; 陳乃群

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/76444

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	盧奕璋(Yi-Chang Lu)
dc.contributor.author	Nae-Chyun Chen	en
dc.contributor.author	陳乃群	zh_TW
dc.date.accessioned	2021-07-09T15:52:27Z	-
dc.date.available	2022-08-29
dc.date.copyright	2017-08-29
dc.date.issued	2017
dc.date.submitted	2017-07-18
dc.identifier.citation	[1] F. Sanger, S. Nicklen, and A. R. Coulson, “Dna sequencing with chain terminating inhibitors,” Proceedings of the national academy of sciences, vol. 74, no. 12, pp. 5463–5467, 1977. [2] (2016) The cost of sequencing a human genome. National Institute of Health. [Online]. Available: https://www.genome.gov/sequencingcosts/ [3] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of molecular biology, vol. 48, no. 3, pp. 443–453, 1970. [4] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Journal of molecular biology, vol. 147, no. 1, pp. 195–197,1981. [5] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of molecular biology, vol. 215, no. 3, pp. 403–410, 1990. [6] F. J. Sedlazeck, P. Rescheneder, and A. Von Haeseler, “Nextgenmap: fast and accurate read mapping in highly polymorphic genomes,” Bioinformatics, p. btt468, 2013. [7] B. Liu, H. Guo, M. Brudno, and Y. Wang, “debga: read alignment with de bruijn graph-based seed and extension,” Bioinformatics, p. btw371, 2016. [8] I. Sovi ́c, M. Siki ́, A. Wilm, S. N. Fenlon, S. Chen, and N. Nagarajan, “Fast and sensitive mapping of nanopore sequencing reads with graphmap,” Nature communications, vol. 7, 2016. [9] S. M. Kiełbasa, R. Wan, K. Sato, P. Horton, and M. C. Frith, “Adaptive seeds tame genomic sequence comparison,” Genome research, vol. 21, no. 3, pp. 487–493, 2011. [10] B. Langmead and S. L. Salzberg, “Fast gapped-read alignment with bowtie 2,” Nature methods, vol. 9, no. 4, pp. 357–359, 2012. [11] H. Li, “Aligning sequence reads, clone sequences and assembly contigs with bwa-mem,” arXiv preprint arXiv:1303.3997, 2013. [12] S. Goodwin, J. D. McPherson, and W. R. McCombie, “Coming of age: ten years of next-generation sequencing technologies,” Nature Reviews Genetics, vol. 17, no. 6, pp. 333–351, 2016. [13] T. J. Treangen and S. L. Salzberg, “Repetitive dna and next-generation sequencing: computational challenges and solutions,” Nature Reviews Genetics, vol. 13, no. 1, pp. 36–46, 2012. [14] D. J. Munroe and T. J. Harris, “Third-generation sequencing fireworks at marco island: advances in sequencing platforms promise to make this technology more accessible,” Nature biotechnology, vol. 28, no. 5, pp. 426–429, 2010. [15] O. N. Technologies. (2016) About minion. [Online]. Available: https://nanoporetech.com/products/minion [16] N. Loman. (2016) Nanopore r9 rapid run data release. [Online]. Available: http://lab.loman.net/2016/07/30/nanopore-r9-data-release/#disqus thread [17] O. N. Technologies. (2016) no thanks, ive already got one. [Online]. Available: https://www.youtube.com/watch?v=nizGyutn6v4 [18] M. Burrows and D. J. Wheeler, “A block-sorting lossless data compression algorithm,” 1994. [19] P. Ferragina and G. Manzini, “Opportunistic data structures with applications,” in Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on. IEEE, 2000, pp. 390–398. [20] H. Li and R. Durbin, “Fast and accurate short read alignment with burrows–wheeler transform,” Bioinformatics, vol. 25, no. 14, pp. 1754–1760, 2009. [21] D. Okanohara and K. Sadakane, “A linear-time burrows-wheeler transform using induced sorting,” in International Symposium on String Processing and Information Retrieval. Springer, 2009, pp. 90–101. [22] P. Ferragina, T. Gagie, and G. Manzini, “Lightweight data indexing and compression in external memory,” Algorithmica, vol. 63, no. 3, pp. 707–730, 2012. [23] H. Li, “Fast construction of fm-index for long sequence reads,” Bioinformatics, p. btu541, 2014. [24] H. Li, J. Ruan, and R. Durbin, “Mapping short dna sequencing reads and calling variants using mapping quality scores,” Genome research, vol. 18, no. 11, pp. 1851–1858, 2008. [25] H. Li and R. Durbin, “Fast and accurate long-read alignment with burrows–wheeler transform,” Bioinformatics, vol. 26, no. 5, pp. 589–595, 2010. [26] H. Li, “Exploring single-sample snp and indel calling with whole-genome de novo assembly,” Bioinformatics, vol. 28, no. 14, pp. 1838–1844, 2012. [27] S. Burkhardt and J. K ̈arkk ̈ainen, “One-gapped q-gram filters for levenshtein distance,” in Annual Symposium on Combinatorial Pattern Matching. Springer, 2002, pp. 225–234. [28] G. Benson, A. Levy, and B. R. Shalom, “Longest common subsequence in k length substrings,” in International Conference on Similarity Search and Applications. Springer, 2013, pp. 257–265. [29] G. Myers, “A fast bit-vector algorithm for approximate string matching based on dynamic programming,” Journal of the ACM (JACM), vol. 46, no. 3, pp. 395–415, 1999. [30] Y. Turakhia, K. J. Zheng, G. Bejerano, and W. J. Dally, “Darwin: A hardware-acceleration framework for genomic sequence alignment,” bioRxiv, p. 092171, 2017. [31] T.-Y. Yeh and Y. N. Patt, “Two-level adaptive training branch prediction,” in Proceedings of the 24th annual international symposium on Microarchitecture. ACM, 1991, pp. 51–61. [32] M. Holtgrewe, A.-K. Emde, D. Weese, and K. Reinert, “A novel and well-defined benchmarking method for second generation read mapping,” BMC bioinformatics, vol. 12, no. 1, p. 210, 2011. [33] Y. Ono, K. Asai, and M. Hamada, “Pbsim: Pacbio reads simulator toward accurate genome assembly,” Bioinformatics, vol. 29, no. 1, pp. 119–121, 2013. [34] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin et al., “The sequence alignment/map format and samtools,” Bioinformatics, vol. 25, no. 16, pp. 2078–2079, 2009. [35] H. Thorvaldsd ́ottir, J. T. Robinson, and J. P. Mesirov, “Integrative genomics viewer (igv): high-performance genomics data visualization and exploration,” Briefings in bioinformatics, vol. 14, no. 2, pp. 178–192, 2013. [36] J. T. Robinson, H. Thorvaldsd ́ottir, W. Winckler, M. Guttman, E. S. Lander, G. Getz, and J. P. Mesirov, “Integrative genomics viewer,” Nature biotechnology, vol. 29, no. 1, pp. 24–26, 2011. [37] H. Li, “A statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data,” Bioinformatics, vol. 27, no. 21, pp. 2987–2993, 2011. [38] L. B. Jorde and S. P. Wooding, “Genetic variation, classification and’race’,” Nature genetics, vol. 36, pp. S28–S33, 2004.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/76444	-
dc.description.abstract	自二十一世紀以來，次世代定序的突破性進展使得基因定序的金錢與時間成本皆大幅降低，開展一系列全新的生物與醫學研究。於2015年，利用特殊奈米孔道的高通量定序技術問世，此定序方法能夠快速分析更長的基因序列，有潛力解決許多序列分析的困難，例如大範圍的序列增減、重複區或是單倍型分析。然而由於這類奈米孔道定序的正確率尚不如前一代定序技術，對目前的主流基因映射演算法來說，映射奈米孔道序列相對困難。在本論文中，我們針對低正確率的長基因序列，提出了一基於布洛─惠勒斯轉換之短序列取樣與連結之基因映射器（BWSL）。此一方法採用包含取樣階段、連結階段與排列階段的三級架構設計。從目標序列中取樣出大量的短序列以提高映射的成功率，在連結與序列排列階段，我們設計了有效率的一維投影與分區動態演算法以提高程式效率與映射正確率。在基於模擬奈米孔道序列的實驗中，BWSL的映射靈敏度優於目前最好的序列映射器。以人類第四對染色體的模擬序列實驗為例，BWSL的映射靈敏度高達98.35%，與文獻中表現最好的GraphMap相比，結果高出0.83%。此外，BWSL比對結果的每鹼基對平均分數與變異分析結果也優於GraphMap，顯示BWSL在奈米孔道序列相關應用上的潛力。	zh_TW
dc.description.abstract	In the previous decade, the advent of high-throughput sequencing makes it possible to acquire and analyze sequence data with low cost and high speed. Since 2015, nanopore-based single-molecule sequencing platforms can generate reads longer than thousands of base pairs at high speed. However, when compared to the accuracy of traditional sequencer, the sequencing accuracy of nanopore platforms is relatively lower, which becomes a great challenge for sequence aligners. In this thesis, we propose Burrows-Wheeler-transform-based aligner with Seeding and Linking (BWSL) to efficiently align long nanopore reads. A three-stage architecture involving seeding, linking and extending is designed for sensitive mapping and accurate alignment. A great number of short seeds is generated to ensure high mapping quality. The seeds are processed with novel algorithms to efficiently be mapped to correct positions and generate accurate alignments. The sensitivity of BWSL on synthetic MinION datasets outperforms current state-of-the-art mappers. Using human chromosome 4 dataset as an example, the sensitivity reaches as high as 98.35% in BWSL, which is 0.83% better than GraphMap. Also, BWSL has high average alignment scores and great variant calling accuracy.	en
dc.description.provenance	Made available in DSpace on 2021-07-09T15:52:27Z (GMT). No. of bitstreams: 1 ntu-106-R04943093-1.pdf: 6976498 bytes, checksum: 806b27dde517e1b3c0f048ca641def96 (MD5) Previous issue date: 2017	en
dc.description.tableofcontents	Abstract i List of Figures vii List of Tables xi 1 Introduction 1 1.1 High-Throughput Sequencing . . . . . . . . . . . . . . . . . . . 1 1.1.1 Overview of Sequencing Technologies . . . . . . . . . . . 1 1.1.2 Sequence Data Processing . . . . . . . . . . . . . . . . . 2 1.1.3 Short Read Sequencing . . . . . . . . . . . . . . . . . . . 3 1.1.4 Single-Molecule Sequencing . . . . . . . . . . . . . . . . 3 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background 7 2.1 Sequence Alignment with Dynamic Programming . . . . . . . . . 7 2.2 String Searching with Indexes . . . . . . . . . . . . . . . . . . . 9 2.2.1 Burrows-Wheeler Transform . . . . . . . . . . . . . . . . 9 2.2.2 Ferragina-Manzini Index . . . . . . . . . . . . . . . . . . 10 2.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 BLAST: Basic Local Alignment Search Tool . . . . . . . 14 2.3.2 BWA: Burrows-Wheeler Aligner . . . . . . . . . . . . . . 16 2.3.3 GraphMap . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Algorithms and Software Implementation 19 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Seeding Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 Seeds Sampling . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.2 Seeds Alignment . . . . . . . . . . . . . . . . . . . . . . 22 3.2.3 Filtration of Hits . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Linking Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 Putative Read Positions . . . . . . . . . . . . . . . . . . . 25 3.3.2 Queue-Based Flexible Histogram and Partial Sorting . . . 26 3.4 Extending Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.1 Genome Alignment Using Constant-Memory Trace-Back (GACT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.2 Dynamic Filtration Strategies . . . . . . . . . . . . . . . 31 3.5 Software Implementation . . . . . . . . . . . . . . . . . . . . . . 34 3.5.1 The BWA Front-End . . . . . . . . . . . . . . . . . . . . 34 3.5.2 Memory Management . . . . . . . . . . . . . . . . . . . 34 4 Experiments 39 4.1 Overview of the Results of BWSL . . . . . . . . . . . . . . . . . 39 4.2 Alignment Results with Synthetic Data . . . . . . . . . . . . . . . 41 4.2.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . 41 4.2.2 Evaluation of Sensitivity and Base-Level Accuracy . . . . 42 4.2.3 Visualization of Alignment Results . . . . . . . . . . . . 43 4.2.4 Variant Analysis of Alignment Results . . . . . . . . . . . 44 4.3 Parameters Selection Guidelines . . . . . . . . . . . . . . . . . . 45 4.3.1 Max Number of Hits Allowed . . . . . . . . . . . . . . . 47 4.3.2 Number of Chains . . . . . . . . . . . . . . . . . . . . . 49 4.3.3 The size of the First Tile . . . . . . . . . . . . . . . . . . 50 5 Conclusions and Future Work 53 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 References 55
dc.language.iso	en
dc.title	使用快速取樣與連結策略之新式長序列基因映射器	zh_TW
dc.title	A Novel Long Read Aligner Using Fast Seeding and Linking Strategies	en
dc.type	Thesis
dc.date.schoolyear	105-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳倩瑜(Chien-Yu Chen),陳和麟(Ho-Lin Chen),陳沛隆(Pei-Lung Chen)
dc.subject.keyword	基因序列映射,高通量定序,次世代定序,奈米孔道定序,	zh_TW
dc.subject.keyword	sequence alignment,high-throughput sequencing,next-generation sequencing,nanopore sequencing,	en
dc.relation.page	59
dc.identifier.doi	10.6342/NTU201701631
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2017-07-18
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電子工程學研究所	zh_TW
dc.date.embargo-lift	2022-08-29	-
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-106-R04943093-1.pdf	6.81 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。