針對高通量定序資料之可延展序列組合演算法

Chien-Chih Chen; 陳建智

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6366

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	賴飛羆(Feipei Lai)
dc.contributor.author	Chien-Chih Chen	en
dc.contributor.author	陳建智	zh_TW
dc.date.accessioned	2021-05-16T16:27:15Z	-
dc.date.available	2013-02-16
dc.date.available	2021-05-16T16:27:15Z	-
dc.date.copyright	2013-02-16
dc.date.issued	2013
dc.date.submitted	2013-01-30
dc.identifier.citation	1. Zhang, Z., Schwartz, S., Wagner, L. & Miller, W. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203-214 (2000). 2. Idury, R. M. & Waterman, M. S. A new algorithm for DNA sequence assembly. J. Comput. Biol. 2, 291-306 (1995). 3. Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196-2204 (2000). 4. Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117-1123 (2009). 5. Collins, L. J., Biggs, P. J., Voelckel, C. & Joly, S. An approach to transcriptome analysis of non-model organisms using short-read sequences. Genome Inform 21, 3-14 (2008). 6. Batzoglou, S. et al. ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, 177-189 (2002). 7. de la Bastide, M. & McCombie, W. R. Assembling genomic DNA sequences with PHRAP. Curr Protoc Bioinformatics Chapter 11, Unit11.4 (2007). 8. Miller, J. R., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315-327 (2010). 9. Schatz, M. C., Delcher, A. L. & Salzberg, S. L. Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165-1173 (2010). 10. Koren, S., Treangen, T. J. & Pop, M. Bambus 2: scaffolding metagenomes. Bioinformatics 27, 2964-2971 (2011). 11. Pop, M. & Salzberg, S. L. Bioinformatics challenges of new sequencing technology. Trends Genet. 24, 142-149 (2008). 12. Li, Z. et al. Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. Brief Funct. Genomics 11, 25-37 (2012). 13. Xia, Q. et al. Complete resequencing of 40 genomes reveals domestication events and genes in silkworm (Bombyx). Science 326, 433-436 (2009). 14. Medvedev, P., Georgiou, K., Myers, G. & Brudno, M. Computability of models for sequence assembly. Algorithms in Bioinformatics 289-301 (2007). 15. M. C. Schatz, D. Sommer, D. R. Kelley, and M. Pop. Contrail: Assembly of large genomes using cloud computing. at <http://contrail-bio.sourceforge.net.> 16. Lin, J. & Dyer, C. Data-intensive text processing with MapReduce. Synthesis Lectures on Human Language Technologies 3, 1-177 (2010). 17. Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods 7, 909-912 (2010). 18. Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265-272 (2010). 19. Hernandez, D., François, P., Farinelli, L., Osterås, M. & Schrenzel, J. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 18, 802-809 (2008). 20. Chaisson, M. J., Brinza, D. & Pevzner, P. A. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res. 19, 336-346 (2009). 21. Birol, I. et al. De novo transcriptome assembly with ABySS. Bioinformatics 25, 2872-2877 (2009). 22. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating inhibitors. 1977. Biotechnology 24, 104-108 (1992). 23. Simpson, J. T. & Durbin, R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549-556 (2012). 24. Chen, C. C., Lin, W. D., Chang, Y. J., Chen, C. L. & Ho, J. M. Enhancing de novo transcriptome assembly by incorporating multiple overlap sizes. ISRN Bioinformatics 2012, (2012). 25. Glenn, T. C. Field guide to next-generation DNA sequencers. Mol. Ecol. Resour. 11, 759-769 (2011). 26. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644-652 (2011). 27. Salzberg, S. L. et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557-567 (2012). 28. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402 (1997). 29. Pettersson, E., Lundeberg, J. & Ahmadian, A. Generations of sequencing technologies. Genomics 93, 105-111 (2009). 30. Pop, M. Genome assembly reborn: Recent computational challenges. Brief Bioinform 10, 354–366 (2009). 31. Paul Medvedev Genome Graphs. (2010). 32. Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376-380 (2005). 33. Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231-239 (1988). 34. White, T. Hadoop: The definitive guide. (Yahoo Press: 2010). 35. Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513-1518 (2011). 36. Ilie, L., Fazayeli, F. & Ilie, S. HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27, 295-302 (2011). 37. Walter, C. Kryder’s law. Sci. Am. 293, 32–33 (2005). 38. Dean, J. & Ghemawat, S. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 107-113 (2008). 39. Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61-65 (2007). 40. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135-1145 (2008). 41. Mardis, E. R. Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 9, 387-402 (2008). 42. Gao, S., Sung, W.-K. & Nagarajan, N. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J. Comput. Biol. 18, 1681-1691 (2011). 43. Jackson, B. G., Schnable, P. S. & Aluru, S. Parallel short sequence assembly of transcriptomes. BMC Bioinformatics 10 Suppl 1, S14 (2009). 44. Nagarajan, N. & Pop, M. Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J. Comput. Biol. 16, 897-908 (2009). 45. Morin, R. et al. Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. BioTechniques 45, 81-94 (2008). 46. Kelley, D. R., Schatz, M. C. & Salzberg, S. L. Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11, R116 (2010). 47. Vishkin, U. Randomized speed-ups in parallel computation. Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing 230-239 (1984). 48. Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. & Pirovano, W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27, 578-579 (2011). 49. Metzker, M. L. Sequencing technologies—the next generation. Nat. Rev. Genet. 11, 31-46 (2009). 50. Stein, L. D. The case for cloud computing in genome informatics. Genome Biol. 11, 207 (2010). 51. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60-65 (2008). 52. Myers, E. W. The fragment assembly string graph. Bioinformatics 21 Suppl 2, ii79-85 (2005). 53. Mardis, E. R. The impact of next-generation sequencing technology on genetics. Trends Genet. 24, 133-141 (2008). 54. Mullikin, J. C. & Ning, Z. The phusion assembler. Genome Res. 13, 81-90 (2003). 55. Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821-829 (2008). 56. Furusawa, C. & Kaneko, K. Zipf’s law in gene expression. Phys. Rev. Lett. 90, 088102 (2003). 57. Yang X., Chockalingam S. P., Aluru S. A survey of error-correction methods for next-generation sequencing. Brief Bioinform. (2012).
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6366	-
dc.description.abstract	DNA定序技術是研究分子生物的最重要的步驟之一，用來判定DNA片段所代表的序列資訊。隨著次世代定序技術的發展，基因體學跟轉錄體學的研究已進入到下一個里程。然而目前的定序技術無法將基因體或轉錄體從頭到尾一次定序完成，必需將樣本切成很多短的片段，再透過序列組合演算法將基因體或轉錄體還原。因此序列組合組合演算法一直是生物資訊中的核心問題。序列組合的挑戰在於: (1)序列中的錯誤、(2)序列中有重複片段、(3)序列中每各位置被定序到的次數不平均、(4)針對大量資料時計算複雜度的問題。其中，隨著次世代定序技術所帶來快速增加的資料量，使得序列組合軟體能具備可延展運算能力來處理大量序列資料為最迫切的需求。這類的需求剛好符合雲端運算的運作模式。在雲端雲算中，使用者可以依需求透過網路向供應商去設置幾千台的電腦資源來做大量資料的平行處理。針對大量的資料，這種可延展的雲端應用通常是發展在MapReduce的架構下。在本篇論文中，我們提出一個以MapReduce為架構的高通量序列組合演算法，稱作CloudBrush。CloudBrush是以雙向串圖(bidirected string graph)為基礎，主要分成兩個階段: 建構圖形(graph construction)和簡化圖形(graph simplification) 。在建構圖形的步驟，我們以一個讀序(read)當做圖上的一個節點(node)，節點跟節點間的圖邊(edge)定義為讀序跟讀序間的重複(overlap)。我們提出一個前綴延伸(prefix-and-extend)的演算法來判定兩兩讀序間的重複。在圖形簡化的步驟，我們採取常見的串圖簡化方法包括:傳遞簡約(transitive reduction)、路徑壓縮(path compression)、移除分枝結構(tips removal)和移除氣泡結構(bubble removal) 。此外我們提出一個新的串圖簡化方式:圖邊調整(edge adjustment)來移除串圖內因序列錯誤所造成的複雜結構。此簡化方式主要是利用圖形中鄰居間的序列資訊來做判斷，移除可能是錯誤造成的圖邊。在效能評比部分，我們利用Genome Assembly Gold-Standard Evaluation為基準來衡量CloudBrush組出的序列品質並跟其他序列組合工具做比較。實驗結果顯示我們的演算法可組出中等長度的N50，且不易發生序列誤接(mis-assembly)的情形。此外針對轉錄體的序列資料我們另外提出一個T-CloudBrush的流程。T-CloudBrush 主要是利用多重參數(multiple-k)來克服轉錄體序列資料深度(coverage)差異的問題。多重參數的概念主要是來自於這項觀察:在考慮序列組出的品質情況下，序列資料的深度跟序列組合演算法使用的重複區域(overlap size)參數呈現正相關的關係。實驗結果顯示T-CloudBrush可以增進轉錄體序列組合的品質。綜合上述，本篇論文在可延伸的運算架構底下，探討序列組合算法所面臨的挑戰，並針對大資料的處理，序列錯誤及序列資料深度差異的問題提出可能的解法。	zh_TW
dc.description.abstract	DNA sequencing is one of the most important procedures in molecular biology research for determining the sequences of bases in specific DNA segments. With the development of next-generation sequencing technologies, studies on genomics and transcriptomics are moving into a new era. However, the current DNA sequencing technologies cannot be used to read entire genomes or a transcript in 1 step; instead, small sequences of 20–1000 bases are read. Thus, sequence assembly continues to be one of the central problems in bioinformatics. The challenges facing sequence assembly include the following: (1) sequencing error, (2) repeat sequences, (3) nonuniform coverage, and (4) computational complexity of processing large volumes of data. From these challenges, considering the rapid growth of data throughput delivered by next-generation sequencing technologies, there is a pressing need for sequence assembly software that can efficiently handle massive sequencing data by using scalable and on-demand computing resources. These requirements fit in with the model of cloud computing. In cloud computing, computing resources can be allocated on demand over the Internet from several thousand computers offered by vendors for analyzing data in parallel. Such cloud-computing applications are constantly being developed for large datasets and are run under the framework of MapReduce. In this dissertation, we have proposed CloudBrush, a parallel pipeline that runs on the MapReduce framework for de novo assembly of high-throughput sequencing data. CloudBrush is based on bidirected string graphs and its analysis consists of 2 main stages: graph construction and graph simplification. During graph construction, a node is defined for each nonredundant sequence read, and the edge is defined for overlap between reads. We have developed a prefix-and-extend algorithm for identifying overlaps between a pair of reads. The graph is further simplified by using conventional operations such as transitive reduction, path compression, tip removal, and bubble removal. We have also introduced a new operation, edge adjustment, for removing error topology structures in string graphs. This operation uses the sequence information of all graph neighbors for each read and eliminates the edges connecting to reads containing rare bases. CloudBrush was evaluated against Genome Assembly Gold-Standard Evaluation (GAGE) benchmarks to compare its assembly quality with that of other assemblers. The results showed that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels. In addition, we have introduced 2 measures, precision and recall, to address the issues of faithfully aligned contigs in order to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush was found to produce contigs with high precision and recall. We have also introduced a T-CloudBrush pipeline for transcriptome data. T-CloudBrush uses the multiple-k concept to overcome the problem of nonuniform coverage of transcriptome data. This concept is based on observation of the correlation between sequencing data coverage and the overlap size used during assembly. The experiment results showed that T-CloudBrush improves the accuracy of de novo transcriptome assembly. In summary, this dissertation explores the challenges facing sequence assembly under the scalable computing framework and provides possible solutions for the problems of sequencing errors, nonuniform coverage, and processing of large volumes of data.	en
dc.description.provenance	Made available in DSpace on 2021-05-16T16:27:15Z (GMT). No. of bitstreams: 1 ntu-102-D96922007-1.pdf: 1563112 bytes, checksum: c21735e392d17daeb3414a9a2de65730 (MD5) Previous issue date: 2013	en
dc.description.tableofcontents	1. Introduction 1 1.1 Shotgun Sequencing and Assembly 2 1.2 Review of Sequencing Technology 4 1.3 Review of Assembly Algorithms 7 1.3.1 Evolution of Assembly Algorithms 7 1.3.2 Comparison of String Graphs and de Bruijn Graphs 10 1.3.3 Common Stages of Genome Assembly 12 1.4 Overview of this Dissertation 14 2. Genome Assembly with MapReduce 16 2.1 Introduction 16 2.2 Methods 17 2.2.1 Distributed Graph Processing in MapReduce 17 2.2.2 CloudBrush: String Graph Assembly Using MapReduce 19 2.3 Results and Discussion 30 2.3.1 Comparison with Other Tools Using GAGE Benchmarks 30 2.3.2 Runtime Analysis 32 3. Error Correction 34 3.1 Introduction 34 3.2 Methods 35 3.2.1 Correct Sequencing Error by String Graph 35 3.2.2 Correct Sequencing Error by Read Stack 43 3.3 Results and Discussion 46 3.3.1 Analysis of Edge Adjustment 46 3.3.2 Evaluation of Assembly Accuracy 51 4. De Novo Assembly of Transcriptome Data 57 4.1 Introduction 57 4.2 Results 58 4.2.1 On the Relationship Between Coverage and Optimum k 58 4.2.2 The Effect of Sequencing Error Rate 62 4.2.3 T-CloudBrush: Multiple Overlap Size of CloudBrush 63 4.2.4 Comparing T-CloudBrush with Existing Tools 65 4.3 Discussion 67 5. Conclusion and Future Research 69 Appendix A: List of Publications 71 Appendix B: CloudBrush Manual 72 Appendix C: CloudBrush Web Demo User Guide 75 Appendix D: Source Code 81 Bibliography 82
dc.language.iso	en
dc.title	針對高通量定序資料之可延展序列組合演算法	zh_TW
dc.title	Scalable Assembly of High-Throughput De Novo Sequencing Data	en
dc.type	Thesis
dc.date.schoolyear	101-1
dc.description.degree	博士
dc.contributor.coadvisor	陳俊良(Chuen-Liang Chen),何建明(Jan-Ming Ho)
dc.contributor.oralexamcommittee	趙坤茂,高成炎,劉邦鋒,洪士灝,林仲彥
dc.subject.keyword	序列組合,平行運算,基因體學,轉錄體學,生物資訊,	zh_TW
dc.subject.keyword	sequence assembly,parallel computing,genomics,transcriptomics,bioinformatics,	en
dc.relation.page	88
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2013-01-30
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-102-1.pdf	1.53 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。