Hadoop雲端運算平台上生物基因組裝演算法實作與效能改進

Chun-Yang Huang; 黃峻揚

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/61381

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	黃乾綱(Chien-Kang Huang)
dc.contributor.author	Chun-Yang Huang	en
dc.contributor.author	黃峻揚	zh_TW
dc.date.accessioned	2021-06-16T13:01:56Z	-
dc.date.available	2013-08-09
dc.date.copyright	2013-08-09
dc.date.issued	2013
dc.date.submitted	2013-08-06
dc.identifier.citation	1. Sanger, F., S. Nicklen, and A.R. Coulson, DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences, 1977. 74(12): p. 5463. 2. Adams, M.D., et al., Complementary DNA sequencing: expressed sequence tags and human genome project. Science, 1991. 252(5013): p. 1651-1656. 3. Velculescu, V.E., et al., Serial analysis of gene expression. Science, 1995. 270(5235): p. 484-487. 4. Kim, J., et al., Mapping DNA-protein interactions in large genomes by sequence tag analysis of genomic enrichment. Nature methods, 2004. 2(1): p. 47-53. 5. Kahvejian, A., J. Quackenbush, and J.F. Thompson, What would you do if you could sequence everything? Nature biotechnology, 2008. 26(10): p. 1125-1133. 6. Grossman, R.L., The case for cloud computing. IT professional, 2009. 11(2): p. 23-27. 7. Margulies, M., et al., Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005. 437(7057): p. 376-380. 8. Bentley, D.R., Whole-genome re-sequencing. Current opinion in genetics & development, 2006. 16(6): p. 545-552. 9. ; Available from: http://www.appliedbiosystems.com. 10. Pop, M., et al., Comparative genome assembly. Briefings in bioinformatics, 2004. 5(3): p. 237-248. 11. Myers, E.W., Toward simplifying and accurately formulating fragment assembly. Journal of Computational Biology, 1995. 2(2): p. 275-290. 12. Warren, R.L., et al., Assembling millions of short DNA sequences using SSAKE. Bioinformatics, 2007. 23(4): p. 500-501. 13. Jeck, W.R., et al., Extending assembly of short DNA sequences to handle error. Bioinformatics, 2007. 23(21): p. 2942-2944. 14. Dohm, J.C., et al., SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Research, 2007. 17(11): p. 1697-1706. 15. Bryant, D., W.K. Wong, and T. Mockler, QSRA–a quality-value guided de novo short read assembler. BMC bioinformatics, 2009. 10(1): p. 69. 16. Butler, J., et al., ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Research, 2008. 18(5): p. 810-820. 17. Simpson, J.T., et al., ABySS: a parallel assembler for short read sequence data. Genome Research, 2009. 19(6): p. 1117-1123. 18. Zerbino, D.R. and E. Birney, Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 2008. 18(5): p. 821-829. 19. Zerbino, D.R., Genome assembly and comparison using de Bruijn graphs, D.R. Zerbino, Editor. 2009, European Molecular Biology Laboratory. 20. Li, R., et al., De novo assembly of human genomes with massively parallel short read sequencing. Genome Research, 2010. 20(2): p. 265-272. 21. Joshi, N., et al., Parallelization of Velvet,“a de novo genome sequence assembler”. 22. Amdahl, G.M. Validity of the single processor approach to achieving large scale computing capabilities. in Proceedings of the April 18-20, 1967, spring joint computer conference. 1967: ACM. 23. Li, Y., et al., Memory Efficient De Bruijn Graph Construction. arXiv preprint arXiv:1207.3532, 2012. 24. Metzker, M.L., Emerging technologies in DNA sequencing. Genome Research, 2005. 15(12): p. 1767-1776. 25. Mardis, E.R., The impact of next-generation sequencing technology on genetics. Trends in genetics, 2008. 24(3): p. 133-141. 26. Hillier, L.D.W., et al., Whole-genome sequencing and variant discovery in C. elegans. Nature methods, 2008. 5(2): p. 183-188. 27. Mortazavi, A., et al., Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 2008. 5(7): p. 621-628. 28. Leja, D. Chromosome. Available from: http://www.accessexcellence.org/RC/VL/GG/nhgri_PDFs/chromosome.pdf. 29. Ghemawat, S., H. Gobioff, and S.T. Leung. The Google file system. 2003: ACM. 30. Dean, J. and S. Ghemawat, MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008. 51(1): p. 107-113. 31. White, T., Hadoop: The Definitive Guide, Second Edition, M. Loukides, Editor. 2010, O'Reilly Media, Inc. 32. Vishwanath, K.V. and N. Nagappan. Characterizing cloud computing hardware reliability. 2010: ACM. 33. Pevzner, P.A., H. Tang, and M.S. Waterman, An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences, 2001. 98(17): p. 9748. 34. Havlak, P., et al., The Atlas genome assembly system. Genome Research, 2004. 14(4): p. 721-732. 35. Myers, E.W., et al., A whole-genome assembly of Drosophila. Science, 2000. 287(5461): p. 2196-2204. 36. Huang, X., et al., PCAP: a whole-genome assembly program. Genome Research, 2003. 13(9): p. 2164-2170. 37. Mullikin, J.C. and Z. Ning, The phusion assembler. Genome Research, 2003. 13(1): p. 81-90. 38. Fleischner, H., Eulerian graphs and related topics. Vol. 1. 1990: Elsevier. 39. Schatz, M.C., A.L. Delcher, and S.L. Salzberg, Assembly of large genomes using second-generation sequencing. Genome Research, 2010. 20(9): p. 1165-1173. 40. Schatz, M.C., Contrail-overview. 2009. 41. Salzberg, S.L., et al., GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research, 2012. 22(3): p. 557-567. 42. Yang Li, P.K., Fangqiu Han, Shengqi Yang, Xifeng Yan, Subhash Suri, MSPexample. 2012. 43. 陳建智, 針對高通量定序資料之可延展序列組合演算法, in 臺灣大學資訊工程學研究所學位論文. 2013, 臺灣大學.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/61381	-
dc.description.abstract	生物基因組裝是將定序出的基因組序列片段組裝回原始序列的技術，但在執行時會耗費許多硬體資源，使得當組裝大型基因組時組裝程序難以執行完畢。Hadoop雲端運算為近幾年非常熱門的話題，使用多台電腦建構出分散式運算環境，可以有效減少本地端的運算量，還可避免資料在本地端與伺服器之間過多的傳輸造成資源浪費。本論文使用M.Schatz等人開發的結合生物基因組裝與Hadoop雲端運算的組裝工具，名為Contrail。Contrail是將基因組裝的演算法實作在Hadoop雲端運算的平台上，利用其分散式運算的特性，解決多數的組裝工具在組裝大型生物基因組時，硬體資源不足導致組裝程序難以順利執行的情況。本論文研究Hadoop改版前後系統架構與API上的差異，修改Contrail的程式，使之能在現行的Hadoop平台上順利執行，並與目前較為常用的兩個組裝工具Velvet與SOAPdenovo作組裝結果的比較。此外，更進一步針對Hadoop系統中運算資源的利用，以及基因組裝工具的圖形演算法兩者的效能問題進行改進。研究發現Contrail的組裝結果在較小的基因組上與Velvet的結果較為相似，較大的基因組則和SOAPdenovo的結果較為類似，而在Velvet與SOAPdenovo的組裝程序皆難以完成的大型基因組，Contrail可以順利得出組裝結果。說明Contrail不僅能處理多數組裝工具難以順利執行的大型基因組，組裝的結果也有一定的參考價值。	zh_TW
dc.description.abstract	Genome assembly is the process of taking the reads and putting them back together to reproduce the original sequences. But the process takes lots of computer resources, makes it hard to complete whole process as it assembling large genome. Hadoop is one of the hottest topics for these years. By construct distributed computational circumstance, Hadoop reduce local computation and avoid frequently data-transportation between server and client. This thesis use the assembly tool developed by M.Schatz, it combines genome assembly and Hadoop cloud computing, named Contrail. Utilizing the characteristic of distributed computation, Contrail is able to solve the problem that most assemblers are hard to complete large genome assembly. This thesis study the revision of Hadoop system architecture and API, and revise the Contrail code to make it be able to run on current version of Hadoop platfrom. Furthermore, we improve the performance of Contrail and compare the assembly result with Velvet and SOAPdenovo. We find out the assembly result of Contrail is similar with Velvet’s in small genome, and more similar with SOAPdenovo in larger genome. To the large genome assembly Velvet and SOAPdenovo are hard to complete the whole assembly process, Contrail complete the assembly process successfully.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T13:01:56Z (GMT). No. of bitstreams: 1 ntu-102-R99525084-1.pdf: 4359440 bytes, checksum: 93b5798b04ab97658f8cd078f1e248a3 (MD5) Previous issue date: 2013	en
dc.description.tableofcontents	摘要 I 目錄 III 圖目錄 VI 表目錄 VIII 第一章緒論 1 第一節次世代定序技術與基因組裝之相關研究 2 第二節面臨的困境 4 第三節論文貢獻 4 第四節論文架構 5 第二章文獻探討 6 第一節　次世代定序(Next-Generation Sequencing) 6 第二節雲端運算平台簡介—Hadoop 7 1. HDFS 8 2. Hadoop MapReduce 10 第三節兩種de novo Assembly方法之比較 11 第三章 Contrail之圖形演算法概述 14 第一節 Contrail之de Bruijn graph的建立 14 第二節　Contrail之de Bruijn graph的簡化 16 第四章 Contrail演算法之改進 17 第一節　因應Hadoop改版之程式上的修改 17 1. JobConf, JobClient與RunningJob之變更 17 i. JobConf類別之變更 18 ii. JobClient類別的改變 19 iii. 捨棄RunningJob介面 21 2. FileInputFormat與FileOutputFormat的改變 22 3. Reporter與OutputCollector的改變 24 i. 捨棄Reporter介面 24 ii. OutputCollector的改變 25 第二節　效能之改進 27 第五章實驗結果與討論 29 第一節實驗設備 29 第二節實驗資料 30 第三節實驗結果比較 31 1. Staphylococcus aureus(金黃色葡萄球菌) 31 2. Rhodobacter sphaeroides(光合菌) 34 3. Human Chromosome 14(人類染色體第14號) 37 4. Bombus impatiens(大黃蜂) 41 第四節 Contrail效能之改進 41 第五節其他論文比較 44 第六章結論與未來方向 46 參考文獻 47
dc.language.iso	zh-TW
dc.subject	基因組裝	zh_TW
dc.subject	MapReduce	zh_TW
dc.subject	Hadoop	zh_TW
dc.subject	Contrail	zh_TW
dc.subject	Genome Assembly	en
dc.subject	Hadoop	en
dc.subject	MapReduce	en
dc.subject	Contrail	en
dc.title	Hadoop雲端運算平台上生物基因組裝演算法實作與效能改進	zh_TW
dc.title	The Implementation and Performance Improvement of Genome Assembly on Hadoop MapReduce	en
dc.type	Thesis
dc.date.schoolyear	101-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	張瑞益(Ra-Yi Chang),歐陽彥正,陳倩瑜
dc.subject.keyword	基因組裝,Hadoop,MapReduce,Contrail,	zh_TW
dc.subject.keyword	Genome Assembly,Hadoop,MapReduce,Contrail,	en
dc.relation.page	49
dc.rights.note	有償授權
dc.date.accepted	2013-08-07
dc.contributor.author-college	工學院	zh_TW
dc.contributor.author-dept	工程科學及海洋工程學研究所	zh_TW
顯示於系所單位：	工程科學及海洋工程學系

文件中的檔案：

檔案	大小	格式
ntu-102-1.pdf 未授權公開取用	4.26 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。