發展一運用RNA定序資料鑑定病原體之演算法

Chin-Ting Wu; 吳勁廷

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52678

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	莊曜宇
dc.contributor.author	Chin-Ting Wu	en
dc.contributor.author	吳勁廷	zh_TW
dc.date.accessioned	2021-06-15T16:22:57Z	-
dc.date.available	2018-08-17
dc.date.copyright	2015-08-17
dc.date.issued	2015
dc.date.submitted	2015-08-15
dc.identifier.citation	1. Moore, P.S. and Y. Chang, Why do viruses cause cancer? Highlights of the first century of human tumour virology. Nature reviews. cancer, 2010. 10(12): p. 878-889. 2. Sarid, R. and S.-J. Gao, Viruses and human cancer: from detection to causality. Cancer letters, 2011. 305(2): p. 218-227. 3. Boshoff, C. and R. Weiss, Kaposi's sarcoma-associated herpesvirus. Advances in cancer research, 1998. 75: p. 57-87. 4. Walboomers, J.M., et al., Human papillomavirus is a necessary cause of invasive cervical cancer worldwide. The journal of pathology, 1999. 189(1): p. 12-19. 5. Mineta, H., et al., Human papilloma virus (HPV) type 16 and 18 detected in head and neck squamous cell carcinoma. Anticancer research, 1997. 18(6B): p. 4765-4768. 6. Perz, J.F., et al., The contributions of hepatitis B virus and hepatitis C virus infections to cirrhosis and primary liver cancer worldwide. Journal of hepatology, 2006. 45(4): p. 529-538. 7. Shibata, D. and L.M. Weiss, Epstein-Barr virus-associated gastric adenocarcinoma. The American journal of pathology, 1992. 140(4): p. 769. 8. Gunvén, P., et al., Epstein-Barr virus in Burkitt's lymphoma and nasopharyngeal carcinoma.[i] Antibodies to EBV associated membrane and viral capsid antigens in Burkitt lymphoma patients. Nature, 1970. 228: p. 1053-6. 9. Lipkin, W.I., Microbe hunting. Microbiology and molecular biology reviews, 2010. 74(3): p. 363-377. 10. Didelot, X., et al., Transforming clinical microbiology with bacterial genome sequencing. Nature reviews genetics, 2012. 13(9): p. 601-612. 11. Chen, E.C., et al., Using a pan-viral microarray assay (Virochip) to screen clinical samples for viral pathogens. Journal of visualized experiments: JoVE, 2011(50). 12. Lanciotti, R., et al., Origin of the West Nile virus responsible for an outbreak of encephalitis in the northeastern United States. Science, 1999. 286(5448): p. 2333-2337. 13. Kuroda, M., et al., Characterization of quasispecies of pandemic 2009 influenza A virus (A/H1N1/2009) by de novo sequencing using a next-generation DNA sequencer. PLoS one, 2010. 5(4): p. e10256. 14. Greninger, A.L., et al., A metagenomic analysis of pandemic influenza A (2009 H1N1) infection in patients from North America. PloS one, 2010. 5(10): p. e13381. 15. Deng, Y.-M., N. Caldwell, and I.G. Barr, Rapid detection and subtyping of human influenza A viruses and reassortants by pyrosequencing. PLoS one, 2011. 6(8): p. e23400. 16. Chin, C.-S., et al., The origin of the Haitian cholera outbreak strain. New England journal of medicine, 2011. 364(1): p. 33-42. 17. Frank, C., et al., Epidemic profile of Shiga-toxin–producing Escherichia coli O104: H4 outbreak in Germany. New England journal of medicine, 2011. 365(19): p. 1771-1780. 18. Rohde, H., et al., Open-source genomic analysis of Shiga-toxin–producing E. coli O104: H4. New England journal of medicine, 2011. 365(8): p. 718-724. 19. Turner, M., Microbe outbreak panics Europe. Nature, 2011. 474(7350): p. 137-137. 20. Lienau, E.K., et al., Identification of a salmonellosis outbreak by means of molecular sequencing. New England journal of medicine, 2011. 364(10): p. 981-982. 21. Feng, H., et al., Clonal integration of a polyomavirus in human Merkel cell carcinoma. Science, 2008. 319(5866): p. 1096-1100. 22. Snitkin, E.S., et al., Tracking a hospital outbreak of carbapenem-resistant Klebsiella pneumoniae with whole-genome sequencing. Science translational medicine, 2012. 4(148): p. 148ra116-148ra116. 23. Rothberg, J.M., et al., An integrated semiconductor device enabling non-optical genome sequencing. Nature, 2011. 475(7356): p. 348-352. 24. Borozan, I., et al., CaPSID: a bioinformatics platform for computational pathogen sequence identification in human genomes and transcriptomes. BMC bioinformatics, 2012. 13: p. 206. 25. Kostic, A.D., et al., PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nature biotechnology, 2011. 29(5): p. 393-396. 26. Bhaduri, A., et al., Rapid identification of non-human sequences in high-throughput sequencing datasets. Bioinformatics (Oxford, England), 2012. 28(8): p. 1174-1175. 27. Schelhorn, S.-E., et al., Sensitive detection of viral transcripts in human tumor transcriptomes. PLoS comput biol, 2013. 9: p. e1003228. 28. Francis, O.E., et al., Pathoscope: species identification and strain attribution with unassembled sequencing data. Genome research, 2013. 23(10): p. 1721-1729. 29. Naeem, R., M. Rashid, and A. Pain, READSCAN: a fast and scalable pathogen discovery program with accurate genome relative abundance estimation. Bioinformatics, 2013. 29(3): p. 391-392. 30. Xu, G., et al., RNA CoMPASS: a dual approach for pathogen and host transcriptome analysis of RNA-seq datasets. 2014. 31. Kostic, A.D., et al., Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome research, 2012. 22(2): p. 292-298. 32. Bhatt, A.S., et al., Sequence-based discovery of Bradyrhizobium enterica in cord colitis syndrome. New England journal of medicine, 2013. 369(6): p. 517-28. 33. Chan, J.Z.M., et al., Genome sequencing in clinical microbiology. Nature biotechnology, 2012. 30(11): p. 1068-1071. 34. Dunne Jr, W.M., L.F. Westblade, and B. Ford, Next-generation and whole-genome sequencing in the diagnostic clinical microbiology laboratory. European journal of clinical microbiology & infectious diseases, 2012. 31(8): p. 1719-1726. 35. Walker, M.J. and S.A. Beatson, Outsmarting outbreaks. Science, 2012. 338(6111): p. 1161-1162. 36. Török, M.E. and S.J. Peacock, Rapid whole-genome sequencing of bacterial pathogens in the clinical microbiology laboratory—pipe dream or reality? Journal of antimicrobial chemotherapy, 2012: p. dks247. 37. Li, H., J. Ruan, and R. Durbin, Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research, 2008. 18(11): p. 1851-1858. 38. Li, H. and R. Durbin, Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 2009. 25(14): p. 1754-1760. 39. Altschul, S.F., et al., Basic local alignment search tool. Journal of molecular biology, 1990. 215(3): p. 403-410. 40. Kent, W.J., BLAT--the BLAST-like alignment tool. Genome research, 2002. 12(4): p. 656-64. 41. Langmead, B., et al., Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 2009. 10(3): p. R25. 42. Grabherr, M.G., et al., Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature biotechnology, 2011. 29(7): p. 644-652. 43. Skinner, M.E., et al., JBrowse: a next-generation genome browser. Genome research, 2009. 19(9): p. 1630-1638. 44. Fujita, P.A., et al., The UCSC genome browser database: update 2011. Nucleic acids research, 2010: p. 963. 45. Pruitt, K.D., et al., NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic acids research, 2012. 40(D1): p. D130-D135. 46. Hercus, C., Novoalign. 2009. 47. Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21. 48. Kim, D., et al., TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome biology, 2013. 14(4): p. R36. 49. Langmead, B. and S.L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nature methods, 2012. 9(4): p. 357-359. 50. Engström, P.G., et al., Systematic evaluation of spliced alignment programs for RNA-seq data. Nature methods, 2013. 10(12): p. 1185-1191. 51. Wang, W.-A., et al. Comparisons and performance evaluations of RNA-seq alignment tools. 2014. IEEE. 52. Marçais, G. and C. Kingsford, A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 2011. 27(6): p. 764-770. 53. Benson, D.A., et al., GenBank. Nucleic acids research, 2000. 28(1): p. 15-18. 54. Goujon, M., et al., A new bioinformatics analysis tools framework at EMBL–EBI. Nucleic acids research, 2010. 38(suppl 2): p. W695-W699. 55. Sugawara, H., et al., DDBJ with new system and face. Nucleic acids research, 2008. 36(suppl 1): p. D22-D24. 56. Pruitt, K.D., T. Tatusova, and D.R. Maglott, NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research, 2007. 35(suppl 1): p. D61-D65. 57. Leinonen, R., H. Sugawara, and M. Shumway, The sequence read archive. Nucleic acids research, 2010: p. gkq1019. 58. Andrews, S., FastQC: A quality control tool for high throughput sequence data. Reference Source, 2010. 59. Gordon, A. and G.J. Hannon, Fastx-toolkit. FASTQ/A short-reads preprocessing tools (unpublished) http://hannonlab. cshl. edu/fastx_toolkit, 2010. 60. Griebel, T., et al., Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic acids research, 2012. 40(20): p. 10073-10083. 61. Newman, M.E.J., Power laws, Pareto distributions and Zipf's law. Contemporary physics, 2005. 46(5): p. 323-351. 62. Baird, N.L., et al., Comparison of Varicella-Zoster Virus RNA Sequences in Human Neurons and Fibroblasts. Journal of virology, 2014. 88(10): p. 5877-5880. 63. Ju, Y.S., et al., A transforming KIF5B and RET gene fusion in lung adenocarcinoma revealed from whole-genome and transcriptome sequencing. Genome research, 2012. 22(3): p. 436-445. 64. Cui, P., et al., A comparison between ribo-minus RNA-sequencing and polyA-selected RNA-sequencing. Genomics, 2010. 96(5): p. 259-265. 65. Bertelsen, B.I., et al., HPV subtypes in cervical cancer biopsies between 1930and 2004: detection using general primer pair PCR and sequencing. Virchows Archiv, 2006. 449(2): p. 141-147. 66. Stoler, M.H., et al., Human papillomavirus type 16 and 18 gene expression in cervical neoplasias. Human pathology, 1992. 23(2): p. 117-128. 67. Goodman, A.L. and J.I. Gordon, Our unindicted coconspirators: human metabolism from a microbial perspective. Cell metabolism, 2010. 12(2): p. 111-116. 68. Rezazadeh, A., et al., The role of human papilloma virus in lung cancer: a review of the evidence. The American journal of the medical sciences, 2009. 338(1): p. 64-67. 69. Chen, Y.-C., et al., Lung adenocarcinoma and human papillomavirus infection. Cancer, 2004. 101(6): p. 1428-1436. 70. Castellarin, M., et al., Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. Genome research, 2012. 22(2): p. 299-306. 71. Lodge, R., et al., MuLV-based vectors pseudotyped with truncated HIV glycoproteins mediate specific gene transfer in CD4+ peripheral blood lymphocytes. Gene therapy, 1998. 5(5): p. 655-664. 72. Podschun, R. and U. Ullmann, Klebsiella spp. as Nosocomial Pathogens: Epidemiology, Taxonomy, Typing Methods, and Pathogenicity Factors. Clinical microbiology review, 1998. 11(4): p. 589-603. 73. McCartney, C., A. Moghadam, and K.B. Sriram, Lung adenocarcinoma masquerading as refractory Klebsiella pneumoniae. BMJ case report, 2014. 2014. 74. Enache-Angoulvant, A. and C. Hennequin, Invasive Saccharomyces infection: a comprehensive review. Clinical infectious disease, 2005. 41(11): p. 1559-68. 75. Tawfik, O.W., et al., Saccharomyces cerevisiae pneumonia in a patient with acquired immune deficiency syndrome. Journal of clinical microbiology, 1989. 27(7): p. 1689-91. 76. Powell, H.A., et al., Chronic obstructive pulmonary disease and risk of lung cancer: the importance of smoking and timing of diagnosis. Journal of thorac oncology, 2013. 8(1): p. 6-11. 77. Marchesi, J.R., et al., Towards the human colorectal cancer microbiome. PLoS one, 2011. 6(5): p. e20447.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52678	-
dc.description.abstract	如何早期診斷由病毒，細菌或是黴菌等病原體引起之感染性疾病為目前臨床研究的重大課題之一。除了傳統的菌種及病毒鑑定方式之外，隨著次世代定序技術的發展，運用次世代定序技術找尋可能的病原體為一有效的鑑定方式。世界上的研發團隊已發展了數個方法來執行來執行菌種鑑定工作。然而這些開發的演算法需耗費大量的電腦計算運算時間及運算資源，以至於在實務運用遭遇到困難。為此我們針對了病原體鑑定開發了一高效能的新穎演算法。此的演算法使用RNA定序資料，經由四個演算步驟進而鑑定病原體之基因序列片段。首先，將定序資料對比於人類參照基因序列，並保留非人類基因序列的資料進行下一步分析；第二步將非人類基因序列，進行全新序列組裝，透過序列重疊、延伸將序列串接成長序列片段；接著利用統計分析模型將鑑定其組裝之精準度。最後我們將通過統計檢定的長片段，利用BLAST工具確認其來源物種。本實驗運過資訊模擬資料以及RNA-seq實驗數據進而評估本演算法之效能。模擬及真實資料的分析結果顯示，本演算法皆呈現高度的精準度與敏感度。再與其他三種演算法的比較分析結果顯示，我們開發的演算法有較高的運算效能。我們將此方法應用於子宮頸癌，肺腺癌以及大腸癌的資料組上，試圖識別可能與這些癌症可能有相關的致病原，分析結果成功地找尋到各種癌症可能相關的病原體。總結而言，本實驗發展的新穎演算可準確且有效率的經由RNA-seq資料檢測出可能的病原體。且本方法之運算效良非常良好，可有效地工作時間，相信這個演算法的開發將有助於病原體檢測的研究發展。	zh_TW
dc.description.abstract	The diagnostic of virus, bacterial or fungus in early stage of infectious disease has been an important issue in clinical research. Except for strain or virus identification by traditional labor-intensive in vitro experiments, in-silico methods have been developed for pathogen identification on account of the innovation of next-generation sequencing. Research groups over the world have developed several methods. However, these in-silico methods are still time-consuming and compute-intensive, so that they occur practical obstacles. To address these issues, we developed an accurate and efficient algorithm for pathogen identification. Here we presented a novel algorithm to identify pathogens in four algorithmic steps through RNA-seq. First, the reads of sequences were aligned to the reference genome of human and those unable to be aligned were retained for subsequent analysis; Secondly, the retained reads were assembled to construct contigs of pathogens by repeated region of retained reads; Next, a statistical model was applied to the putative transcript contigs to remove fake contigs resulting from random assembly. We then applied BLAST to the contigs that passed the statistical test to identify the species and strains of the pathogens. To evaluate the performance, we adopted both simulation and real data sets that contains samples with pathogen infections. The results of both simulation and real data show that our algorithm have high sensitivity and accuracy. We compared our method with the other three methods and demonstrated that algorithm we developed has higher effectiveness. Furthermore, we also applied our method to the cervical cancer, lung adenocarcinoma and colorectal cancer dataset for identifying possible pathogens associated with these three kinds of cancers. In summary, our method is accurate and effective in detecting pathogens using RNA-seq data from patient samples. Moreover, the efficiency and short working time of our proposed method has enabled the use of large data set in pathogenic studies.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T16:22:57Z (GMT). No. of bitstreams: 1 ntu-104-R02945032-1.pdf: 9061321 bytes, checksum: 4a6b85a677b160b5db9593ecf28d3835 (MD5) Previous issue date: 2015	en
dc.description.tableofcontents	CONTENTS 誌謝..................................................................................................................................................................I 中文摘要 III ABSTRACT V Chapter 1 Introduction 1 1.1 Background 1 1.1.1 Infectious disease and the Pathogen identification 1 1.1.2 Utilization of sequencing technology for pathogen identification 2 1.2 Background survey 3 1.2.1 Pathseq 4 1.2.2 RINS 5 1.2.3 CaPSID 6 1.2.4 Virana 7 1.3 Motivation and specific aims 7 Chapter 2 Materials and Methods 9 2.1 Reads alignment 11 2.2 Sequence assembly 11 2.3 Fake contigs removal 12 2.4 Species identification 14 2.5 Calculation of relative error and accuracy 15 2.6 Data preprocess and simulation 17 2.6.1 Data preprocess 17 2.6.2 Human RNA-seq data simulation 17 2.6.3 Pathogen RNA-seq data simulation 18 2.7 Datasets 19 2.7.1 Lung tissue infected by Varicella-Zoster Virus 20 2.7.2 Cervical cancer datasets 20 2.7.3 Lung adenocarcinoma datasets 21 2.7.4 Metagenomic study of colorectal cancer 21 Chapter 3 Results 23 3.1 Distribution modeling 23 3.2 Performance evaluation 25 3.3 Method comparison 27 3.4 RNA-seq of cell line with virus infection 29 3.5 Application of cervical cancer 30 3.5 Application of lung adenocarcinoma 34 3.6 Application of colorectal cancer 36 Chapter 4 Discussion 39 4.1 RNA extraction protocol 39 4.2 Model reliability 39 4.3 Method comparison 41 4.4 Limitation of Patho-finder 41 4.5 Results Interpretation 43 4.6 Metagenomic study 44 Chapter 5 Conclusion 46 Chapter 6 References 47
dc.language.iso	en
dc.subject	病原體	zh_TW
dc.subject	RNA定序	zh_TW
dc.subject	次世代定序	zh_TW
dc.subject	next generation sequencing	en
dc.subject	RNA-seq	en
dc.subject	pathogen	en
dc.title	發展一運用RNA定序資料鑑定病原體之演算法	zh_TW
dc.title	Development of a Fast Algorithm for Pathogen Identification through RNA-seq	en
dc.type	Thesis
dc.date.schoolyear	103-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	蔡孟勳,賴亮全,盧子彬,蕭朱杏,蕭自宏
dc.subject.keyword	次世代定序,RNA定序,病原體,	zh_TW
dc.subject.keyword	next generation sequencing,RNA-seq,pathogen,	en
dc.relation.page	54
dc.rights.note	有償授權
dc.date.accepted	2015-08-15
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	生醫電子與資訊學研究所	zh_TW
顯示於系所單位：	生醫電子與資訊學研究所

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf 未授權公開取用	8.85 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。