應用機器學習於排序外顯子資料之致病基因變異點

Yu-Shan Huang; 黃榆珊

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/17687

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	賴飛羆(Fei-pei Lai)
dc.contributor.author	Yu-Shan Huang	en
dc.contributor.author	黃榆珊	zh_TW
dc.date.accessioned	2021-06-08T00:30:54Z	-
dc.date.copyright	2020-08-07
dc.date.issued	2020
dc.date.submitted	2020-08-05
dc.identifier.citation	[1] Sam Behjati and Patrick S. Tarpey. What is next generation sequencing? Archives of Disease in Childhood: Education and Practice Edition, 98(6):236–238, dec 2013. [2] F. Sanger, S. Nicklen, and A. R. Coulson. DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences of the United States of America, 74(12):5463–5467, 1977. [3] Somak Roy, Christopher Coldren, Arivarasan Karunamurthy, Nefize S. Kip, Eric W. Klee, Stephen E. Lincoln, Annette Leon, Mrudula Pullambhatla, Robyn L. Temple-Smolkin, Karl V. Voelkerding, Chen Wang, and Alexis B. Carter. Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines: A Joint Recommendation of the Association for Molecular Pathology and the College of American Pathologists, jan 2018. [4] Xiangtao Liu, Shizhong Han, Zuoheng Wang, Joel Gelernter, and Bao Zhu Yang. Variant Callers for Next-Generation Sequencing Data: A Comparison Study. PLoS ONE, 8(9), sep 2013. [5] G. A. Tollefson, J. Schuster, F. Gelin, A. Agudelo, A. Ragavendran, I. Restrepo, P. Stey, J. Padbury, and A. Uzun. VIVA (VIsualization of VAriants): A VCF File Visualization Tool. Scientific Reports, 9(1):1–7, dec 2019. [6] Michael G. Zomnir, Lev Lipkin, Maciej Pacula, Enrique Dominguez Meneses, Allison MacLeay, Sekhar Duraisamy, Nishchal Nadhamuni, Saeed H. Al Turki, Zongli Zheng, Miguel Rivera, Valentina Nardi, Dora DiasSantagata, A. John Iafrate, Long P. Le, and Jochen K. Lennerz. Artificial Intelligence Approach for Variant Reporting. JCO Clinical Cancer Informatics, (2):1–13, dec 2018. [7] Ada Hamosh, Alan F. Scott, Joanna Amberger, David Valle, and Victor A. McKusick. Online Mendelian Inheritance in Man (OMIM). Human Mutation, 15(1):57–61, 2000. [8] Margaret P Adam, Holly H Ardinger, Roberta A Pagon, Stephanie E Wallace, Lora J H Bean, Karen Stephens, and Anne Amemiya, editors. GeneReviews(®). 1993. [9] Joel. Faintuch and Salomão. Faintuch. Precision medicine for investigators, practitioners and providers. Academic Press, 2020. [10] Ching Hsu. An Integrated Genetic Variation Analysis System for Gene Diagnostics in Precision Medicine. Master’s thesis, National Taiwan University, 2018. [11] Ting-Fu Chen. Variants Prioritizer for Exome Data Based on Text-mining. Master’s thesis, National Taiwan University, 2018. [12] K. Jayanthi and C. Mahesh. A Study on machine learning methods and applications in genetics and genomics. International Journal of Engineering and Technology(UAE), 7(1.7 Special Issue 7):201–204, 2018. [13] Maxwell W. Libbrecht and William Stafford Noble. Machine learning applications in genetics and genomics, may 2015. [14] Kai Wang, Mingyao Li, and Hakon Hakonarson. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 38(16):e164–e164, 2010. [15] Michael Stromberg, Rajat Roy, Julien Lajugie, Yu Jiang, Haochen Li, and Elliott Margulies. Nirvana. pages 596–596. Association for Computing Machinery (ACM), aug 2017. [16] William McLaren, Laurent Gil, Sarah E. Hunt, Harpreet Singh Riat, Graham R.S. Ritchie, Anja Thormann, Paul Flicek, and Fiona Cunningham. The Ensembl Variant Effect Predictor. Genome Biology, 17(1):122, jun 2016. [17] Quan Li and Kai Wang. InterVar: Clinical Interpretation of Genetic Variants by the 2015 ACMG-AMP Guidelines. American Journal of Human Genetics, 100(2):267–280, feb 2017. [18] Hintzsche JD, Robinson WA, and Tan AC. A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data. International journal of genomics, 2016, 2016. [19] Hui Yang and Kai Wang. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nature Protocols, 10(10):1556–1566, oct 2015. [20] Xiao Chang and Kai Wang. wANNOVAR: annotating genetic variants for personal genomes via the web. Journal of medical genetics, 49(7):433–6, jul 2012. [21] Laura Clarke, Xiangqun Zheng-Bradley, Richard Smith, Eugene Kulesha, Chunlin Xiao, Iliana Toneva, Brendan Vaughan, Don Preuss, Rasko Leinonen, Martin Shumway, Stephen Sherry, and Paul Flicek. The 1000 Genomes Pproject: Data management and community access, may 2012. [22] Wenqing Fu, Timothy D. O’Connor, Goo Jun, Hyun Min Kang, Goncalo Abecasis, Suzanne M. Leal, Stacey Gabriel, David Altshuler, Jay Shendure, Deborah A. Nickerson, Michael J. Bamshad, and Joshua M. Akey. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature, 493(7431):216–220, jan 2013. [23] Ivan A. Adzhubei, Steffen Schmidt, Leonid Peshkin, Vasily E. Ramensky, Anna Gerasimova, Peer Bork, Alexey S. Kondrashov, and Shamil R. Sunyaev. A method and server for predicting damaging missense mutations, apr 2010. [24] Pauline C Ng and Steven Henikoff. SIFT: Predicting amino acid changes that affect protein function. Nucleic acids research, 31(13):3812–3814, jul 2003. [25] Philipp Rentzsch, Daniela Witten, Gregory M Cooper, Jay Shendure, and Martin Kircher. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Research, 47(D1):D886–D894, 2018. [26] S T Sherry, M H Ward, M Kholodov, J Baker, L Phan, E M Smigielski, and K Sirotkin. dbSNP: the NCBI database of genetic variation. Nucleic acids research, 29(1):308–311, jan 2001. [27] Konrad J Karczewski, Ben Weisburd, Brett Thomas, Matthew Solomonson, Douglas M Ruderfer, David Kavanagh, Tymor Hamamsy, Monkol Lek, Kaitlin E Samocha, Beryl B Cummings, Daniel Birnbaum, Mark J Daly, and Daniel G MacArthur. The ExAC browser: displaying reference data information from over 60 000 exomes. Nucleic acids research, 45(D1):D840–D845, jan 2017. [28] Gene Yeo and Christopher B. Burge. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. In Journal of Computational Biology, volume 11, pages 377–394. J Comput Biol, 2004. [29] Kishore Jaganathan, Sofia Kyriazopoulou Panagiotopoulou, Jeremy F. McRae, Siavash Fazel Darbandi, David Knowles, Yang I. Li, Jack A. Kosmicki, Juan Arbelaez, Wenwu Cui, Grace B. Schwartz, Eric D. Chow, Efstathios Kanterakis, Hong Gao, Amirali Kia, Serafim Batzoglou, Stephan J. Sanders, and Kyle Kai How Farh. Predicting Splicing from Primary Sequence with Deep Learning. Cell, 176(3):535–548.e24, jan 2019. [30] Johannes Birgmeier, Maximilian Haeussler, Cole A. Deisseroth, Karthik A. Jagadeesh, Alexander J. Ratner, Harendra Guturu, Aaron M. Wenger, Peter D. Stenson, David N. Cooper, Christopher Ré, Jonathan A. Bernstein, and Gill Bejerano. AMELIE accelerates Mendelian patient diagnosis directly from the primary literature. bioRxiv, page 171322, aug 2017. [31] Peter N. Robinson, Sebastian Köhler, Sebastian Bauer, Dominik Seelow, Denise Horn, and Stefan Mundlos. The Human Phenotype Ontology: A Tool for Annotating and Analyzing Human Hereditary Disease. The American Journal of Human Genetics, 83(5):610–615, nov 2008. [32] Gil Stelzer, Inbar Plaschkes, Danit OzLevi, Anna Alkelai, Tsviya Olender, Shahar Zimmerman, Michal Twik, Frida Belinky, Simon Fishilevich, Ron Nudel, Yaron GuanGolan, David Warshawsky, Dvir Dahary, Asher Kohn, Yaron Mazor, Sergey Kaplan, Tsippi Iny Stein, Hagit N. Baris, Noa Rappaport, Marilyn Safran, and Doron Lancet. VarElect: The phenotype-based variation prioritizer of the GeneCards Suite. BMC Genomics, 17(S2):444, jun 2016. [33] Gil Stelzer, Naomi Rosen, Inbar Plaschkes, Shahar Zimmerman, Michal Twik, Simon Fishilevich, Tsippi Iny Stein, Ron Nudel, Iris Lieder, Yaron Mazor, Sergey Kaplan, Dvir Dahary, David Warshawsky, Yaron GuanGolan, Asher Kohn, Noa Rappaport, Marilyn Safran, and Doron Lancet. The GeneCards suite: From gene data mining to disease genome sequence analyses. Current Protocols in Bioinformatics, 2016:1.30.1–1.30.33, 2016. [34] Damian Smedley, Julius O.B. Jacobsen, Marten Jäger, Sebastian Köhler, Manuel Holtgrewe, Max Schubach, Enrico Siragusa, Tomasz Zemojtel, Orion J. Buske, Nicole L. Washington, William P. Bone, Melissa A. Haendel, and Peter N. Robinson. Next-generation diagnostics and disease-gene discovery with the Exomiser. Nature Protocols, 10(12):2004–2015, dec 2015. [35] Valentina Cipriani, Nikolas Pontikos, Gavin Arno, Panagiotis I. Sergouniotis, Eva Lenassi, Penpitcha Thawong, Daniel Danis, Michel Michaelides, Andrew R. Webster, Anthony T. Moore, Peter N. Robinson, Julius O.B. Jacobsen, and Damian Smedley. An improved phenotype-driven tool for rare mendelian variant prioritization: Benchmarking exomiser on real patient whole-exome data. Genes, 11(4), apr 2020. [36] Imane Boudellioua, Maxat Kulmanov, Paul N. Schofield, Georgios V. Gkoutos, and Robert Hoehndorf. DeepPVP: Phenotype-based prioritization of causative variants using deep learning. BMC Bioinformatics, 20(1):65, feb 2019. [37] Qigang Li, Keyan Zhao, Carlos D. Bustamante, Xin Ma, and Wing H. Wong. Xrare: a machine learning method jointly modeling phenotypes and genetic evidence for rare disease diagnosis. Genetics in Medicine, 21(9):2126–2134, sep 2019. [38] James M. Holt, Brandon Wilk, Camille L. Birch, Donna M. Brown, Manavalan Gajapathy, Alexander C. Moss, Nadiya Sosonkina, Melissa A. Wilk, Julie A. Anderson, Jeremy M. Harris, Jacob M. Kelly, Fariba Shaterferdosian, Angelina E. UnoAntonison, Arthur Weborg, , and Elizabeth A. Worthey. VarSight: Prioritizing clinically reported variants with binary classification algorithms. BMC Bioinformatics, 20(1):496, oct 2019. [39] Hui Yang, Peter N. Robinson, and Kai Wang. Phenolyzer: Phenotype-based prioritization of candidate genes for human diseases. Nature Methods, 12(9):841–843, aug 2015. [40] Matthew D. Mailman, Michael Feolo, Yumi Jin, Masato Kimura, Kimberly Tryka, Rinat Bagoutdinov, Luning Hao, Anne Kiang, Justin Paschall, Lon Phan, Natalia Popova, Stephanie Pretel, Lora Ziyabari, Moira Lee, Yu Shao, Zhen Y. Wang, Karl Sirotkin, Minghong Ward, Michael Kholodov, Kerry Zbicz, Jeffrey Beck, Michael Kimelman, Sergey Shevelev, Don Preuss, Eugene Yaschenko, Alan Graeff, James Ostell, and Stephen T. Sherry. The NCBI dbGaP database of genotypes and phenotypes, oct 2007. [41] Shoba Ranganathan, Michael Ray Gribskov, Kenta Nakai, and Christian Schönbach. Encyclopedia of bioinformatics and computational biology. [42] Melissa J Landrum, Jennifer M Lee, George R Riley, Wonhee Jang, Wendy S Rubinstein, Deanna M Church, and Donna R Maglott. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic acids research, 42(Database issue):D980–5, jan 2014. [43] P D Stenson, E V Ball, and Matthew Mort. Human gene mutation database (HGMD): 2003 update. Hum Mutat, 25, 2003. [44] Chien Te Fan, Jui Chu Lin, and Chung His Lee. Taiwan biobank: A project aiming to aid Taiwan’s transition into a biomedical island. Pharmacogenomics, 9(2):235–246, feb 2008. [45] Sue Richards, Nazneen Aziz, Sherri Bale, David Bick, Soma Das, Julie GastierFoster, Wayne W. Grody, Madhuri Hegde, Elaine Lyon, Elaine Spector, Karl Voelkerding, and Heidi L. Rehm. Standards and guidelines for the interpretation of sequence variants: A joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genetics in Medicine, 17(5):405–424, may 2015. [46] Emily C Glassberg, Ziyue Gao, Arbel Harpak, Xun Lant, and Jonathan K Pritchard. Measurement of selective constraint on human gene expression. bioRxiv, page 345801, jun 2018. [47] A. R. Aronson. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proceedings / AMIA ... Annual Symposium. AMIA Symposium, pages 17–21, 2001. [48] Alan R. Aronson and François Michel Lang. An overview of MetaMap: Historical perspective and recent advances. Journal of the American Medical Informatics Association, 17(3):229–236, may 2010. [49] D. A.B. Lindberg, B. L. Humphreys, and A. T. McCray. The unified medical language system, 1993. [50] Cong Liu, Fabricio Sampaio Peres Kury, Ziran Li, Casey Ta, Kai Wang, and Chunhua Weng. Doc2Hpo: a web application for efficient and accurate HPO concept curation. Nucleic Acids Research, 47(W1):W566–W570, 2019. [51] A Tchechmedjiev, A Abdaoui, V Emonet, S Melzi …, and undefined 2018. Enhanced functionalities for annotating and indexing clinical text with the NCBO Annotator+. academic.oup.com. [52] Track Beaulieu, Karen Sparck, and Peter Willett. Okapi at TREC7: automatic ad hoc, filtering, VLC and interactive track. 2000. [53] Donna Maglott, Jim Ostell, Kim D Pruitt, and Tatiana Tatusova. Entrez Gene: gene-centered information at NCBI. Nucleic acids research, 33(Database issue):D54–8, jan 2005. [54] Chih-Hsuan Wei, HungYu Kao, and Zhiyong Lu. PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research, 41(Web Server issue):W518–22, jul 2013. [55] Lillian Lee. IDF revisited: A simple new derivation within the Robertson-Sparck Jones probabilistic model. may 2007. [56] Xiaoming Liu, Chunlei Wu, Chang Li, and Eric Boerwinkle. dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs. Human Mutation, 37(3):235–241, mar 2016. [57] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Andreas Müller, Joel Nothman, Gilles Louppe, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikitlearn: Machine Learning in Python. jan 2012. [58] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Physical Review E Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, 69(6):16, 2004. [59] Brian C. Ross. Mutual information between discrete and continuous data sets. PLoS ONE, 9(2):e87357, feb 2014. [60] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, oct 2001. [61] Leo Breiman and Leo Breiman. Consistency for a simple model of random forests. 2004. [62] David Ellerman. Logical Entropy: Introduction to Classical and Quantum Logical Information Theory. Entropy, 20(9):679, sep 2018.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/17687	-
dc.description.abstract	近幾年來，由於次世代基因定序技術的快速發展，人類的基因體序列能在極短的時間內被定序出來，因此次世代定序技術已被廣泛應用在臨床診斷上，特別是應用在那些患有遺傳性疾病的病人身上。病人的DNA序列透過定序並經過複雜的生物資訊分析流程後會產生出外顯子中單點核苷酸變異(Single Nucleotide Variant, SNVs)的資料，醫師往往需利用人工的方式針對各個變異點查詢各種基因資料庫中的相關文獻，並從這些巨量資料中判讀出少數與病人的症狀具關聯性及具有致病性的變異點位，這樣耗費人力及時間的過程造成醫師在臨床遺傳疾病上判讀的負擔。為了協助醫師更快速且準確的判讀次世代定序所產生的基因變異分析結果，此研究嘗試建立一個機器學習模型來預測病人外顯子資料中的致病變異點位。我們彙整了多位患有遺傳性疾病的病人之全外顯子定序(Whole Exome Sequencing)及模組定序(Gene Panel)資料作為訓練資料，並整合多個資料庫對於外顯子變異點位的註解作為資料特徵，利用變異點位的特徵與病人臨床症狀的關聯性透過機器學習方法訓練模型，模型會自動排序變異點位，輸出最有可能導致病人臨床病徵的致病變異點。我們利用台大醫院108位患有罕見遺傳性疾病病人的全外顯子定序資料作為測試資料，並運用關鍵字擷取工具自動地從電子病歷中提取出病徵關鍵字，將此關鍵字和定序資料放入模型來找出每位病人約741個候選變異點位中與此症狀最為相關的目標變異點位，在已知的134個致病變異點中，機器學習模型成功地將92.6\%的目標變異點排列在候選列表中的前十名。使用此系統可大幅縮短醫師及遺傳研究人員判讀基因變異的時間，增加臨床判讀的診斷率，提高醫療服務品質。	zh_TW
dc.description.abstract	In recent years, thanks to the rapid development of next-generation sequencing (NGS) technology, an entire human genome can be sequenced in a short period of time. Therefore, NGS technology is being widely introduced into clinical diagnosis practice, especially with those diagnosis of hereditary disorders. Processing the DNA sequence data of a patient requires multiple tools and complex bioinformatics pipelines, and the exome data of single nucleotide variant (SNVs) will be generated. To determine the true causal variants of a patient with genetic disease, physicians often need to view numerous features on every variant manually and search for literature in different genetic databases to understand the effect of genetic variation. It is a burden for physicians to go through these laborious and time-consuming processes case-by-case. In order to assist physicians to interpret the genetic variation information generated by NGS in a short period of time, we tried to construct a machine learning model for disease causing variants prediction in exome data. In our research, we collected sequencing data from whole exome sequencing and gene panel as training set. Then we integrated variant annotations from multiple genetic databases for model training. The model we built will rank SNVs and output the most possible disease-causing candidates. For model testing, we collected whole exome sequencing data from 108 patients with rare genetic disorders in National Taiwan University Hospital. We applied sequencing data and phenotypic information automatically extracted by keyword extraction tool from patient's electronic medical records into our machine learning model. In the result, we succeed in 92.6\% of the cases to locate the causative variant in the top 10 ranking list of average 741 candidate variants per person after filtering. The model ranks the same as manual performance, and it has been to use to help clinical diagnosis with genetic diseases.	en
dc.description.provenance	Made available in DSpace on 2021-06-08T00:30:54Z (GMT). No. of bitstreams: 1 U0001-0408202015035300.pdf: 5666028 bytes, checksum: 674a479f134c6d69e0a7f04a51f86b52 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	口試委員會審定書 i 致謝 ii 中文摘要 iii Abstract iv Contents v List of Figures vii List of Tables viii 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Objective 3 2 Related Works 5 2.1 Variant Annotation 5 2.2 Variant Prioritization 6 3 Data Description 8 3.1 Data Introduction 8 3.2 Data Description 8 3.2.1 Patient Demographics 9 3.2.2 Variant Call Format File 9 3.2.3 Phenotype Information 9 4 Methodology 10 4.1 Workflow 10 4.2 Variant Annotation and Filtering 11 4.2.1 Variant Annotation 11 4.2.2 Variant Filtering 13 4.3 Phenotype Extraction 15 4.3.1 Phenotype Extraction 15 4.3.2 Phenotype-gene Similarity Score 16 4.4 Data Preprocessing 17 4.5 Feature Selection 21 4.6 Building Model 22 4.7 Performance Evaluation 24 5 Result and Discussion 25 5.1 Feature Selection 25 5.2 Model Performance 25 5.2.1 Prediction with Different Keyword Extraction Tools 26 5.2.2 Prediction with Mixed Keyword Extraction Tools 28 5.2.3 Other Machine Learning Methods 30 5.2.4 Feature Importance 32 5.2.5 In Comparison with Other Tools 33 5.3 Web Application 35 6 Conclusion and Future Works 39 A Appendix 40 Bibliography 56
dc.language.iso	en
dc.title	應用機器學習於排序外顯子資料之致病基因變異點	zh_TW
dc.title	Prioritization of Disease-Causing Variants of Exome Data by Machine Learning	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.coadvisor	李妮鍾(Ni-Chung Lee)
dc.contributor.oralexamcommittee	胡務亮(Wuh-Liang Hwu),簡榮彥(Jung-Yien Chien),郭律成(Lu-Cheng Kuo)
dc.subject.keyword	次世代基因定序分析,遺傳變異分析,機器學習,	zh_TW
dc.subject.keyword	Next Generation Sequencing,Genetic Variation Analysis,Machine Learning,	en
dc.relation.page	64
dc.identifier.doi	10.6342/NTU202002376
dc.rights.note	未授權
dc.date.accepted	2020-08-05
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
U0001-0408202015035300.pdf Restricted Access	5.53 MB	Adobe PDF

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets