應用機器學習於排序外顯子資料之致病基因變異點

Yu-Shan Huang; 黃榆珊

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/17687

標題:	應用機器學習於排序外顯子資料之致病基因變異點 Prioritization of Disease-Causing Variants of Exome Data by Machine Learning
作者:	Yu-Shan Huang 黃榆珊
指導教授:	賴飛羆(Fei-pei Lai)
共同指導教授:	李妮鍾(Ni-Chung Lee)
關鍵字:	次世代基因定序分析,遺傳變異分析,機器學習, Next Generation Sequencing,Genetic Variation Analysis,Machine Learning,
出版年 :	2020
學位:	碩士
摘要:	近幾年來，由於次世代基因定序技術的快速發展，人類的基因體序列能在極短的時間內被定序出來，因此次世代定序技術已被廣泛應用在臨床診斷上，特別是應用在那些患有遺傳性疾病的病人身上。病人的DNA序列透過定序並經過複雜的生物資訊分析流程後會產生出外顯子中單點核苷酸變異(Single Nucleotide Variant, SNVs)的資料，醫師往往需利用人工的方式針對各個變異點查詢各種基因資料庫中的相關文獻，並從這些巨量資料中判讀出少數與病人的症狀具關聯性及具有致病性的變異點位，這樣耗費人力及時間的過程造成醫師在臨床遺傳疾病上判讀的負擔。為了協助醫師更快速且準確的判讀次世代定序所產生的基因變異分析結果，此研究嘗試建立一個機器學習模型來預測病人外顯子資料中的致病變異點位。我們彙整了多位患有遺傳性疾病的病人之全外顯子定序(Whole Exome Sequencing)及模組定序(Gene Panel)資料作為訓練資料，並整合多個資料庫對於外顯子變異點位的註解作為資料特徵，利用變異點位的特徵與病人臨床症狀的關聯性透過機器學習方法訓練模型，模型會自動排序變異點位，輸出最有可能導致病人臨床病徵的致病變異點。我們利用台大醫院108位患有罕見遺傳性疾病病人的全外顯子定序資料作為測試資料，並運用關鍵字擷取工具自動地從電子病歷中提取出病徵關鍵字，將此關鍵字和定序資料放入模型來找出每位病人約741個候選變異點位中與此症狀最為相關的目標變異點位，在已知的134個致病變異點中，機器學習模型成功地將92.6\%的目標變異點排列在候選列表中的前十名。使用此系統可大幅縮短醫師及遺傳研究人員判讀基因變異的時間，增加臨床判讀的診斷率，提高醫療服務品質。 In recent years, thanks to the rapid development of next-generation sequencing (NGS) technology, an entire human genome can be sequenced in a short period of time. Therefore, NGS technology is being widely introduced into clinical diagnosis practice, especially with those diagnosis of hereditary disorders. Processing the DNA sequence data of a patient requires multiple tools and complex bioinformatics pipelines, and the exome data of single nucleotide variant (SNVs) will be generated. To determine the true causal variants of a patient with genetic disease, physicians often need to view numerous features on every variant manually and search for literature in different genetic databases to understand the effect of genetic variation. It is a burden for physicians to go through these laborious and time-consuming processes case-by-case. In order to assist physicians to interpret the genetic variation information generated by NGS in a short period of time, we tried to construct a machine learning model for disease causing variants prediction in exome data. In our research, we collected sequencing data from whole exome sequencing and gene panel as training set. Then we integrated variant annotations from multiple genetic databases for model training. The model we built will rank SNVs and output the most possible disease-causing candidates. For model testing, we collected whole exome sequencing data from 108 patients with rare genetic disorders in National Taiwan University Hospital. We applied sequencing data and phenotypic information automatically extracted by keyword extraction tool from patient's electronic medical records into our machine learning model. In the result, we succeed in 92.6\% of the cases to locate the causative variant in the top 10 ranking list of average 741 candidate variants per person after filtering. The model ranks the same as manual performance, and it has been to use to help clinical diagnosis with genetic diseases.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/17687
DOI:	10.6342/NTU202002376
全文授權:	未授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-0408202015035300.pdf 未授權公開取用	5.53 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。