請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/660
標題: | 基於分群集成技術的非平衡學習應用於預測非編碼區變異的致病性 Clustering Ensemble Based Imbalanced Learning for Predicting Pathogenic Non-coding Variants |
作者: | Kai-Wen Chuang 莊凱文 |
指導教授: | 陳倩瑜 |
關鍵字: | 分群集成,非平衡資料,非編碼區變異,致病性,機器學習, clustering ensemble,imbalanced data,non-coding variant,pathogenic,machine learning, |
出版年 : | 2019 |
學位: | 碩士 |
摘要: | 在次世代定序以及全基因組定序漸漸普及的情況下,已經在全人類的基因組中發現了數千萬個基因變異,其中大部分的基因變異集中在非編碼,這些發生於非編碼區的基因變異可能會導致基因的調控機制產生改變,進而導致疾病產生。然而,實際上會影響人體基因功能進而造成疾病的變異僅佔非常少數,所以要如何在這麼大量的變異中去找出與疾病有相關聯的變異是個很大的挑戰。
近年來已經有許多機器學習的方法用於預測人類基因組中的致病變異,但當非致病變異數量上升時,意味著資料集的正/負(致病/非致病)樣本間的比例變大,分類器的精確率和召回率明顯下降,為了讓分類器在不平衡資料集下的預測效果能有效的提升,本研究開發出一種基於分群集成 (Clustering Ensemble,CE)採樣技術和Hyper-ensemble集成方法的機器學習框架:CE-SMURF,改善一般機器學習演算法在學習不平衡資料集時效果不佳的問題,並應用於預測非編碼區的致病變異。 With the help of Next Generation Sequencing (NGS) and whole-genome sequencing (WGS), many variants in the non-coding regions were found in the human genome, but the ensured pathogenic variants were only a minority. It is a challenge to find pathogenic variants from such a large number of non-coding variants. Recently, a method, HyperSMURF, was previously proposed to tackle this problem by using both sampling and over-sampling techniques to balance the data. Through reproducing the analytic results of HyperSMURF, we observed that this approach might generate samples that did not help with training in minority or reduced the samples that might benefit training in majority. In this regard, this study aims at presenting a machine learning framework, CE-SMURF. The CE-based (Clustering Ensemble-based) method is used to find the samples of the center in majority and the samples of the boundary in minority, and then use the resampling technique to balance the ratio of data. Moreover, in order to improve the learning performance, we used the ensemble method to build multiple models, and computed the final scores by averaging the probability of variants in each model. It is found that CE-SMURF can significantly improve the performance of the predicting non-coding pathogenic variants. |
URI: | http://tdr.lib.ntu.edu.tw/handle/123456789/660 |
DOI: | 10.6342/NTU201903179 |
全文授權: | 同意授權(全球公開) |
顯示於系所單位: | 生物機電工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-108-1.pdf | 2.29 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。