基於分群集成技術的非平衡學習應用於預測非編碼區變異的致病性

Kai-Wen Chuang; 莊凱文

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/660

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳倩瑜
dc.contributor.author	Kai-Wen Chuang	en
dc.contributor.author	莊凱文	zh_TW
dc.date.accessioned	2021-05-11T04:53:26Z	-
dc.date.available	2019-08-19
dc.date.available	2021-05-11T04:53:26Z	-
dc.date.copyright	2019-08-19
dc.date.issued	2019
dc.date.submitted	2019-08-13
dc.identifier.citation	1. Edwards, S.L., et al., Beyond GWASs: illuminating the dark road from association to function. Am J Hum Genet, 2013. 93(5): p. 779-97. 2. Smedley, D., et al., A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease. Am J Hum Genet, 2016. 99(3): p. 595-606. 3. Kircher, M., et al., A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet, 2014. 46(3): p. 310-5. 4. Quang, D., Y. Chen, and X. Xie, DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics, 2015. 31(5): p. 761-3. 5. Ionita-Laza, I., et al., A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat Genet, 2016. 48(2): p. 214-20. 6. Ritchie, G.R., et al., Functional annotation of noncoding sequence variants. Nat Methods, 2014. 11(3): p. 294-6. 7. Schubach, M., et al., Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants. Sci Rep, 2017. 7(1): p. 2959. 8. Chawla, N.V., et al., SMOTE: synthetic minority oversampling technique. J. Artif. Int. Res., 2002. 16(1): p. 321-357. 9. 陈思，郭躬德，陈黎飞, 基于聚类融合的不平衡数据分类方法. 模式识别与人工智能, 2010. 23(6): p. 772-775 10. Breiman, L., Random Forests. Machine Learning, 2001. 45(1): p. 5-32. 11. Rojano, E., et al., Regulatory variants: from detection to predicting impact. Brief Bioinform, 2018. 12. Stenson, P.D., et al., Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat, 2003. 21(6): p. 577-81. 13. Landrum, M.J., et al., ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res, 2018. 46(D1): p. D1062-D1067. 14. Genomes Project, C., et al., A global reference for human genetic variation. Nature, 2015. 526(7571): p. 68-74. 15. MacQueen, J. Some methods for classification and analysis of multivariate observations. in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. 1967. Berkeley, Calif.: University of California Press. 16. Fred, A.L. and A.K. Jain. Data clustering using evidence accumulation. in Object recognition supported by user interaction for service robots. 2002. IEEE. 17. Fred, A. Finding consistent clusters in data partitions. in International Workshop on Multiple Classifier Systems. 2001. Springer. 18. Strehl, A. and J. Ghosh, Cluster ensembles---a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 2002. 3(Dec): p. 583-617. 19. Zhou, Z.-H. and W. Tang, Clusterer ensemble. Knowledge-Based Systems, 2006. 19(1): p. 77-83. 20. Topchy, A., et al. Adaptive clustering ensembles. in Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004. 2004. IEEE. 21. Chen, S., G. Guo, and L. Chen. Semi-supervised classification based on clustering ensembles. in International Conference on Artificial Intelligence and Computational Intelligence. 2009. Springer. 22. Liu, L., et al., Biological relevance of computationally predicted pathogenicity of noncoding variants. Nat Commun, 2019. 10(1): p. 330. 23. Richards, S., et al., Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med, 2015. 17(5): p. 405-24.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/handle/123456789/660	-
dc.description.abstract	在次世代定序以及全基因組定序漸漸普及的情況下，已經在全人類的基因組中發現了數千萬個基因變異，其中大部分的基因變異集中在非編碼，這些發生於非編碼區的基因變異可能會導致基因的調控機制產生改變，進而導致疾病產生。然而，實際上會影響人體基因功能進而造成疾病的變異僅佔非常少數，所以要如何在這麼大量的變異中去找出與疾病有相關聯的變異是個很大的挑戰。近年來已經有許多機器學習的方法用於預測人類基因組中的致病變異，但當非致病變異數量上升時，意味著資料集的正/負(致病/非致病)樣本間的比例變大，分類器的精確率和召回率明顯下降，為了讓分類器在不平衡資料集下的預測效果能有效的提升，本研究開發出一種基於分群集成 (Clustering Ensemble，CE)採樣技術和Hyper-ensemble集成方法的機器學習框架：CE-SMURF，改善一般機器學習演算法在學習不平衡資料集時效果不佳的問題，並應用於預測非編碼區的致病變異。	zh_TW
dc.description.abstract	With the help of Next Generation Sequencing (NGS) and whole-genome sequencing (WGS), many variants in the non-coding regions were found in the human genome, but the ensured pathogenic variants were only a minority. It is a challenge to find pathogenic variants from such a large number of non-coding variants. Recently, a method, HyperSMURF, was previously proposed to tackle this problem by using both sampling and over-sampling techniques to balance the data. Through reproducing the analytic results of HyperSMURF, we observed that this approach might generate samples that did not help with training in minority or reduced the samples that might benefit training in majority. In this regard, this study aims at presenting a machine learning framework, CE-SMURF. The CE-based (Clustering Ensemble-based) method is used to find the samples of the center in majority and the samples of the boundary in minority, and then use the resampling technique to balance the ratio of data. Moreover, in order to improve the learning performance, we used the ensemble method to build multiple models, and computed the final scores by averaging the probability of variants in each model. It is found that CE-SMURF can significantly improve the performance of the predicting non-coding pathogenic variants.	en
dc.description.provenance	Made available in DSpace on 2021-05-11T04:53:26Z (GMT). No. of bitstreams: 1 ntu-108-R06631033-1.pdf: 2344716 bytes, checksum: 3629b31616e3f1d12a1ed13cfa2c5b98 (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	致謝 i 摘要 ii Abstract iii 目錄 iv 圖目錄 vi 表目錄 vii 第一章研究目的 1 第二章文獻探討 3 2.1 非編碼區變異 (Non-coding variant) 3 2.2 資料庫 5 2.2.1 HGMD (Human Gene Mutation Database) 5 2.2.2 ClinVar 5 2.2.3 1000 Genomes Project 6 2.3 HyperSMURF 7 2.3.1 採樣技術 8 2.3.2 Hyper-ensemble 9 2.3.3 Pseudocode 9 2.4 不平衡資料的分群集成採樣技術 10 2.4.1 分群集成 11 2.4.2 分群集成採樣 12 第三章研究方法 13 3.1 訓練集 13 3.1.1 致病性 13 3.1.2 非致病性 14 3.2 測試集 14 3.2.1 致病性 14 3.2.2 非致病性 15 3.3 特徵選取 15 3.4 CE-SMURF機器學習框架 16 3.4.1 分群集成採樣 17 3.4.2 Hyper-ensemble 18 3.4.3 CE-SMURF參數 19 3.4.4 Pseudocode 19 3.5 模型表現評估指標 21 第四章結果與討論 24 4.1 採樣參數對訓練的影響 24 4.2 不同方法間預測的比較 26 4.3 不平衡程度對預測的影響 28 4.4 不同可信度變異資料對預測的影響 30 第五章結論 32 參考文獻 33 附錄1 各類別內詳細特徵 35
dc.language.iso	zh-TW
dc.subject	機器學習	zh_TW
dc.subject	分群集成	zh_TW
dc.subject	非平衡資料	zh_TW
dc.subject	非編碼區變異	zh_TW
dc.subject	致病性	zh_TW
dc.subject	machine learning	en
dc.subject	pathogenic	en
dc.subject	non-coding variant	en
dc.subject	imbalanced data	en
dc.subject	clustering ensemble	en
dc.title	基於分群集成技術的非平衡學習應用於預測非編碼區變異的致病性	zh_TW
dc.title	Clustering Ensemble Based Imbalanced Learning for Predicting Pathogenic Non-coding Variants	en
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	吳君泰,蔡懷寬
dc.subject.keyword	分群集成,非平衡資料,非編碼區變異,致病性,機器學習,	zh_TW
dc.subject.keyword	clustering ensemble,imbalanced data,non-coding variant,pathogenic,machine learning,	en
dc.relation.page	36
dc.identifier.doi	10.6342/NTU201903179
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2019-08-14
dc.contributor.author-college	生物資源暨農學院	zh_TW
dc.contributor.author-dept	生物產業機電工程學研究所	zh_TW
顯示於系所單位：	生物機電工程學系

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf	2.29 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。