結合以結構為基礎的功能區域預測方法預測基因體中蛋白質轉譯區段的有害變異

Pei-Hsuan Chen; 陳佩萱

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/67491

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	楊安綏(An-Suei Yang)
dc.contributor.author	Pei-Hsuan Chen	en
dc.contributor.author	陳佩萱	zh_TW
dc.date.accessioned	2021-06-17T01:34:32Z	-
dc.date.available	2018-07-31
dc.date.copyright	2017-08-29
dc.date.issued	2017
dc.date.submitted	2017-08-01
dc.identifier.citation	1. Ward, L.D. & Kellis, M. Interpreting noncoding genetic variation in complex traits and human disease. Nat Biotechnol 30, 1095-106 (2012). 2. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285-91 (2016). 3. Genomes Project, C. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061-73 (2010). 4. Kumar, P., Henikoff, S. & Ng, P.C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc 4, 1073-81 (2009). 5. Choi, Y., Sims, G.E., Murphy, S., Miller, J.R. & Chan, A.P. Predicting the functional effect of amino acid substitutions and indels. PLoS One 7, e46688 (2012). 6. Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat Methods 7, 248-9 (2010). 7. Li, B. et al. Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics 25, 2744-50 (2009). 8. Baugh, E.H. et al. Robust classification of protein variation using structural modelling and large-scale data integration. Nucleic Acids Res 44, 2501-13 (2016). 9. Zhou, H., Gao, M. & Skolnick, J. ENTPRISE: An Algorithm for Predicting Human Disease-Associated Amino Acid Substitutions from Sequence Entropy and Predicted Protein Structures. PLoS One 11, e0150965 (2016). 10. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46, 310-5 (2014). 11. Swan, A.L. et al. A machine learning heuristic to identify biologically relevant and minimal biomarker panels from omics data. BMC Genomics 16 Suppl 1, S2 (2015). 12. Diaz-Uriarte, R. & Alvarez de Andres, S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7, 3 (2006). 13. Tang, H. & Thomas, P.D. Tools for Predicting the Functional Impact of Nonsynonymous Genetic Variation. Genetics 203, 635-47 (2016). 14. Todorovic, V. Genetics. Predicting the impact of genomic variation. Nat Methods 13, 203 (2016). 15. Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J.D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat Genet 48, 214-20 (2016). 16. Ng, P.C. & Henikoff, S. Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7, 61-80 (2006). 17. Chen, C.T. et al. Protein-protein interaction site predictions with three-dimensional probability distributions of interacting atoms on protein surfaces. PLoS One 7, e37706 (2012). 18. Tsai, K.C. et al. Prediction of carbohydrate binding sites on protein surfaces with 3-dimensional probability density distributions of interacting atoms. PLoS One 7, e40846 (2012). 19. Jian, J.W. et al. Predicting Ligand Binding Sites on Protein Surfaces by 3-Dimensional Probability Density Distributions of Interacting Atoms. PLoS One 11, e0160315 (2016). 20. Wu, C.H. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 34, D187-91 (2006). 21. Berman, H.M. et al. The Protein Data Bank. Nucleic Acids Res 28, 235-42 (2000). 22. Fiser, A. & Sali, A. Modeller: generation and refinement of homology-based protein structure models. Methods Enzymol 374, 461-91 (2003). 23. Eswar, N. et al. Comparative protein structure modeling using Modeller. Curr Protoc Bioinformatics Chapter 5, Unit 5 6 (2006). 24. Pieper, U. et al. MODBASE, a database of annotated comparative protein structure models, and associated resources. Nucleic Acids Res 32, D217-22 (2004). 25. Schwede, T., Kopp, J., Guex, N. & Peitsch, M.C. SWISS-MODEL: An automated protein homology-modeling server. Nucleic Acids Res 31, 3381-5 (2003). 26. Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res 33, W382-8 (2005). 27. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009). 28. Burges, C.J.C. A tutorial on Support Vector Machines for pattern recognition. Data Mining and Knowledge Discovery 2, 121-167 (1998). 29. Chang, C.-C. & Lin, C.-J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2, 27 (2011). 30. Grimm, D.G. et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum Mutat 36, 513-23 (2015). 31. Kuhn, M. Caret package. (2008). 32. Fawcett, T. An introduction to ROC analysis. Pattern recognition letters 27, 861-874 (2006). 33. Davis, J. et al. View Learning for Statistical Relational Learning: With an Application to Mammography. in IJCAI 677-683 (Citeseer, 2005). 34. Davis, J. & Goadrich, M. The relationship between Precision-Recall and ROC curves. in Proceedings of the 23rd international conference on Machine learning 233-240 (ACM, 2006). 35. Saito, T. & Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one 10, e0118432 (2015). 36. Noble, W.S. What is a support vector machine? Nat Biotechnol 24, 1565-7 (2006). 37. Meka, H., Werner, F., Cordell, S.C., Onesti, S. & Brick, P. Crystal structure and RNA binding of the Rpb4/Rpb7 subunits of human RNA polymerase II. Nucleic Acids Res 33, 6435-44 (2005). 38. Echwald, S.M. et al. Identification of four amino acid substitutions in hexokinase II and studies of relationships to NIDDM, glucose effectiveness, and insulin sensitivity. Diabetes 44, 347-53 (1995). 39. Libbrecht, M.W. & Noble, W.S. Machine learning applications in genetics and genomics. Nat Rev Genet 16, 321-32 (2015).
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/67491	-
dc.description.abstract	隨著高通量技術的發展，以及各個不同定序計畫產生的序列變異的數量逐漸增加，如何應用電腦計算方法來協助解釋這些序列變異，成為大家所關注的研究議題。在現存的方法中，大多利用以序列或結構為基礎的資訊來檢測這些序列變異的影響，並且他們試著解釋這些變異在蛋白質功能的破壞或者疾病致病性上的影響是什麼。在本篇研究中，我們整合以序列為基礎的資訊以及ISMBLab功能性區域預測方法，來辨識會破壞蛋白質功能的有害的胺基酸取代。我們從VIPUR預測工具的訓練集中，蒐集8,884個蛋白質變異來建構出一個SVM的分類器，而這些蛋白質變異皆是已經有明確的實驗上證明它是否會破壞蛋白質功能的變異。從結果中可以得知，我們的分類器能夠可以推展運用至其他物種預測上，且在ROC及PR曲線下的面積皆能得到更好的數值。若和其他方法做比較的話，對於人類變異的測試資料集，我們的分類器可以得到0.405的Matthews相關係數。總結，我們提出一個整合以結構為基礎的功能區域預測方法，可以來預測胺基酸取代對於蛋白質功能的影響，另外也能夠證明衍生自ISMBLab功能區域預測的特徵值對於蛋白質變異的預測是有幫助的。	zh_TW
dc.description.abstract	As high-throughput techniques advance and massive sequence variation data is generated by different sequencing projects, the application of computational methods to annotate these variations tends to be an issue of concern. Existing methods exploit sequence-based or structure-based information to interpret the effects of variations and most of them correlate the effects with the functional disruption of a protein or the disease pathogenicity. Here we present a method that integrates sequence-based information and ISMBLab functional site prediction to identify the deleterious amino acid substitutions which disrupt the functions of proteins. In this work, we collect 8,884 protein variants from VIPUR training set, which have clear experimental evidences on the disruptions of protein functions, to train a SVM classifier. The results show that our classifier can generalize to other organism with better values of the area under ROC and PR curves. Compare to other methods, the Matthews correlation coefficients for human variants testing set is 0.405. In summary, we provide an incorporating structure-based functional site prediction method to predict the effects of amino acid substitutions on protein functions, and prove that features derived from ISMBLab functional site prediction are useful for predicting protein variations.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T01:34:32Z (GMT). No. of bitstreams: 1 ntu-106-R04B48002-1.pdf: 2568087 bytes, checksum: 2e427ce63e7716b7459418b7d5049d04 (MD5) Previous issue date: 2017	en
dc.description.tableofcontents	口試委員會審定書 i 誌謝 ii 摘要 iii ABSTRACT iv LIST OF FIGURES vii LIST OF TABLES viii CHAPTER 1 Introduction 1 1.1 Annotation of genetic variations 1 1.2 Researches using machine learning algorithms 2 1.3 Research hypothesis 4 CHAPTER 2 Materials and Methods 6 2.1 Benchmark dataset 6 2.2 Characterize protein variants 7 2.2.1 Sequence-based features 7 2.2.2 Structure-based features 9 2.3 Training a classifier by support vector machine (SVM) 10 2.4 Independent set for comparisons 11 2.5 Feature selection by recursive feature elimination 12 2.6 Prediction capacity 12 CHAPTER 3 Results 15 3.1 Correlations between features and the label 15 3.2 Performance evaluation of features 16 3.3 Comparison ISMBLab* to other classifiers 19 3.4 Visualization of data distribution 21 3.5 Feature selection by recursive feature elimination 26 3.6 Detailed consequence and annotation in human variants 31 CHAPTER 4 Discussions 36 CHAPTER 5 Conclusions 39 CHAPTER 6 References 41
dc.language.iso	en
dc.subject	胺基酸取代	zh_TW
dc.subject	註解變異	zh_TW
dc.subject	支持向量機	zh_TW
dc.subject	機器學習	zh_TW
dc.subject	有害變異	zh_TW
dc.subject	蛋白質變異	zh_TW
dc.subject	support vector machine	en
dc.subject	amino acid substitution	en
dc.subject	protein variation	en
dc.subject	deleterious variation	en
dc.subject	machine learning	en
dc.subject	annotating variation	en
dc.title	結合以結構為基礎的功能區域預測方法預測基因體中蛋白質轉譯區段的有害變異	zh_TW
dc.title	Incorporating structure-based functional site prediction in predicting deleterious protein coding region variation in human genome.	en
dc.type	Thesis
dc.date.schoolyear	105-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	蔡懷寬(Huai-Kuang Tsai),陳倩瑜(Chien-Yu Chen),許世宜(Sheh-Yi Sheu)
dc.subject.keyword	註解變異,胺基酸取代,蛋白質變異,有害變異,機器學習,支持向量機,	zh_TW
dc.subject.keyword	annotating variation,amino acid substitution,protein variation,deleterious variation,machine learning,support vector machine,	en
dc.relation.page	43
dc.identifier.doi	10.6342/NTU201702281
dc.rights.note	有償授權
dc.date.accepted	2017-08-02
dc.contributor.author-college	生命科學院	zh_TW
dc.contributor.author-dept	基因體與系統生物學學位學程	zh_TW
顯示於系所單位：	基因體與系統生物學學位學程

文件中的檔案：

檔案	大小	格式
ntu-106-1.pdf 未授權公開取用	2.51 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。