分析序列保留性對蛋白質中核糖核酸結合殘基預測之影響

Ya-Ping Chen; 陳雅萍

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/40187

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	黃乾綱
dc.contributor.author	Ya-Ping Chen	en
dc.contributor.author	陳雅萍	zh_TW
dc.date.accessioned	2021-06-14T16:42:21Z	-
dc.date.available	2011-08-17
dc.date.copyright	2011-08-17
dc.date.issued	2011
dc.date.submitted	2011-08-13
dc.identifier.citation	1. Smith, C.W.J., RNA:protein interactions : a practical approach. 1998, Oxford ; New York: Oxford University Press. xxv,341p. 2. Barton, N.H., Evolution. 2007, Cold Spring Harbor, N.Y.: Cold Spring Harbor Laboratory Press. xiv, 833 p. 3. Hertel, K.J. and B.R. Graveley, RS domains contact the pre-mRNA throughout spliceosome assembly. Trends in Biochemical Sciences, 2005. 30(3): p. 115-118. 4. Noller, H.F., RNA structure: Reading the ribosome. Science, 2005. 309(5740): p. 1508-1514. 5. Moras, D., Structural and functional relationships between aminoacyl-tRNA synthetases. Trends in Biochemical Sciences, 1992. 17(4): p. 159-164. 6. Morozova, N., et al., Protein-RNA interactions: exploring binding patterns with a three-dimensional superposition analysis of high resolution structures. Bioinformatics, 2006. 22(22): p. 2746-2752. 7. Shulman-Peleg, A., et al., Prediction of interacting single-stranded RNA bases by protein-binding patterns. Journal of Molecular Biology, 2008. 379(2): p. 299-316. 8. Elliott, D. and M. Ladomery, Molecular biology of RNA. 2011, Oxford ; New York: Oxford University Press. 441 p. 9. Sucheck, S.J. and C.H. Wong, RNA as a target for small molecules. Current Opinion in Chemical Biology, 2000. 4(6): p. 678-686. 10. Gherghe, C.M., et al., Native-like RNA Tertiary Structures Using a Sequence-Encoded Cleavage Agent and Refinement by Discrete Molecular Dynamics. Journal of the American Chemical Society, 2009. 131(7): p. 2541-2546. 11. Jones, S. and J.M. Thornton, Prediction of protein-protein interaction sites using patch analysis. Journal of Molecular Biology, 1997. 272(1): p. 133-143. 12. Marcotte, E.M., et al., Detecting protein function and protein-protein interactions from genome sequences. Science, 1999. 285(5428): p. 751-753. 13. Jones, S., et al., Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins. Nucleic Acids Research, 2003. 31(24): p. 7189-7198. 14. Ahmad, S., M.M. Gromiha, and A. Sarai, Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics, 2004. 20(4): p. 477-486. 15. Crick, F., Central Dogma of Molecular Biology. Nature, 1970. 227(5258): p. 561-563. 16. Li, J.L., et al., Darwinian Evolution of Prions in Cell Culture. Science, 2010. 327(5967): p. 869-872. 17. Draper, D.E., Protein-Rna Recognition. Annual Review of Biochemistry, 1995. 64: p. 593-620. 18. Jeong, E., I.F. Chung, and S. Miyano, A neural network method for identification of RNA-interacting residues in protein. Genome Informatics, 2004. 15(1): p. 105-116. 19. Rost, B. and C. Sander, Prediction of Protein Secondary Structure at Better Than 70-Percent Accuracy. Journal of Molecular Biology, 1993. 232(2): p. 584-599. 20. Jeong, E. and S. Miyano, A weighted profile based method for protein-RNA interacting residue prediction. Transactions on Computational Systems Biology Iv, 2006. 3939: p. 123-139. 21. Wang, L.J. and S.J. Brown, BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Research, 2006. 34: p. W243-W248. 22. Terribilini, M., et al., RNABindR: a server for analyzing and predicting RNA-binding sites in proteins. Nucleic Acids Research, 2007. 35: p. W578-W584. 23. Berman, H.M., et al., The Protein Data Bank. Acta Crystallographica Section D-Biological Crystallography, 2002. 58(Pt 6 No 1): p. 899-907. 24. Raghava, G.P.S., M. Kumar, and A.M. Gromiha, Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins-Structure Function and Bioinformatics, 2008. 71(1): p. 189-194. 25. Wang, Y., et al., PRINTR: Prediction of RNA binding sites in proteins using SVM and profiles. Amino Acids, 2008. 35(2): p. 295-302. 26. Tong, J., P. Jiang, and Z.H. Lu, RISP: A web-based server for prediction of RNA-binding sites in proteins. Computer Methods and Programs in Biomedicine, 2008. 90(2): p. 148-153. 27. Sung, T.Y., et al., Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics, 2008. 9. 28. Spriggs, R.V., et al., Protein function annotation from sequence: prediction of residues interacting with RNA. Bioinformatics, 2009. 25(12): p. 1492-1497. 29. Adamczak, R., A. Porollo, and J. Meller, Accurate prediction of solvent accessibility using neural networks-based regression. Proteins-Structure Function and Bioinformatics, 2004. 56(4): p. 753-767. 30. Wang, L.J., et al., BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Systems Biology, 2010. 4 Suppl 1: p. S3. 31. Zhang, T., et al., Analysis and Prediction of RNA-Binding Residues Using Sequence, Evolutionary Conservation, and Predicted Secondary Structure and Solvent Accessibility. Current Protein & Peptide Science, 2010. 11(7): p. 609-628. 32. Dor, O. and Y.Q. Zhou, Real-SPINE: An integrated system of neural networks for real-value prediction of protein structural properties. Proteins-Structure Function and Bioinformatics, 2007. 68(1): p. 76-81. 33. Huang, Y.F., et al., Predicting RNA-binding residues from evolutionary information and sequence conservation. BMC Genomics, 2010. 11 Suppl 4: p. S2. 34. Hsu, C.M., C.Y. Chen, and B.J. Liu, WildSpan: mining structured motifs from protein sequences. Algorithms for Molecular Biology, 2011. 6(1): p. 6. 35. Wang, C.C., et al., Identification of RNA-binding sites in proteins by integrating various sequence information. Amino Acids, 2011. 40(1): p. 239-248. 36. Altschul, S.F., et al., Basic Local Alignment Search Tool. Journal of Molecular Biology, 1990. 215(3): p. 403-410. 37. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 1997. 25(17): p. 3389-3402. 38. Draper, D.E., Themes in RNA-protein recognition. Journal of Molecular Biology, 1999. 293(2): p. 255-270. 39. Jones, D.T., Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 1999. 292(2): p. 195-202. 40. Ouali, M. and R.D. King, Cascaded multiple classifiers for secondary structure prediction. Protein Science, 2000. 9(6): p. 1162-1176. 41. Daughdrill, G.W., L.J. Hanely, and F.W. Dahlquist, The c-terminal half of the anti-sigma factor FlgM contains a dynamic equilibrium solution structure favoring helical conformations. Biochemistry, 1998. 37(4): p. 1076-1082. 42. Weiss, M.A., et al., Folding Transition in the DNA-Binding Domain of Gcn4 on Specific Binding to DNA. Nature, 1990. 347(6293): p. 575-578. 43. Dunker, A.K., et al., Intrinsic disorder and protein function. Biochemistry, 2002. 41(21): p. 6573-6582. 44. Ward, J.J., et al., The DISOPRED server for the prediction of protein disorder. Bioinformatics, 2004. 20(13): p. 2138-2139. 45. Breiman, L., Random forests. Machine Learning, 2001. 45(1): p. 5-32. 46. Liaw, A. and M. Wiener, Classification and Regression by randomForest. R News, 2002. 2(3): p. 18-22. 47. Cortes, C. and V. Vapnik, Support-Vector Networks. Machine Learning, 1995. 20(3): p. 273-297. 48. Chang, C.-C. and C.-J. Lin, LIBSVM: a library for support vector machines. 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. 49. Maetschke, S.R. and Z. Yuan, Exploiting structural and topological information to improve prediction of RNA-protein binding sites. BMC Bioinformatics, 2009. 10: p. 341. 50. Spriggs, R.V. and S. Jones, RNA-binding residues in sequence space: Conservation and interaction patterns. Computational Biology and Chemistry, 2009. 33(5): p. 397-403. 51. Finn, R.D., et al., The Pfam protein families database. Nucleic Acids Research, 2010. 38: p. D211-D222.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/40187	-
dc.description.abstract	核糖核酸與蛋白質的交互作用在基因表現的許多階段扮演重要的角色，如合成信使核糖核酸前體、剪接信使核糖核酸以及轉譯。普遍認為蛋白質是透過結合區域或結合模體辨識目標核糖核酸，其對應之核酸型態以及辨識的結構等級皆相當多元，由蛋白質一級結構預測核糖核酸結合殘基因此頗具挑戰性。　　本論文延續ProteRNA預測方法發展支援向量機與隨機森林分類器，特徵方面增加預測而得之不穩定序，亦改善以WildSpan樣式之序列保留區域進行後處理的品質。考慮資料集製備的效果和結合位置差異，在五疊交叉驗證中，兩分類器的馬修相關係數分別可達0.5288和0.4698。而在獨立測試的表現，支援向量機領先各提供線上服務的預測方法，可獲得92.12%的精確度、38.10%的靈敏度、97.47%的專一性、59.89%的準確度、0.4657的F測量值和0.4381的馬修相關係數，隨機森林表現僅次於支援向量機，可獲得90.08%的精確度、34.47%的靈敏度、95.59%的專一性、43.62%的準確度、0.3851的F測量值和0.3346的馬修相關係數。　　觀察不同序列相似度之資料集對機器學習方法預測能力所構成的趨勢，並討論其中預測表現的提升與瓶頸來源為何。我們發現和查詢序列同時存在於資料集的同源序列，甚至相似度較低的遠親同源序列，都可能影響預測，使結果更接近正確的結合位置分佈。此外，直接比對資料集內最接近的序列以決定結合殘基，在部分情況的預測結果是優於特徵向量為位置加權矩陣之機器學習方法的；然而機器學習方法在獨立測試時面對新穎的蛋白質序列，預測的優良表現終究顯現其一般化能力。	zh_TW
dc.description.abstract	Protein-RNA interactions play a vital role in many stages of gene expression such as pre-mRNA synthesis, mRNA splicing and translation. It is generally believed that binding domains or binding motifs enable RNA-binding proteins to recognize their target RNA. Since the corresponding nucleic acid type and the structure level recognized can be quite diverse, predicting RNA-binding residues from primary structure of proteins is indeed a challenging task. In this thesis, we continue the work of ProteRNA and develop two classifiers, namely support vector machine (SVM) and random forests (RF), with the predicted protein disorder added as a new feature descriptor. For the post-processing procedure, we build a discriminator in order to improve the pattern quality by distinguishing RNA-binding residues from other functionally important ones in conserved regions. When considering the dataset preparation effects and variance in binding sites, the two classifiers achieve Matthew’s correlation coefficient (MCC) of 0.5288 and 0.4698 using five-fold cross-validation. Our approach outperforms other predictors which provide online service. Testing on the independent test dataset, the SVM model achieves an accuracy of 92.12%, sensitivity of 38.10%, specificity of 97.47%, precision of 59.89%, F-score of 0.4657 and MCC of 0.4381, while the RF model ranks second only to SVM, it achieves an accuracy of 90.08%, sensitivity of 34.47%, specificity of 95.59%, precision of 43.62%, F-score of 0.3851 and MCC of 0.3346. We observe the measure trend in machine learning methods for datasets based on different sequence identities, and discuss the origin of performance increment and bottleneck. We find out that the homologous sequence, or even remote homologous in the same dataset as query sequence will probably make prediction result closer to the distribution of real binding sites. Besides, a method that identifies the nearest neighbor by sequence alignment and determines its binding residues accordingly may perform better than machine learning methods trained on PSSM in some cases. Nevertheless, when dealing with novel protein sequences, the excellent performance of machine learning methods shows great generalization ability.	en
dc.description.provenance	Made available in DSpace on 2021-06-14T16:42:21Z (GMT). No. of bitstreams: 1 ntu-100-R98525038-1.pdf: 3708330 bytes, checksum: c567e046a567f47cd6cd9403ad40007f (MD5) Previous issue date: 2011	en
dc.description.tableofcontents	致謝 I 摘要 II ABSTRACT III 目錄 V 圖目錄 VII 表目錄 VIII 第一章導論 1 第二章文獻回顧 4 2.1 中心法則（Central Dogma） 4 2.2 相關研究 4 2.3 BLASTClust 7 2.4 從胺基酸序列汲取資訊的工具 7 2.4.1 PSI-BLAST 7 2.4.2 PSIPRED 8 2.4.3 DISOPRED 9 2.5 隨機森林 9 2.6 支援向量機 10 2.7 WildSpan 14 第三章實驗材料及方法 15 3.1 資料集 15 3.2 特徵向量編碼 15 3.3 以WildSpan樣式進行後處理 17 3.4 系統架構 19 3.5 分類器效能評估 20 第四章實驗結果 22 4.1 參數最佳化 22 4.2 結合WildSpan樣式 24 4.3 改變資料集挑選方式 25 4.4 多數決標示結合位置 27 4.5 獨立測試 29 4.6 序列相似度與預測能力 31 4.6.1 交叉驗證之資料集討論 31 4.6.2 獨立測試之訓練集討論 36 4.7 比對資料集內最接近序列方法 37 第五章結論 41 參考文獻 43 附錄A──序列相似度樹形圖 47 附錄B──RB33、RB301序列清單 55 附錄C──預測表現數據補充 57
dc.language.iso	zh-TW
dc.subject	預測核糖核酸結合殘基	zh_TW
dc.subject	保留區域	zh_TW
dc.subject	機器學習	zh_TW
dc.subject	序列相似度	zh_TW
dc.subject	predicting RNA-binding residues	en
dc.subject	conserved regions	en
dc.subject	machine learning	en
dc.subject	sequence identities	en
dc.title	分析序列保留性對蛋白質中核糖核酸結合殘基預測之影響	zh_TW
dc.title	Analyzing the Impacts of Sequence Conservation on Protein RNA-binding Residue Prediction	en
dc.type	Thesis
dc.date.schoolyear	99-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	歐陽彥正,張瑞益,陳倩瑜
dc.subject.keyword	預測核糖核酸結合殘基,保留區域,機器學習,序列相似度,	zh_TW
dc.subject.keyword	predicting RNA-binding residues,conserved regions,machine learning,sequence identities,	en
dc.relation.page	60
dc.rights.note	有償授權
dc.date.accepted	2011-08-14
dc.contributor.author-college	工學院	zh_TW
dc.contributor.author-dept	工程科學及海洋工程學研究所	zh_TW
顯示於系所單位：	工程科學及海洋工程學系

文件中的檔案：

檔案	大小	格式
ntu-100-1.pdf 未授權公開取用	3.62 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。