Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/40187
Title: | 分析序列保留性對蛋白質中核糖核酸結合殘基預測之影響 Analyzing the Impacts of Sequence Conservation on Protein RNA-binding Residue Prediction |
Authors: | Ya-Ping Chen 陳雅萍 |
Advisor: | 黃乾綱 |
Keyword: | 預測核糖核酸結合殘基,保留區域,機器學習,序列相似度, predicting RNA-binding residues,conserved regions,machine learning,sequence identities, |
Publication Year : | 2011 |
Degree: | 碩士 |
Abstract: | 核糖核酸與蛋白質的交互作用在基因表現的許多階段扮演重要的角色,如合成信使核糖核酸前體、剪接信使核糖核酸以及轉譯。普遍認為蛋白質是透過結合區域或結合模體辨識目標核糖核酸,其對應之核酸型態以及辨識的結構等級皆相當多元,由蛋白質一級結構預測核糖核酸結合殘基因此頗具挑戰性。
本論文延續ProteRNA預測方法發展支援向量機與隨機森林分類器,特徵方面增加預測而得之不穩定序,亦改善以WildSpan樣式之序列保留區域進行後處理的品質。考慮資料集製備的效果和結合位置差異,在五疊交叉驗證中,兩分類器的馬修相關係數分別可達0.5288和0.4698。而在獨立測試的表現,支援向量機領先各提供線上服務的預測方法,可獲得92.12%的精確度、38.10%的靈敏度、97.47%的專一性、59.89%的準確度、0.4657的F測量值和0.4381的馬修相關係數,隨機森林表現僅次於支援向量機,可獲得90.08%的精確度、34.47%的靈敏度、95.59%的專一性、43.62%的準確度、0.3851的F測量值和0.3346的馬修相關係數。 觀察不同序列相似度之資料集對機器學習方法預測能力所構成的趨勢,並討論其中預測表現的提升與瓶頸來源為何。我們發現和查詢序列同時存在於資料集的同源序列,甚至相似度較低的遠親同源序列,都可能影響預測,使結果更接近正確的結合位置分佈。此外,直接比對資料集內最接近的序列以決定結合殘基,在部分情況的預測結果是優於特徵向量為位置加權矩陣之機器學習方法的;然而機器學習方法在獨立測試時面對新穎的蛋白質序列,預測的優良表現終究顯現其一般化能力。 Protein-RNA interactions play a vital role in many stages of gene expression such as pre-mRNA synthesis, mRNA splicing and translation. It is generally believed that binding domains or binding motifs enable RNA-binding proteins to recognize their target RNA. Since the corresponding nucleic acid type and the structure level recognized can be quite diverse, predicting RNA-binding residues from primary structure of proteins is indeed a challenging task. In this thesis, we continue the work of ProteRNA and develop two classifiers, namely support vector machine (SVM) and random forests (RF), with the predicted protein disorder added as a new feature descriptor. For the post-processing procedure, we build a discriminator in order to improve the pattern quality by distinguishing RNA-binding residues from other functionally important ones in conserved regions. When considering the dataset preparation effects and variance in binding sites, the two classifiers achieve Matthew’s correlation coefficient (MCC) of 0.5288 and 0.4698 using five-fold cross-validation. Our approach outperforms other predictors which provide online service. Testing on the independent test dataset, the SVM model achieves an accuracy of 92.12%, sensitivity of 38.10%, specificity of 97.47%, precision of 59.89%, F-score of 0.4657 and MCC of 0.4381, while the RF model ranks second only to SVM, it achieves an accuracy of 90.08%, sensitivity of 34.47%, specificity of 95.59%, precision of 43.62%, F-score of 0.3851 and MCC of 0.3346. We observe the measure trend in machine learning methods for datasets based on different sequence identities, and discuss the origin of performance increment and bottleneck. We find out that the homologous sequence, or even remote homologous in the same dataset as query sequence will probably make prediction result closer to the distribution of real binding sites. Besides, a method that identifies the nearest neighbor by sequence alignment and determines its binding residues accordingly may perform better than machine learning methods trained on PSSM in some cases. Nevertheless, when dealing with novel protein sequences, the excellent performance of machine learning methods shows great generalization ability. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/40187 |
Fulltext Rights: | 有償授權 |
Appears in Collections: | 工程科學及海洋工程學系 |
Files in This Item:
File | Size | Format | |
---|---|---|---|
ntu-100-1.pdf Restricted Access | 3.62 MB | Adobe PDF |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.