區域結構碼序列在蛋白質穿針引線法上的應用

Yang-Wen Chen; 陳暘文

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/31501

標題:	區域結構碼序列在蛋白質穿針引線法上的應用 Application of alphabet code encoded sequence to protein threading
作者:	Yang-Wen Chen 陳暘文
指導教授:	陳中明
關鍵字:	蛋白質結構預測,穿針引線法,模板資料庫,胺基酸序列,區域結構碼序列,定位點, Protein Structure Prediction,Threading,Template Library,Amino Acid Sequence,Alphabet Code Encoded Sequence,Fixed Positions,
出版年 :	2006
學位:	碩士
摘要:	最近，由於人類基因體定序完成，隨之而來的大量基因序列資料已使得傳統醫學有了革命性的進展。有了電腦科學的幫助，我們不但可以在短時間分析大量的資料，也能確保生物醫學的研究具有正確性與安全性。然而，研究基因層級對於臨床應用的實用性並不高，因為真正參與生理作用的往往是基因所表現的蛋白質。一般相信結構和功能間有著密切的關係，如果我們能對蛋白質折疊的形狀有清楚的了解，那麼我們也就能大致上決定它的功能。在過去，蛋白質結構必須經由實驗才能得知，例如：運用X-射線繞射法或是NMR光譜法，但此二者皆有其技術上的困難與限制。因此，蛋白質結構預測已逐漸在生物醫學的研究當中扮演了相當重要的角色。蛋白質結構預測的主要困難在於缺乏同源性蛋白質時，模板骨架往往不易選取。如果一個未知結構的序列能在蛋白質資料庫中找到與其序列相似度介於20%與30%之間的遠同源蛋白質，便可使用序列結構比對（或稱為穿針引線法）來解決結構預測的問題。傳統的蛋白質穿針引線法是架構在胺基酸序列的層次上，去探究不同的胺基酸在不同空間結構與環境上的偏向性。不過，蛋白質的一級結構與三級結構之間，往往僅有少量的關連性存在，因此穿針引線法在現階段仍有所多需要突破的地方。在本研究中，我們提出了一個以區域結構碼序列為基礎的蛋白質穿針引線法。區域結構碼是在我們的先期研究中，藉由將一些具有相似保留結構的四元片段分成30群而建立的。這些將序列與結構兩個面向加以聯繫的區域結構碼，將比胺基酸序列更具有空間結構上的意涵。我們從SCOP 1.69的資料庫中挑選了945個折疊代表，以此建立出我們的模板資料庫。在適當選取訓練資料及對每一個能量項目建立出相對應的分數矩陣後，我們就可以衡量輸入進來的區域結構碼序列與模板資料庫中的每個元素的適合性。為了減少搜尋空間及時間的複雜度，我們引進了尋找多個定位點的概念，而這個想法有點類似在尋找兩個蛋白質之間，彼此相對應的結構模組。雖然初步結果並未有突破性的表現，我們仍然探討了區域結構碼序列在蛋白質穿針引線法上的可能應用。與著名的穿針引線法伺服器Gen-THREADER比較過後，我們發現：若能加大我們的模板資料庫，研究的結果將會有所提升。一些測試資料的結果也顯示：我們應使用可以信賴的演算法去尋找適合的能量函數權值，並觀察在不同的折疊下其相對應最佳能量函數權值的變化。即使初步的研究情形遇到了瓶頸，但將區域結構碼序列應用到蛋白質穿針引線法上，仍是生物資訊學上的一個新的嘗試與突破。我們也說明了這個具有獨特雙重性質（快速搜尋的序列層次及三維結構的空間資訊）的區域結構碼序列，是值得好好加以重視的。因此，我們期望區域結構碼序列的概念能更全面性的應用到蛋白質結構預測及結構生物資訊學的領域當中。 The Human Genome Project has recently completed sequencing of human genome. Consequently, the huge amount of genomic sequence data has revolutionized the studies of conventional medical science. With the aid of computer science, we can not only analyze numerous data but ensure the safety and correctness of the studies of medical science as well. However, researches in genomic level might be less practical than those in protein level in terms of further applications to clinical use because it is protein that actually participates in a physiological process. It is commonly believed that protein structures are highly correlated with protein functions. If we get a clear picture of how a protein folds, we can possibly determine its functionality. In the past, protein structures were obtained by X-ray diffraction or nuclear magnetic resonance but both methods have technical limitation. Thus, protein structure prediction has played a major role in the field of biomedical science. The major difficulty in developing the methods of protein structure prediction consists in the selection of the protein backbone template especially when there is no homology protein to the query protein sequence. If there is relatively low amino acid sequence similarity (usually from 20% to 30%) between an unknown sequence and its remote homologue in the protein database, a protein structure prediction method of sequence-structure alignment, namely threading, will be extensively used. Conventional protein threading methods are based on amino acid sequences and often exploit the fact that different amino acid types have different preferences for occupying different structural environments and spatial proximity. However, there exists little relationship between protein primary structure and tertiary structure. Accordingly, the threading research challenge is only partially met at the present time. In this study, we proposed a novel method — protein threading based on alphabet code encoded sequence. Alphabet codes, derived from our former researches, were created by clustering some specific conserved quadripeptides into 30 clusters. These codes, which naturally connect sequence to structure, are endowed with more structural information than amino acid ones. We picked up 945 fold representatives from the SCOP 1.69 database as our template library. After randomly choosing our training data and creating corresponding scoring matrix for each energy term, we could measure which template was compatible with the input alphabet code encode sequence. To reduce our search space and time complexity, we also introduced an idea of finding several fixed positions, which is to some extent like finding common structurally-aligned motifs between two proteins. Although our preliminary result was devoid of convincing performance, we still made a study of the application of alphabet code encoded sequence to protein threading. Compared with the famous threading server — GenTHREADER, our result was less reliable because of the fewer number of core templates. Some test data performances also suggested the importance of finding the appropriate weight of each energy term and the suitable corresponding weights for our energy function may vary with different folds. In spite of this, our method is still a breakthrough utilizing alphabet code encoded sequence for protein threading. We also illustrated that a high premium could be placed on the unique characteristic of alphabet code encoded sequence — fast-searching sequence level and 3D-rich structure information. Therefore, we anticipate the concept of alphabet code encoded sequence be applied to all aspects of protein structure prediction and even the field of structure bioinformatics.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/31501
全文授權:	有償授權
顯示於系所單位：	醫學工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-95-1.pdf 未授權公開取用	789.66 kB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。