蛋白質最大序列樣式探勘演算法

Yu Ling; 凌宇

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/35957

標題:	蛋白質最大序列樣式探勘演算法 Mining Maximal Sequential Patterns in Protein Databases
作者:	Yu Ling 凌宇
指導教授:	李瑞庭
關鍵字:	蛋白質,最大序列樣式,字尾樹, protein,maximal sequential pattern,suffix tree,
出版年 :	2005
學位:	碩士
摘要:	蛋白質序列中的序列樣式，和蛋白質所執行的功能有著密不可分的關係。因此，如何從蛋白質序列資料庫中，透過系統化的方法，探勘出重要的序列樣式，已成為一個相當重要的研究課題。針對此一課題，本論文提出了一個以字尾樹(suffix tree)為基礎的演算法。演算法中的所有過程，包括探勘序列樣式中的封閉性高頻子字串(closed frequent substring)，將封閉性子字串組成最大高頻序列樣式(maximal frequent sequential pattern)，以及調整序列樣式中子字串間的間隔(gap)等，皆可利用字尾樹中所記錄的發生資訊(occurrence information)來完成。而為了確保序列樣式的精簡性，我們的演算法刪減了不必要的序列樣式，僅保留最大序列樣式。由實驗的結果顯示，我們的演算法不僅能夠找出PROSITE資料庫中所記錄的序列樣式，並且還發現了其他一些值得提供生物學家進一步研究的結果，例如更長的序列樣式，及分類樣式集合(classifier pattern set)等。另外，我們演算法在實驗中，也展現了較 Chang and Halgamuge 的方法更為優良的結果。 Because of the close relationship between sequential patterns and protein function, systematically mining significant sequential patterns in protein databases has become an important research topic. In this thesis, we proposed a suffix-tree-based algorithm to discover patterns in protein databases. We use the occurrence information maintained in the suffix tree to mine closed frequent substrings, generate maximal frequent sequential patterns, and adjust the gaps within the patterns. To ensure the compactness of the patterns we generate, we do not generate all patterns but only maximal patterns. From the experimental results, our proposed algorithm can find not only the patterns recorded in PROSITE database, but also some other patterns worth of further biological studying, such as longer patterns and the classifier pattern set. Besides, our proposed algorithm generates better results than those of Chang and Halgamuge’s method in the experiment.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/35957
全文授權:	有償授權
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-94-1.pdf 目前未授權公開取用	673.8 kB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。