蛋白質最大序列樣式探勘演算法

Yu Ling; 凌宇

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/35957

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	李瑞庭
dc.contributor.author	Yu Ling	en
dc.contributor.author	凌宇	zh_TW
dc.date.accessioned	2021-06-13T07:48:45Z	-
dc.date.available	2005-08-01
dc.date.copyright	2005-08-01
dc.date.issued	2005
dc.date.submitted	2005-07-26
dc.identifier.citation	[1] Agrawal, R. and Srikant, R. (1994). “Fast algorithms for mining association rules”, In Proc. Int. Conf. Very Large Data Bases (VLDB’94), 487-499. [2] Agrawal, R. and Srikant, R. (1996). “Mining sequential patterns: Generalizations and performance improvements”, In Proc. 5th Int. Conf. Extending Database Technology (EDBT’96), 3-17. [3] Ayres, J., Flannick, J., Gehrke, J. and Yiu, T. (2002). “Sequential pattern mining using a bitmap representation”, In Proc. Int. Conf. Knowledge Discovery and Data Mining (KDD’02), 429-435. [4] Bahar, I. and Chen, S-C. (2004). “Mining frequent patterns in protein structures: a study of protease families”, Bioinformatics, 20, i77-i85. [5] Bork, P. and Koonin, E. (1996). “Protein sequence motifs”, Current Opinion in Structural Biogy, 6, 366-376. [6] Braun, W., Ivanciuc, O. and Schein, C. (2002). “Data mining of sequences and 3D structures of allergenic proteins”, Bioinformaics, 18, 1358-1364. [7] Braun, W., Mathura, V. and Schein, C. (2003). “Identifying property based sequence motifs in protein families and superfamilies: application to DNase-1 related endonucleases”, Bioinformatics, 19, 1381-1390. [8] Chang, B. and Halgmuge, S. (2002). “Protein motif extraction with neuro-fuzzy optimization”, Bioinformatics, 18, 1084-1090. [9] Cosic, I. (1994). “Macromolecular bioactivity: is it resonant interaction between macromolecules?- theory and applications”, IEEE Transactions on Bio-medical Engineering, 41,1101-1114. [10] Gusfield, D. (1997). Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, Cambridge, U.K. [11] Han, J. and Kamber, M. (2001). Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, CA, USA. [12] Holm, L. and Heger, A. (2003). “Sensitive pattern discovery with ‘fuzzy’ alignments of distantly related proteins”, Bioinformatics, 19, i130-i137. [13] Keles, S., Laan, M. and Vulpe, C. (2004). “Regulatory motif finding by logical regression”, Bioinformatics, 20, 2799-2811. [14] Krishnan, A., Li, K-B., and Issac, P. (2004). “Rapid detection of conserved regions in protein sequences using wavelets”, Silico Biology, 13-22. [15] Landgraf, R., Xenarios, I., Eisenberg, D. (2001). “Three-dimensional Cluster Analysis Identifies Interfaces and Functional residue Clusters in Proteins”, Journal of Molecular Biology, 307, 1487-1502. [16] Li, H. and Li, J. (2004). “Discovery of stable and significant binding motif pairs from PDB complexes and protein interaction datasets”, Bioinformatics Advance Access. [17] McCreight, E.M. (1976). “A space-economic suffix tree construction algorithm”, Journal of the ACM, 23, 262-272. [18] Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U. and Hsu, M.-C. (2001). “PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth”, In Proc. Int. Conf. Data Engineering (ICDE ’01), 215-224. [19] Rigoutsos, I. and Floratos, A. (1998). “Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm”, Bioinformatis, 14, 55-67. [20] Ukkonen, E. (1985). “Finding approximate patterns in strings”, Journal of Algorithm, 6, 132-137. [21] Ukkonen, E. (1995). “On-line construction of suffix trees”, Algorithmica, 14, 249-260. [22] Yan, X., Han, J. and Afshar, R. (2003). “CloSpan: Mining Closed Sequential Patterns in Large Datasets”, In Proc. SIAM Int. Conf. on Data Mining (SDM'03), 166-177.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/35957	-
dc.description.abstract	蛋白質序列中的序列樣式，和蛋白質所執行的功能有著密不可分的關係。因此，如何從蛋白質序列資料庫中，透過系統化的方法，探勘出重要的序列樣式，已成為一個相當重要的研究課題。針對此一課題，本論文提出了一個以字尾樹(suffix tree)為基礎的演算法。演算法中的所有過程，包括探勘序列樣式中的封閉性高頻子字串(closed frequent substring)，將封閉性子字串組成最大高頻序列樣式(maximal frequent sequential pattern)，以及調整序列樣式中子字串間的間隔(gap)等，皆可利用字尾樹中所記錄的發生資訊(occurrence information)來完成。而為了確保序列樣式的精簡性，我們的演算法刪減了不必要的序列樣式，僅保留最大序列樣式。由實驗的結果顯示，我們的演算法不僅能夠找出PROSITE資料庫中所記錄的序列樣式，並且還發現了其他一些值得提供生物學家進一步研究的結果，例如更長的序列樣式，及分類樣式集合(classifier pattern set)等。另外，我們演算法在實驗中，也展現了較 Chang and Halgamuge 的方法更為優良的結果。	zh_TW
dc.description.abstract	Because of the close relationship between sequential patterns and protein function, systematically mining significant sequential patterns in protein databases has become an important research topic. In this thesis, we proposed a suffix-tree-based algorithm to discover patterns in protein databases. We use the occurrence information maintained in the suffix tree to mine closed frequent substrings, generate maximal frequent sequential patterns, and adjust the gaps within the patterns. To ensure the compactness of the patterns we generate, we do not generate all patterns but only maximal patterns. From the experimental results, our proposed algorithm can find not only the patterns recorded in PROSITE database, but also some other patterns worth of further biological studying, such as longer patterns and the classifier pattern set. Besides, our proposed algorithm generates better results than those of Chang and Halgamuge’s method in the experiment.	en
dc.description.provenance	Made available in DSpace on 2021-06-13T07:48:45Z (GMT). No. of bitstreams: 1 ntu-94-R92725020-1.pdf: 689972 bytes, checksum: 0a56349722299f29aeef7bc575780ecf (MD5) Previous issue date: 2005	en
dc.description.tableofcontents	Table of Contents i List of Figures ii List of Tables iii Chapter 1 Introduction 1 Chapter 2 Literature Survey 3 2.1 Sequential mining methods 3 2.2 Feature-based methods 3 2.3 Mathematical methods 6 2.4 MSA-based methods 6 2.5 Chang and Halgamuge’s algorithm 7 2.6 Discussion 8 Chapter 3 Our Proposed Algorithm 10 3.1 Term and notation description 10 3.2 Problem definition and algorithm overview 12 3.3 Finding all closed frequent substrings 14 3.4 Finding all maximal frequent patterns 21 3.5 Gap adjustment 23 Chapter 4 Experiments and Performance Evaluation 28 4.1 Experiments on real data 28 4.2 Comparisons with PROSITE database 33 4.3 Comparisons with Chang and Halgamuge’s algorithm 38 Chapter 5 Conclusions and Future Work 40 References 42
dc.language.iso	en
dc.subject	蛋白質	zh_TW
dc.subject	字尾樹	zh_TW
dc.subject	最大序列樣式	zh_TW
dc.subject	suffix tree	en
dc.subject	maximal sequential pattern	en
dc.subject	protein	en
dc.title	蛋白質最大序列樣式探勘演算法	zh_TW
dc.title	Mining Maximal Sequential Patterns in Protein Databases	en
dc.type	Thesis
dc.date.schoolyear	93-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳良華,傅楸善
dc.subject.keyword	蛋白質,最大序列樣式,字尾樹,	zh_TW
dc.subject.keyword	protein,maximal sequential pattern,suffix tree,	en
dc.relation.page	43
dc.rights.note	有償授權
dc.date.accepted	2005-07-26
dc.contributor.author-college	管理學院	zh_TW
dc.contributor.author-dept	資訊管理學研究所	zh_TW
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-94-1.pdf 未授權公開取用	673.8 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。