請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/61998
標題: | 探勘生物序列循序樣式的高效率方法 Efficient Methods for Mining Sequential Patterns of Biological Sequences |
作者: | Chiang-Chi Liao 廖強棋 |
指導教授: | 陳銘憲(Ming-Syan Chen) |
關鍵字: | 循序樣式,結構模體,樣式探勘,資料探勘,生物資訊, Sequential patterns,Structured motifs,Pattern mining,Data mining,Bioinformatics, |
出版年 : | 2013 |
學位: | 博士 |
摘要: | Scientific progress in recent years has led to the generation of huge amounts of biological data, most of which remains unanalyzed. Mining the data may provide insights into various realms of biology, such as finding co-occurring biosequences, which are essential for biological data mining and analysis. Data mining techniques like sequential pattern mining may reveal implicitly meaningful patterns among the DNA or protein sequences.
Pattern mining for biological sequences is an important problem in bioinformatics and computational biology. Sequential pattern mining can reveal all-length motifs in biological sequences. Performing sequential pattern mining on biological sequences helps reveal implicit motifs/patterns, which are usually of functional significance and have specific structures. If biologists hope to uncover the potential of sequential pattern mining in their field, it is necessary to move away from traditional sequential pattern mining algorithms since these algorithms have difficulty in handling small alphabets and long sequence lengths in biological data, such as gene and protein sequences. To tackle the problem, this dissertation proposes an approach called Depth-First SPelling (DFSP) algorithm for mining sequential patterns in biological sequences. DFSP is a general model for mining sequential patterns of biological sequences. The algorithm’s processing speed is faster than that of PrefixSpan, its leading competitor, and DFSP is superior to other sequential pattern mining algorithms for biological sequences. Furthermore, gap constraints are important in computational biology since they cope with irrelative regions, which are not conserved in evolution. An approach is devised to efficiently mine sequential patterns (motifs) of biological sequences with gap constraints in this dissertation. The approach is called the Depth-First Spelling algorithm for mining sequential patterns with Gap constraints in biological sequences (referred to as DFSG). DFSG is a general gap constraint model in sequential pattern mining of biological sequences. GenPrefixSpan is a method based on PrefixSpan with gap constraints, and therefore this dissertation compares DFSG with GenPrefixSpan. In biological sequences, DFSG’s runtime is substantially shorter than that of GenPrefixSpan. The intra- and inter-block gap constraints are shown to effectively cope with the substitutions, insertions, loops, and deletions involved in evolution, but induce even higher computation cost. Hence this dissertation presents an approach called Depth-first spelling algorithm for mining structured motifs with Intra- and inter-Block gap constraints in biological sequences (referred to as DIB) to tackle this challenge. When mining biological sequences, DIB’s runtime is shown to be much shorter than a previously developed projection-based sequential pattern mining algorithm, MAGIIC. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/61998 |
全文授權: | 有償授權 |
顯示於系所單位: | 電機工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-102-1.pdf 目前未授權公開取用 | 1.46 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。