詞頻探索方法用於高效率之基因體同源與同線圖譜對映

Yu-Jung Chang; 張育榮

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/9563

標題:	詞頻探索方法用於高效率之基因體同源與同線圖譜對映 Copy Number-Based Seeding Approaches to Efficient Orthology and Synteny Mapping in Genome Comparisons
作者:	Yu-Jung Chang 張育榮
指導教授:	高成炎(Cheng-Yen Kao)
共同指導教授:	何建明(Jan-Ming Ho)
關鍵字:	比較基因體學,演化同線對映,演化同源對映,序列比對,後綴陣列, comparative genomics,synteny mapping,orthology mapping,sequence alignment,seeding,suffix array,
出版年 :	2008
學位:	博士
摘要:	尋找與回溯不同生物基因體間在演化上之共同來源區段(稱之為演化同源與同線圖譜對映，synteny and orthology mapping)，是比較基因體學中基礎的工作。隨著定序技術的進展，愈來愈多的大型基因體序列已經定序完成或近乎完成。這一方面使得以全基因體比對進行演化同源與同線圖譜對映顯得日益重要，另一方面也帶來了新的研究挑戰。面對為數眾多、隨時間分歧演化且動輒數十億萬鹼基對的基因體序列比對，我們要如何建立具備高靈敏度、高特異度以及高效率的比對引擎與方法是其中核心的研究課題。我們首先針對近距大型基因體間同源與同線圖譜對映，發展出UniMarker方法。以人與小鼠比對為例，此方法採用長度16且在這兩個基因體都只出現一次的短序列來建立出次數頻譜，以偵測尋找同源與同線的基因體區段。實驗結果顯示，人與小鼠（基因體長度均為約三十億萬鹼基對）的基因體同源與同線對映只需數小時於一台個人電腦即能完成，同時其產出之圖譜與小鼠基因體定序協會(MGSC)之圖譜有99%的一致。接著，針對非近距大型基因體間同源與同線圖譜對映，我們提出新型態的種子詞彙(seed)，稱為maximal α-marker pairs(簡稱α-pairs)，α代表該種子詞彙在兩個欲比對序列上之總出現次數的上限，這種選取方式有別於常見以限制種子詞彙長度而不考慮詞頻的選取方式，例如：採用固定長度的k-mer與設定長度下限的MEM方法。奠基於增強式後綴陣列(enhanced suffix arrays)，我們提出了一個線性演算法來產生所有的α-pairs。根據人比對小鼠、雞與河豚的實驗結果，上述α-marker方法較之限制長度的方法(k-mer, MEM)在連續性匹配(contiguous matching)的同源種子詞彙選取(orthology seeding)上，能同時達成明顯較佳的靈敏度與較佳的效率。此外，我們更延伸此詞頻探索方法到非連續性匹配(discontiguous matching)的同源種子詞彙選取。從ROC曲線上的比較結果顯示，非連續性的wobble α-pairs明顯優於其他未限制詞頻之非連續性種子詞彙(spaced k-mer seeds)。 Motivation: Orthology/synteny mapping—finding orthologous regions among genomes and organizing these evolutionary counterparts into a coherent global picture—is fundamental to studies of comparative genomics. With the increasing number of completely sequenced genomes and thus the increase in comparisons of massive nucleotide sequences, the need for orthology/synteny mapping methods of high sensitivity/specificity and high efficiency becomes even more compelling. Results: First we have developed the UniMarker (UM) method for synteny mapping of large genomes that are closely related, such as the human and mouse. In this method, the occurrence spectra of genome-wide unique 16mer sequences present in both the human and mouse genome are used to directly detected orthologous genomic segments. Being sequence alignment-free, the UM method is very fast and the high-quality human-mouse synteny maps based on DNA comparisons can be completed in a few hours on single desktop computer. Second, we propose a new type of DNA sequence seed for use in orthology mapping of not closely related genomes. We call our seeds α-pairs, where α is an integer equal to or greater than the number of times any qualifying seed can be found in the compared genomes. These copy number-based seeds are thus distinct from the well-known length-based seeds, such as the fixed-length k-mer seeds or the maximal exact match (MEM) seeds which have a length no less than k. We present a linear time algorithm to efficiently retrieve α-pairs in two given genomic sequences based on enhanced suffix arrays. A comparison of the results using α-pairs with those using length-based seeds for their ability to detect the orthologues annotated by Ensembl and COG for several vertebrate genomes/chromosomes and for prokaryote genomes of long evolutionary distances suggested that orthology seeding using copy number can achieve a higher sensitivity and better efficiency than orthology seeding using length. Moreover, we extend the α-pair method to generate discontiguous wobble seeds of maximal length with copy number constraints. The comparative results of ROC curves for human chr.15 vs. mouse chr.7, chicken chr.10, and pufferfish genome showed that the discontiguous wobble α-pairs achieved significantly better performances than spaced k-mer seeding methods tested.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/9563
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-97-1.pdf	2 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。