針對高通量定序資料之可延展序列組合演算法

Chien-Chih Chen; 陳建智

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6366

標題:	針對高通量定序資料之可延展序列組合演算法 Scalable Assembly of High-Throughput De Novo Sequencing Data
作者:	Chien-Chih Chen 陳建智
指導教授:	賴飛羆(Feipei Lai)
共同指導教授:	陳俊良(Chuen-Liang Chen),何建明(Jan-Ming Ho)
關鍵字:	序列組合,平行運算,基因體學,轉錄體學,生物資訊, sequence assembly,parallel computing,genomics,transcriptomics,bioinformatics,
出版年 :	2013
學位:	博士
摘要:	DNA定序技術是研究分子生物的最重要的步驟之一，用來判定DNA片段所代表的序列資訊。隨著次世代定序技術的發展，基因體學跟轉錄體學的研究已進入到下一個里程。然而目前的定序技術無法將基因體或轉錄體從頭到尾一次定序完成，必需將樣本切成很多短的片段，再透過序列組合演算法將基因體或轉錄體還原。因此序列組合組合演算法一直是生物資訊中的核心問題。序列組合的挑戰在於: (1)序列中的錯誤、(2)序列中有重複片段、(3)序列中每各位置被定序到的次數不平均、(4)針對大量資料時計算複雜度的問題。其中，隨著次世代定序技術所帶來快速增加的資料量，使得序列組合軟體能具備可延展運算能力來處理大量序列資料為最迫切的需求。這類的需求剛好符合雲端運算的運作模式。在雲端雲算中，使用者可以依需求透過網路向供應商去設置幾千台的電腦資源來做大量資料的平行處理。針對大量的資料，這種可延展的雲端應用通常是發展在MapReduce的架構下。在本篇論文中，我們提出一個以MapReduce為架構的高通量序列組合演算法，稱作CloudBrush。CloudBrush是以雙向串圖(bidirected string graph)為基礎，主要分成兩個階段: 建構圖形(graph construction)和簡化圖形(graph simplification) 。在建構圖形的步驟，我們以一個讀序(read)當做圖上的一個節點(node)，節點跟節點間的圖邊(edge)定義為讀序跟讀序間的重複(overlap)。我們提出一個前綴延伸(prefix-and-extend)的演算法來判定兩兩讀序間的重複。在圖形簡化的步驟，我們採取常見的串圖簡化方法包括:傳遞簡約(transitive reduction)、路徑壓縮(path compression)、移除分枝結構(tips removal)和移除氣泡結構(bubble removal) 。此外我們提出一個新的串圖簡化方式:圖邊調整(edge adjustment)來移除串圖內因序列錯誤所造成的複雜結構。此簡化方式主要是利用圖形中鄰居間的序列資訊來做判斷，移除可能是錯誤造成的圖邊。在效能評比部分，我們利用Genome Assembly Gold-Standard Evaluation為基準來衡量CloudBrush組出的序列品質並跟其他序列組合工具做比較。實驗結果顯示我們的演算法可組出中等長度的N50，且不易發生序列誤接(mis-assembly)的情形。此外針對轉錄體的序列資料我們另外提出一個T-CloudBrush的流程。T-CloudBrush 主要是利用多重參數(multiple-k)來克服轉錄體序列資料深度(coverage)差異的問題。多重參數的概念主要是來自於這項觀察:在考慮序列組出的品質情況下，序列資料的深度跟序列組合演算法使用的重複區域(overlap size)參數呈現正相關的關係。實驗結果顯示T-CloudBrush可以增進轉錄體序列組合的品質。綜合上述，本篇論文在可延伸的運算架構底下，探討序列組合算法所面臨的挑戰，並針對大資料的處理，序列錯誤及序列資料深度差異的問題提出可能的解法。 DNA sequencing is one of the most important procedures in molecular biology research for determining the sequences of bases in specific DNA segments. With the development of next-generation sequencing technologies, studies on genomics and transcriptomics are moving into a new era. However, the current DNA sequencing technologies cannot be used to read entire genomes or a transcript in 1 step; instead, small sequences of 20–1000 bases are read. Thus, sequence assembly continues to be one of the central problems in bioinformatics. The challenges facing sequence assembly include the following: (1) sequencing error, (2) repeat sequences, (3) nonuniform coverage, and (4) computational complexity of processing large volumes of data. From these challenges, considering the rapid growth of data throughput delivered by next-generation sequencing technologies, there is a pressing need for sequence assembly software that can efficiently handle massive sequencing data by using scalable and on-demand computing resources. These requirements fit in with the model of cloud computing. In cloud computing, computing resources can be allocated on demand over the Internet from several thousand computers offered by vendors for analyzing data in parallel. Such cloud-computing applications are constantly being developed for large datasets and are run under the framework of MapReduce. In this dissertation, we have proposed CloudBrush, a parallel pipeline that runs on the MapReduce framework for de novo assembly of high-throughput sequencing data. CloudBrush is based on bidirected string graphs and its analysis consists of 2 main stages: graph construction and graph simplification. During graph construction, a node is defined for each nonredundant sequence read, and the edge is defined for overlap between reads. We have developed a prefix-and-extend algorithm for identifying overlaps between a pair of reads. The graph is further simplified by using conventional operations such as transitive reduction, path compression, tip removal, and bubble removal. We have also introduced a new operation, edge adjustment, for removing error topology structures in string graphs. This operation uses the sequence information of all graph neighbors for each read and eliminates the edges connecting to reads containing rare bases. CloudBrush was evaluated against Genome Assembly Gold-Standard Evaluation (GAGE) benchmarks to compare its assembly quality with that of other assemblers. The results showed that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels. In addition, we have introduced 2 measures, precision and recall, to address the issues of faithfully aligned contigs in order to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush was found to produce contigs with high precision and recall. We have also introduced a T-CloudBrush pipeline for transcriptome data. T-CloudBrush uses the multiple-k concept to overcome the problem of nonuniform coverage of transcriptome data. This concept is based on observation of the correlation between sequencing data coverage and the overlap size used during assembly. The experiment results showed that T-CloudBrush improves the accuracy of de novo transcriptome assembly. In summary, this dissertation explores the challenges facing sequence assembly under the scalable computing framework and provides possible solutions for the problems of sequencing errors, nonuniform coverage, and processing of large volumes of data.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6366
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-102-1.pdf	1.53 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。