探討轉錄體序列組裝對序列回貼以及基因表現定量的影響

Ping-Han Hsieh; 謝秉翰

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/68808

標題:	探討轉錄體序列組裝對序列回貼以及基因表現定量的影響 Effect of de novo transcriptome assembly on quality of read mapping and transcript quantification
作者:	Ping-Han Hsieh 謝秉翰
指導教授:	歐陽彥正(Yen-Jen Oyang)
共同指導教授:	陳倩瑜(Chien-Yu Chen)
關鍵字:	核糖核酸定序技,無參考序列轉錄體組裝,組裝錯誤,轉錄體表現量估計,監督式機器學習, RNA-Seq,de novo transcriptome assembly,wrongly-assembled contigs,quantification of transcript abundance,supervised machine learning,
出版年 :	2017
學位:	碩士
摘要:	利用核糖核酸的定序技術可以了解轉錄體在不同的生長階段或是生理狀態下的表現情形，進而了解生物體內的基因調控途徑。除此之外，由於核糖核酸的定序技術不需要事先使用參考的基因體或轉錄體序列，因此也特別適用於還未有詳盡註解基因體或是未曾被研究過的物種上。在沒有參考序列的情況下，研究者必須要利用定序出的小片段核糖核酸進行轉錄體序列的組裝與重建。然而，組裝過程中產生的多餘或是錯誤序列，很有可能對後續的定量分析造成嚴重的影響。因此，如何正確地用計算的方式估計轉錄體的表現量便是個相當重要的課題。本論文旨於評估轉錄體序列組裝的品質是如何影響轉錄體表現量的定量演算法。組裝後的序列會被分類為十二類不同意義的組裝序列類別，並且針對每個類別進行定量的分析與比較。結果顯示了在生物體中的轉錄體即便具有大量的相似性，對參考基因體或轉錄體的定量並沒有太大的影響，但卻會導致組裝錯誤進而造成組裝過後的序列定量具有較大的誤差，尤其針對於把多條相似序列的合併成一條序列的組裝錯誤，會得到最為嚴重的結果。除此之外，本論文也提出了一個預測組裝錯誤的監督式學習演算法，能幫助將來的研究者對於分析的組裝序列有更進一步的瞭解。總結來說，本研究利用多種組裝與定量演算法的比較，提供研究者在無參考序列物種的轉錄體組裝與定量更多的了解。 Correct quantification of transcript abundance is essential to understand the functional products of the genome in different physiological conditions and developmental stages. Recently, the development of high-throughput RNA sequencing (RNA-Seq) allows the researchers to perform transcriptome analysis for the organisms without the reference genome and transcriptome. For these practical projects, de novo transcriptome assembly must be carried out prior to quantification. However, a large number of fragmented contigs and redundant sequences produced by the assemblers may result in unreliable abundance estimation. In this regard, this study first investigates how assembly quality might affect the quality of read mapping and count estimation, and then proposes a classifier to characterize the assembled sequences. By the experiments and analyses conducted in this study, several important factors that might seriously affect the accuracy of the RNA-Seq analysis were comprehensively discussed. First, the effects of twelve distinctive assembly groups along with the intrinsic similarity presented in the reference transcriptome on quantification quality were examined. The results showed that the similar subsequences presented in the reference transcriptome only slightly influence mapping quality, but lead to many poorly-assembled contigs. The contigs that merge multiple transcripts into one most heavily decreased the reliability of abundance estimation. Second, a predicting algorithm was proposed to help researchers estimate the quantification reliability for further analyses. In summary, the analytic results conducted in this study provides valuable insights for future studies related to RNA-Seq data analysis.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/68808
DOI:	10.6342/NTU201703727
全文授權:	有償授權
顯示於系所單位：	生醫電子與資訊學研究所

文件中的檔案：

檔案	大小	格式
ntu-106-1.pdf 未授權公開取用	33.12 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。