請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52139
標題: | 以貝氏分析方法來偵測轉錄體定序資料之顯著基因 Identification of Differentially Expressed Genes of RNA-Seq Data based on Bayesian Approaches |
作者: | Yu-Shiang Zeng 曾禹翔 |
指導教授: | 蔡政安 |
關鍵字: | 轉錄體定序資料,基因表現量,貝氏分析,對數線性模型, RNA-seq,Gene expression,Bayesian inference,Log-linear model, |
出版年 : | 2015 |
學位: | 碩士 |
摘要: | 近來隨著次世代定序技術發展愈來愈快速以及日趨成熟,這項科技
已經在各個領域廣泛的被使用到,如醫學、農業、生物科技等等。次世 代定序技術可以用來做全基因體定序,也可以將一些已知的物種重新 定序,更可以探討在生物性上的理論,而其中一項重要的應用就是轉 錄體定序(RNA-seq) 資料。轉錄體定序資料常被用來檢定基因表現量, 近年來,轉錄體定序資料已漸漸取代微陣列資料(Microarray) 成為研究基因表現量的一個指標。然而在探討轉錄體定序資料時,由於它是屬 於離散型變數,且資料會發生變異數大於平均值的現象,這種現象我 們稱作過度離異(over-dispersion)。我們通常會用負二項分配(Negative Binomial Model) 解決過度離異問題,但如何估計模型中的參數,這其 中又牽涉到許多統計方法。近來常見的如DESeq、edgeR 跟DSS 都是 在分析上常用的方法。但這幾種方法都是用點估計來估計參數,並沒 有將不確定性考慮進去。在本論文中,我們建立了兩個模型,分別為 對數線性模型,以及貝氏階層模型,利用馬可夫鏈蒙地卡羅(MCMC) 的方法得到我們有興趣的參數,進而可以找出表現量不同的基因。最 後我們分別利用模擬資料以及實際資料來評估DESeq、edgeR、DSS 以 及我們方法的好壞。其中我們發現當各組的重複數接近甚至相同的時 候,我們的線性對數模型相較於其他方法是表現較好的;而當重複數 如果是極端不平衡的情況之下,我們會建議利用中位數估計法來進行 檢定。 With the rapid development of Next Generation Sequencing technology, plenty of industries such as medical science, agriculture and bio-technology are taken to the next level. Next Generation Sequencing technology makes whole genome sequencing and de novo sequencing possible to explore the biology-based theory; besides, RNA-seq data is one of the core applications of Next Generation Sequencing technology. RNA-seq data is to obtain the gene expression level and to test whether specific gene is differentially expressed. Recently, RNA-seq data has replaced Microarray technology and becomes the important benchmark of gene expression test gradually. However, because of the discrete RNA-Seq read counts, the phenomena of over-dispersion (the variance of the data is larger than the mean) will occur. To deal with over-dispersion problem, negative binomial model is applied; however, the parameter estimation is another issue to be considered. Nowadays, some analysis softwares for RNA-seq data like DESeq, edgeR and DSS only use point estimation to obtain the parameters without considering the uncertainty in RNA-seq data. Here, we use Markov chain Monte Carlo (MCMC) method to obtain the estimates of parameters that it may be concerned with detecting the differentially expressed genes. In the end of the thesis, we compare the performance of DESeq, edgeR, DSS and our method by both simulated and real RNA-seq data. Our log-linear model performs much more superior than DESeq, edgeR and DSS while the replicates between groups are close or same. Besides, when the number of replicates between groups is extremely unbalanced, then we suggest that median estimator would be the proper method for detecting differentially expressed genes. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/52139 |
全文授權: | 有償授權 |
顯示於系所單位: | 農藝學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-104-1.pdf 目前未授權公開取用 | 4.73 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。