以距離統計量為基準的基因選取方法及基因富集分析

Yu-Chuan Tang; 湯育全

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21617

標題:	以距離統計量為基準的基因選取方法及基因富集分析 Methods based on distance statistics for detection of differentially expressed genes and gene set enrichment analysis
作者:	Yu-Chuan Tang 湯育全
指導教授:	蔡政安(Chen-An Tsai)
關鍵字:	微陣列,差異表現基因,基因集分析,分位數差距, microarray,differentially expressed genes,gene set analysis,quantile difference,
出版年 :	2019
學位:	碩士
摘要:	在本論文中的第一部分為單一差異表現基因分析，像是t檢定或SAM這類的統計方法都是將每個基因視為獨立並分別檢定是否為差異表現基因，但在未考慮基因之間的相關性下，得到的檢定結果可能會產生偏差。因此，近期有一個被稱為OR值的新統計量被提出，其優點是不需要模型假設及估計參數，並利用歐式距離考慮欲檢定基因與所有基因之間的關係以及整體資料的分散程度。在本論文中使用多元常態分配、多元t分配及混合分配來模擬基因表現資料，接著使用OR值檢定單一基因是否為差異表現基因，並與t檢定及不使用OR值的方法做比較，結果發現使用OR值的加權分位數差距方法在所有情況都有不錯的表現，尤其在基因之間相關性高的多元t分布下以及混合分布兩種多維度常態分布的平移量差距大於0時有很高的檢定力且錯誤發現率也較低。在本論文中的第二部份基因集分析採用自足假說，而檢定方法主要是嘗試調整第一部份中的量化基因表現量差異的統計量來做基因集分析，同時與現有常用的基因集分析方法做比較，結果發現只有在多元 t 分布下才比較明顯的看出以距離為主的方法如分位數差距總和、加權分位數差距總和及energy test 方法其檢定力表現相較於其他方法好，其他情況下並沒有發現特別有優勢的方法；第三部份則是使用一組乳癌病人的實際資料來進行單一差異表現基因和基因集分析，並與其他方法比較結果。總和來說，在進行單一差異表現基因分析時，OR值是個值得考慮的方法，但在基因集分析中則可能還需要一個更穩健的統計量。 The first part of this paper is to study the effectiveness of differentially expressed gene analysis. Statistical methods such as t-test or SAM treat each gene as independent and separately identify whether it is a differentially expressed gene. However, the results of the test may be biased because of the correlation between genes. Therefore, a novel statistic called OR value is proposed for identifying differentially expressed genes recently. The advantage of OR value is no model assumptions and no estimated parameters, as well as the Euclidean distance is used to consider the correlation between genes and the dispersion of data. In this paper, multivariate normal distribution, multivariate t distribution, and mixed distribution are used to simulate gene expression data, and then the OR value is used to identify whether the gene is a differentially expressed gene, and compared it to the commonly used t-test and non-OR methods. The results show that the weighted quantile difference method using OR value performs well in all cases, especially in the multivariate t distribution with a high correlation coefficient and the mixed distribution with shift amount greater than 0. The second aim of this paper is gene set analysis (GSA) using the self-contained hypothesis. Adjustments for the GSA method is carried out using statistics in the first part, and we also compared it to commonly used gene set analysis methods. The results show that only in the multivariate t distribution, the distance-based methods such as the sum of the quantile difference, the sum of the weighted quantile difference and the energy test method perform better than other methods, and there is no apparent method outperforming others under other conditions. Finally, we applied the OR-based method and competing methods to a large scale dataset from a group of breast cancer patients to perform the differentially expressed gene and gene set analysis. In summary, the OR value is a worthwhile method when performing the differentially expressed gene analysis, but a more robust statistic may be needed to extend the analysis for gene-set level.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21617
DOI:	10.6342/NTU201901248
全文授權:	未授權
顯示於系所單位：	農藝學系

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 未授權公開取用	3.87 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。