以距離統計量為基準的基因選取方法及基因富集分析

Yu-Chuan Tang; 湯育全

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21617

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	蔡政安(Chen-An Tsai)
dc.contributor.author	Yu-Chuan Tang	en
dc.contributor.author	湯育全	zh_TW
dc.date.accessioned	2021-06-08T03:39:56Z	-
dc.date.copyright	2019-07-15
dc.date.issued	2019
dc.date.submitted	2019-07-08
dc.identifier.citation	Ackermann, M. and Strimmer, K. A general modular framework for gene set enrichment analysis. BMC Bioinformatics 2009;10:47. Ashburner, M., et al. Gene Ontology: tool for the unification of biology. Nature Genetics 2000;25:25. Benjamini, Y. and Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 1995;57(1):289-300. Blackwood, M.A. and Weber, B.L. BRCA1 and BRCA2: from molecular genetics to clinical medicine. Journal of Clinical Oncology 1998;16(5):1969-1977. Bobkov, S. and Ledoux, M. One-dimensional empirical measures, order statistics and Kantorovich transport distances. preprint 2014. Feugaing, D.D.S., Götte, M. and Viola, M. More than matrix: the multifaceted role of decorin in cancer. European journal of cell biology 2013;92(1):1-11. Gable, A.L., et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research 2018;47(D1):D607-D613. Goeman, J.J., et al. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 2004;20(1):93-99. Goeman, J.J. and Bühlmann, P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 2007;23(8):980-987. Heberle, H., et al. InteractiVenn: a web-based tool for the analysis of sets through Venn diagrams. BMC Bioinformatics 2015;16(1):169. Hummel, M., Meister, R. and Mansmann, U. GlobalANCOVA: exploration and assessment of gene group effects. Bioinformatics 2007;24(1):78-85. Irigoien, I., et al. Identifying Extreme Observations, Outliers and Noise in Clinical and Genetic Data. Curr Bioinform 2017;12(2):101-117. Irigoien, I. and Arenas, C. Identification of differentially expressed genes by means of outlier detection. BMC Bioinformatics 2018;19(1):317. Irizarry, R.A., et al. Summaries of Affymetrix GeneChip probe level data. Nucleic acids research 2003;31(4):e15-e15. Irizarry, R.A. and Wu, Z. Graphics Toolbox for Assessment of Affymetrix Expression Measures. In.; 2019. Jambusaria, A., et al. A computational approach to identify cellular heterogeneity and tissue-specific gene regulatory networks. BMC Bioinformatics 2018;19(1):217. Kanehisa, M. and Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 2000;28(1):27-30. Kong, S.W., Pu, W.T. and Park, P.J. A multivariate approach for integrating genome-wide expression data and biological knowledge. Bioinformatics 2006;22(19):2373-2380. Liberzon, A., et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 2011;27(12):1739-1740. Nabavi, S., et al. EMDomics: a robust and powerful method for the identification of genes differentially expressed between heterogeneous classes. Bioinformatics 2016;32(4):533-541. National Cancer Institute. Genetics of Breast and Gynecologic Cancers. In. Ni, I.B.P., et al. Gene expression patterns distinguish breast carcinomas from normal breast tissues: the Malaysian context. Pathology-Research and Practice 2010;206(4):223-228. Schäfer, J. and Strimmer, K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology 2005;4(1). Simon, R.M., et al. Design and analysis of DNA microarray investigations. Springer Science & Business Media; 2003. Subramanian, A., et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005;102(43):15545-15550. Székely, G.J. E-Statistics: The energy of statistical samples. Bowling Green State University, Department of Mathematics and Statistics Technical Report 2003;3(05):1-18. Thomas, R., et al. Validation and characterization of DNA microarray gene expression data distribution and associated moments. BMC Bioinformatics 2010;11(1):576. Tusher, V.G., Tibshirani, R. and Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 2001;98(9):5116-5121. Welch, B.L. The Significance of the Difference Between Two Means when the Population Variances are Unequal. Biometrika 1938;29(3-4):350-362. Yang, D., Parrish, R.S. and Brock, G.N. Empirical evaluation of consistency and accuracy of methods to detect differentially expressed genes based on microarray data. Comput Biol Med 2014;46:1-10. Zuo, Y., et al. Incorporating prior biological knowledge for network-based differential gene expression analysis using differentially weighted graphical LASSO. BMC Bioinformatics 2017;18(1). 何奇軒. 臺灣大學; 2018. 以非常態情境評估基因集合分析方法在真實基因資料下之表現研究.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21617	-
dc.description.abstract	在本論文中的第一部分為單一差異表現基因分析，像是t檢定或SAM這類的統計方法都是將每個基因視為獨立並分別檢定是否為差異表現基因，但在未考慮基因之間的相關性下，得到的檢定結果可能會產生偏差。因此，近期有一個被稱為OR值的新統計量被提出，其優點是不需要模型假設及估計參數，並利用歐式距離考慮欲檢定基因與所有基因之間的關係以及整體資料的分散程度。在本論文中使用多元常態分配、多元t分配及混合分配來模擬基因表現資料，接著使用OR值檢定單一基因是否為差異表現基因，並與t檢定及不使用OR值的方法做比較，結果發現使用OR值的加權分位數差距方法在所有情況都有不錯的表現，尤其在基因之間相關性高的多元t分布下以及混合分布兩種多維度常態分布的平移量差距大於0時有很高的檢定力且錯誤發現率也較低。在本論文中的第二部份基因集分析採用自足假說，而檢定方法主要是嘗試調整第一部份中的量化基因表現量差異的統計量來做基因集分析，同時與現有常用的基因集分析方法做比較，結果發現只有在多元 t 分布下才比較明顯的看出以距離為主的方法如分位數差距總和、加權分位數差距總和及energy test 方法其檢定力表現相較於其他方法好，其他情況下並沒有發現特別有優勢的方法；第三部份則是使用一組乳癌病人的實際資料來進行單一差異表現基因和基因集分析，並與其他方法比較結果。總和來說，在進行單一差異表現基因分析時，OR值是個值得考慮的方法，但在基因集分析中則可能還需要一個更穩健的統計量。	zh_TW
dc.description.abstract	The first part of this paper is to study the effectiveness of differentially expressed gene analysis. Statistical methods such as t-test or SAM treat each gene as independent and separately identify whether it is a differentially expressed gene. However, the results of the test may be biased because of the correlation between genes. Therefore, a novel statistic called OR value is proposed for identifying differentially expressed genes recently. The advantage of OR value is no model assumptions and no estimated parameters, as well as the Euclidean distance is used to consider the correlation between genes and the dispersion of data. In this paper, multivariate normal distribution, multivariate t distribution, and mixed distribution are used to simulate gene expression data, and then the OR value is used to identify whether the gene is a differentially expressed gene, and compared it to the commonly used t-test and non-OR methods. The results show that the weighted quantile difference method using OR value performs well in all cases, especially in the multivariate t distribution with a high correlation coefficient and the mixed distribution with shift amount greater than 0. The second aim of this paper is gene set analysis (GSA) using the self-contained hypothesis. Adjustments for the GSA method is carried out using statistics in the first part, and we also compared it to commonly used gene set analysis methods. The results show that only in the multivariate t distribution, the distance-based methods such as the sum of the quantile difference, the sum of the weighted quantile difference and the energy test method perform better than other methods, and there is no apparent method outperforming others under other conditions. Finally, we applied the OR-based method and competing methods to a large scale dataset from a group of breast cancer patients to perform the differentially expressed gene and gene set analysis. In summary, the OR value is a worthwhile method when performing the differentially expressed gene analysis, but a more robust statistic may be needed to extend the analysis for gene-set level.	en
dc.description.provenance	Made available in DSpace on 2021-06-08T03:39:56Z (GMT). No. of bitstreams: 1 ntu-108-R06621207-1.pdf: 3967027 bytes, checksum: 0d495b7c89cf27de120e19aec93cfd6a (MD5) Previous issue date: 2019	en
dc.description.tableofcontents	第一章研究背景 1 第二章單一差異表現基因分析 3 第一節分析方法 3 一、使用OR值的方法 3 二、其他方法 6 三、評估指標 7 第二節模擬設定 8 一、多維度常態分布 8 二、多維度t分布 9 三、混合分布 11 第三節結果 13 第三章基因集分析 15 第一節分析方法 15 一、以距離為基礎的方法 15 二、其他常用方法 16 第二節模擬設定 20 一、多維度常態分布 20 二、多維度t分布 21 三、混合分布 23 第三節結果 25 第四章真實資料分析 27 第一節資料處理 27 第二節單一差異表現基因分析 27 第三節基因集分析 29 第五章討論 30 參考文獻 32 附錄 190
dc.language.iso	zh-TW
dc.title	以距離統計量為基準的基因選取方法及基因富集分析	zh_TW
dc.title	Methods based on distance statistics for detection of differentially expressed genes and gene set enrichment analysis	en
dc.type	Thesis
dc.date.schoolyear	107-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	劉力瑜(Li-yu Liu),蔡欣甫(Shin-Fu Tsai)
dc.subject.keyword	微陣列,差異表現基因,基因集分析,分位數差距,	zh_TW
dc.subject.keyword	microarray,differentially expressed genes,gene set analysis,quantile difference,	en
dc.relation.page	192
dc.identifier.doi	10.6342/NTU201901248
dc.rights.note	未授權
dc.date.accepted	2019-07-08
dc.contributor.author-college	生物資源暨農學院	zh_TW
dc.contributor.author-dept	農藝學研究所	zh_TW
顯示於系所單位：	農藝學系

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 未授權公開取用	3.87 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。