整合基因型資訊與生物路徑和基因表現資料提升阿茲海默症預測準確度

Hur Wang; 王賀

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/64850

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳倩瑜
dc.contributor.author	Hur Wang	en
dc.contributor.author	王賀	zh_TW
dc.date.accessioned	2021-06-16T23:01:56Z	-
dc.date.available	2025-03-06
dc.date.copyright	2020-03-06
dc.date.issued	2020
dc.date.submitted	2020-02-25
dc.identifier.citation	1. Wan, X., et al., BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. 2010. 87(3): p. 325-340. 2. Moore, J.H. and S.M. Williams, New strategies for identifying gene-gene interactions in hypertension. Annals of Medicine, 2002. 34(2): p. 88-95. 3. Niel, C., et al., A survey about methods dedicated to epistasis detection. Frontiers in Genetics, 2015. 6: p. 285. 4. Kira, K. and L.A. Rendell, A practical approach to feature selection, in Machine Learning Proceedings 1992. 1992, Elsevier. p. 249-256. 5. Chang, Y.-C., et al., GenEpi: Gene-based Epistasis Discovery Using Machine Learning. 2018: p. 421719. 6. Bettens, K., K. Sleegers, and C. Van Broeckhoven, Genetic insights in Alzheimer's disease. Lancet Neurol, 2013. 12(1): p. 92-104. 7. Rogaev, E., et al., Familial Alzheimer's disease in kindreds with missense mutations in a gene on chromosome 1 related to the Alzheimer's disease type 3 gene. 1995. 376(6543): p. 775. 8. Lambert, J.C., et al., Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease. Nature Genetics, 2013. 45(12): p. 1452-U206. 9. Klein, R.J., et al., Complement factor H polymorphism in age-related macular degeneration. Science, 2005. 308(5720): p. 385-9. 10. Nature, W.T.C.C.C.J., Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. 2007. 447(7145): p. 661. 11. Vassy, J.L., et al., Polygenic type 2 diabetes prediction at the limit of common variant detection. 2014. 63(6): p. 2172-2182. 12. Chung, Y., et al., Odds ratio based multifactor-dimensionality reduction method for detecting gene-gene interactions. Bioinformatics, 2007. 23(1): p. 71-6. 13. Lou, X.Y., et al., A generalized combinatorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine dependence. Am J Hum Genet, 2007. 80(6): p. 1125-37. 14. Calle, M.L., et al., MB-MDR: model-based multifactor dimensionality reduction for detecting interactions in high-dimensional genomic data. 2008. 15. Gui, J., et al., A robust multifactor dimensionality reduction method for detecting gene-gene interactions with application to the genetic analysis of bladder cancer susceptibility. Ann Hum Genet, 2011. 75(1): p. 20-8. 16. Schwarz, D.F., I.R. Konig, and A. Ziegler, On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics, 2010. 26(14): p. 1752-1758. 17. Jiang, R., et al., A random forest approach to the detection of epistatic interactions in case-control studies. Bmc Bioinformatics, 2009. 10(1): p. S65. 18. De Lobel, L., et al., A screening methodology based on Random Forests to improve the detection of gene-gene interactions. Eur J Hum Genet, 2010. 18(10): p. 1127-32. 19. Zhang, X., et al., TEAM: efficient two-locus epistasis tests in human genome-wide association study. Bioinformatics, 2010. 26(12): p. i217-i227. 20. Motsinger-Reif, A.A., et al., Power of grammatical evolution neural networks to detect gene-gene interactions in the presence of error. BMC Res Notes, 2008. 1(1): p. 65. 21. Uppu, S. and A. Krishna, A deep hybrid model to detect multi-locus interacting SNPs in the presence of noise. Int J Med Inform, 2018. 119: p. 134-151. 22. Ritchie, M.D., L.W. Hahn, and J.H. Moore, Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol, 2003. 24(2): p. 150-7. 23. Cerami, E.G., et al., Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res, 2011. 39(Database issue): p. D685-90. 24. Torkamani, A., N.E. Wineinger, and E.J.J.N.R.G. Topol, The personal and clinical utility of polygenic risk scores. 2018. 19(9): p. 581. 25. Liu, C.-C., et al., Apolipoprotein E and Alzheimer disease: risk, mechanisms and therapy. 2013. 9(2): p. 106. 26. Duncan, L., et al., Analysis of polygenic risk score usage and performance in diverse human populations. 2019. 10(1): p. 1-9. 27. Costanzo, M., et al., The genetic landscape of a cell. Science, 2010. 327(5964): p. 425-31.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/64850	-
dc.description.abstract	隨著基因定序技術和計算資源的大幅進步，人們越來越關注疾病的基因型和表現型的關聯性；在過去十年間，許多研究利用全基因組關聯性分析（Genome-wide Association Studies, GWAS）探討遺傳變異對疾病的影響，並且找到數千個單核苷酸變異（Single Nucleotide Polymorphism, SNP）和疾病或性狀的關聯，然而遺傳學家們認為此方法嚴重低估複雜疾病的潛在生物機制，忽略了單一變異之間的交互作用，又稱為上位作用（Epistasis）。進行全基因組上位作用分析目前仍是很大的挑戰，因為分析的過程必須把所有單一變異兩兩組合做展開，需要用到大量計算資源。近年有許多機器學習應用在上位作用的方法被提出，本實驗室在先前研究中提出了以基因為單位先對遺傳變異做分組找出特徵子集，再建模尋找上位作用的偵測方法 - GenEpi（Gene-based Epistasis Discovery），可以有效降低運算複雜度。然而，這個方法在基因選擇過程中可能會忽略潛在的跨基因上位作用。因此，本論文提出另一種整合生物路徑資料以提高跨基因上位作用預測能力的方法，來解決跨基因的上位作用可能被GenEpi忽略的問題。此外，由於整合生物路徑資料以提高跨基因上位作用預測會產生過多的組合，本論文進一步整合基因表現資料，利用差異表現基因來減少計算時間。首先，本論文用R語言中的Limma套件從阿茲海默症樣本中，將實驗組與對照組相比，篩選出差異表現基因，接著從 Pathway Commons 資料庫中擷取至少包含一個差異表現基因的基因配對，並且通過組合編碼和L1正規化線性回歸特徵選擇對每個基因配對建模；最後，將每個基因配對篩選的特徵整合在一起建立用於預測表現型的最終模型。本研究所使用的全基因組關聯性資料是來自阿茲海默症神經影像計畫，經過差異表現分析後得到 192 個差異表現基因，將這些差一表現基因回貼至 Pathway Commons Database 得到 18,234 個基因配對，經過特徵篩選得到 42,427 個特徵，分佈在 11,139 個基因配對，將所有特徵合併後再次建模，最終預測模型選出 32 個變異點位特徵，包含著名的生物標記 APOE。在最終預測模型中，十折交叉驗證（10-fold cross-validation）得到的預測準確率和 F1 分數分別為 0.843 和 0.780，兩者皆高於原始 GenEpi 的預測結果。本研究利用生物路徑資料和基因表現數據在疾病表現型預測達到更好的效果，透過此方法得到的 SNP 可為未來的功能研究提供重要線索並協助了解複雜疾病的致病機制。	zh_TW
dc.description.abstract	Epistasis is the interaction between genetic variants associated with phenotypes, a key to understanding complex diseases like Alzheimer’s disease (AD). However, discovering epistasis is a time-consuming procedure, which aims at testing all of the interactions between millions of variants. A previous study (GenEpi) of my lab used gene-based epistasis analysis by grouping genetic variants in a gene to reduce the computational complexity. In this way, potential cross-gene epistasis might be neglected during gene selection. In this regard, this thesis presents a new method that integrates biological pathways to improve the capability of predicting AD using cross-gene epistasis. Moreover, the differential genes can be applied to even reduce the computing time if expression profiles exist. First, differentially‑expressed genes (DEGs) were identified from AD samples, compared with control subjects, using the Limma package in R. Next, gene pairs in the Pathway Commons Database that contains at least one DEG are obtained. Then, we modeled each gene pairs by two-element combinatorial encoding and L1-regularized regression with stability selection proposed by GenEpi. After that, the selected features for each gene pair are pooled together to construct the final model for predicting the phenotype. The genome-wide association (GWA) data and expression profiles used in this thesis are from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. We found 192 DEGs after differential expression analysis and obtained 18,234 gene pairs after mapping DEGs to the Pathway Commons Database. After feature selection, we obtained 42,427 features in 11,139 gene pairs. We collected these features and then modeled them again. The final prediction model contains 32 significant SNP features, including the well-known AD biomarker APOE. The 10-fold cross-validation (CV) accuracy and F1 score of the final model are 0.843 and 0.780, respectively. The result is better than that delivered by the original version of GenEpi. The proposed method can predict the phenotype better by leveraging pathway data and gene expression data. It is concluded that the discovered SNPs will provide important leads to design future functional studies to understand the mechanisms of the complex disease, AD.	en
dc.description.provenance	Made available in DSpace on 2021-06-16T23:01:56Z (GMT). No. of bitstreams: 1 ntu-109-R06631013-1.pdf: 4227079 bytes, checksum: 782cd7554ae8d369b0d2f0e649fd06c8 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	誌謝 i 中文摘要 ii Abstract iv 目錄 v 圖目錄 vii 表目錄 x 第一章前言 1 1.1 背景 1 1.2 研究動機 1 1.3 研究目的 2 第二章文獻探討 3 2.1 阿茲海默症 3 2.2 全基因組關聯性分析 4 2.3 上位作用 5 2.4 上位作用的偵測方法 5 2.4.1 基於多因子降維法的預測方法 5 2.4.2 基於樹的預測方法 6 2.4.3 基於類神經網路的預測方法 7 2.4.4 混合類神經網路與隨機森林的預測方法 8 2.4.5 以基因為單位分類SNP特徵的預測方法 9 2.5 上位作用的模擬資料 10 第三章研究方法 11 3.1 下載UCSC資料庫 11 3.2 連鎖不平衡區塊分析 12 3.3 差異表現分析 12 3.4 將特徵回貼至生物路徑資料基因配對 13 3.5 組合編碼、質量控制和特徵篩選 15 3.6 特徵合併 15 3.7 表現型預測 16 第四章結果與討論 17 4.1 差異表現分析 17 4.2 基因配對的預測結果比較 19 4.3 差異表現基因與基因交互作用資料之影響 20 4.4 不同預測模型比較 21 4.5 特徵分析 24 第五章結論 28 第六章參考文獻 29 附錄1 ExtendSingleDEG 重要特徵 31
dc.language.iso	zh-TW
dc.title	整合基因型資訊與生物路徑和基因表現資料提升阿茲海默症預測準確度	zh_TW
dc.title	Improving AD Prediction using Genotyping Data Combined with Biological Pathways and Gene Expression Profiles	en
dc.type	Thesis
dc.date.schoolyear	108-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	歐陽彥正,蔡承宏
dc.subject.keyword	阿茲海默症,全基因組關聯分析,上位作用,生物路徑資料,	zh_TW
dc.subject.keyword	Alzheimer’s disease,Epistasis,Genome-wide association studies,Biological pathway data,Gene expression profiles,	en
dc.relation.page	32
dc.identifier.doi	10.6342/NTU202000208
dc.rights.note	有償授權
dc.date.accepted	2020-02-25
dc.contributor.author-college	生物資源暨農學院	zh_TW
dc.contributor.author-dept	生物機電工程學系	zh_TW
顯示於系所單位：	生物機電工程學系

文件中的檔案：

檔案	大小	格式
ntu-109-1.pdf 目前未授權公開取用	4.13 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。