利用預訓練模型進行SNP數據預測HLA型別

沈宜萱; Yi-Hsuan Shen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96534

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	趙坤茂	zh_TW
dc.contributor.advisor	Kun-Mao Chao	en
dc.contributor.author	沈宜萱	zh_TW
dc.contributor.author	Yi-Hsuan Shen	en
dc.date.accessioned	2025-02-19T16:24:09Z	-
dc.date.available	2025-02-20	-
dc.date.copyright	2025-02-19	-
dc.date.issued	2025	-
dc.date.submitted	2025-02-05	-
dc.identifier.citation	S. R. Browning and B. L. Browning. Beagle: Software for the efficient analysis of genome-wide association studies.https://faculty.washington.edu/browning/beagle/beagle.html#download. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, arXiv:1810.04805, 2018. A.Dilthey, S.Leslie, L.Moutsianas, J.Shen, C.Cox, M.R.Nelson, and G.McVean. Multi-population classical hla type imputation. PLoS Computational Biology, 9(2):e1002877, 2013. A. T. Dilthey, L. Moutsianas, Leslie, et al. Hla*imp: An integrated framework for imputing classical hla alleles from snp genotypes. Bioinformatics, 27(10):968–972, 2011. J. He, S. Zhang, and C. Fang. Prediction of dna enhancers based on multi-species genomic base model dnabert-2 and bigru network. In Proceedings of the 2024 4th International Conference on Bioinformatics and Intelligent Computing, BIC ’24, page 375–379, 2024. B. Institute. Snp2hla website. https://software.broadinstitute.org/mpg/ snp2hla/. A. Jaramillo and K. Hacke. The human leukocyte antigen system: Nomenclature and dna-based typing for transplantation. In S. Gönen, editor, Human Leukocyte Antigens, chapter 4. IntechOpen, Rijeka, 2023. Y. Ji, Z. Zhou, H. Liu, and R. V. Davuluri. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021. X. Jia, B. Han, S. Onengut-Gumuscu, Chen, et al. Imputing amino acid polymorphisms in human leukocyte antigens. PLoS One, 8(6):e64683, 2013. K. Kim, S. Bang, H. Lee, and S. Bae. Construction and application of a korean reference panel for imputing classical alleles and amino acids of human leukocyte antigen genes. PloS one, 9(11):e112546, 2014. S. Leslie, P. Donnelly, and G. McVean. A statistical method for predicting classical hla alleles from snp data. American Journal of Human Genetics, 82(1):48–56, 2008. I. Moaz, A. Abdallah, M. Yousef, et al. Main insights of genome wide association studies into hcv-related hcc. Egypt Liver Journal, 10(4), 2020. T. Naito. Deep-hla. https://github.com/tatsuhikonaito/DEEP-HLA, 2023. T. Naito and Y. Okada. Hla imputation and its application to genetic and molecular fine-mapping of the mhc region in autoimmune diseases. Seminars in Immunopathology, 44, 2022. T. Naito, K. Suzuki, J. Hirata, et al. A deep learning method for hla imputation and trans-ethnic mhc fine-mapping of type 1 diabetes. Nature Communications, 12(1):1639, 2021. P. J. Norman, S. J. Norberg, N. Nemat-Gorgani, T. Royce, J. A., et al. Very long haplotype tracts characterized at high resolution from hla homozygous cell lines. Immunogenetics, 67:601–623, 2015. B.Sivaprakasam and P.Sadagopan. Hla allele type prediction: A review on concepts, methods and algorithms. Asian Journal of Biological and Life Sciences, 12, 2023. K. Tanaka, K. Kato, N. Nonaka, et al. Efficient hla imputation from sequential snps data by transformer. Journal of Human Genetics, 69:533–540, 2024. H. Wang, J. Li, H. Wu, E. Hovy, and Y. Sun. Pre-trained language models and their applications. Engineering, 25:51–65, 2023. X. Zheng, J. Shen, C. Cox, et al. Hibag—hla genotype imputation with attribute bagging. Pharmacogenomics Journal, 14:192–200, 2014. Z. Zhou, Y. Ji, W. Li, P. Dutta, R. V. Davuluri, and H. Liu. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. ArXiv, abs/2306.15006, 2023.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96534	-
dc.description.abstract	人類白血球抗原(HLA)與免疫反應、器官移植及疾病易感性等密切相關。藉由確認HLA型別再進行定位關聯分析能幫助了解疾病的關鍵位點。由於大規模直接定序HLA基因型十分耗時，因此發展出HLA imputation此種利用 SNP資料應用統計等技術來預測HLA型別的方法，並且近年也逐漸應用深度學習技術。無監督預訓練模型透過大量未標註數據學習，藉由微調來適應各種下游任務，是一種熱門的方法。本研究結合了BERT-based 模型處理文字序列的優勢，並考量HLA基因區域的連鎖不平衡，將參考資料集的SNP位點轉換為連續序列，再利用 DNABERT-2預訓練模型提取SNP序列特徵並學習序列的上下文資訊，嘗試不同的微調策略進而有效預測HLA型別。本研究使用Pan-Asian 資料集和韓國資料集進行驗證，在微調過程中，進一步測試了過採樣和類別加權等方法來處理類別高度不平衡的資料，並採用Stratified K-fold cross-validation來評估模型穩定性。研究結果證實，透過資料預處理及微調策略，利用非連續的 SNP 序列亦能在資源有限的情況下有效提升HLA型別預測的準確性。過去的多項研究已展示HLA imputation的高準確度對疾病精細定位的貢獻。利用深度學習有效捕捉基因序列間的信息，可能減少對大規模參考面板的依賴，並提升預測準確率與效率，進而推動藥物研發與個人化醫療。	zh_TW
dc.description.abstract	Human Leukocyte Antigen (HLA) is closely associated with immune responses, organ transplantation, and disease susceptibility. Identifying HLA types and conducting linkage analysis can aid in understanding the key loci related to diseases. However, direct sequencing of HLA genotypes is time-consuming. Hence, HLA imputation, a method that utilizes SNP data and applies statistical techniques to predict HLA types, has emerged as an alternative. Recently, deep learning has also been gradually applied in this field. Unsupervised pre-trained models learn from large amounts of unlabeled data and can be fine-tuned to adapt to various downstream tasks, making them a popular approach. This study considers the characteristics of BERT-based models in processing textual sequences and takes into account the linkage disequilibrium properties of the HLA gene region. By converting SNP loci into continuous sequences and utilizing the DNABERT-2 pre-trained model to extract features, the model learns the contextual information of the sequences and effectively predicts HLA types by using different fine-tuning strategies. In this study, Pan-Asian and Korean datasets are used. During the fine-tuning process, methods such as oversampling and class weighting are tested to address the class imbalance problem. The stratified k-fold cross-validation is employed to evaluate the model’s stability. The study validates that even with non-continuous DNA sequences, through SNP data preprocessing and fine-tuning strategies, it is possible to effectively predict HLA types using SNP data under resource-constrained conditions. Previous research has demonstrated that high-accuracy HLA imputation is crucial for disease mapping. Effective deep learning methods can capture gene sequence context, which may reduce the need for large reference panels and improve prediction accuracy and efficiency, thereby advancing precision medicine.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-02-19T16:24:08Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-02-19T16:24:09Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements i 摘要 ii Abstract iii Contents v List of Figures viii List of Tables x Chapter 1 Introduction 1 1.1 HLA,SNP and Linkage Disequilibrium 1 1.1.1 HLA 1 1.1.2 SNP and Haplotype in HLA Genes 2 1.1.3 Naming of HLA Alleles and Allelic Resolution 3 1.1.4 Linkage Disequilibrium 4 1.2 HLA Imputation 5 Chapter 2 Literature Review 6 2.1 Conventional HLA Imputation Methods 6 2.2 DEEP*HLA 7 2.3 HLARIMNT 7 2.4 Pre-trainedModel 9 2.4.1 BERT Model 9 2.4.2 DNABERT Model 10 2.4.3 DNABERT-2 Model 12 2.5 Machine Learning Classifier 13 2.5.1 Random Forest 13 2.5.2 K-Nearest Neighbors(KNN) 13 2.6 Neural Network 14 2.6.1 Bidirectional Gated Recurrent Unit (BiGRU) 14 2.6.2 Bidirectional Long Short-Term Memory (BiLSTM) 14 Chapter 3 Material and Method 15 3.1 Datasets 15 3.1.1 Reference Panel 16 3.1.2 Pan-Asian Reference Panel 18 3.1.3 Korean Reference Panel 18 3.2 Sample SNP Dataset 19 3.3 Data Preprocessing Flow 19 3.3.1 HLA Data Grouping 19 3.3.2 Data Preprocessing 20 3.4 Pre-trained DNABERT-2 Model 24 3.5 Strategy - Feature Extraction with DNABERT-2 for Machine Learning Classifier 24 3.5.1 Random Forest and K-Nearest Neighbors 25 3.6 Strategy - Feature Extraction with DNABERT-2 for Neural Network Classification 26 3.7 Strategy - Fine-Tuning DNABERT-2 27 3.7.1 Model Architecture 27 3.7.2 Hyperparameters and Optimization 30 3.7.3 Handling Imbalanced Data 31 3.7.4 Fine-Tuning higher layers of DNABERT-2 and integrating with BiGRU 33 3.8 Evaluation Metrics 35 Chapter4 Experimental Results 37 4.1 Data Preprocessing 37 4.1.1 Pan-Asian Reference Panel 37 4.1.2 Korean Reference Panel 38 4.2 Feature Extraction with DNABERT-2 for Machine Learning Classifier 38 4.3 Feature Extraction with DNABERT-2 for Neural Network Classification 42 4.4 Fine-Tuning DNABERT-2 44 4.4.1 Imbalanced Data 44 4.4.2 HLABERTiGRU 46 4.4.3 Overall Performance Comparison 47 Chapter 5 Conclusion 50 5.1 Discussion 50 5.2 FutureWork 51 References 52	-
dc.language.iso	en	-
dc.title	利用預訓練模型進行SNP數據預測HLA型別	zh_TW
dc.title	Utilizing a Pre-trained Model for Predicting HLA Types from SNP Data	en
dc.type	Thesis	-
dc.date.schoolyear	113-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	王弘倫;吳彥緯	zh_TW
dc.contributor.oralexamcommittee	Hung-Lung Wang;Yen-Wei Wu	en
dc.subject.keyword	HLA,SNP,HLA imputation,Fine-tuning,Pre-trained Model,	zh_TW
dc.subject.keyword	HLA,SNP,HLA imputation,Fine-tuning,Pre-trained Model,	en
dc.relation.page	54	-
dc.identifier.doi	10.6342/NTU202500437	-
dc.rights.note	未授權	-
dc.date.accepted	2025-02-06	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	生醫電子與資訊學研究所	-
dc.date.embargo-lift	N/A	-
顯示於系所單位：	生醫電子與資訊學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf 目前未授權公開取用	4.79 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。