請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96534
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 趙坤茂 | zh_TW |
dc.contributor.advisor | Kun-Mao Chao | en |
dc.contributor.author | 沈宜萱 | zh_TW |
dc.contributor.author | Yi-Hsuan Shen | en |
dc.date.accessioned | 2025-02-19T16:24:09Z | - |
dc.date.available | 2025-02-20 | - |
dc.date.copyright | 2025-02-19 | - |
dc.date.issued | 2025 | - |
dc.date.submitted | 2025-02-05 | - |
dc.identifier.citation | S. R. Browning and B. L. Browning. Beagle: Software for the efficient analysis of genome-wide association studies.https://faculty.washington.edu/browning/beagle/beagle.html#download.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, arXiv:1810.04805, 2018. A.Dilthey, S.Leslie, L.Moutsianas, J.Shen, C.Cox, M.R.Nelson, and G.McVean. Multi-population classical hla type imputation. PLoS Computational Biology, 9(2):e1002877, 2013. A. T. Dilthey, L. Moutsianas, Leslie, et al. Hla*imp: An integrated framework for imputing classical hla alleles from snp genotypes. Bioinformatics, 27(10):968–972, 2011. J. He, S. Zhang, and C. Fang. Prediction of dna enhancers based on multi-species genomic base model dnabert-2 and bigru network. In Proceedings of the 2024 4th International Conference on Bioinformatics and Intelligent Computing, BIC ’24, page 375–379, 2024. B. Institute. Snp2hla website. https://software.broadinstitute.org/mpg/ snp2hla/. A. Jaramillo and K. Hacke. The human leukocyte antigen system: Nomenclature and dna-based typing for transplantation. In S. Gönen, editor, Human Leukocyte Antigens, chapter 4. IntechOpen, Rijeka, 2023. Y. Ji, Z. Zhou, H. Liu, and R. V. Davuluri. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021. X. Jia, B. Han, S. Onengut-Gumuscu, Chen, et al. Imputing amino acid polymorphisms in human leukocyte antigens. PLoS One, 8(6):e64683, 2013. K. Kim, S. Bang, H. Lee, and S. Bae. Construction and application of a korean reference panel for imputing classical alleles and amino acids of human leukocyte antigen genes. PloS one, 9(11):e112546, 2014. S. Leslie, P. Donnelly, and G. McVean. A statistical method for predicting classical hla alleles from snp data. American Journal of Human Genetics, 82(1):48–56, 2008. I. Moaz, A. Abdallah, M. Yousef, et al. Main insights of genome wide association studies into hcv-related hcc. Egypt Liver Journal, 10(4), 2020. T. Naito. Deep-hla. https://github.com/tatsuhikonaito/DEEP-HLA, 2023. T. Naito and Y. Okada. Hla imputation and its application to genetic and molecular fine-mapping of the mhc region in autoimmune diseases. Seminars in Immunopathology, 44, 2022. T. Naito, K. Suzuki, J. Hirata, et al. A deep learning method for hla imputation and trans-ethnic mhc fine-mapping of type 1 diabetes. Nature Communications, 12(1):1639, 2021. P. J. Norman, S. J. Norberg, N. Nemat-Gorgani, T. Royce, J. A., et al. Very long haplotype tracts characterized at high resolution from hla homozygous cell lines. Immunogenetics, 67:601–623, 2015. B.Sivaprakasam and P.Sadagopan. Hla allele type prediction: A review on concepts, methods and algorithms. Asian Journal of Biological and Life Sciences, 12, 2023. K. Tanaka, K. Kato, N. Nonaka, et al. Efficient hla imputation from sequential snps data by transformer. Journal of Human Genetics, 69:533–540, 2024. H. Wang, J. Li, H. Wu, E. Hovy, and Y. Sun. Pre-trained language models and their applications. Engineering, 25:51–65, 2023. X. Zheng, J. Shen, C. Cox, et al. Hibag—hla genotype imputation with attribute bagging. Pharmacogenomics Journal, 14:192–200, 2014. Z. Zhou, Y. Ji, W. Li, P. Dutta, R. V. Davuluri, and H. Liu. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. ArXiv, abs/2306.15006, 2023. | - |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96534 | - |
dc.description.abstract | 人類白血球抗原(HLA)與免疫反應、器官移植及疾病易感性等密切相關。藉 由確認HLA型別再進行定位關聯分析能幫助了解疾病的關鍵位點。由於大規模直接定序HLA基因型十分耗時,因此發展出HLA imputation此種利用 SNP資料應用統計等技術來預測HLA型別的方法,並且近年也逐漸應用深度學習技術。
無監督預訓練模型透過大量未標註數據學習,藉由微調來適應各種下游任務,是一種熱門的方法。本研究結合了BERT-based 模型處理文字序列的優勢,並考量HLA基因區域的連鎖不平衡,將參考資料集的SNP位點轉換為連續序列,再利用 DNABERT-2預訓練模型提取SNP序列特徵並學習序列的上下文資訊,嘗試不同的微調策略進而有效預測HLA型別。本研究使用Pan-Asian 資料集和韓國資料集進行驗證,在微調過程中,進一步測試了過採樣和類別加權等方法來處理類別高度不平衡的資料,並採用Stratified K-fold cross-validation來評估模型穩定性。研究結果證實,透過資料預處理及微調策略,利用非連續的 SNP 序列亦能在資源有限的情況下有效提升HLA型別預測的準確性。 過去的多項研究已展示HLA imputation的高準確度對疾病精細定位的貢獻。利用深度學習有效捕捉基因序列間的信息,可能減少對大規模參考面板的依賴,並提升預測準確率與效率,進而推動藥物研發與個人化醫療。 | zh_TW |
dc.description.abstract | Human Leukocyte Antigen (HLA) is closely associated with immune responses, organ transplantation, and disease susceptibility. Identifying HLA types and conducting linkage analysis can aid in understanding the key loci related to diseases. However, direct sequencing of HLA genotypes is time-consuming. Hence, HLA imputation, a method that utilizes SNP data and applies statistical techniques to predict HLA types, has emerged as an alternative. Recently, deep learning has also been gradually applied in this field.
Unsupervised pre-trained models learn from large amounts of unlabeled data and can be fine-tuned to adapt to various downstream tasks, making them a popular approach. This study considers the characteristics of BERT-based models in processing textual sequences and takes into account the linkage disequilibrium properties of the HLA gene region. By converting SNP loci into continuous sequences and utilizing the DNABERT-2 pre-trained model to extract features, the model learns the contextual information of the sequences and effectively predicts HLA types by using different fine-tuning strategies. In this study, Pan-Asian and Korean datasets are used. During the fine-tuning process, methods such as oversampling and class weighting are tested to address the class imbalance problem. The stratified k-fold cross-validation is employed to evaluate the model’s stability. The study validates that even with non-continuous DNA sequences, through SNP data preprocessing and fine-tuning strategies, it is possible to effectively predict HLA types using SNP data under resource-constrained conditions. Previous research has demonstrated that high-accuracy HLA imputation is crucial for disease mapping. Effective deep learning methods can capture gene sequence context, which may reduce the need for large reference panels and improve prediction accuracy and efficiency, thereby advancing precision medicine. | en |
dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-02-19T16:24:08Z No. of bitstreams: 0 | en |
dc.description.provenance | Made available in DSpace on 2025-02-19T16:24:09Z (GMT). No. of bitstreams: 0 | en |
dc.description.tableofcontents | Acknowledgements i
摘要 ii Abstract iii Contents v List of Figures viii List of Tables x Chapter 1 Introduction 1 1.1 HLA,SNP and Linkage Disequilibrium 1 1.1.1 HLA 1 1.1.2 SNP and Haplotype in HLA Genes 2 1.1.3 Naming of HLA Alleles and Allelic Resolution 3 1.1.4 Linkage Disequilibrium 4 1.2 HLA Imputation 5 Chapter 2 Literature Review 6 2.1 Conventional HLA Imputation Methods 6 2.2 DEEP*HLA 7 2.3 HLARIMNT 7 2.4 Pre-trainedModel 9 2.4.1 BERT Model 9 2.4.2 DNABERT Model 10 2.4.3 DNABERT-2 Model 12 2.5 Machine Learning Classifier 13 2.5.1 Random Forest 13 2.5.2 K-Nearest Neighbors(KNN) 13 2.6 Neural Network 14 2.6.1 Bidirectional Gated Recurrent Unit (BiGRU) 14 2.6.2 Bidirectional Long Short-Term Memory (BiLSTM) 14 Chapter 3 Material and Method 15 3.1 Datasets 15 3.1.1 Reference Panel 16 3.1.2 Pan-Asian Reference Panel 18 3.1.3 Korean Reference Panel 18 3.2 Sample SNP Dataset 19 3.3 Data Preprocessing Flow 19 3.3.1 HLA Data Grouping 19 3.3.2 Data Preprocessing 20 3.4 Pre-trained DNABERT-2 Model 24 3.5 Strategy - Feature Extraction with DNABERT-2 for Machine Learning Classifier 24 3.5.1 Random Forest and K-Nearest Neighbors 25 3.6 Strategy - Feature Extraction with DNABERT-2 for Neural Network Classification 26 3.7 Strategy - Fine-Tuning DNABERT-2 27 3.7.1 Model Architecture 27 3.7.2 Hyperparameters and Optimization 30 3.7.3 Handling Imbalanced Data 31 3.7.4 Fine-Tuning higher layers of DNABERT-2 and integrating with BiGRU 33 3.8 Evaluation Metrics 35 Chapter4 Experimental Results 37 4.1 Data Preprocessing 37 4.1.1 Pan-Asian Reference Panel 37 4.1.2 Korean Reference Panel 38 4.2 Feature Extraction with DNABERT-2 for Machine Learning Classifier 38 4.3 Feature Extraction with DNABERT-2 for Neural Network Classification 42 4.4 Fine-Tuning DNABERT-2 44 4.4.1 Imbalanced Data 44 4.4.2 HLABERTiGRU 46 4.4.3 Overall Performance Comparison 47 Chapter 5 Conclusion 50 5.1 Discussion 50 5.2 FutureWork 51 References 52 | - |
dc.language.iso | en | - |
dc.title | 利用預訓練模型進行SNP數據預測HLA型別 | zh_TW |
dc.title | Utilizing a Pre-trained Model for Predicting HLA Types from SNP Data | en |
dc.type | Thesis | - |
dc.date.schoolyear | 113-1 | - |
dc.description.degree | 碩士 | - |
dc.contributor.oralexamcommittee | 王弘倫;吳彥緯 | zh_TW |
dc.contributor.oralexamcommittee | Hung-Lung Wang;Yen-Wei Wu | en |
dc.subject.keyword | HLA,SNP,HLA imputation,Fine-tuning,Pre-trained Model, | zh_TW |
dc.subject.keyword | HLA,SNP,HLA imputation,Fine-tuning,Pre-trained Model, | en |
dc.relation.page | 54 | - |
dc.identifier.doi | 10.6342/NTU202500437 | - |
dc.rights.note | 未授權 | - |
dc.date.accepted | 2025-02-06 | - |
dc.contributor.author-college | 電機資訊學院 | - |
dc.contributor.author-dept | 生醫電子與資訊學研究所 | - |
dc.date.embargo-lift | N/A | - |
顯示於系所單位: | 生醫電子與資訊學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-113-1.pdf 目前未授權公開取用 | 4.79 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。