Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 生醫電子與資訊學研究所
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96534
Full metadata record
???org.dspace.app.webui.jsptag.ItemTag.dcfield???ValueLanguage
dc.contributor.advisor趙坤茂zh_TW
dc.contributor.advisorKun-Mao Chaoen
dc.contributor.author沈宜萱zh_TW
dc.contributor.authorYi-Hsuan Shenen
dc.date.accessioned2025-02-19T16:24:09Z-
dc.date.available2025-02-20-
dc.date.copyright2025-02-19-
dc.date.issued2025-
dc.date.submitted2025-02-05-
dc.identifier.citationS. R. Browning and B. L. Browning. Beagle: Software for the efficient analysis of genome-wide association studies.https://faculty.washington.edu/browning/beagle/beagle.html#download.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, arXiv:1810.04805, 2018.
A.Dilthey, S.Leslie, L.Moutsianas, J.Shen, C.Cox, M.R.Nelson, and G.McVean. Multi-population classical hla type imputation. PLoS Computational Biology, 9(2):e1002877, 2013.
A. T. Dilthey, L. Moutsianas, Leslie, et al. Hla*imp: An integrated framework for imputing classical hla alleles from snp genotypes. Bioinformatics, 27(10):968–972, 2011.
J. He, S. Zhang, and C. Fang. Prediction of dna enhancers based on multi-species genomic base model dnabert-2 and bigru network. In Proceedings of the 2024 4th International Conference on Bioinformatics and Intelligent Computing, BIC ’24, page 375–379, 2024.
B. Institute. Snp2hla website. https://software.broadinstitute.org/mpg/ snp2hla/.
A. Jaramillo and K. Hacke. The human leukocyte antigen system: Nomenclature and dna-based typing for transplantation. In S. Gönen, editor, Human Leukocyte Antigens, chapter 4. IntechOpen, Rijeka, 2023.
Y. Ji, Z. Zhou, H. Liu, and R. V. Davuluri. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
X. Jia, B. Han, S. Onengut-Gumuscu, Chen, et al. Imputing amino acid polymorphisms in human leukocyte antigens. PLoS One, 8(6):e64683, 2013.
K. Kim, S. Bang, H. Lee, and S. Bae. Construction and application of a korean reference panel for imputing classical alleles and amino acids of human leukocyte antigen genes. PloS one, 9(11):e112546, 2014.
S. Leslie, P. Donnelly, and G. McVean. A statistical method for predicting classical hla alleles from snp data. American Journal of Human Genetics, 82(1):48–56, 2008.
I. Moaz, A. Abdallah, M. Yousef, et al. Main insights of genome wide association studies into hcv-related hcc. Egypt Liver Journal, 10(4), 2020.
T. Naito. Deep-hla. https://github.com/tatsuhikonaito/DEEP-HLA, 2023.
T. Naito and Y. Okada. Hla imputation and its application to genetic and molecular fine-mapping of the mhc region in autoimmune diseases. Seminars in Immunopathology, 44, 2022.
T. Naito, K. Suzuki, J. Hirata, et al. A deep learning method for hla imputation and trans-ethnic mhc fine-mapping of type 1 diabetes. Nature Communications, 12(1):1639, 2021.
P. J. Norman, S. J. Norberg, N. Nemat-Gorgani, T. Royce, J. A., et al. Very long haplotype tracts characterized at high resolution from hla homozygous cell lines. Immunogenetics, 67:601–623, 2015.
B.Sivaprakasam and P.Sadagopan. Hla allele type prediction: A review on concepts, methods and algorithms. Asian Journal of Biological and Life Sciences, 12, 2023.
K. Tanaka, K. Kato, N. Nonaka, et al. Efficient hla imputation from sequential snps data by transformer. Journal of Human Genetics, 69:533–540, 2024.
H. Wang, J. Li, H. Wu, E. Hovy, and Y. Sun. Pre-trained language models and their applications. Engineering, 25:51–65, 2023.
X. Zheng, J. Shen, C. Cox, et al. Hibag—hla genotype imputation with attribute bagging. Pharmacogenomics Journal, 14:192–200, 2014.
Z. Zhou, Y. Ji, W. Li, P. Dutta, R. V. Davuluri, and H. Liu. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. ArXiv, abs/2306.15006, 2023.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96534-
dc.description.abstract人類白血球抗原(HLA)與免疫反應、器官移植及疾病易感性等密切相關。藉 由確認HLA型別再進行定位關聯分析能幫助了解疾病的關鍵位點。由於大規模直接定序HLA基因型十分耗時,因此發展出HLA imputation此種利用 SNP資料應用統計等技術來預測HLA型別的方法,並且近年也逐漸應用深度學習技術。
無監督預訓練模型透過大量未標註數據學習,藉由微調來適應各種下游任務,是一種熱門的方法。本研究結合了BERT-based 模型處理文字序列的優勢,並考量HLA基因區域的連鎖不平衡,將參考資料集的SNP位點轉換為連續序列,再利用 DNABERT-2預訓練模型提取SNP序列特徵並學習序列的上下文資訊,嘗試不同的微調策略進而有效預測HLA型別。本研究使用Pan-Asian 資料集和韓國資料集進行驗證,在微調過程中,進一步測試了過採樣和類別加權等方法來處理類別高度不平衡的資料,並採用Stratified K-fold cross-validation來評估模型穩定性。研究結果證實,透過資料預處理及微調策略,利用非連續的 SNP 序列亦能在資源有限的情況下有效提升HLA型別預測的準確性。
過去的多項研究已展示HLA imputation的高準確度對疾病精細定位的貢獻。利用深度學習有效捕捉基因序列間的信息,可能減少對大規模參考面板的依賴,並提升預測準確率與效率,進而推動藥物研發與個人化醫療。
zh_TW
dc.description.abstractHuman Leukocyte Antigen (HLA) is closely associated with immune responses, organ transplantation, and disease susceptibility. Identifying HLA types and conducting linkage analysis can aid in understanding the key loci related to diseases. However, direct sequencing of HLA genotypes is time-consuming. Hence, HLA imputation, a method that utilizes SNP data and applies statistical techniques to predict HLA types, has emerged as an alternative. Recently, deep learning has also been gradually applied in this field.
Unsupervised pre-trained models learn from large amounts of unlabeled data and can be fine-tuned to adapt to various downstream tasks, making them a popular approach. This study considers the characteristics of BERT-based models in processing textual sequences and takes into account the linkage disequilibrium properties of the HLA gene region. By converting SNP loci into continuous sequences and utilizing the DNABERT-2 pre-trained model to extract features, the model learns the contextual information of the sequences and effectively predicts HLA types by using different fine-tuning strategies.
In this study, Pan-Asian and Korean datasets are used. During the fine-tuning process, methods such as oversampling and class weighting are tested to address the class imbalance problem. The stratified k-fold cross-validation is employed to evaluate the model’s stability. The study validates that even with non-continuous DNA sequences, through SNP data preprocessing and fine-tuning strategies, it is possible to effectively predict HLA types using SNP data under resource-constrained conditions.
Previous research has demonstrated that high-accuracy HLA imputation is crucial for disease mapping. Effective deep learning methods can capture gene sequence context, which may reduce the need for large reference panels and improve prediction accuracy and efficiency, thereby advancing precision medicine.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-02-19T16:24:08Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-02-19T16:24:09Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 ii
Abstract iii
Contents v
List of Figures viii
List of Tables x
Chapter 1 Introduction 1
1.1 HLA,SNP and Linkage Disequilibrium 1
1.1.1 HLA 1
1.1.2 SNP and Haplotype in HLA Genes 2
1.1.3 Naming of HLA Alleles and Allelic Resolution 3
1.1.4 Linkage Disequilibrium 4
1.2 HLA Imputation 5
Chapter 2 Literature Review 6
2.1 Conventional HLA Imputation Methods 6
2.2 DEEP*HLA 7
2.3 HLARIMNT 7
2.4 Pre-trainedModel 9
2.4.1 BERT Model 9
2.4.2 DNABERT Model 10
2.4.3 DNABERT-2 Model 12
2.5 Machine Learning Classifier 13
2.5.1 Random Forest 13
2.5.2 K-Nearest Neighbors(KNN) 13
2.6 Neural Network 14
2.6.1 Bidirectional Gated Recurrent Unit (BiGRU) 14
2.6.2 Bidirectional Long Short-Term Memory (BiLSTM) 14
Chapter 3 Material and Method 15
3.1 Datasets 15
3.1.1 Reference Panel 16
3.1.2 Pan-Asian Reference Panel 18
3.1.3 Korean Reference Panel 18
3.2 Sample SNP Dataset 19
3.3 Data Preprocessing Flow 19
3.3.1 HLA Data Grouping 19
3.3.2 Data Preprocessing 20
3.4 Pre-trained DNABERT-2 Model 24
3.5 Strategy - Feature Extraction with DNABERT-2 for Machine Learning Classifier 24
3.5.1 Random Forest and K-Nearest Neighbors 25
3.6 Strategy - Feature Extraction with DNABERT-2 for Neural Network Classification 26
3.7 Strategy - Fine-Tuning DNABERT-2 27
3.7.1 Model Architecture 27
3.7.2 Hyperparameters and Optimization 30
3.7.3 Handling Imbalanced Data 31
3.7.4 Fine-Tuning higher layers of DNABERT-2 and integrating with BiGRU 33
3.8 Evaluation Metrics 35
Chapter4 Experimental Results 37
4.1 Data Preprocessing 37
4.1.1 Pan-Asian Reference Panel 37
4.1.2 Korean Reference Panel 38
4.2 Feature Extraction with DNABERT-2 for Machine Learning Classifier 38
4.3 Feature Extraction with DNABERT-2 for Neural Network Classification 42
4.4 Fine-Tuning DNABERT-2 44
4.4.1 Imbalanced Data 44
4.4.2 HLABERTiGRU 46
4.4.3 Overall Performance Comparison 47
Chapter 5 Conclusion 50
5.1 Discussion 50
5.2 FutureWork 51
References 52
-
dc.language.isoen-
dc.subjectFine-tuningzh_TW
dc.subjectPre-trained Modelzh_TW
dc.subjectHLA imputationzh_TW
dc.subjectSNPzh_TW
dc.subjectHLAzh_TW
dc.subjectPre-trained Modelen
dc.subjectHLAen
dc.subjectSNPen
dc.subjectHLA imputationen
dc.subjectFine-tuningen
dc.title利用預訓練模型進行SNP數據預測HLA型別zh_TW
dc.titleUtilizing a Pre-trained Model for Predicting HLA Types from SNP Dataen
dc.typeThesis-
dc.date.schoolyear113-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee王弘倫;吳彥緯zh_TW
dc.contributor.oralexamcommitteeHung-Lung Wang;Yen-Wei Wuen
dc.subject.keywordHLA,SNP,HLA imputation,Fine-tuning,Pre-trained Model,zh_TW
dc.subject.keywordHLA,SNP,HLA imputation,Fine-tuning,Pre-trained Model,en
dc.relation.page54-
dc.identifier.doi10.6342/NTU202500437-
dc.rights.note未授權-
dc.date.accepted2025-02-06-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept生醫電子與資訊學研究所-
dc.date.embargo-liftN/A-
Appears in Collections:生醫電子與資訊學研究所

Files in This Item:
File SizeFormat 
ntu-113-1.pdf
  Restricted Access
4.79 MBAdobe PDF
Show simple item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved