利用預訓練模型進行SNP數據預測HLA型別

沈宜萱; Yi-Hsuan Shen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96534

標題:	利用預訓練模型進行SNP數據預測HLA型別 Utilizing a Pre-trained Model for Predicting HLA Types from SNP Data
作者:	沈宜萱 Yi-Hsuan Shen
指導教授:	趙坤茂 Kun-Mao Chao
關鍵字:	HLA,SNP,HLA imputation,Fine-tuning,Pre-trained Model, HLA,SNP,HLA imputation,Fine-tuning,Pre-trained Model,
出版年 :	2025
學位:	碩士
摘要:	人類白血球抗原(HLA)與免疫反應、器官移植及疾病易感性等密切相關。藉由確認HLA型別再進行定位關聯分析能幫助了解疾病的關鍵位點。由於大規模直接定序HLA基因型十分耗時，因此發展出HLA imputation此種利用 SNP資料應用統計等技術來預測HLA型別的方法，並且近年也逐漸應用深度學習技術。無監督預訓練模型透過大量未標註數據學習，藉由微調來適應各種下游任務，是一種熱門的方法。本研究結合了BERT-based 模型處理文字序列的優勢，並考量HLA基因區域的連鎖不平衡，將參考資料集的SNP位點轉換為連續序列，再利用 DNABERT-2預訓練模型提取SNP序列特徵並學習序列的上下文資訊，嘗試不同的微調策略進而有效預測HLA型別。本研究使用Pan-Asian 資料集和韓國資料集進行驗證，在微調過程中，進一步測試了過採樣和類別加權等方法來處理類別高度不平衡的資料，並採用Stratified K-fold cross-validation來評估模型穩定性。研究結果證實，透過資料預處理及微調策略，利用非連續的 SNP 序列亦能在資源有限的情況下有效提升HLA型別預測的準確性。過去的多項研究已展示HLA imputation的高準確度對疾病精細定位的貢獻。利用深度學習有效捕捉基因序列間的信息，可能減少對大規模參考面板的依賴，並提升預測準確率與效率，進而推動藥物研發與個人化醫療。 Human Leukocyte Antigen (HLA) is closely associated with immune responses, organ transplantation, and disease susceptibility. Identifying HLA types and conducting linkage analysis can aid in understanding the key loci related to diseases. However, direct sequencing of HLA genotypes is time-consuming. Hence, HLA imputation, a method that utilizes SNP data and applies statistical techniques to predict HLA types, has emerged as an alternative. Recently, deep learning has also been gradually applied in this field. Unsupervised pre-trained models learn from large amounts of unlabeled data and can be fine-tuned to adapt to various downstream tasks, making them a popular approach. This study considers the characteristics of BERT-based models in processing textual sequences and takes into account the linkage disequilibrium properties of the HLA gene region. By converting SNP loci into continuous sequences and utilizing the DNABERT-2 pre-trained model to extract features, the model learns the contextual information of the sequences and effectively predicts HLA types by using different fine-tuning strategies. In this study, Pan-Asian and Korean datasets are used. During the fine-tuning process, methods such as oversampling and class weighting are tested to address the class imbalance problem. The stratified k-fold cross-validation is employed to evaluate the model’s stability. The study validates that even with non-continuous DNA sequences, through SNP data preprocessing and fine-tuning strategies, it is possible to effectively predict HLA types using SNP data under resource-constrained conditions. Previous research has demonstrated that high-accuracy HLA imputation is crucial for disease mapping. Effective deep learning methods can capture gene sequence context, which may reduce the need for large reference panels and improve prediction accuracy and efficiency, thereby advancing precision medicine.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96534
DOI:	10.6342/NTU202500437
全文授權:	未授權
電子全文公開日期:	N/A
顯示於系所單位：	生醫電子與資訊學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf 目前未授權公開取用	4.79 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。