應用深度學習於 SNP 微陣列資料預測 HLA 對偶基因型

蔡毓璁; Yu-Tsung Tsai

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90558

標題:	應用深度學習於 SNP 微陣列資料預測 HLA 對偶基因型 Applying deep learning on SNP array data to predict HLA alleles
作者:	蔡毓璁 Yu-Tsung Tsai
指導教授:	陳倩瑜 Chien-Yu Chen
關鍵字:	人類白血球抗原,機器學習,變換器,單核苷酸多態性, Human leukocyte antigen,Machine learning,Transformer,Single Nucleotide Polymorphism,
出版年 :	2023
學位:	碩士
摘要:	人類白血球抗原（human leukocyte antigen，簡稱HLA）位於人體第六條染色體的主要組織相容性複合物（major histocompatibility complex，簡稱MHC）上，具有高度複雜性且與傳染性、免疫性以及癌症疾病相關聯，並且HLA也與許多免疫藥物有相互作用。目前HLA的分型方法對於疾病研究或是免疫相關分析來說，仍需要耗費大量時間和資源，而且價格不斐，這限制了其在臨床診斷和研究中的應用。因此利用單核苷酸多態性（Single Nucleotide Polymorphism，簡稱SNP）預測HLA的插補方法是一個優秀的方法。HLA的預測工具對於大部分的基因座有著不低的一致性以及準確度，但對於某些特殊的基因座，像是HLA-B，較難以準確預測。基於上述提到HLA與人類健康和疾病的研究高度相關，為了優化整體的訓練效果，本論文開發一個能夠快速且準確預測HLA型態的機器學習模型。本研究使用了Taiwan Biobank 2.0的資料並搭配自然語言處理技術中的變換器（Transformer ）模型作為核心訓練模型，並對其進行了調整和優化，以便更好地適應HLA序列的特徵，並建立了一個根據台灣人特有的SNP點位預測HLA基因型的模型TW2HLA（Transformer With TaiWan HLA data）。TW2HLA優於傳統的HLA預測模型，其效率和準確性都有了顯著提升。TW2HLA還針對稀少的HLA基因型的預測準確性進行了優化。由於低頻HLA基因型較為罕見，現有的模型在預測這些基因型時往往表現不佳。為了解決這個問題，我們訓練了一個能夠在預測罕見HLA基因型方面表現良好的模型。此外本研究與插補模型SNP2HLA、預測模型HIBAG以及機器學習模型DEEPHLA進行比較，以準確度與靈敏度作為分析的基準，並關注低頻率HLA基因型的訓練成效，同時也對於模型的訓練資料集大小進行討論與比較。總結來說，本論文開發了一個基於變換器模型的HLA預測模型，其在預測HLA型態方面表現合格，而且可以在大規模HLA檢測中發揮重要作用。這個模型將有助於提高臨床診斷和研究的效率，也有望成為未來HLA相關研究的基礎。 HLA is the name given to the MHC in humans. The HLA system is the human version of the MHC and represents the group of genes on chromosome 6 that codes for the MHC. HLA is extraordinarily complex and is associated with infectious, immune, and cancer-related diseases. Additionally, HLA interacts strongly with immunomodulatory drugs. Currently, HLA genotyping methods require a significant amount of time and resources, making them costly and limiting their use in clinical diagnosis and research related to immunology. Therefore, using a single-nucleotide polymorphism (SNP)-based imputation method to predict HLA alleles is an excellent alternative. HLA prediction tools have high consistency and accuracy for most loci, but for certain loci such as HLA-B is more difficult to predict. Given the high relevance of HLA to human health and diseases, this study aims to develop a machine learning model that can swiftly and accurately predict HLA types to optimize overall training performance. The propose method used a transformer model, a type of natural language processing technology, as the core training model. The developed method was applied on the data from Taiwan Biobank 2.0. The model, TW2HLA, was adjusted and optimized to better adapt to the characteristics of HLA sequences. TW2HLA outperformed traditional HLA prediction models in terms of efficiency and accuracy. TW2HLA also optimized the prediction accuracy of rare HLA genotypes. As low-frequency HLA genotypes are rare, existing models often perform poorly when predicting these genotypes. To address this issue, this study trained a model that performed well in predicting rare HLA genotypes. In addition, this study compared the results with the imputation model SNP2HLA and the prediction model HIBAG and the machine learning models DEEPHLA, using accuracy and sensitivity as the evaluation metrics. Special attention was given to the training performance for low-frequency HLA genotypes. Furthermore, the study also discussed and compared the sizes of the training datasets used in the models. In summary, this thesis developed a HLA prediction model based on transformer models, which performs well in predicting HLA allele types and can play a significant role in large-scale HLA testing. This method will help improve the efficiency of clinical diagnosis and research. It is expected to become the foundation for future HLA-related studies.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90558
DOI:	10.6342/NTU202302754
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2025-12-31
顯示於系所單位：	生物機電工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	2.12 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。