Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 生物資源暨農學院
  3. 生物機電工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98846
標題: 應用生醫文獻大型語言模型於白血病相關基因的擷取與分析
Retrieving Leukemia-related Genes Using Large Language Models built with Biomedical Literature
作者: 林東甫
Tung-Pu Lin
指導教授: 陳倩瑜
Chien-Yu Chen
關鍵字: 急性骨髓性白血病,大型語言模型,生醫文本探勘,檢索增強生成,提示工程,微調,
Acute Myeloid Leukemia (AML),Large Language Model (LLM),Biomedical Text Mining,Retrieval-Augmented Generation (RAG),Prompt Engineering,Fine-tune,
出版年 : 2025
學位: 碩士
摘要: 急性骨髓性白血病(Acute Myeloid Leukemia, AML)是一種高度異質性的惡性腫瘤,其病理進展與治療策略密切受到體細胞突變、遺傳特徵以及對各類治療反應的影響。隨著個人化醫療與基因標靶治療的持續發展,如何有效辨識並分析與疾病相關的基因,已成為提升臨床預後判斷與治療效果的關鍵。然而,現有的基因資訊擷取技術仍然高度仰賴傳統的文獻檢索與人工整理,缺乏具擴展性與自動語意理解能力的工具。為解決此問題,本研究旨在探討開源大型語言模型(Large Language Model, LLM)於生物醫學文獻分析中的應用潛力,並以AML為實例,建立一套基於BioGPT模型的基因機率分析流程。BioGPT是由微軟開發的語言模型,基於OpenAI原始GPT-2架構,並進一步以生物醫學文獻摘要進行預訓練。本研究採用以「Causal Language Modeling」中的「next-token prediction(下一詞預測)」為核心的分析框架。透過建構疾病語境提示語(disease-context prompt),評估特定基因名稱作為下一詞出現的正規化機率值,以推測其與疾病文獻的語意關聯性。以歐洲白血病網路(European Leukemianet, ELN)AML治療指引中建議的基因集合為測試資料,本研究結果顯示,透過適當的提示詞設計(Prompt Engineering),已知和AML相關的基因與其他基因在預測機率分布上存在顯著差異,顯示本方法具有潛在有效性。為進一步強化模型的穩定性與泛化能力,本研究整合檢索增強生成(Retrieval-Augmented Generation, RAG )機制,並調整嵌入模型與chunk size等超參數,進行分析架構改良。最終結果顯示,透過上述微調(Fine Tune),可有效提升模型對AML相關基因的預測準確性,並聚焦於具高度語意關聯之目標區段。本研究所提出的分析流程可作為未來生醫語言模型應用之模組化方法框架,亦提供大型語言模型應用於領域知識擷取與語意分析的初步實證依據與技術參考。
Acute Myeloid Leukemia (AML) is a highly heterogeneous malignancy, whose pathological progression and therapeutic strategies are significantly influenced by somatic mutations, genetic characteristics, and responses to various treatments. With the continuous advancement of personalized medicine and gene-targeted therapies, the effective identification and analysis of disease-associated genes has become critical for improving prognosis and treatment outcomes. However, current gene information extraction methods still heavily rely on conventional literature searches and manual curation, lacking scalable and semantically-aware automated tools. To address this issue, this study explores the potential applications of open-source Large Language Model (LLM) in biomedical literature analysis, using AML as a case study to establish a gene probability estimation pipeline based on the BioGPT model. BioGPT, developed by Microsoft, is a transformer-based language model built upon OpenAI’s original GPT-2 architecture and further pretrained on biomedical literature abstracts. We adopt a next-token prediction framework rooted in Causal Language Modeling, constructing disease-context prompts to evaluate the normalized probability of a given gene name appearing as the next token. This score is then used to infer the semantic association between the gene and AML-related literature. Using gene targets recommended by the European Leukemianet (ELN) AML treatment guidelines as test data, our results demonstrate that with carefully designed prompts, the predicted probability distributions for known targets are significantly different from those of other genes, indicating the potential effectiveness of this approach. To further enhance model stability and generalizability, we introduced the Retrieval-Augmented Generation (RAG) framework into the existing pipeline and fine-tuned architectural components such as the embedding model and chunk size. The final results show that these refinements improved the model’s prediction performance and allowed it to better focus on the relevant semantic scope, leading to more selective and goal-oriented predictions. The proposed pipeline provides a modular framework for future biomedical LLM applications and offers preliminary empirical support and technical references for domain-specific knowledge extraction and semantic analysis using large language models.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98846
DOI: 10.6342/NTU202504150
全文授權: 同意授權(全球公開)
電子全文公開日期: 2025-08-20
顯示於系所單位:生物機電工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf3.98 MBAdobe PDF檢視/開啟
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved