Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 生物資源暨農學院
  3. 生物機電工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98846
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳倩瑜zh_TW
dc.contributor.advisorChien-Yu Chenen
dc.contributor.author林東甫zh_TW
dc.contributor.authorTung-Pu Linen
dc.date.accessioned2025-08-19T16:25:44Z-
dc.date.available2025-08-20-
dc.date.copyright2025-08-19-
dc.date.issued2025-
dc.date.submitted2025-08-08-
dc.identifier.citationBrown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T.J., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language Models are Few-Shot Learners. ArXiv, abs/2005.14165.
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Chapter of the Association for Computational Linguistics.
Döhner, H., Estey, E., Grimwade, D., Amadori, S., Appelbaum, F. R., Büchner, T., Dombret, H., Ebert, B. L., Fenaux, P., Larson, R. A., Levine, R. L., Lo-Coco, F., Naoe, T., Niederwieser, D., Ossenkoppele, G. J., Sanz, M., Sierra, J., Tallman, M. S., Tien, H. F., Wei, A. H., … Bloomfield, C. D. (2017). Diagnosis and management of AML in adults: 2017 ELN recommendations from an international expert panel. Blood, 129(4), 424–447.
Döhner, H., Wei, A. H., Appelbaum, F. R., Craddock, C., DiNardo, C. D., Dombret, H., Ebert, B. L., Fenaux, P., Godley, L. A., Hasserjian, R. P., Larson, R. A., Levine, R. L., Miyazaki, Y., Niederwieser, D., Ossenkoppele, G., Röllig, C., Sierra, J., Stein, E. M., Tallman, M. S., Tien, H. F., … Löwenberg, B. (2022). Diagnosis and management of AML in adults: 2022 recommendations from an international expert panel on behalf of the ELN. Blood, 140(12), 1345–1377.
Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2021). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1), Article 2.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining. Bioinformatics, 36(4), 1234-1240.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Article 793).
Luo, R., Sun, L., Xia, Y., Qin, T., Zhang, S., Poon, H., & Liu, T.-Y. (2022, September). BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, *23*(6), bbac409.
National Center for Biotechnology Information (NCBI). (2020). PubMed Overview. National Library of Medicine.
Papaemmanuil, E., Gerstung, M., Bullinger, L., Gaidzik, V. I., Paschka, P., Roberts, N. D., Potter, N. E., Heuser, M., Thol, F., Bolli, N., Gundem, G., Van Loo, P., Martincorena, I., Ganly, P., Mudie, L., McLaren, S., O'Meara, S., Raine, K., Jones, D. R., Teague, J. W., … Campbell, P. J. (2016). Genomic Classification and Prognosis in Acute Myeloid Leukemia. The New England journal of medicine, 374(23), 2209–2221.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-training. OpenAI.
Schlenk, R. F., Döhner, K., Krauter, J., Fröhling, S., Corbacioglu, A., Bullinger, L., Habdank, M., Späth, D., Morgan, M., Benner, A., Schlegelberger, B., Heil, G., Ganser, A., Döhner, H., & German-Austrian Acute Myeloid Leukemia Study Group (2008). Mutations and treatment outcome in cytogenetically normal acute myeloid leukemia. The New England journal of medicine, 358(18), 1909–1918.
Sennrich, R., Haddow, B., & Birch, A. (2016, August). Neural machine translation of rare words with subword units. In K. Erk & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1715–1725). Association for Computational Linguistics.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., & Rush, A. (2020). Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38-45.
Zagirova, D., Pushkov, S., Leung, G. H. D., Liu, B. H. M., Urban, A., Sidorenko, D., Kalashnikov, A., Kozlova, E., Naumov, V., Pun, F. W., Ozerov, I. V., Aliper, A., & Zhavoronkov, A. (2023). Biomedical generative pre-trained based transformer language model for age-related disease target discovery. Aging (Albany NY), 15(18), 9293–9309.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98846-
dc.description.abstract急性骨髓性白血病(Acute Myeloid Leukemia, AML)是一種高度異質性的惡性腫瘤,其病理進展與治療策略密切受到體細胞突變、遺傳特徵以及對各類治療反應的影響。隨著個人化醫療與基因標靶治療的持續發展,如何有效辨識並分析與疾病相關的基因,已成為提升臨床預後判斷與治療效果的關鍵。然而,現有的基因資訊擷取技術仍然高度仰賴傳統的文獻檢索與人工整理,缺乏具擴展性與自動語意理解能力的工具。為解決此問題,本研究旨在探討開源大型語言模型(Large Language Model, LLM)於生物醫學文獻分析中的應用潛力,並以AML為實例,建立一套基於BioGPT模型的基因機率分析流程。BioGPT是由微軟開發的語言模型,基於OpenAI原始GPT-2架構,並進一步以生物醫學文獻摘要進行預訓練。本研究採用以「Causal Language Modeling」中的「next-token prediction(下一詞預測)」為核心的分析框架。透過建構疾病語境提示語(disease-context prompt),評估特定基因名稱作為下一詞出現的正規化機率值,以推測其與疾病文獻的語意關聯性。以歐洲白血病網路(European Leukemianet, ELN)AML治療指引中建議的基因集合為測試資料,本研究結果顯示,透過適當的提示詞設計(Prompt Engineering),已知和AML相關的基因與其他基因在預測機率分布上存在顯著差異,顯示本方法具有潛在有效性。為進一步強化模型的穩定性與泛化能力,本研究整合檢索增強生成(Retrieval-Augmented Generation, RAG )機制,並調整嵌入模型與chunk size等超參數,進行分析架構改良。最終結果顯示,透過上述微調(Fine Tune),可有效提升模型對AML相關基因的預測準確性,並聚焦於具高度語意關聯之目標區段。本研究所提出的分析流程可作為未來生醫語言模型應用之模組化方法框架,亦提供大型語言模型應用於領域知識擷取與語意分析的初步實證依據與技術參考。zh_TW
dc.description.abstractAcute Myeloid Leukemia (AML) is a highly heterogeneous malignancy, whose pathological progression and therapeutic strategies are significantly influenced by somatic mutations, genetic characteristics, and responses to various treatments. With the continuous advancement of personalized medicine and gene-targeted therapies, the effective identification and analysis of disease-associated genes has become critical for improving prognosis and treatment outcomes. However, current gene information extraction methods still heavily rely on conventional literature searches and manual curation, lacking scalable and semantically-aware automated tools. To address this issue, this study explores the potential applications of open-source Large Language Model (LLM) in biomedical literature analysis, using AML as a case study to establish a gene probability estimation pipeline based on the BioGPT model. BioGPT, developed by Microsoft, is a transformer-based language model built upon OpenAI’s original GPT-2 architecture and further pretrained on biomedical literature abstracts. We adopt a next-token prediction framework rooted in Causal Language Modeling, constructing disease-context prompts to evaluate the normalized probability of a given gene name appearing as the next token. This score is then used to infer the semantic association between the gene and AML-related literature. Using gene targets recommended by the European Leukemianet (ELN) AML treatment guidelines as test data, our results demonstrate that with carefully designed prompts, the predicted probability distributions for known targets are significantly different from those of other genes, indicating the potential effectiveness of this approach. To further enhance model stability and generalizability, we introduced the Retrieval-Augmented Generation (RAG) framework into the existing pipeline and fine-tuned architectural components such as the embedding model and chunk size. The final results show that these refinements improved the model’s prediction performance and allowed it to better focus on the relevant semantic scope, leading to more selective and goal-oriented predictions. The proposed pipeline provides a modular framework for future biomedical LLM applications and offers preliminary empirical support and technical references for domain-specific knowledge extraction and semantic analysis using large language models.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-19T16:25:44Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-08-19T16:25:44Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents謝辭 i
中文摘要 ii
英文摘要 iii
目次 iv
圖次 vi
表次 viii
第一章 前言 1
1.1 背景介紹 1
1.1 研究目的 2
第二章 文獻探討 4
2.1 白血病的遺傳特徵與治療反應 4
2.2 Transformer 5
2.3 語言建模任務 6
2.4 基於Transformer的模型在醫學研究中的應用 8
2.5 檢索增強生成 9
第三章 研究方法 11
3.1 資料蒐集與預處理 11
3.1.1 Pubmed 11
3.1.2 AML相關文獻蒐集與預處理 12
3.1.3 European LeukemiaNet 13
3.1.4 HUGO Gene Nomenclature Committee 15
3.2 模型架構與流程設計 16
3.2.1 BioGPT 16
3.2.2 Prompt設計 17
3.2.3 Next-token prediction 19
3.2.4 正規化策略 21
3.2.5 運算資源與實作環境 22
3.3 檢索增強生成 23
3.3.1 設計動機 23
3.3.2 RAG語料庫資料蒐集與預處理 24
3.3.3 嵌入模型 25
3.3.4 與BioGPT推論流程整合 27
第四章 結果與討論 28
4.1 Prompt設計與生成分析 28
4.2 ELN標的基因預測 30
4.2.1 ELN標的基因與背景基因之比較 30
4.2.2 ELN標的基因與非AML基因之比較 33
4.3 檢索增強生成模型微調與影響分析 36
4.3.1 不同嵌入模型的檢索增強生成效果比較 36
4.3.2 不同chunk size下的檢索增強生成效果 42
4.3.3 RAG在不同prompt下對預測的影響分析 45
4.3.4 RAG對ELN標的基因與非AML基因預測的影響 48
4.4 總結 51
第五章 結論 52
參考文獻 54
-
dc.language.isozh_TW-
dc.subject大型語言模型zh_TW
dc.subject微調zh_TW
dc.subject急性骨髓性白血病zh_TW
dc.subject檢索增強生成zh_TW
dc.subject生醫文本探勘zh_TW
dc.subject提示工程zh_TW
dc.subjectRetrieval-Augmented Generation (RAG)en
dc.subjectFine-tuneen
dc.subjectPrompt Engineeringen
dc.subjectBiomedical Text Miningen
dc.subjectLarge Language Model (LLM)en
dc.subjectAcute Myeloid Leukemia (AML)en
dc.title應用生醫文獻大型語言模型於白血病相關基因的擷取與分析zh_TW
dc.titleRetrieving Leukemia-related Genes Using Large Language Models built with Biomedical Literatureen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee蔡承宏;謝秉翰zh_TW
dc.contributor.oralexamcommitteeCheng-Hong Tsai;Ping-Han Hsiehen
dc.subject.keyword急性骨髓性白血病,大型語言模型,生醫文本探勘,檢索增強生成,提示工程,微調,zh_TW
dc.subject.keywordAcute Myeloid Leukemia (AML),Large Language Model (LLM),Biomedical Text Mining,Retrieval-Augmented Generation (RAG),Prompt Engineering,Fine-tune,en
dc.relation.page55-
dc.identifier.doi10.6342/NTU202504150-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2025-08-13-
dc.contributor.author-college生物資源暨農學院-
dc.contributor.author-dept生物機電工程學系-
dc.date.embargo-lift2025-08-20-
顯示於系所單位:生物機電工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf3.98 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved