與生醫資料庫對話：透過自然語言轉SQL、文字嵌入及知識圖譜方法探索檢索增強生成的應用

葉政翔; Zheng-Xiang Ye

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97546

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	林仲彥	zh_TW
dc.contributor.advisor	Chung-Yen Lin	en
dc.contributor.author	葉政翔	zh_TW
dc.contributor.author	Zheng-Xiang Ye	en
dc.date.accessioned	2025-07-02T16:24:10Z	-
dc.date.available	2025-07-03	-
dc.date.copyright	2025-07-02	-
dc.date.issued	2025	-
dc.date.submitted	2025-06-20	-
dc.identifier.citation	Alm, E. and Arkin, A. P. 2003. Biological networks. Current opinion in structural biology 13, 2, 193-202. Alstott, J., Bullmore, E. T. and Plenz, D. 2014. powerlaw: A Python Package for Analysis of Heavy-Tailed Distributions. Plos One 9, 1. doi:10.1371/journal.pone.0085777 Amberger, J. S., Bocchini, C. A., Scott, A. F. and Hamosh, A. 2019. OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Research 47, D1, D1038-D1043. doi:10.1093/nar/gky1151 Bahdanau, D., Cho, K. and Bengio, Y. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473. doi:10.48550/arXiv.1409.0473 Bean, D. M., Wu, H. H., Dzahini, O., Broadbent, M., Stewart, R. and Dobson, R. J. B. 2017. Knowledge graph prediction of unknown adverse drug reactions and validation in electronic health records. Scientific Reports 7. doi:10.1038/s41598-017-16674-x Bengio, Y., Ducharme, R., Vincent, P. and Janvin, C. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155. Bergmann, S., Ihmels, J. and Barkai, N. 2004. Similarities and differences in genome-wide expression data of six organisms. Plos Biology 2, 1, 85-93. doi:10.1371/journal.pbio.0020009 Bodenreider, O. 2004. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research 32, D267-D270. doi:10.1093/nar/gkh061 Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G., Lespiau, J.-B., Damoc, B., Clark, A., de Las Casas, D., Guy, A., Menick, J., Ring, R., Hennigan, T., Huang, S., Maggiore, L., Jones, C., Cassirer, A., Brock, A., Paganini, M., Irving, G., Vinyals, O., Osindero, S., Simonyan, K., Rae, J. W., Elsen, E. and Sifre, L. 2021. Improving language models by retrieving from trillions of tokens. arXiv:2112.04426. doi:10.48550/arXiv.2112.04426 Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I. and Amodei, D. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165. doi:10.48550/arXiv.2005.14165 Cao, H. 2024. Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark. arXiv:2406.01607. doi:10.48550/arXiv.2406.01607 Cavalleri, E., Cabri, A., Soto-Gomez, M., Bonfitto, S., Perlasca, P., Gliozzo, J., Callahan, T. J., Reese, J., Robinson, P. N., Casiraghi, E., Valentini, G. and Mesiti, M. 2024. An ontology-based knowledge graph for representing interactions involving RNA molecules. Scientific Data 11, 1. doi:10.1038/s41597-024-03673-7 Chang, S. and Fosler-Lussier, E. 2023. How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings. arXiv:2305.11853. doi:10.48550/arXiv.2305.11853 Codd, E. F. 1970. A RELATIONAL MODEL OF DATA FOR LARGE SHARED DATA BANKS. Communications of the ACM 13, 6, 377-387. doi:10.1145/362384.362685 Codd, E. F. 1972. Further normalization of the data base relational model. Data base systems 6, 1972, 33-64. Codd, E. F. Seven steps to rendezvous with the casual user. In Proceedings of the Proc. IFIP TC-2 Working Conference on Data Base Management Systems (Cargese, Corsica, April 1-5, 1974, 1974). North-Holland, Cargese, Corsica. Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805. doi:10.48550/arXiv.1810.04805 Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Osazuwa Ness, R. and Larson, J. 2024. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130. doi:10.48550/arXiv.2404.16130 Ehrlinger, L. and Wöß, W. 2016. Towards a definition of knowledge graphs. SEMANTiCS (Posters, Demos, SuCCESS) 48, 1-4, 2. Fan, T., Wang, J., Ren, X. and Huang, C. 2025. MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation. arXiv:2501.06713. doi:10.48550/arXiv.2501.06713 Fatemi, B., Halcrow, J. and Perozzi, B. 2023. Talk like a Graph: Encoding Graphs for Large Language Models. arXiv:2310.04560. doi:10.48550/arXiv.2310.04560 Feng, F., Tang, F., Gao, Y., Zhu, D., Li, T., Yang, S., Yao, Y., Huang, Y. and Liu, J. 2022. GenomicKB: a knowledge graph for the human genome. Nucleic Acids Research 51, D1, D950-D956. doi:10.1093/nar/gkac957 Forbes, S. A., Beare, D., Boutselakis, H., Bamford, S., Bindal, N., Tate, J., Cole, C. G., Ward, S., Dawson, E., Ponting, L., Stefancsik, R., Harsha, B., Kok, C. Y., Jia, M. M., Jubb, H., Sondka, Z., Thompson, S., De, T. and Campbell, P. J. 2017. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Research 45, D1, D777-D783. doi:10.1093/nar/gkw1121 Gao, L., Ma, X., Lin, J. and Callan, J. 2022. Precise Zero-Shot Dense Retrieval without Relevance Labels. arXiv:2212.10496. doi:10.48550/arXiv.2212.10496 Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M. and Wang, H. 2023. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997. doi:10.48550/arXiv.2312.10997 Guo, J., Du, L., Liu, H., Zhou, M., He, X. and Han, S. 2023. GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking. arXiv:2305.15066. doi:10.48550/arXiv.2305.15066 Guo, J., Zhan, Z., Gao, Y., Xiao, Y., Lou, J.-G., Liu, T. and Zhang, D. 2019. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. arXiv:1905.08205. doi:10.48550/arXiv.1905.08205 Guo, Z., Xia, L., Yu, Y., Ao, T. and Huang, C. 2024. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv:2410.05779. doi:10.48550/arXiv.2410.05779 Harris, Z. 1954. Distributional structure. Word 10, 2-3, 146-162. doi:10.1007/978-94-009-8467-7_1 He, X., Tian, Y., Sun, Y., Chawla, N. V., Laurent, T., LeCun, Y., Bresson, X. and Hooi, B. 2024. G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering. arXiv:2402.07630. doi:10.48550/arXiv.2402.07630 Himmelstein, D. S., Lizee, A., Hessler, C., Brueggeman, L., Chen, S. L., Hadley, D., Green, A., Khankhanian, P. and Baranzini, S. E. 2017. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. Elife 6. doi:10.7554/eLife.26726 Hong, Z., Yuan, Z., Zhang, Q., Chen, H., Dong, J., Huang, F. and Huang, X. 2024. Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL. arXiv:2406.08426. doi:10.48550/arXiv.2406.08426 Hristidis, V., Gravano, L. and Papakonstantinou, Y. Efficient IR-style keyword search over relational databases. In Proceedings of the Proceedings of the 29th international conference on Very large data bases - Volume 29 (Berlin, Germany, 2003). VLDB Endowment, Berlin, Germany. Hristidis, V. and Papakonstantinou, Y. Discover: keyword search in relational databases. In Proceedings of the Proceedings of the 28th international conference on Very Large Data Bases (Hong Kong, China, 2002). VLDB Endowment, Hong Kong, China. Islam, M. K., Amaya-Ramirez, D., Maigret, B., Devignes, M. D., Aridhi, S. and Smaïl-Tabbone, M. 2023. Molecular-evaluated and explainable drug repurposing for COVID-19 using ensemble knowledge graph embedding. Scientific Reports 13, 1. doi:10.1038/s41598-023-30095-z Islamaj Doğan, R., Chatr-aryamontri, A., Kim, S., Wei, C.-H., Peng, Y., Comeau, D. and Lu, Z. BioCreative VI Precision Medicine Track: creating a training corpus for mining protein-protein interactions affected by mutations. In Proceedings of (Vancouver, Canada, August, 2017). Association for Computational Linguistics, Vancouver, Canada. Izacard, G. and Grave, E. 2020. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. arXiv:2007.01282. doi:10.48550/arXiv.2007.01282 Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. and Barabási, A. L. 2000. The large-scale organization of metabolic networks. Nature 407, 6804, 651-654. doi:10.1038/35036627 Jia, B. F., Raphenya, A. R., Alcock, B., Waglechner, N., Guo, P. Y., Tsang, K. K., Lago, B. A., Dave, B. M., Pereira, S., Sharma, A. N., Doshi, S., Courtot, M., Lo, R., Williams, L. E., Frye, J. G., Elsayegh, T., Sardar, D., Westman, E. L., Pawlowski, A. C., Johnson, T. A., Brinkman, F. S. L., Wright, G. D. and McArthur, A. G. 2017. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Research 45, D1, D566-D573. doi:10.1093/nar/gkw1004 Jiang, J., Zhou, K., Dong, Z., Ye, K., Zhao, W. X. and Wen, J.-R. 2023. StructGPT: A General Framework for Large Language Model to Reason over Structured Data. arXiv:2305.09645. doi:10.48550/arXiv.2305.09645 Jiang, J., Zhou, K., Zhao, W. X., Song, Y., Zhu, C., Zhu, H. and Wen, J.-R. 2024. KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph. arXiv:2402.11163. doi:10.48550/arXiv.2402.11163 Jiménez, A., Merino, M. J., Parras, J. and Zazo, S. 2024. Explainable drug repurposing via path based knowledge graph completion. Scientific Reports 14, 1. doi:10.1038/s41598-024-67163-x Jonnalagadda, S. and Gonzalez, G. BioSimplify: an open source sentence simplification engine to improve recall in automatic biomedical information extraction. In Proceedings of the AMIA Annual Symposium Proceedings (2010) Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y. and Ishiguro-Watanabe, M. 2024. KEGG: biological systems database as a model of the real world. Nucleic Acids Research 53, D1, D672-D677. doi:10.1093/nar/gkae909 Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361. doi:10.48550/arXiv.2001.08361 Katsogiannis-Meimarakis, G. and Koutrika, G. 2023. A survey on deep learning approaches for text-to-SQL. The VLDB Journal 32, 4, 905–936. doi:10.1007/s00778-022-00776-8 Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L. and Lewis, M. 2019. Generalization through Memorization: Nearest Neighbor Language Models. arXiv:1911.00172. doi:10.48550/arXiv.1911.00172 Khanin, R. and Wit, E. 2006. How scale-free are biological networks. Journal of Computational Biology 13, 3, 810-818. doi:10.1089/cmb.2006.13.810 Kilicoglu, H., Rosemblat, G., Fiszman, M. and Rindflesch, T. C. 2011. Constructing a semantic predication gold standard from the biomedical literature. BMC Bioinformatics 12. doi:10.1186/1471-2105-12-486 Kim, S., Chen, J., Cheng, T. J., Gindulyte, A., He, J., He, S. Q., Li, Q. L., Shoemaker, B. A., Thiessen, P. A., Yu, B., Zaslavsky, L., Zhang, J. and Bolton, E. E. 2024. PubChem 2025 update. Nucleic Acids Research 53, D1, D1516-D1525. doi:10.1093/nar/gkae1059 Kim, S. K., Lee, M. K., Jang, H., Lee, J. J., Lee, S. H., Jang, Y., Jang, H. and Kim, A. 2024. TM-MC 2.0: an enhanced chemical database of medicinal materials in Northeast Asian traditional medicine. BMC Complementary Medicine and Therapies 24, 1. doi:10.1186/s12906-023-04331-y Koizumi, Y., Ohishi, Y., Niizumi, D., Takeuchi, D. and Yasuda, M. 2020. Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval. arXiv:2012.07331. doi:10.48550/arXiv.2012.07331 Krithara, A., Nentidis, A., Bougiatiotis, K. and Paliouras, G. 2023. BioASQ-QA: A manually curated corpus for Biomedical Question Answering. Scientific Data 10, 1. doi:10.1038/s41597-023-02068-4 Lan, T., Cai, D., Wang, Y., Huang, H. and Mao, X.-L. 2023. Copy Is All You Need. arXiv:2307.06962. doi:10.48550/arXiv.2307.06962 LangChain. 2025. LangChain [Software]. https://www.langchain.com/ Langfuse. 2025. Langfuse [Software]. https://github.com/langfuse/langfuse Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S. and Kiela, D. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401. doi:10.48550/arXiv.2005.11401 Li, F. and Jagadish, H. V. 2014. Constructing an interactive natural language interface for relational databases. Proc. VLDB Endow. 8, 1, 73–84. doi:10.14778/2735461.2735468 Li, H., Zhang, J., Liu, H., Fan, J., Zhang, X., Zhu, J., Wei, R., Pan, H., Li, C. and Chen, H. 2024. CodeS: Towards Building Open-source Language Models for Text-to-SQL. arXiv:2402.16347. doi:10.48550/arXiv.2402.16347 Li, J., Hui, B., Qu, G., Yang, J., Li, B., Li, B., Wang, B., Qin, B., Cao, R., Geng, R., Huo, N., Zhou, X., Ma, C., Li, G., Chang, K. C. C., Huang, F., Cheng, R. and Li, Y. 2023. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. arXiv:2305.03111. doi:10.48550/arXiv.2305.03111 Li, J., Sun, Y. P., Johnson, R. J., Sciaky, D., Wei, C. H., Leaman, R., Davis, A. P., Mattingly, C. J., Wiegers, T. C. and Lu, Z. Y. 2016. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database-the Journal of Biological Databases and Curation. doi:10.1093/database/baw068 Ling, C., Zhao, X., Zhang, X., Cheng, W., Liu, Y., Sun, Y., Oishi, M., Osaki, T., Matsuda, K., Ji, J., Bai, G., Zhao, L. and Chen, H. 2024. Uncertainty Quantification for In-Context Learning of Large Language Models. arXiv:2402.10189. doi:10.48550/arXiv.2402.10189 Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F. and Liang, P. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172. doi:10.48550/arXiv.2307.03172 Lu, Y., Goi, S. Y., Zhao, X. and Wang, J. 2025. Biomedical Knowledge Graph: A Survey of Domains, Tasks, and Real-World Applications. arXiv:2501.11632. doi:10.48550/arXiv.2501.11632 Luo, Y., Lin, X., Wang, W. and Zhou, X. Spark: top-k keyword query in relational databases. In Proceedings of the Proceedings of the 2007 ACM SIGMOD international conference on Management of data (Beijing, China, 2007). Association for Computing Machinery, Beijing, China. Ma, S., Xu, C., Jiang, X., Li, M., Qu, H., Yang, C., Mao, J. and Guo, J. 2024. Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation. arXiv:2407.10805. doi:10.48550/arXiv.2407.10805 Maslov, S. and Sneppen, K. 2002. Specificity and stability in topology of protein networks. Science 296, 5569, 910-913. doi:10.1126/science.1065103 Mavromatis, C. and Karypis, G. 2024. GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning. arXiv:2405.20139. doi:10.48550/arXiv.2405.20139 Mikolov, T., Chen, K., Corrado, G. and Dean, J. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. doi:10.48550/arXiv.1301.3781 Mohammadjafari, A., Maida, A. S. and Gottumukkala, R. 2024. From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems. arXiv:2410.01066. doi:10.48550/arXiv.2410.01066 Muennighoff, N., Tazi, N., Magne, L. and Reimers, N. 2022. MTEB: Massive Text Embedding Benchmark. arXiv:2210.07316. doi:10.48550/arXiv.2210.07316 Neumann, M., King, D., Beltagy, I. and Ammar, W. 2019. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. arXiv:1902.07669. doi:10.48550/arXiv.1902.07669 OpenAI. 2025. Prompt generation. https://platform.openai.com/docs/guides/prompt-generation Peng, B., Zhu, Y., Liu, Y., Bo, X., Shi, H., Hong, C., Zhang, Y. and Tang, S. 2024. Graph Retrieval-Augmented Generation: A Survey. arXiv:2408.08921. doi:10.48550/arXiv.2408.08921 Peng, W., Li, G., Jiang, Y., Wang, Z., Ou, D., Zeng, X., Xu, D., Xu, T. and Chen, E. 2023. Large Language Model based Long-tail Query Rewriting in Taobao Search. arXiv:2311.03758. doi:10.48550/arXiv.2311.03758 Pennington, J., Socher, R. and Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of (Doha, Qatar, October, 2014). Association for Computational Linguistics, Doha, Qatar. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. 2018. Deep contextualized word representations. arXiv:1802.05365. doi:10.48550/arXiv.1802.05365 Piñero, J., Ramírez-Anguita, J. M., Saüch-Pitarch, J., Ronzano, F., Centeno, E., Sanz, F. and Furlong, L. 2020. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research 48, D1, D845-D855. doi:10.1093/nar/gkz1021 Popescu, A.-M., Armanasu, A., Etzioni, O., Ko, D. and Yates, A. Modern natural language interfaces to databases: composing statistical parsing with semantic tractability. In Proceedings of the Proceedings of the 20th international conference on Computational Linguistics (Geneva, Switzerland, 2004). Association for Computational Linguistics, Geneva, Switzerland. Pourreza, M. and Rafiei, D. 2024. DTS-SQL: Decomposed Text-to-SQL with Small Large Language Models. arXiv:2402.01117. doi:10.48550/arXiv.2402.01117 Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I. 2018. Improving language understanding by generative pre-training. Rajkumar, N., Li, R. and Bahdanau, D. 2022. Evaluating the Text-to-SQL Capabilities of Large Language Models. arXiv:2204.00498. doi:10.48550/arXiv.2204.00498 Ramos, R., Martins, B., Elliott, D. and Kementchedjhieva, Y. 2022. SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation. arXiv:2209.15323. doi:10.48550/arXiv.2209.15323 Ramírez, S. 2025. FastAPI [Software]. https://github.com/fastapi/fastapi Ravikumar, K. E., Rastegar-Mojarad, M. and Liu, H. F. 2017. BELMiner: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences. Database-the Journal of Biological Databases and Curation. doi:10.1093/database/baw156 Robertson, S. and Zaragoza, H. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4, 333–389. doi:10.1561/1500000019 Salminen, J., Liu, C., Pian, W., Chi, J., Häyhänen, E. and Jansen, B. J. Deus Ex Machina and Personas from Large Language Models: Investigating the Composition of AI-Generated Persona Descriptions. In Proceedings of the Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA, 2024). Association for Computing Machinery, Honolulu, HI, USA. Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Le Scao, T., Raja, A., Dey, M., Saiful Bari, M., Xu, C., Thakker, U., Sharma Sharma, S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta, D., Chang, J., Tian-Jian Jiang, M., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Fevry, T., Fries, J. A., Teehan, R., Bers, T., Biderman, S., Gao, L., Wolf, T. and Rush, A. M. 2021. Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv:2110.08207. doi:10.48550/arXiv.2110.08207 Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A. and Manning, C. D. 2024. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. arXiv:2401.18059. doi:10.48550/arXiv.2401.18059 Sarto, S., Cornia, M., Baraldi, L. and Cucchiara, R. 2022. Retrieval-Augmented Transformer for Image Captioning. arXiv:2207.13162. doi:10.48550/arXiv.2207.13162 Seal, R. L., Braschi, B., Gray, K., Jones, T. E. M., Tweedie, S., Haim-Vilmovsky, L. and Bruford, E. A. 2023. Genenames.org: the HGNC resources in 2023. Nucleic Acids Research 51, D1, D1003-D1009. doi:10.1093/nar/gkac888 Shannon, C. E. 1948. A Mathematical Theory of Communication. Bell System Technical Journal 27, 3, 379-423. doi:10.1002/j.1538-7305.1948.tb01338.x Stahl, P. M. 2025. Lingua - An accurate natural language detection library for short and mixed-language text. https://github.com/pemistahl/lingua-py Sukhvinder Singh, I., Aggarwal, R., Allahverdiyev, I., Taha, M., Akalin, A., Zhu, K. and O'Brien, S. 2024. ChunkRAG: Novel LLM-Chunk Filtering Method for RAG Systems. arXiv:2410.19572. doi:10.48550/arXiv.2410.19572 Szklarczyk, D., Santos, A., von Mering, C., Jensen, L. J., Bork, P. and Kuhn, M. 2016. STITCH 5: augmenting protein-chemical interaction networks with tissue and affinity data. Nucleic Acids Research 44, D1, D380-D384. doi:10.1093/nar/gkv1277 Tseng, H.-Y., Lee, H.-Y., Jiang, L., Yang, M.-H. and Yang, W. 2020. RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval. arXiv:2007.08513. doi:10.48550/arXiv.2007.08513 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. and Polosukhin, I. 2017. Attention Is All You Need. arXiv:1706.03762. doi:10.48550/arXiv.1706.03762 Wang, B., Ren, C., Yang, J., Liang, X., Bai, J., Chai, L., Yan, Z., Zhang, Q.-W., Yin, D., Sun, X. and Li, Z. 2023. MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL. arXiv:2312.11242. doi:10.48550/arXiv.2312.11242 Wang, B., Shin, R., Liu, X., Polozov, O. and Richardson, M. 2019. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. arXiv:1911.04942. doi:10.48550/arXiv.1911.04942 Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R. and Wei, F. 2023. Improving Text Embeddings with Large Language Models. arXiv:2401.00368. doi:10.48550/arXiv.2401.00368 Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R. and Wei, F. 2024. Multilingual E5 Text Embeddings: A Technical Report. arXiv:2402.05672. doi:10.48550/arXiv.2402.05672 Warikoo, N., Chang, Y. C. and Hsu, W. L. 2018. LPTK: a linguistic pattern-aware dependency tree kernel approach for the BioCreative VI CHEMPROT task. Database-the Journal of Biological Databases and Curation. doi:10.1093/database/bay108 Wei, C. H., Allot, A., Lai, P. T., Leaman, R., Tian, S. B., Luo, L., Jin, Q., Wang, Z. Z., Chen, Q. Y. and Lu, Z. Y. 2024. PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Research 52, W1, W540-W546. doi:10.1093/nar/gkae235 Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M. and Le, Q. V. 2021. Finetuned Language Models Are Zero-Shot Learners. arXiv:2109.01652. doi:10.48550/arXiv.2109.01652 Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J. and Fedus, W. 2022. Emergent Abilities of Large Language Models. arXiv:2206.07682. doi:10.48550/arXiv.2206.07682 Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. and Zhou, D. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903. doi:10.48550/arXiv.2201.11903 Wu, X., Wang, M., Liu, Y., Shi, X., Yan, H., Lu, X., Zhu, J. and Zhang, W. 2024. LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios. arXiv:2411.07037. doi:10.48550/arXiv.2411.07037 Xu, D., Zhang, M. Z., Xie, Y. P., Wang, F., Chen, M., Zhu, K. Q. and Wei, J. 2016. DTMiner: identification of potential disease targets through biomedical literature mining. Bioinformatics 32, 23, 3619-3626. doi:10.1093/bioinformatics/btw503 Xu, S., Chan, R. W. S., Li, T. Q., Ng, E. H. Y. and Yeung, W. S. B. 2020. Understanding the regulatory mechanisms of endometrial cells on activities of endometrial mesenchymal stem-like cells during menstruation. Stem Cell Research & Therapy 11, 1. doi:10.1186/s13287-020-01750-3 Yaghmazadeh, N., Wang, Y., Dillig, I. and Dillig, T. 2017. SQLizer: query synthesis from natural language. Proc. ACM Program. Lang. 1, OOPSLA, Article 63. doi:10.1145/3133887 Yu, S., Yuan, Z., Xia, J., Luo, S., Ying, H., Zeng, S., Ren, J., Yuan, H., Zhao, Z., Lin, Y., Lu, K., Wang, J., Xie, Y. and Shum, H.-Y. 2022. BIOS: An Algorithmically Generated Biomedical Knowledge Graph. arXiv:2203.09975. doi:10.48550/arXiv.2203.09975 Zhang, B., Ye, Y., Du, G., Hu, X., Li, Z., Yang, S., Liu, C. H., Zhao, R., Li, Z. and Mao, H. 2024. Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation. arXiv:2403.02951. doi:10.48550/arXiv.2403.02951 Zhang, Y., Sui, X., Pan, F., Yu, K. X., Li, K. Q., Tian, S. B., Erdengasileng, A., Han, Q., Wang, W. J., Wang, J. A., Wang, J., Sun, D. H., Chung, H., Zhou, J., Zhou, E., Lee, B., Zhang, P. L., Qiu, X., Zhao, T. T. and Zhang, J. F. 2025. A comprehensive large-scale biomedical knowledge graph for AI-powered data-driven biomedical research. Nature Machine Intelligence 7, 4. doi:10.1038/s42256-025-01014-w Zhang, Z. Y., Verma, A., Doshi-Velez, F. and Low, B. K. H. 2024. Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models. arXiv:2407.14845. doi:10.48550/arXiv.2407.14845 Zhao, P., Zhang, H., Yu, Q., Wang, Z., Geng, Y., Fu, F., Yang, L., Zhang, W., Jiang, J. and Cui, B. 2024. Retrieval-Augmented Generation for AI-Generated Content: A Survey. arXiv:2402.19473. doi:10.48550/arXiv.2402.19473	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97546	-
dc.description.abstract	生物資料庫作為實驗研究與文獻彙整的核心樞紐，使科學家能高效的存取其專業領域內的資訊。資料庫提供者的目標不僅在於收集高品質的數據，亦在於確保服務的穩定性及搜尋結果的準確性。近年來大型語言模型的突破賦予模型強大的語意理解能力，使直覺式的自然語言搜尋成為可能。本研究整合大型語言模型至兩個不同的生物資料庫。MSCare是一個基於間質幹細胞PubMed文獻所建構的聊天機器人，為與非結構化文字資料互動的例子；TWHM聊天機器人則輔助臺灣漢醫藥 (TWHM) 資料庫，為與關聯式資料庫中結構化資料互動的例子。MSCare利用文字嵌入 (text embeddings) 與知識圖譜 (knowledge graph) 擷取相關文獻資訊並進行推理；TWHM聊天機器人則利用大型語言模型生成SQL，以支援藉自然語言進行進階資料庫查詢的技術。本研究設計了客製化的評估方法，用以分析並提升兩個系統的回應品質。結果顯示，MSCare在超過75%的問題上優於基準的大型語言模型，該表現主要來自於文字嵌入方法的貢獻。知識圖譜進一步提升了回應多樣性，並支援間接關係的推理，儘管在回應完整性方面仍有部分限制。MSCare的知識圖譜呈現無尺度網路 (scale-free network) 的特性，並有效捕捉MSC研究中的生物實體。藉本研究設計之資料表選擇與查詢優化策略，TWHM聊天機器人在SQL生成與執行方面有高成功率。本研究驗證了整合大型語言模型至生物資料庫的可行性。然而，在知識圖譜建構、檢索策略及系統效能的評估上仍存在挑戰，為後續研究與優化的重要方向。	zh_TW
dc.description.abstract	Biological databases serve as central hubs for collecting and organizing experimental research and literature, enabling scientists to efficiently access domain-specific information. Database providers aim not only to curate high-quality data but also to ensure stable services and accurate search results. Recent advances in large language models (LLMs) have introduced powerful semantic understanding capabilities, allowing for more intuitive searches using natural language. This study explores the integration of LLMs into two distinct biological databases. MSCare, a chatbot built on PubMed articles related to mesenchymal stem cells (MSCs), enables interaction with unstructured textual data. The TWHM chatbot, developed to supplement the Taiwan Han Medicine (TWHM) database, facilitates interaction with structured data stored in a relational database. MSCare integrates text embeddings and a knowledge graph to extract biomedical content and support reasoning, while the TWHM chatbot uses LLM-based SQL query generation to support advanced searches based on natural language questions. Custom evaluation methods were developed to assess and enhance the response quality of both systems. Results show that MSCare outperforms a baseline LLM on more than 75% of questions, with the primary contribution coming from the text embedding approach. The knowledge graph further enhances response diversity and supports reasoning over indirect relationships, despite some limitations in contextual completeness. The MSC knowledge graph exhibits scale-free properties and effectively captures key entities central to MSC research. The TWHM chatbot achieves a high success rate in SQL query generation and execution, enabled by tailored schema selection and query refinement strategies. This study demonstrates the feasibility of integrating LLMs into biological databases. Nevertheless, challenges remain in knowledge graph construction, retrieval strategy design, and precise system performance evaluation. These areas represent key directions for future enhancement.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-02T16:24:10Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-07-02T16:24:10Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	誌謝 i 摘要 ii Abstract iii Contents v List of Figures ix List of Tables xi Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Research Aims and Objectives 3 1.3 Thesis Structure 4 Chapter 2 Background 5 2.1 Language Models 5 2.1.1 General Concepts 5 2.1.2 A Brief History of Language Models 6 2.1.3 Capabilities of Large Language Models 7 2.2 Retrieval-Augmented Generation (RAG) 8 2.3 Database Management System (DBMS) 9 2.4 Text Embedding Models and Embedding-Based RAG 11 2.4.1 Text Embedding Models 11 2.4.2 RAG with Text Embedding Models 12 2.5 Knowledge Graph and Graph RAG 14 2.5.1 Biomedical Knowledge Graph 14 2.5.2 RAG with Knowledge Graphs 15 2.6 Text-to-SQL 17 Chapter 3 Materials and Methods 19 3.1 Chatbot System Architecture 19 3.1.1 Overview 19 3.1.2 Chat History Management 21 3.1.3 Language Detection 21 3.1.4 Data Retrieval and Response Generation 21 3.1.5 Implementation Details 22 3.2 MSCare Design and Evaluation 22 3.2.1 MSCare Data Sources 24 3.2.2 MSCare Text Embedding Construction and Retrieval 24 3.2.3 MSCare Knowledge Graph Construction 24 3.2.4 MSCare Knowledge Graph Retrieval 27 3.2.5 MSCare Response Generation 30 3.2.6 MSCare Evaluation 30 3.3 TWHM Chatbot Design and Evaluation 36 3.3.1 TWHM Data Source 38 3.3.2 TWHM Data Storage 39 3.3.3 TWHM Data Retrieval 41 3.3.4 TWHM Chatbot Response Generation 42 3.3.5 TWHM Chatbot Evaluation 42 Chapter 4 MSCare Results 46 4.1 Text Embedding-Based Retrieval 46 4.1.1 Text Chunk Statistics 46 4.1.2 Text Embedding Retrieval Evaluation 47 4.2 MSC Knowledge Graph 50 4.2.1 Graph Statistics 50 4.2.2 Graph Entity Connectivity and Topology 52 4.2.3 Knowledge Graph Retrieval Evaluation 54 4.2.4 Reasoning with Paths in the Knowledge Graph 56 4.3 MSCare Response Quality Evaluation 58 4.4 MSCare Response Case Study 61 Chapter 5 TWHM Chatbot Results 70 5.1 TWHM Database Statistics 70 5.2 Text-to-SQL Evaluation 70 5.2.1 Number of Generated Queries per Question 70 5.2.2 Refiner Invocation Frequency 73 5.2.3 Query Quality Evaluation 75 5.3 TWHM Chatbot Response Case Study 79 Chapter 6 Discussion 83 6.1 Text Embedding 83 6.1.1 Relevance of Retrieved Text Chunks 83 6.1.2 Text Chunking Strategies and Chunk Size 84 6.2 Knowledge Graph 84 6.2.1 MSC Knowledge Graph 84 6.2.2 Graph Construction 85 6.2.3 Graph Retrieval Strategies 86 6.2.4 Indirect Relationship Inference 87 6.2.5 Response Presentation 88 6.3 Text-to-SQL 88 6.3.1 Database Schema Selection and Representation 89 6.3.2 SQL Quality Dependency on the Choices of LLMs 90 6.3.3 Extensibility of the Text-to-SQL Approach Developed in this Study 90 6.4 Limitations 91 6.4.1 Response Generation 91 6.4.2 Evaluation Methods 92 Chapter 7 Conclusion 94 Appendices 95 References 135	-
dc.language.iso	en	-
dc.subject	大型語言模型	zh_TW
dc.subject	自然語言轉SQL	zh_TW
dc.subject	知識圖譜	zh_TW
dc.subject	語意搜尋	zh_TW
dc.subject	檢索增強生成	zh_TW
dc.subject	生物資料庫	zh_TW
dc.subject	semantic search	en
dc.subject	retrieval-augmented generation	en
dc.subject	large language models	en
dc.subject	biological databases	en
dc.subject	text-to-SQL	en
dc.subject	knowledge graph	en
dc.title	與生醫資料庫對話：透過自然語言轉SQL、文字嵌入及知識圖譜方法探索檢索增強生成的應用	zh_TW
dc.title	Communicating with Biomedical Databases: Exploring Retrieval-Augmented Generation via Text-to-SQL, Text Embedding, and Knowledge Graph-Based Approaches	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	莊樹諄;黃瀚萱;林泰元;陳淑華	zh_TW
dc.contributor.oralexamcommittee	Trees-Juen Chuang;Hen-Hsen Huang;Thai-Yen Ling;Shu-Hwa Chen	en
dc.subject.keyword	生物資料庫,大型語言模型,檢索增強生成,語意搜尋,知識圖譜,自然語言轉SQL,	zh_TW
dc.subject.keyword	biological databases,large language models,retrieval-augmented generation,semantic search,knowledge graph,text-to-SQL,	en
dc.relation.page	142	-
dc.identifier.doi	10.6342/NTU202501236	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2025-06-20	-
dc.contributor.author-college	生命科學院	-
dc.contributor.author-dept	基因體與系統生物學學位學程	-
dc.date.embargo-lift	2025-07-03	-
顯示於系所單位：	基因體與系統生物學學位學程

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf	10.3 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。