Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 生命科學院
  3. 基因體與系統生物學學位學程
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97546
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor林仲彥zh_TW
dc.contributor.advisorChung-Yen Linen
dc.contributor.author葉政翔zh_TW
dc.contributor.authorZheng-Xiang Yeen
dc.date.accessioned2025-07-02T16:24:10Z-
dc.date.available2025-07-03-
dc.date.copyright2025-07-02-
dc.date.issued2025-
dc.date.submitted2025-06-20-
dc.identifier.citationAlm, E. and Arkin, A. P. 2003. Biological networks. Current opinion in structural biology 13, 2, 193-202.
Alstott, J., Bullmore, E. T. and Plenz, D. 2014. powerlaw: A Python Package for Analysis of Heavy-Tailed Distributions. Plos One 9, 1. doi:10.1371/journal.pone.0085777
Amberger, J. S., Bocchini, C. A., Scott, A. F. and Hamosh, A. 2019. OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Research 47, D1, D1038-D1043. doi:10.1093/nar/gky1151
Bahdanau, D., Cho, K. and Bengio, Y. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473. doi:10.48550/arXiv.1409.0473
Bean, D. M., Wu, H. H., Dzahini, O., Broadbent, M., Stewart, R. and Dobson, R. J. B. 2017. Knowledge graph prediction of unknown adverse drug reactions and validation in electronic health records. Scientific Reports 7. doi:10.1038/s41598-017-16674-x
Bengio, Y., Ducharme, R., Vincent, P. and Janvin, C. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155.
Bergmann, S., Ihmels, J. and Barkai, N. 2004. Similarities and differences in genome-wide expression data of six organisms. Plos Biology 2, 1, 85-93. doi:10.1371/journal.pbio.0020009
Bodenreider, O. 2004. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research 32, D267-D270. doi:10.1093/nar/gkh061
Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G., Lespiau, J.-B., Damoc, B., Clark, A., de Las Casas, D., Guy, A., Menick, J., Ring, R., Hennigan, T., Huang, S., Maggiore, L., Jones, C., Cassirer, A., Brock, A., Paganini, M., Irving, G., Vinyals, O., Osindero, S., Simonyan, K., Rae, J. W., Elsen, E. and Sifre, L. 2021. Improving language models by retrieving from trillions of tokens. arXiv:2112.04426. doi:10.48550/arXiv.2112.04426
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I. and Amodei, D. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165. doi:10.48550/arXiv.2005.14165
Cao, H. 2024. Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark. arXiv:2406.01607. doi:10.48550/arXiv.2406.01607
Cavalleri, E., Cabri, A., Soto-Gomez, M., Bonfitto, S., Perlasca, P., Gliozzo, J., Callahan, T. J., Reese, J., Robinson, P. N., Casiraghi, E., Valentini, G. and Mesiti, M. 2024. An ontology-based knowledge graph for representing interactions involving RNA molecules. Scientific Data 11, 1. doi:10.1038/s41597-024-03673-7
Chang, S. and Fosler-Lussier, E. 2023. How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings. arXiv:2305.11853. doi:10.48550/arXiv.2305.11853
Codd, E. F. 1970. A RELATIONAL MODEL OF DATA FOR LARGE SHARED DATA BANKS. Communications of the ACM 13, 6, 377-387. doi:10.1145/362384.362685
Codd, E. F. 1972. Further normalization of the data base relational model. Data base systems 6, 1972, 33-64.
Codd, E. F. Seven steps to rendezvous with the casual user. In Proceedings of the Proc. IFIP TC-2 Working Conference on Data Base Management Systems (Cargese, Corsica, April 1-5, 1974, 1974). North-Holland, Cargese, Corsica.
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805. doi:10.48550/arXiv.1810.04805
Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Osazuwa Ness, R. and Larson, J. 2024. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130. doi:10.48550/arXiv.2404.16130
Ehrlinger, L. and Wöß, W. 2016. Towards a definition of knowledge graphs. SEMANTiCS (Posters, Demos, SuCCESS) 48, 1-4, 2.
Fan, T., Wang, J., Ren, X. and Huang, C. 2025. MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation. arXiv:2501.06713. doi:10.48550/arXiv.2501.06713
Fatemi, B., Halcrow, J. and Perozzi, B. 2023. Talk like a Graph: Encoding Graphs for Large Language Models. arXiv:2310.04560. doi:10.48550/arXiv.2310.04560
Feng, F., Tang, F., Gao, Y., Zhu, D., Li, T., Yang, S., Yao, Y., Huang, Y. and Liu, J. 2022. GenomicKB: a knowledge graph for the human genome. Nucleic Acids Research 51, D1, D950-D956. doi:10.1093/nar/gkac957
Forbes, S. A., Beare, D., Boutselakis, H., Bamford, S., Bindal, N., Tate, J., Cole, C. G., Ward, S., Dawson, E., Ponting, L., Stefancsik, R., Harsha, B., Kok, C. Y., Jia, M. M., Jubb, H., Sondka, Z., Thompson, S., De, T. and Campbell, P. J. 2017. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Research 45, D1, D777-D783. doi:10.1093/nar/gkw1121
Gao, L., Ma, X., Lin, J. and Callan, J. 2022. Precise Zero-Shot Dense Retrieval without Relevance Labels. arXiv:2212.10496. doi:10.48550/arXiv.2212.10496
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M. and Wang, H. 2023. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997. doi:10.48550/arXiv.2312.10997
Guo, J., Du, L., Liu, H., Zhou, M., He, X. and Han, S. 2023. GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking. arXiv:2305.15066. doi:10.48550/arXiv.2305.15066
Guo, J., Zhan, Z., Gao, Y., Xiao, Y., Lou, J.-G., Liu, T. and Zhang, D. 2019. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. arXiv:1905.08205. doi:10.48550/arXiv.1905.08205
Guo, Z., Xia, L., Yu, Y., Ao, T. and Huang, C. 2024. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv:2410.05779. doi:10.48550/arXiv.2410.05779
Harris, Z. 1954. Distributional structure. Word 10, 2-3, 146-162. doi:10.1007/978-94-009-8467-7_1
He, X., Tian, Y., Sun, Y., Chawla, N. V., Laurent, T., LeCun, Y., Bresson, X. and Hooi, B. 2024. G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering. arXiv:2402.07630. doi:10.48550/arXiv.2402.07630
Himmelstein, D. S., Lizee, A., Hessler, C., Brueggeman, L., Chen, S. L., Hadley, D., Green, A., Khankhanian, P. and Baranzini, S. E. 2017. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. Elife 6. doi:10.7554/eLife.26726
Hong, Z., Yuan, Z., Zhang, Q., Chen, H., Dong, J., Huang, F. and Huang, X. 2024. Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL. arXiv:2406.08426. doi:10.48550/arXiv.2406.08426
Hristidis, V., Gravano, L. and Papakonstantinou, Y. Efficient IR-style keyword search over relational databases. In Proceedings of the Proceedings of the 29th international conference on Very large data bases - Volume 29 (Berlin, Germany, 2003). VLDB Endowment, Berlin, Germany.
Hristidis, V. and Papakonstantinou, Y. Discover: keyword search in relational databases. In Proceedings of the Proceedings of the 28th international conference on Very Large Data Bases (Hong Kong, China, 2002). VLDB Endowment, Hong Kong, China.
Islam, M. K., Amaya-Ramirez, D., Maigret, B., Devignes, M. D., Aridhi, S. and Smaïl-Tabbone, M. 2023. Molecular-evaluated and explainable drug repurposing for COVID-19 using ensemble knowledge graph embedding. Scientific Reports 13, 1. doi:10.1038/s41598-023-30095-z
Islamaj Doğan, R., Chatr-aryamontri, A., Kim, S., Wei, C.-H., Peng, Y., Comeau, D. and Lu, Z. BioCreative VI Precision Medicine Track: creating a training corpus for mining protein-protein interactions affected by mutations. In Proceedings of (Vancouver, Canada, August, 2017). Association for Computational Linguistics, Vancouver, Canada.
Izacard, G. and Grave, E. 2020. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. arXiv:2007.01282. doi:10.48550/arXiv.2007.01282
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. and Barabási, A. L. 2000. The large-scale organization of metabolic networks. Nature 407, 6804, 651-654. doi:10.1038/35036627
Jia, B. F., Raphenya, A. R., Alcock, B., Waglechner, N., Guo, P. Y., Tsang, K. K., Lago, B. A., Dave, B. M., Pereira, S., Sharma, A. N., Doshi, S., Courtot, M., Lo, R., Williams, L. E., Frye, J. G., Elsayegh, T., Sardar, D., Westman, E. L., Pawlowski, A. C., Johnson, T. A., Brinkman, F. S. L., Wright, G. D. and McArthur, A. G. 2017. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Research 45, D1, D566-D573. doi:10.1093/nar/gkw1004
Jiang, J., Zhou, K., Dong, Z., Ye, K., Zhao, W. X. and Wen, J.-R. 2023. StructGPT: A General Framework for Large Language Model to Reason over Structured Data. arXiv:2305.09645. doi:10.48550/arXiv.2305.09645
Jiang, J., Zhou, K., Zhao, W. X., Song, Y., Zhu, C., Zhu, H. and Wen, J.-R. 2024. KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph. arXiv:2402.11163. doi:10.48550/arXiv.2402.11163
Jiménez, A., Merino, M. J., Parras, J. and Zazo, S. 2024. Explainable drug repurposing via path based knowledge graph completion. Scientific Reports 14, 1. doi:10.1038/s41598-024-67163-x
Jonnalagadda, S. and Gonzalez, G. BioSimplify: an open source sentence simplification engine to improve recall in automatic biomedical information extraction. In Proceedings of the AMIA Annual Symposium Proceedings (2010)
Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y. and Ishiguro-Watanabe, M. 2024. KEGG: biological systems database as a model of the real world. Nucleic Acids Research 53, D1, D672-D677. doi:10.1093/nar/gkae909
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361. doi:10.48550/arXiv.2001.08361
Katsogiannis-Meimarakis, G. and Koutrika, G. 2023. A survey on deep learning approaches for text-to-SQL. The VLDB Journal 32, 4, 905–936. doi:10.1007/s00778-022-00776-8
Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L. and Lewis, M. 2019. Generalization through Memorization: Nearest Neighbor Language Models. arXiv:1911.00172. doi:10.48550/arXiv.1911.00172
Khanin, R. and Wit, E. 2006. How scale-free are biological networks. Journal of Computational Biology 13, 3, 810-818. doi:10.1089/cmb.2006.13.810
Kilicoglu, H., Rosemblat, G., Fiszman, M. and Rindflesch, T. C. 2011. Constructing a semantic predication gold standard from the biomedical literature. BMC Bioinformatics 12. doi:10.1186/1471-2105-12-486
Kim, S., Chen, J., Cheng, T. J., Gindulyte, A., He, J., He, S. Q., Li, Q. L., Shoemaker, B. A., Thiessen, P. A., Yu, B., Zaslavsky, L., Zhang, J. and Bolton, E. E. 2024. PubChem 2025 update. Nucleic Acids Research 53, D1, D1516-D1525. doi:10.1093/nar/gkae1059
Kim, S. K., Lee, M. K., Jang, H., Lee, J. J., Lee, S. H., Jang, Y., Jang, H. and Kim, A. 2024. TM-MC 2.0: an enhanced chemical database of medicinal materials in Northeast Asian traditional medicine. BMC Complementary Medicine and Therapies 24, 1. doi:10.1186/s12906-023-04331-y
Koizumi, Y., Ohishi, Y., Niizumi, D., Takeuchi, D. and Yasuda, M. 2020. Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval. arXiv:2012.07331. doi:10.48550/arXiv.2012.07331
Krithara, A., Nentidis, A., Bougiatiotis, K. and Paliouras, G. 2023. BioASQ-QA: A manually curated corpus for Biomedical Question Answering. Scientific Data 10, 1. doi:10.1038/s41597-023-02068-4
Lan, T., Cai, D., Wang, Y., Huang, H. and Mao, X.-L. 2023. Copy Is All You Need. arXiv:2307.06962. doi:10.48550/arXiv.2307.06962
LangChain. 2025. LangChain [Software]. https://www.langchain.com/
Langfuse. 2025. Langfuse [Software]. https://github.com/langfuse/langfuse
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S. and Kiela, D. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401. doi:10.48550/arXiv.2005.11401
Li, F. and Jagadish, H. V. 2014. Constructing an interactive natural language interface for relational databases. Proc. VLDB Endow. 8, 1, 73–84. doi:10.14778/2735461.2735468
Li, H., Zhang, J., Liu, H., Fan, J., Zhang, X., Zhu, J., Wei, R., Pan, H., Li, C. and Chen, H. 2024. CodeS: Towards Building Open-source Language Models for Text-to-SQL. arXiv:2402.16347. doi:10.48550/arXiv.2402.16347
Li, J., Hui, B., Qu, G., Yang, J., Li, B., Li, B., Wang, B., Qin, B., Cao, R., Geng, R., Huo, N., Zhou, X., Ma, C., Li, G., Chang, K. C. C., Huang, F., Cheng, R. and Li, Y. 2023. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. arXiv:2305.03111. doi:10.48550/arXiv.2305.03111
Li, J., Sun, Y. P., Johnson, R. J., Sciaky, D., Wei, C. H., Leaman, R., Davis, A. P., Mattingly, C. J., Wiegers, T. C. and Lu, Z. Y. 2016. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database-the Journal of Biological Databases and Curation. doi:10.1093/database/baw068
Ling, C., Zhao, X., Zhang, X., Cheng, W., Liu, Y., Sun, Y., Oishi, M., Osaki, T., Matsuda, K., Ji, J., Bai, G., Zhao, L. and Chen, H. 2024. Uncertainty Quantification for In-Context Learning of Large Language Models. arXiv:2402.10189. doi:10.48550/arXiv.2402.10189
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F. and Liang, P. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172. doi:10.48550/arXiv.2307.03172
Lu, Y., Goi, S. Y., Zhao, X. and Wang, J. 2025. Biomedical Knowledge Graph: A Survey of Domains, Tasks, and Real-World Applications. arXiv:2501.11632. doi:10.48550/arXiv.2501.11632
Luo, Y., Lin, X., Wang, W. and Zhou, X. Spark: top-k keyword query in relational databases. In Proceedings of the Proceedings of the 2007 ACM SIGMOD international conference on Management of data (Beijing, China, 2007). Association for Computing Machinery, Beijing, China.
Ma, S., Xu, C., Jiang, X., Li, M., Qu, H., Yang, C., Mao, J. and Guo, J. 2024. Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation. arXiv:2407.10805. doi:10.48550/arXiv.2407.10805
Maslov, S. and Sneppen, K. 2002. Specificity and stability in topology of protein networks. Science 296, 5569, 910-913. doi:10.1126/science.1065103
Mavromatis, C. and Karypis, G. 2024. GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning. arXiv:2405.20139. doi:10.48550/arXiv.2405.20139
Mikolov, T., Chen, K., Corrado, G. and Dean, J. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. doi:10.48550/arXiv.1301.3781
Mohammadjafari, A., Maida, A. S. and Gottumukkala, R. 2024. From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems. arXiv:2410.01066. doi:10.48550/arXiv.2410.01066
Muennighoff, N., Tazi, N., Magne, L. and Reimers, N. 2022. MTEB: Massive Text Embedding Benchmark. arXiv:2210.07316. doi:10.48550/arXiv.2210.07316
Neumann, M., King, D., Beltagy, I. and Ammar, W. 2019. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. arXiv:1902.07669. doi:10.48550/arXiv.1902.07669
OpenAI. 2025. Prompt generation. https://platform.openai.com/docs/guides/prompt-generation
Peng, B., Zhu, Y., Liu, Y., Bo, X., Shi, H., Hong, C., Zhang, Y. and Tang, S. 2024. Graph Retrieval-Augmented Generation: A Survey. arXiv:2408.08921. doi:10.48550/arXiv.2408.08921
Peng, W., Li, G., Jiang, Y., Wang, Z., Ou, D., Zeng, X., Xu, D., Xu, T. and Chen, E. 2023. Large Language Model based Long-tail Query Rewriting in Taobao Search. arXiv:2311.03758. doi:10.48550/arXiv.2311.03758
Pennington, J., Socher, R. and Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of (Doha, Qatar, October, 2014). Association for Computational Linguistics, Doha, Qatar.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. 2018. Deep contextualized word representations. arXiv:1802.05365. doi:10.48550/arXiv.1802.05365
Piñero, J., Ramírez-Anguita, J. M., Saüch-Pitarch, J., Ronzano, F., Centeno, E., Sanz, F. and Furlong, L. 2020. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research 48, D1, D845-D855. doi:10.1093/nar/gkz1021
Popescu, A.-M., Armanasu, A., Etzioni, O., Ko, D. and Yates, A. Modern natural language interfaces to databases: composing statistical parsing with semantic tractability. In Proceedings of the Proceedings of the 20th international conference on Computational Linguistics (Geneva, Switzerland, 2004). Association for Computational Linguistics, Geneva, Switzerland.
Pourreza, M. and Rafiei, D. 2024. DTS-SQL: Decomposed Text-to-SQL with Small Large Language Models. arXiv:2402.01117. doi:10.48550/arXiv.2402.01117
Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I. 2018. Improving language understanding by generative pre-training.
Rajkumar, N., Li, R. and Bahdanau, D. 2022. Evaluating the Text-to-SQL Capabilities of Large Language Models. arXiv:2204.00498. doi:10.48550/arXiv.2204.00498
Ramos, R., Martins, B., Elliott, D. and Kementchedjhieva, Y. 2022. SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation. arXiv:2209.15323. doi:10.48550/arXiv.2209.15323
Ramírez, S. 2025. FastAPI [Software]. https://github.com/fastapi/fastapi
Ravikumar, K. E., Rastegar-Mojarad, M. and Liu, H. F. 2017. BELMiner: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences. Database-the Journal of Biological Databases and Curation. doi:10.1093/database/baw156
Robertson, S. and Zaragoza, H. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4, 333–389. doi:10.1561/1500000019
Salminen, J., Liu, C., Pian, W., Chi, J., Häyhänen, E. and Jansen, B. J. Deus Ex Machina and Personas from Large Language Models: Investigating the Composition of AI-Generated Persona Descriptions. In Proceedings of the Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA, 2024). Association for Computing Machinery, Honolulu, HI, USA.
Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Le Scao, T., Raja, A., Dey, M., Saiful Bari, M., Xu, C., Thakker, U., Sharma Sharma, S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta, D., Chang, J., Tian-Jian Jiang, M., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Fevry, T., Fries, J. A., Teehan, R., Bers, T., Biderman, S., Gao, L., Wolf, T. and Rush, A. M. 2021. Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv:2110.08207. doi:10.48550/arXiv.2110.08207
Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A. and Manning, C. D. 2024. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. arXiv:2401.18059. doi:10.48550/arXiv.2401.18059
Sarto, S., Cornia, M., Baraldi, L. and Cucchiara, R. 2022. Retrieval-Augmented Transformer for Image Captioning. arXiv:2207.13162. doi:10.48550/arXiv.2207.13162
Seal, R. L., Braschi, B., Gray, K., Jones, T. E. M., Tweedie, S., Haim-Vilmovsky, L. and Bruford, E. A. 2023. Genenames.org: the HGNC resources in 2023. Nucleic Acids Research 51, D1, D1003-D1009. doi:10.1093/nar/gkac888
Shannon, C. E. 1948. A Mathematical Theory of Communication. Bell System Technical Journal 27, 3, 379-423. doi:10.1002/j.1538-7305.1948.tb01338.x
Stahl, P. M. 2025. Lingua - An accurate natural language detection library for short and mixed-language text. https://github.com/pemistahl/lingua-py
Sukhvinder Singh, I., Aggarwal, R., Allahverdiyev, I., Taha, M., Akalin, A., Zhu, K. and O'Brien, S. 2024. ChunkRAG: Novel LLM-Chunk Filtering Method for RAG Systems. arXiv:2410.19572. doi:10.48550/arXiv.2410.19572
Szklarczyk, D., Santos, A., von Mering, C., Jensen, L. J., Bork, P. and Kuhn, M. 2016. STITCH 5: augmenting protein-chemical interaction networks with tissue and affinity data. Nucleic Acids Research 44, D1, D380-D384. doi:10.1093/nar/gkv1277
Tseng, H.-Y., Lee, H.-Y., Jiang, L., Yang, M.-H. and Yang, W. 2020. RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval. arXiv:2007.08513. doi:10.48550/arXiv.2007.08513
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. and Polosukhin, I. 2017. Attention Is All You Need. arXiv:1706.03762. doi:10.48550/arXiv.1706.03762
Wang, B., Ren, C., Yang, J., Liang, X., Bai, J., Chai, L., Yan, Z., Zhang, Q.-W., Yin, D., Sun, X. and Li, Z. 2023. MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL. arXiv:2312.11242. doi:10.48550/arXiv.2312.11242
Wang, B., Shin, R., Liu, X., Polozov, O. and Richardson, M. 2019. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. arXiv:1911.04942. doi:10.48550/arXiv.1911.04942
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R. and Wei, F. 2023. Improving Text Embeddings with Large Language Models. arXiv:2401.00368. doi:10.48550/arXiv.2401.00368
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R. and Wei, F. 2024. Multilingual E5 Text Embeddings: A Technical Report. arXiv:2402.05672. doi:10.48550/arXiv.2402.05672
Warikoo, N., Chang, Y. C. and Hsu, W. L. 2018. LPTK: a linguistic pattern-aware dependency tree kernel approach for the BioCreative VI CHEMPROT task. Database-the Journal of Biological Databases and Curation. doi:10.1093/database/bay108
Wei, C. H., Allot, A., Lai, P. T., Leaman, R., Tian, S. B., Luo, L., Jin, Q., Wang, Z. Z., Chen, Q. Y. and Lu, Z. Y. 2024. PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Research 52, W1, W540-W546. doi:10.1093/nar/gkae235
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M. and Le, Q. V. 2021. Finetuned Language Models Are Zero-Shot Learners. arXiv:2109.01652. doi:10.48550/arXiv.2109.01652
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J. and Fedus, W. 2022. Emergent Abilities of Large Language Models. arXiv:2206.07682. doi:10.48550/arXiv.2206.07682
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. and Zhou, D. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903. doi:10.48550/arXiv.2201.11903
Wu, X., Wang, M., Liu, Y., Shi, X., Yan, H., Lu, X., Zhu, J. and Zhang, W. 2024. LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios. arXiv:2411.07037. doi:10.48550/arXiv.2411.07037
Xu, D., Zhang, M. Z., Xie, Y. P., Wang, F., Chen, M., Zhu, K. Q. and Wei, J. 2016. DTMiner: identification of potential disease targets through biomedical literature mining. Bioinformatics 32, 23, 3619-3626. doi:10.1093/bioinformatics/btw503
Xu, S., Chan, R. W. S., Li, T. Q., Ng, E. H. Y. and Yeung, W. S. B. 2020. Understanding the regulatory mechanisms of endometrial cells on activities of endometrial mesenchymal stem-like cells during menstruation. Stem Cell Research & Therapy 11, 1. doi:10.1186/s13287-020-01750-3
Yaghmazadeh, N., Wang, Y., Dillig, I. and Dillig, T. 2017. SQLizer: query synthesis from natural language. Proc. ACM Program. Lang. 1, OOPSLA, Article 63. doi:10.1145/3133887
Yu, S., Yuan, Z., Xia, J., Luo, S., Ying, H., Zeng, S., Ren, J., Yuan, H., Zhao, Z., Lin, Y., Lu, K., Wang, J., Xie, Y. and Shum, H.-Y. 2022. BIOS: An Algorithmically Generated Biomedical Knowledge Graph. arXiv:2203.09975. doi:10.48550/arXiv.2203.09975
Zhang, B., Ye, Y., Du, G., Hu, X., Li, Z., Yang, S., Liu, C. H., Zhao, R., Li, Z. and Mao, H. 2024. Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation. arXiv:2403.02951. doi:10.48550/arXiv.2403.02951
Zhang, Y., Sui, X., Pan, F., Yu, K. X., Li, K. Q., Tian, S. B., Erdengasileng, A., Han, Q., Wang, W. J., Wang, J. A., Wang, J., Sun, D. H., Chung, H., Zhou, J., Zhou, E., Lee, B., Zhang, P. L., Qiu, X., Zhao, T. T. and Zhang, J. F. 2025. A comprehensive large-scale biomedical knowledge graph for AI-powered data-driven biomedical research. Nature Machine Intelligence 7, 4. doi:10.1038/s42256-025-01014-w
Zhang, Z. Y., Verma, A., Doshi-Velez, F. and Low, B. K. H. 2024. Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models. arXiv:2407.14845. doi:10.48550/arXiv.2407.14845
Zhao, P., Zhang, H., Yu, Q., Wang, Z., Geng, Y., Fu, F., Yang, L., Zhang, W., Jiang, J. and Cui, B. 2024. Retrieval-Augmented Generation for AI-Generated Content: A Survey. arXiv:2402.19473. doi:10.48550/arXiv.2402.19473
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97546-
dc.description.abstract生物資料庫作為實驗研究與文獻彙整的核心樞紐,使科學家能高效的存取其專業領域內的資訊。資料庫提供者的目標不僅在於收集高品質的數據,亦在於確保服務的穩定性及搜尋結果的準確性。近年來大型語言模型的突破賦予模型強大的語意理解能力,使直覺式的自然語言搜尋成為可能。本研究整合大型語言模型至兩個不同的生物資料庫。MSCare是一個基於間質幹細胞PubMed文獻所建構的聊天機器人,為與非結構化文字資料互動的例子;TWHM聊天機器人則輔助臺灣漢醫藥 (TWHM) 資料庫,為與關聯式資料庫中結構化資料互動的例子。MSCare利用文字嵌入 (text embeddings) 與知識圖譜 (knowledge graph) 擷取相關文獻資訊並進行推理;TWHM聊天機器人則利用大型語言模型生成SQL,以支援藉自然語言進行進階資料庫查詢的技術。本研究設計了客製化的評估方法,用以分析並提升兩個系統的回應品質。結果顯示,MSCare在超過75%的問題上優於基準的大型語言模型,該表現主要來自於文字嵌入方法的貢獻。知識圖譜進一步提升了回應多樣性,並支援間接關係的推理,儘管在回應完整性方面仍有部分限制。MSCare的知識圖譜呈現無尺度網路 (scale-free network) 的特性,並有效捕捉MSC研究中的生物實體。藉本研究設計之資料表選擇與查詢優化策略,TWHM聊天機器人在SQL生成與執行方面有高成功率。本研究驗證了整合大型語言模型至生物資料庫的可行性。然而,在知識圖譜建構、檢索策略及系統效能的評估上仍存在挑戰,為後續研究與優化的重要方向。zh_TW
dc.description.abstractBiological databases serve as central hubs for collecting and organizing experimental research and literature, enabling scientists to efficiently access domain-specific information. Database providers aim not only to curate high-quality data but also to ensure stable services and accurate search results. Recent advances in large language models (LLMs) have introduced powerful semantic understanding capabilities, allowing for more intuitive searches using natural language. This study explores the integration of LLMs into two distinct biological databases. MSCare, a chatbot built on PubMed articles related to mesenchymal stem cells (MSCs), enables interaction with unstructured textual data. The TWHM chatbot, developed to supplement the Taiwan Han Medicine (TWHM) database, facilitates interaction with structured data stored in a relational database. MSCare integrates text embeddings and a knowledge graph to extract biomedical content and support reasoning, while the TWHM chatbot uses LLM-based SQL query generation to support advanced searches based on natural language questions. Custom evaluation methods were developed to assess and enhance the response quality of both systems. Results show that MSCare outperforms a baseline LLM on more than 75% of questions, with the primary contribution coming from the text embedding approach. The knowledge graph further enhances response diversity and supports reasoning over indirect relationships, despite some limitations in contextual completeness. The MSC knowledge graph exhibits scale-free properties and effectively captures key entities central to MSC research. The TWHM chatbot achieves a high success rate in SQL query generation and execution, enabled by tailored schema selection and query refinement strategies. This study demonstrates the feasibility of integrating LLMs into biological databases. Nevertheless, challenges remain in knowledge graph construction, retrieval strategy design, and precise system performance evaluation. These areas represent key directions for future enhancement.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-02T16:24:10Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-07-02T16:24:10Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents誌謝 i
摘要 ii
Abstract iii
Contents v
List of Figures ix
List of Tables xi
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Research Aims and Objectives 3
1.3 Thesis Structure 4
Chapter 2 Background 5
2.1 Language Models 5
2.1.1 General Concepts 5
2.1.2 A Brief History of Language Models 6
2.1.3 Capabilities of Large Language Models 7
2.2 Retrieval-Augmented Generation (RAG) 8
2.3 Database Management System (DBMS) 9
2.4 Text Embedding Models and Embedding-Based RAG 11
2.4.1 Text Embedding Models 11
2.4.2 RAG with Text Embedding Models 12
2.5 Knowledge Graph and Graph RAG 14
2.5.1 Biomedical Knowledge Graph 14
2.5.2 RAG with Knowledge Graphs 15
2.6 Text-to-SQL 17
Chapter 3 Materials and Methods 19
3.1 Chatbot System Architecture 19
3.1.1 Overview 19
3.1.2 Chat History Management 21
3.1.3 Language Detection 21
3.1.4 Data Retrieval and Response Generation 21
3.1.5 Implementation Details 22
3.2 MSCare Design and Evaluation 22
3.2.1 MSCare Data Sources 24
3.2.2 MSCare Text Embedding Construction and Retrieval 24
3.2.3 MSCare Knowledge Graph Construction 24
3.2.4 MSCare Knowledge Graph Retrieval 27
3.2.5 MSCare Response Generation 30
3.2.6 MSCare Evaluation 30
3.3 TWHM Chatbot Design and Evaluation 36
3.3.1 TWHM Data Source 38
3.3.2 TWHM Data Storage 39
3.3.3 TWHM Data Retrieval 41
3.3.4 TWHM Chatbot Response Generation 42
3.3.5 TWHM Chatbot Evaluation 42
Chapter 4 MSCare Results 46
4.1 Text Embedding-Based Retrieval 46
4.1.1 Text Chunk Statistics 46
4.1.2 Text Embedding Retrieval Evaluation 47
4.2 MSC Knowledge Graph 50
4.2.1 Graph Statistics 50
4.2.2 Graph Entity Connectivity and Topology 52
4.2.3 Knowledge Graph Retrieval Evaluation 54
4.2.4 Reasoning with Paths in the Knowledge Graph 56
4.3 MSCare Response Quality Evaluation 58
4.4 MSCare Response Case Study 61
Chapter 5 TWHM Chatbot Results 70
5.1 TWHM Database Statistics 70
5.2 Text-to-SQL Evaluation 70
5.2.1 Number of Generated Queries per Question 70
5.2.2 Refiner Invocation Frequency 73
5.2.3 Query Quality Evaluation 75
5.3 TWHM Chatbot Response Case Study 79
Chapter 6 Discussion 83
6.1 Text Embedding 83
6.1.1 Relevance of Retrieved Text Chunks 83
6.1.2 Text Chunking Strategies and Chunk Size 84
6.2 Knowledge Graph 84
6.2.1 MSC Knowledge Graph 84
6.2.2 Graph Construction 85
6.2.3 Graph Retrieval Strategies 86
6.2.4 Indirect Relationship Inference 87
6.2.5 Response Presentation 88
6.3 Text-to-SQL 88
6.3.1 Database Schema Selection and Representation 89
6.3.2 SQL Quality Dependency on the Choices of LLMs 90
6.3.3 Extensibility of the Text-to-SQL Approach Developed in this Study 90
6.4 Limitations 91
6.4.1 Response Generation 91
6.4.2 Evaluation Methods 92
Chapter 7 Conclusion 94
Appendices 95
References 135
-
dc.language.isoen-
dc.subject大型語言模型zh_TW
dc.subject自然語言轉SQLzh_TW
dc.subject知識圖譜zh_TW
dc.subject語意搜尋zh_TW
dc.subject檢索增強生成zh_TW
dc.subject生物資料庫zh_TW
dc.subjectsemantic searchen
dc.subjectretrieval-augmented generationen
dc.subjectlarge language modelsen
dc.subjectbiological databasesen
dc.subjecttext-to-SQLen
dc.subjectknowledge graphen
dc.title與生醫資料庫對話:透過自然語言轉SQL、文字嵌入及知識圖譜方法探索檢索增強生成的應用zh_TW
dc.titleCommunicating with Biomedical Databases: Exploring Retrieval-Augmented Generation via Text-to-SQL, Text Embedding, and Knowledge Graph-Based Approachesen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee莊樹諄;黃瀚萱;林泰元;陳淑華zh_TW
dc.contributor.oralexamcommitteeTrees-Juen Chuang;Hen-Hsen Huang;Thai-Yen Ling;Shu-Hwa Chenen
dc.subject.keyword生物資料庫,大型語言模型,檢索增強生成,語意搜尋,知識圖譜,自然語言轉SQL,zh_TW
dc.subject.keywordbiological databases,large language models,retrieval-augmented generation,semantic search,knowledge graph,text-to-SQL,en
dc.relation.page142-
dc.identifier.doi10.6342/NTU202501236-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2025-06-20-
dc.contributor.author-college生命科學院-
dc.contributor.author-dept基因體與系統生物學學位學程-
dc.date.embargo-lift2025-07-03-
顯示於系所單位:基因體與系統生物學學位學程

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf10.3 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved