請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97826完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 郭瑞祥 | zh_TW |
| dc.contributor.advisor | Ruey-Shan Guo | en |
| dc.contributor.author | 黎家愷 | zh_TW |
| dc.contributor.author | Chia-Kai Li | en |
| dc.date.accessioned | 2025-07-18T16:05:25Z | - |
| dc.date.available | 2025-07-19 | - |
| dc.date.copyright | 2025-07-18 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-07-09 | - |
| dc.identifier.citation | [1] N. Aletras and M. Stevenson. Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics(IWCS 2013) –Long Papers, pages 13–22, March 2013.
[2] M. Aliannejadi, H. Zamani, F. Crestani, and W. B. Croft. Asking clarifying questions in opendomain information-seeking conversations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 475–484, 2019. [3] F. Bianchi, S. Terragni, and D. Hovy. Pre-training is a hot topic: Contextualized document embeddings improve topic coherence. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2279–2290, 2021. [4] D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012. [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003. [6] R. J. G. B. Campello, D. Moulavi, and J. Sander. Density-based clustering based on hierarchical density estimates. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 160–172. Springer, 2013. [7] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. M. Blei. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems, pages 288–296, 2009. [8] X. Cheng, X. Yan, Y. Lan, and J. Guo. BTM: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering, 26(12):2928–2941, 2014. [9] M. de Groot, M. Aliannejadi, and M. R. Haas. Experiments on generalizability of BERTopic on multi-domain short text. arXiv preprint arXiv:2212.08459, 2022. [10] A. B. Dieng, F. J. R. Ruiz, and D. M. Blei. Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8:439–453, 2020. [11] R. Ding, R. Nallapati, and B. Xiang. Coherence-aware neural topic modeling. arXiv preprint arXiv:1809.02687, 2018. [12] P. Finardi, L. Avila, R. Castaldoni, P. Gengo, C. Larcher, M. Piau, P. Costa, and V. Caridá. The chronicles of RAG: The retriever, the chunk and the generator. arXiv preprint arXiv:2401.07883, 2024. [13] L. Gao, Z. Dai, and J. Callan. COIL: Revisit exact lexical match in information retrieval with 63 doi:10.6342/NTU202501611 contextualized inverted list. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3030–3042, Online, June 2021. Association for Computational Linguistics. [14] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, and H. Wang. Retrieval‐augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023. [15] M. Grootendorst. BERTopic: Neural topic modeling with a class‐based tf‐idf procedure. arXiv preprint arXiv:2203.05794, 2022. [16] M. Grootendorst. What is so special about BERTopic? https://newsletter. maartengrootendorst.com/p/bertopic-what-is-so-special-about, 2022. Accessed: 2025-04-07. [17] H. Han, Y. Wang, H. Shomer, K. Guo, J. Ding, Y. Lei, M. Halappanavar, R. A. Rossi, S. Mukherjee, X. Tang, Q. He, Z. Hua, B. Long, T. Zhao, N. Shah, A. Javari, Y. Xia, and J. Tang. Retrieval-augmented generation with graphs (graphrag), 2025. arXiv preprint, submitted 31 Dec 2024; revised 8 Jan 2025. [18] H. Hashemi, H. Zamani, and W. B. Croft. Guided transformer: Leveraging multiple external sources for representation learning in conversational search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1131–1140, 2021. [19] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 50–57, Berkeley, California, USA, 1999. ACM. [20] J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019. [21] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense passage retrieval for open-domain question answering. In Proceedings of EMNLP 2020, pages 6769–6781, November 2020. [22] D. Kumar. Evaluation with RAGAS. https://medium.com/@danushidk507/ evaluation-with-ragas-873a574b86a9, 2023. Accessed: 2025-04-07. [23] S. Kumar. Generative retrieval for end-to-end search systems. https://blog. reachsumit.com/posts/2023/09/generative-retrieval/, 2023. Accessed: 2025- 04-02. [24] M. Li, L. Huang, C. H. Tan, and K. K. Wei. Helpfulness of online product reviews as seen by consumers: Source and content features. International Journal of Electronic Commerce, 17(4):101–136, 2013. [25] M. Li, M. Li, K. Xiong, and J. Lin. Multi-task dense retrieval via model uncertainty fusion for open-domain question answering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 274–287, November 2021. [26] J. Lin, X. Ma, S. Lin, J. Yang, R. Pradeep, and R. Nogueira. Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2356–2363, 2021. 64 doi:10.6342/NTU202501611 [27] F. Luna. UMAP: An alternative dimensionality reduction technique. https://medium.com/mcd-unison/ umap-an-alternative-dimensionality-reduction-technique-7a5e77e80982, 2023. Accessed: 2025-04-02. [28] X. Ma, Y. Gong, P. He, H. Zhao, and N. Duan. Query rewriting for retrieval‐augmented large language models. arXiv preprint arXiv:2305.14283, 2023. [29] J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297. University of California Press, 1967. [30] M. B. Mahmoud. All about latent dirichlet allocation (LDA) in NLP. https://mohamedbakrey094.medium.com/ all-about-latent-dirichlet-allocation-lda-in-nlp-6cfa7825034e, 2023. Accessed: 2025-04-02. [31] M. Mayank. Minilm. https://mohitmayank.com/a_lazy_data_science_guide/ natural_language_processing/minilm/, 2025. Accessed: 2025-04-02. [32] L. McInnes, J. Healy, and S. Astels. HDBSCAN: Hierarchical density based clustering. Journal of Open Source Software, 2(11):205, 2017. [33] L. McInnes, J. Healy, and S. Astels. How hdbscan works. https://hdbscan. readthedocs.io/en/latest/how_hdbscan_works.html, 2017. Accessed: 2025-04- 02. [34] L. McInnes, J. Healy, and J. Melville. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018. [35] N. Reimers and I. Gurevych. Sentence-BERT: Sentence embeddings using siamese BERTnetworks. In Proceedings of EMNLP-IJCNLP 2019, pages 3982–3992, 2019. [36] S. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009. [37] O. Sanseviero. Sentence embeddings: Cross-encoders and re-ranking. https: //osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings2/, 2024. Accessed: 2025-04-07. [38] S. Terragni, E. Fersini, B. G. Galuzzi, P. Tropeano, and A. Candelieri. OCTIS: Comparing and optimizing topic models is simple! In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics – System Demonstrations, pages 263–270, 2021. [39] N. Tran and D. Litman. Enhancing knowledge retrieval with topic modeling for knowledgegrounded dialogue. arXiv preprint arXiv:2405.04713, 2024. [40] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019. [41] Y. Zhu, H. Yuan, S. Wang, J. Liu, W. Liu, C. Deng, Z. Dou, J. Wen, et al. Large language models for information retrieval: A survey. arXiv, 2308.07107, 2023. arXiv:2308.07107v4 65 doi:10.6342/NTU202501611 [cs.CL]. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97826 | - |
| dc.description.abstract | 檢索增強生成(RAG)的效能在很大程度上取決於檢索階段的品質。傳統的RAG 系統主要依賴向量相似度匹配,卻無法完美捕捉查詢與文檔之間的潛在主題結構,導致檢索結果不理想。為了解決這個問題,本研究在 RAG 的檢索階段引入主題建模(透過 BERTopic 流程 ),對 Amazon 產品評論進行主題聚類,並利用查詢重寫來判定查詢的主題 。透過將檢索範圍限定於相關主題,我們的方法能有效過濾無關文檔,提升檢索精度,並進一步改善下游文本生成的品質。我們在多個檢索基準數據集上進行實驗評估,結果顯示,相較於傳統的向量檢索方法(Naive RAG),BERTopic 檢索策略在檢索相關度與 RAG 的整體表現方面均有顯著提升。 | zh_TW |
| dc.description.abstract | The effectiveness of Retrieval-Augmented Generation (RAG) largely depends on the quality of the retrieval stage. Traditional RAG systems primarily rely on vector similarity matching, which cannot perfectly capture the underlying thematic structure between queries and documents, leading to suboptimal retrieval results. To address this issue, this study introduces topic modeling (via the BERTopic process) in the RAG retrieval phase, performs topic clustering on Amazon product reviews, and utilizes query rewriting to determine the topics of queries. By restricting the retrieval scope to relevant topics, our approach effectively filters out irrelevant documents, enhances retrieval precision, and further improves the quality of downstream text generation. We conducted experimental evaluations on multiple retrieval benchmark datasets, and the results indicate that, compared to traditional vector-based retrieval methods (Naive RAG), the BERTopic retrieval strategy significantly improves both retrieval relevance and the overall performance of RAG. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-18T16:05:25Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-07-18T16:05:25Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 目次
致謝 .................................................. i 摘要 .................................................. iii Abstract .............................................. v 目次 .................................................. vii 圖次 .................................................. xi 表次 .................................................. xiii 第一章 緒論 .......................................... 1 1.1 研究背景 ....................................... 1 1.2 研究動機 ....................................... 2 1.3 研究目的 ....................................... 3 1.4 研究架構 ....................................... 3 第二章 文獻探討 ...................................... 5 2.1 RAG 與檢索技術 ................................. 5 2.1.1 Sparse Retrieval(稀疏檢索) ............... 6 2.1.2 Dense Retrieval(密集檢索) ............... 7 2.1.3 混合檢索技術(Hybrid Retrieval) ........... 8 2.2 傳統主題模型的限制 ............................. 8 2.2.1 短文本下的限制 ............................. 9 2.2.2 主題評估指標的局限:主題一致性 ........... 11 2.3 主題模型於資訊檢索之應用 ..................... 12 2.4 查詢重寫(Query Rewriting)技術於檢索之應用 .. 14 2.4.1 RAG 系統的自動化評估 ..................... 15 2.5 文獻缺口 ....................................... 15 2.6 研究貢獻 ....................................... 16 第三章 研究方法 ...................................... 19 3.1 研究流程 ....................................... 19 3.2 主題模型建置 ................................... 20 3.2.1 主題模型選擇 ............................... 20 3.2.2 降維之必要性 ............................... 21 3.2.3 文件嵌入階段 ............................... 22 3.2.4 UMAP 降維 .................................. 23 3.2.5 HDBSCAN 聚類 ............................... 24 3.2.5.1 核心距離計算 ........................... 24 3.2.5.2 互達距離矩陣構建 ....................... 24 3.2.5.3 層次式聚類與穩定性剪枝 ............... 25 3.2.6 c-TF-IDF 主題表徵方法 ..................... 26 3.3 實驗參數配置 ................................... 27 3.3.1 WETC:主題一致性指標 ..................... 28 3.3.2 Topic Diversity:主題多樣性指標 .......... 29 3.4 查詢主題映射 ................................... 30 3.4.1 主題式檢索策略概述 ....................... 30 3.4.2 查詢所屬主題判斷 ......................... 30 3.4.2.1 近似主題分佈 ......................... 31 3.4.3 查詢重寫策略 ............................... 31 3.5 主題過濾之理論效能分析 ....................... 33 3.5.1 主題過濾提升檢索效能的充分條件 ........... 33 3.5.2 符號說明 ................................... 33 第四章 實驗與結果討論 ............................... 37 4.1 資料來源與描述 ............................... 37 4.2 資料清理流程 ................................. 37 4.2.1 主題模型之參數調校與評估方法 ........... 38 4.2.1.1 詞嵌入模型(word embedding model) . 39 4.2.1.2 超參數空間設計 ....................... 40 4.2.1.3 模型評估指標 ......................... 41 4.2.1.4 最佳超參數配置 ....................... 41 4.3 主題建模 ....................................... 42 4.3.1 降維與聚類模型設定 ....................... 43 4.3.2 向量化與主題詞重排序 ..................... 44 4.3.3 主題階層與相似度分析 ..................... 44 4.3.4 主題摘要 ................................... 46 4.4 驗證集介紹 ..................................... 46 4.5 實驗結果與分析 ............................... 49 4.5.1 評估指標定義 ............................. 49 4.5.2 Dense retrieval ........................... 53 4.5.3 Hybrid retrieval .......................... 54 4.5.4 Dense retrieval 引入重排序模型(Reranker) . 55 4.5.5 Hybrid retrieval 引入重排序模型(Reranker) . 56 4.5.6 生成階段品質分析 ......................... 57 4.5.7 誤差來源與模型限制 ....................... 59 第五章 結果與未來展望 ............................... 61 5.1 結論 ........................................... 61 5.2 未來展望 ....................................... 61 參考文獻 ............................................. 63 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 資訊檢索 | zh_TW |
| dc.subject | 大型語言模型 | zh_TW |
| dc.subject | 主題建模 | zh_TW |
| dc.subject | 檢索增強生成 | zh_TW |
| dc.subject | 自然語言處理 | zh_TW |
| dc.subject | 資訊檢索 | zh_TW |
| dc.subject | 大型語言模型 | zh_TW |
| dc.subject | 主題建模 | zh_TW |
| dc.subject | 檢索增強生成 | zh_TW |
| dc.subject | 自然語言處理 | zh_TW |
| dc.subject | Information Retrieval | en |
| dc.subject | Large Language Models | en |
| dc.subject | Information Retrieval | en |
| dc.subject | Natural Language Processing | en |
| dc.subject | Retrieval-Augmented Generation | en |
| dc.subject | Topic Modeling | en |
| dc.subject | Large Language Models | en |
| dc.subject | Topic Modeling | en |
| dc.subject | Retrieval-Augmented Generation | en |
| dc.subject | Natural Language Processing | en |
| dc.title | 主題模型強化之短文本檢索與生成品質提升 | zh_TW |
| dc.title | Enhancing Retrieval and Generation Quality for Short Texts with Topic Models | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 王子騫;藍俊宏 | zh_TW |
| dc.contributor.oralexamcommittee | Tzu-Chien Wang;Jakey Blue | en |
| dc.subject.keyword | 資訊檢索,自然語言處理,檢索增強生成,主題建模,大型語言模型, | zh_TW |
| dc.subject.keyword | Information Retrieval,Natural Language Processing,Retrieval-Augmented Generation,Topic Modeling,Large Language Models, | en |
| dc.relation.page | 66 | - |
| dc.identifier.doi | 10.6342/NTU202501611 | - |
| dc.rights.note | 未授權 | - |
| dc.date.accepted | 2025-07-11 | - |
| dc.contributor.author-college | 工學院 | - |
| dc.contributor.author-dept | 工業工程學研究所 | - |
| dc.date.embargo-lift | N/A | - |
| 顯示於系所單位: | 工業工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 9.03 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
