Multi-DSI: 可微搜尋索引的非確定性標識符和概念對齊

柳宇澤; Yu-Ze Liu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94284

標題:	Multi-DSI: 可微搜尋索引的非確定性標識符和概念對齊 Multi-DSI: Non-deterministic Identifier and Concept Alignment for Differentiable Search Index
作者:	柳宇澤 Yu-Ze Liu
指導教授:	鄭卜壬 Pu-Jen Cheng
關鍵字:	生成式資訊檢索,可微分搜尋索引,查詢生成應用,概念對齊,多重索引點檢索, Generative Information Retrieval,Differentiable Search Index,Query Generation,Concept Alignment,Multiple Indexing Point Retrieval,
出版年 :	2024
學位:	碩士
摘要:	信息檢索（IR）已經被研究了很長一段時間。為解決IR問題，提出了許多方法，這些方法大致分為兩個方向：統計方法和深度學習方法。統計方法通常利用詞語的分佈來計算查詢與文檔的相似性，而深度學習模型則傾向於學習編碼器，並將查詢和文檔投射到向量空間中進行檢索。隨著生成性深度學習模型的出現，生成性信息檢索（Generative IR）引起了越來越多的關注。生成性信息檢索為解決信息檢索問題提供了新視角，並且透過生成模型直接生成文檔的標示符，減少了在推理過程中計算相似性所需的複雜度，該複雜度極大地受語料庫規模的影響。然而，現有方法面臨兩個問題：（1）當文檔僅用一個語義標識符（ID）表示時，檢索模型可能無法捕捉到文檔多方面且複雜的內容；（2）當生成的訓練數據存在語義模糊時，檢索模型可能難以區分相似文檔內容之間的差異。為了解決這些問題，我們提出了Multi-DSI，旨在（1）提供多個非確定性的語義標識符(Non-deterministic Semantic Identifier)；（2）對齊查詢和文檔的概念以避免模糊性。在兩個基準數據集上的大量實驗表明，所提出的模型比基線方法顯著提高了7.4%的性能。 There are many methods proposed to tackle IR problems. They are roughly divided into two directions, statistical methods and deep learning methods. While statistical methods usually utilize the distribution of words to calculate the similarities of the queries and documents, deep learning models tend to learn encoders and project queries and documents to a vector space for retrieval. With the advent of generative deep learning models, generative IR has gained increasing attention. However, existing methods face two issues: (1) when a document is represented by a single semantic ID, the retrieval model may fail to capture the multifaceted and complex content of the document; and (2) when the generated training data exhibits semantic ambiguity, the retrieval model may struggle to distinguish the differences in the content of similar documents. To address these issues, we propose Multi-DSI to (1) offer multiple non-deterministic semantic identifiers and (2) align the concepts of queries and documents to avoid ambiguity. Extensive experiments on two benchmark datasets demonstrate that the proposed model significantly outperforms baseline methods by 7.4%.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94284
DOI:	10.6342/NTU202402960
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf	846.78 kB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。