Multi-DSI: 可微搜尋索引的非確定性標識符和概念對齊

柳宇澤; Yu-Ze Liu

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94284

Title:	Multi-DSI: 可微搜尋索引的非確定性標識符和概念對齊 Multi-DSI: Non-deterministic Identifier and Concept Alignment for Differentiable Search Index
Authors:	柳宇澤 Yu-Ze Liu
Advisor:	鄭卜壬 Pu-Jen Cheng
Keyword:	生成式資訊檢索,可微分搜尋索引,查詢生成應用,概念對齊,多重索引點檢索, Generative Information Retrieval,Differentiable Search Index,Query Generation,Concept Alignment,Multiple Indexing Point Retrieval,
Publication Year :	2024
Degree:	碩士
Abstract:	信息檢索（IR）已經被研究了很長一段時間。為解決IR問題，提出了許多方法，這些方法大致分為兩個方向：統計方法和深度學習方法。統計方法通常利用詞語的分佈來計算查詢與文檔的相似性，而深度學習模型則傾向於學習編碼器，並將查詢和文檔投射到向量空間中進行檢索。隨著生成性深度學習模型的出現，生成性信息檢索（Generative IR）引起了越來越多的關注。生成性信息檢索為解決信息檢索問題提供了新視角，並且透過生成模型直接生成文檔的標示符，減少了在推理過程中計算相似性所需的複雜度，該複雜度極大地受語料庫規模的影響。然而，現有方法面臨兩個問題：（1）當文檔僅用一個語義標識符（ID）表示時，檢索模型可能無法捕捉到文檔多方面且複雜的內容；（2）當生成的訓練數據存在語義模糊時，檢索模型可能難以區分相似文檔內容之間的差異。為了解決這些問題，我們提出了Multi-DSI，旨在（1）提供多個非確定性的語義標識符(Non-deterministic Semantic Identifier)；（2）對齊查詢和文檔的概念以避免模糊性。在兩個基準數據集上的大量實驗表明，所提出的模型比基線方法顯著提高了7.4%的性能。 There are many methods proposed to tackle IR problems. They are roughly divided into two directions, statistical methods and deep learning methods. While statistical methods usually utilize the distribution of words to calculate the similarities of the queries and documents, deep learning models tend to learn encoders and project queries and documents to a vector space for retrieval. With the advent of generative deep learning models, generative IR has gained increasing attention. However, existing methods face two issues: (1) when a document is represented by a single semantic ID, the retrieval model may fail to capture the multifaceted and complex content of the document; and (2) when the generated training data exhibits semantic ambiguity, the retrieval model may struggle to distinguish the differences in the content of similar documents. To address these issues, we propose Multi-DSI to (1) offer multiple non-deterministic semantic identifiers and (2) align the concepts of queries and documents to avoid ambiguity. Extensive experiments on two benchmark datasets demonstrate that the proposed model significantly outperforms baseline methods by 7.4%.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94284
DOI:	10.6342/NTU202402960
Fulltext Rights:	同意授權(全球公開)
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
ntu-112-2.pdf	846.78 kB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets