請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94284
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 鄭卜壬 | zh_TW |
dc.contributor.advisor | Pu-Jen Cheng | en |
dc.contributor.author | 柳宇澤 | zh_TW |
dc.contributor.author | Yu-Ze Liu | en |
dc.date.accessioned | 2024-08-15T16:37:02Z | - |
dc.date.available | 2024-08-16 | - |
dc.date.copyright | 2024-08-15 | - |
dc.date.issued | 2024 | - |
dc.date.submitted | 2024-08-05 | - |
dc.identifier.citation | Tay, Tran, Dehghani, Ni, Bahri, Mehta, Qin, Hui, Zhao, Gupta, et al. 2022: Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al., “Transformer memory as a differentiable search index,” Advances in Neural Information Processing Systems, vol. 35, pp. 21 831–21 843, 2022.
Metzler, Tay, Bahri, and Najork 2021: Donald Metzler, Yi Tay, Dara Bahri, and Marc Najork, “Rethinking search: Making domain experts out of dilettantes,” in Acm sigir forum, ACM New York, NY, USA, vol. 55, 2021, pp. 1–27. Zhuang, Ren, Shou, Pei, Gong, Zuccon, and Jiang 2022: Shengyao Zhuang, Houxing Ren, Linjun Shou, Jian Pei, Ming Gong, Guido Zuccon, and Daxin Jiang, “Bridging the gap between indexing and retrieval for differentiable search index with query generation,” arXiv preprint arXiv:2206.10128, 2022. Bevilacqua, Ottaviano, Lewis, Yih, Riedel, and Petroni 2022: Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Scott Yih, Sebastian Riedel, and Fabio Petroni, “Autoregressive search engines: Generating substrings as document identifiers,” Advances in Neural Information Processing Systems, vol. 35, pp. 31 668–31 683, 2022. Tang, Zhang, Guo, Chen, Zhu, Wang, Yin, and Cheng 2023: Yubao Tang, Ruqing Zhang, Jiafeng Guo, Jiangui Chen, Zuowei Zhu, Shuaiqiang Wang, Dawei Yin, and Xueqi Cheng, “Semantic-enhanced differentiable search index inspired by learning strategies,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 4904–4913. Li, Yang, Wang, Wei, and Li 2023: Yongqi Li, Nan Yang, Liang Wang, Furu Wei, and Wenjie Li, “Multiview identifiers enhanced generative retrieval,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Jul. 2023, pp. 6636–6648. Wang, Hou, Wang, Miao, Wu, Chen, Xia, Chi, Zhao, Liu, et al. 2022: Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, et al., “A neural corpus indexer for document retrieval,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 600–25 614, 2022. Li, Yang, Wang, Wei, and Li 2024: Yongqi Li, Nan Yang, Liang Wang, Furu Wei, and Wenjie Li, “Learning to rank in generative retrieval,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 8716–8723. Nguyen and Yates 2023: Thong Nguyen and Andrew Yates, “Generative retrieval as dense retrieval,” arXiv preprint arXiv:2306.11397, 2023. Sun, Yan, Chen, Wang, Zhu, Ren, Chen, Yin, Rijke, and Ren 2024: Weiwei Sun, Lingyong Yan, Zheng Chen, Shuaiqiang Wang, Haichao Zhu, Pengjie Ren, Zhumin Chen, Dawei Yin, Maarten Rijke, and Zhaochun Ren, “Learning to tokenize for generative retrieval,” Advances in Neural Information Processing Systems, vol. 36, 2024. Mehta, Gupta, Tay, Dehghani, Tran, Rao, Najork, Strubell, and Metzler 2022: Sanket Vaibhav Mehta, Jai Gupta, Yi Tay, Mostafa Dehghani, Vinh Q Tran, Jinfeng Rao, Marc Najork, Emma Strubell, and Donald Metzler, “Dsi++: Updating transformer memory with new documents,” arXiv preprint arXiv:2212.09744, 2022. Wang, Zhou, Tu, and Dou 2023: Zihan Wang, Yujia Zhou, Yiteng Tu, and Zhicheng Dou, “Novo: Learnable and interpretable document identifiers for model-based ir,” in Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023, pp. 2656–2665. Zhang, Liu, Zhou, Dou, and Cao 2023: Peitian Zhang, Zheng Liu, Yujia Zhou, Zhicheng Dou, and Zhao Cao, “Term-sets can be strong document identifiers for auto-regressive search engines,” arXiv preprint arXiv:2305.13859, 2023. Ren, Zhao, Liu, Wu, Wen, and Wang 2023: Ruiyang Ren, Wayne Xin Zhao, Jing Liu, Hua Wu, Ji-Rong Wen, and Haifeng Wang, “Tome: A two-stage approach for model-based retrieval,” arXiv preprint arXiv:2305.11161, 2023. Lee, Kim, Chang, Oh, Yang, Karpukhin, Lu, and Seo 2022: Hyunji Lee, Jaeyoung Kim, Hoyeon Chang, Hanseok Oh, Sohee Yang, Vlad Karpukhin, Yi Lu, and Minjoon Seo, “Nonparametric decoding for generative retrieval,” arXiv preprint arXiv:2210.02068, 2022. Karpukhin, Oğuz, Min, Lewis, Wu, Edunov, Chen, and Yih 2020: Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih, “Dense passage retrieval for open-domain question answering,” arXiv preprint arXiv:2004.04906, 2020. Lee, Sung, Kang, and Chen 2020: Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, and Danqi Chen, “Learning dense representations of phrases at scale,” arXiv preprint arXiv:2012.12624, 2020. Xiong, Xiong, Li, Tang, Liu, Bennett, Ahmed, and Overwijk 2020: Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk, “Approximate nearest neighbor negative contrastive learning for dense text retrieval,” arXiv preprint arXiv:2007.00808, 2020. Wang, Yang, Huang, Jiao, Yang, Jiang, Majumder, and Wei 2022: Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei, “Text embeddings by weakly-supervised contrastive pre-training,” arXiv preprint arXiv:2212.03533, 2022. Wang, Yang, Huang, Jiao, Yang, Jiang, Majumder, and Wei 2022: Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei, “Simlm: Pre-training with representation bottleneck for dense passage retrieval,” arXiv preprint arXiv:2207.02578, 2022. Khattab and Zaharia 2020: Omar Khattab and Matei Zaharia, “Colbert: Efficient and effective passage search via contextualized late interaction over bert,” in Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020, pp. 39–48. De Cao, Izacard, Riedel, and Petroni 2020: Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni, “Autoregressive entity retrieval,” arXiv preprint arXiv:2010.00904, 2020. Zhou, Yao, Dou, Wu, Zhang, and Wen 2022: Yujia Zhou, Jing Yao, Zhicheng Dou, Ledell Wu, Peitian Zhang, and Ji-Rong Wen, “Ultron: An ultimate retriever on corpus with a model-based indexer,” arXiv preprint arXiv:2208.09257, 2022. Xin Rong, “Word2vec parameter learning explained,” arXiv preprint arXiv:1411.2738, 2014. Jey Han Lau and Timothy Baldwin, “An empirical evaluation of doc2vec with practical insights into document embedding generation,” arXiv preprint arXiv:1607.05368, 2016. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. Nils Reimers and Iryna Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019. Yubao Tang, Ruqing Zhang, Jiafeng Guo, and Maarten de Rijke, “Recent advances in generative information retrieval,” in SIGIR-AP 2023: 1st International ACM SIGIR Conference on Information Retrieval in the Asia Pacific, ACM, Nov. 2023. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020. Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, and Xueqi Cheng, “Listwise generative retrieval models via a sequential learning process,” ACM Transactions on Information Systems, vol. 42, no. 5, pp. 1–31, 2024. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng, “Ms marco: A human-generated machine reading comprehension dataset,” 2016. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al., “Natural questions: A benchmark for question answering research,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466, 2019. Bayan Issa, Muhammed Basheer Jasser, Hui Na Chua, and Muzaffar Hamzah, “A comparative study on embedding models for keyword extraction using keybert method,” in 2023 IEEE 13th International Conference on System Engineering and Technology (ICSET), IEEE, 2023, pp. 40–45. | - |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94284 | - |
dc.description.abstract | 信息檢索(IR)已經被研究了很長一段時間。為解決IR問題,提出了許多方法,這些方法大致分為兩個方向:統計方法和深度學習方法。統計方法通常利用詞語的分佈來計算查詢與文檔的相似性,而深度學習模型則傾向於學習編碼器,並將查詢和文檔投射到向量空間中進行檢索。隨著生成性深度學習模型的出現,生成性信息檢索(Generative IR)引起了越來越多的關注。生成性信息檢索為解決信息檢索問題提供了新視角,並且透過生成模型直接生成文檔的標示符,減少了在推理過程中計算相似性所需的複雜度,該複雜度極大地受語料庫規模的影響。然而,現有方法面臨兩個問題:(1)當文檔僅用一個語義標識符(ID)表示時,檢索模型可能無法捕捉到文檔多方面且複雜的內容;(2)當生成的訓練數據存在語義模糊時,檢索模型可能難以區分相似文檔內容之間的差異。為了解決這些問題,我們提出了Multi-DSI,旨在(1)提供多個非確定性的語義標識符(Non-deterministic Semantic Identifier);(2)對齊查詢和文檔的概念以避免模糊性。在兩個基準數據集上的大量實驗表明,所提出的模型比基線方法顯著提高了7.4%的性能。 | zh_TW |
dc.description.abstract | There are many methods proposed to tackle IR problems. They are roughly divided into two directions, statistical methods and deep learning methods. While statistical methods usually utilize the distribution of words to calculate the similarities of the queries and documents, deep learning models tend to learn encoders and project queries and documents to a vector space for retrieval. With the advent of generative deep learning models, generative IR has gained increasing attention. However, existing methods face two issues: (1) when a document is represented by a single semantic ID, the retrieval model may fail to capture the multifaceted and complex content of the document; and (2) when the generated training data exhibits semantic ambiguity, the retrieval model may struggle to distinguish the differences in the content of similar documents. To address these issues, we propose Multi-DSI to (1) offer multiple non-deterministic semantic identifiers and (2) align the concepts of queries and documents to avoid ambiguity. Extensive experiments on two benchmark datasets demonstrate that the proposed model significantly outperforms baseline methods by 7.4%. | en |
dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-08-15T16:37:02Z No. of bitstreams: 0 | en |
dc.description.provenance | Made available in DSpace on 2024-08-15T16:37:02Z (GMT). No. of bitstreams: 0 | en |
dc.description.tableofcontents | Acknowledgements .................................................... i
摘要 ................................................................. iii ABSTRACT ............................................................. iv CONTENTS ............................................................. vii LIST OF FIGURES ....................................................... ix LIST OF TABLES ........................................................ xi Chapter 1 Introduction .............................................. 1 Chapter 2 Related Work .............................................. 4 2.1 Dense Retrieval .................................................. 5 2.1.1 Cross-encoder architecture ................................. 5 2.1.2 Dual-encoder architecture .................................. 5 2.2 Generative Retrieval ............................................. 6 Chapter 3 Methodology ............................................... 9 3.1 Problem Statement ................................................ 9 3.2 Model Overview ................................................... 9 3.3 Concept Alignment with Document Information ...................... 9 3.4 Components in Multi-DSI ......................................... 11 3.4.1 Document-aligned Concept Construction ..................... 11 3.4.2 Non-deterministic Concept-aware Semantic Identifier (CSID) 12 3.5 Model Training .................................................. 14 3.6 Retrieval with a User Query ..................................... 15 Chapter 4 Experiments .............................................. 16 4.1 Experimental Settings ........................................... 16 4.1.1 Datasets ................................................... 16 4.1.2 Baseline Methods ........................................... 17 4.1.3 Evaluation Metrics ......................................... 17 4.1.4 Implementation Details ..................................... 18 4.2 Experimental Results & Discussion .............................. 18 4.2.1 Comparison of Retrieval Performance ....................... 18 4.2.2 Ablation Study ............................................. 19 4.2.3 Effectiveness of Multiple Document IDs ..................... 19 4.2.4 Need of Concept Alignment .................................. 21 4.2.5 Study on k of K-means used for CSID construction ........... 22 4.2.6 Study on clustering_evoke_size used for CSID construction .. 22 Chapter 5 Conclusion ................................................ 25 References .......................................................... 27 | - |
dc.language.iso | en | - |
dc.title | Multi-DSI: 可微搜尋索引的非確定性標識符和概念對齊 | zh_TW |
dc.title | Multi-DSI: Non-deterministic Identifier and Concept Alignment for Differentiable Search Index | en |
dc.type | Thesis | - |
dc.date.schoolyear | 112-2 | - |
dc.description.degree | 碩士 | - |
dc.contributor.oralexamcommittee | 陳信希;陳縕儂;姜俊宇 | zh_TW |
dc.contributor.oralexamcommittee | Hsin-Hsi Chen;Yun-Nung Chen;Jyun-Yu Jiang | en |
dc.subject.keyword | 生成式資訊檢索,可微分搜尋索引,查詢生成應用,概念對齊,多重索引點檢索, | zh_TW |
dc.subject.keyword | Generative Information Retrieval,Differentiable Search Index,Query Generation,Concept Alignment,Multiple Indexing Point Retrieval, | en |
dc.relation.page | 32 | - |
dc.identifier.doi | 10.6342/NTU202402960 | - |
dc.rights.note | 同意授權(全球公開) | - |
dc.date.accepted | 2024-08-08 | - |
dc.contributor.author-college | 電機資訊學院 | - |
dc.contributor.author-dept | 資訊工程學系 | - |
顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-112-2.pdf | 846.78 kB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。