Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84146
Title: 圖片與文字的多模態檢索
Mr. Right: Multimodal Retrieval on Representation of ImaGe witH Text
Authors: Cheng-An Hsieh
解正安
Advisor: 鄭卜壬(Pu-Jen Cheng)
Keyword: 多模態,資訊檢索,圖片檢索,文檔檢索,
Multimodal Retrieval,Information Retrieval,Image Retrieval,Text Retrieval,
Publication Year : 2022
Degree: 碩士
Abstract: 多模態學習是近年來的一個挑戰,它將過去的單模態學習擴展到多模態,像是文本、圖像或語音。這種擴展需要模型去處理和理解來自多種模態的信息。在信息檢索中,傳統的檢索任務側重於單模態文檔和搜尋詞之間的相似性,而圖像-文本的檢索則假設大多數文本只有包含圖像的資訊。這種區分忽略了現實世界的關鍵字可能涉及文本內容、圖像概念或兩者皆包含。為了解決這個問題,我們介紹了基於文本和圖像的多模態檢索(Mr. Right),它包含了新穎且強大的多模態檢索數據集。我們使用具有豐富文本和圖像的維基百科數據集,並生成三種類型的關鍵字:文本相關、圖像相關和混合。為了驗證我們數據集的有效性,我們提供了一種多模態訓練方式並評估了以前文本檢索和圖像檢索的模型。結果表明,提出的多模態檢索可以提高檢索性能。然而,創建同時具有文本和圖像的多模態資訊仍然是一個挑戰。我們希望 Mr. Right 能讓我們更好地拓寬當前的檢索系統,並為信息檢索中多模態學習的進步做出貢獻。
Multimodal learning is a recent challenge that extends unimodal learning by generalizing its domain to diverse modalities, such as texts, images, or speech. This extension requires models to process and relate information from multiple modalities. In Information Retrieval, traditional retrieval tasks focus on the similarity between unimodal documents and queries, while image-text retrieval hypothesizes that most texts contain the scene context from images. This separation has ignored that real-world queries may involve text content, image captions, or both. To address this, we introduce Multimodal Retrieval on Representation of ImaGe witH Text (Mr. Right), a novel and robust dataset for multimodal retrieval. We utilize the Wikipedia dataset with rich text-image examples and generate three types of queries: text-related, image-related, and mixed. To validate the effectiveness of our dataset, we provide a multimodal training paradigm and evaluate previous text retrieval and image retrieval frameworks. The results show that proposed multimodal retrieval can improve retrieval performance, but creating a well-unified document representation with texts and images is still a challenge. We hope Mr. Right allows us to broaden current retrieval systems better and contributes to accelerating the advancement of multimodal learning in the Information Retrieval.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84146
DOI: 10.6342/NTU202201355
Fulltext Rights: 同意授權(限校園內公開)
metadata.dc.date.embargo-lift: 2022-07-13
Appears in Collections:資訊網路與多媒體研究所

Files in This Item:
File SizeFormat 
U0001-0807202217184200.pdf
Access limited in NTU ip range
14.53 MBAdobe PDF
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved