Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84146
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor鄭卜壬(Pu-Jen Cheng)
dc.contributor.authorCheng-An Hsiehen
dc.contributor.author解正安zh_TW
dc.date.accessioned2023-03-19T22:05:21Z-
dc.date.copyright2022-07-13
dc.date.issued2022
dc.date.submitted2022-07-11
dc.identifier.citation[1] 2/4th battalion (australia). https://en.wikipedia.org/wiki/2/4th_ Battalion_(Australia). [2] Anthotyros.https://en.wikipedia.org/wiki/Anthotyros. [3] Bing concert hall. https://en.wikipedia.org/wiki/Bing_Concert_Hall. [4] Bosley yu. https://en.wikipedia.org/wiki/Bosley_Yu. [5] Breim church. https://en.wikipedia.org/wiki/Breim_Church. [6] Centro nacional de artesanato e design. https://en.wikipedia.org/wiki/ Centro_Nacional_de_Artesanato_e_Design. [7] Dickerson potato house. https://en.wikipedia.org/wiki/Dickerson_ Potato_House. [8] Hillside shopping centre. https://en.wikipedia.org/wiki/Hillside_ Shopping_Centre. [9] La florida (park). https://en.wikipedia.org/wiki/La_Florida_(park). [10] Lee keun­ho. https://en.wikipedia.org/wiki/Lee_Keun-ho. [11] Mari kim. https://en.wikipedia.org/wiki/Mari_Kim. [12] Nürtingen station. https://en.wikipedia.org/wiki/N%C3%BCrtingen_ station. [13] Plaza del mercado de ponce. https://en.wikipedia.org/wiki/Plaza_del_ Mercado_de_Ponce. [14] Podklasztor.https://en.wikipedia.org/wiki/Podklasztor. [15] Profit (gamer). https://en.wikipedia.org/wiki/Profit_(gamer)). [16] Sangertown square. https://en.wikipedia.org/wiki/Sangertown_Square. [17] Schwelm west station. https://en.wikipedia.org/wiki/N%C3%BCrtingen_ station. [18] Spring pygmy sunfish. https://en.wikipedia.org/wiki/Spring_pygmy_ sunfish. [19] Stanley potato house. https://en.wikipedia.org/wiki/Stanley_Potato_ House. [20] Vlašićcheese.https://en.wikipedia.org/wiki/Vla%C5%A1i%C4%87_cheese. [21] S.Antol,A.Agrawal,J.Lu,M.Mitchell,D.Batra,C.L.Zitnick,andD.Parikh.Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015. [22] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural codes for image retrieval. In European conference on computer vision, pages 584–599. Springer, 2014. [23] T.Badamdorj,M.Rochan,Y.Wang,andL.Cheng.Jointvisualandaudiolearningfor video highlight detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8127–8137, 2021. [24] A. Bagher Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.­P. Morency. Multi­ modal language analysis in the wild: CMU­MOSEI dataset and interpretable dy­ namic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, Mel­ bourne, Australia, July 2018. Association for Computational Linguistics. [25] L. Boytsov and E. Nyberg. Flexible retrieval with NMSLIB and FlexNeuART. In Proceedings of Second Workshop for NLP Open Source Software (NLP­OSS), pages 32–43, Online, Nov. 2020. Association for Computational Linguistics. [26] C. Busso, M. Bulut, C.­C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008. [27] Y.Chen,S.Zhang,F.Liu,Z.Chang,M.Ye,andZ.Qi.Transhash:Transformer­based hamming hashing for efficient image retrieval. arXiv preprint arXiv:2105.01823, 2021. [28] Y.­C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu. Uniter: Universal image­text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020. [29] Y.­S. Chuang, R. Dangovski, H. Luo, Y. Zhang, S. Chang, M. Soljačić, S.­W. Li, W.­t. Yih, Y. Kim, and J. Glass. Diffcse: Difference­based contrastive learning for sentence embeddings. arXiv preprint arXiv:2204.10298, 2022. [30] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702– 703, 2020. [31] J. Devlin, M.­W. Chang, K. Lee, and K. Toutanova. Bert: Pre­training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [32] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un­ terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [33] Z.­Y. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, Z. Liu, M. Zeng, et al. An empirical study of training end­to­end vision­and­language transformers. arXiv preprint arXiv:2111.02387, 2021. [34] A. El­Nouby, N. Neverova, I. Laptev, and H. Jégou. Training vision transformers for image retrieval. arXiv preprint arXiv:2102.05644, 2021. [35] D. Gillick, A. Presta, and G. S. Tomar. End­to­end retrieval in continuous space. arXiv preprint arXiv:1811.08008, 2018. [36] A. Gordo, J. Almazán, J. Revaud, and D. Larlus. Deep image retrieval: Learning global representations for image search. In European conference on computer vision, pages 241–257. Springer, 2016. [37] Y. Goyal, T. Khot, D. Summers­Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. [38] H.Jegou,M.Douze,andC.Schmid.Hammingembeddingandweakgeometriccon­ sistency for large scale image search. In European conference on computer vision, pages 304–317. Springer, 2008. [39] C. Jia, Y. Yang, Y. Xia, Y.­T. Chen, Z. Parekh, H. Pham, Q. Le, Y.­H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision­language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021. [40] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.­t. Yih. Dense passage retrieval for open­domain question answering. arXiv preprint arXiv:2004.04906, 2020. [41] O. Khattab and M. Zaharia. Colbert: Efficient and effective passage search via con­ textualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39– 48, 2020. [42] W. Kim, B. Son, and I. Kim. Vilt: Vision­and­language transformer without con­ volution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021. [43] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Ep­ stein, I. Polosukhin, J. Devlin, K. Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019. [44] J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language­image pre­ training for unified vision­language understanding and generation. arXiv preprint arXiv:2201.12086, 2022. [45] J.Li,R.Selvaraju,A.Gotmare,S.Joty,C.Xiong,andS.C.H.Hoi.Alignbeforefuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 2021. [46] T. Li, Z. Zhang, L. Pei, and Y. Gan. Hashformer: Vision transformer based deep hashing for image retrieval. IEEE Signal Processing Letters, 29:827–831, 2022. [47] T.­Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. [48] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle­ moyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. [49] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng. Ms marco: A human generated machine reading comprehension dataset. In CoCo@ NIPS, 2016. [50] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 2161–2168. Ieee, 2006. [51] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4004–4012, 2016. [52] J.Philbin,O.Chum,M.Isard,J.Sivic,andA.Zisserman.Objectretrievalwithlarge vocabularies and fast spatial matching. In 2007 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2007. [53] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region­to­phrase correspondences for richer image­to­sentence models. In Proceedings of the IEEE international confer­ ence on computer vision, pages 2641–2649, 2015. [54] Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao, D. Dong, H. Wu, and H. Wang. Rocketqa: An optimized training approach to dense passage retrieval for open­ domain question answering. arXiv preprint arXiv:2010.08191, 2020. [55] F. Radenović, A. Iscen, G. Tolias, Y. Avrithis, and O. Chum. Revisiting oxford and paris: Large­scale image retrieval benchmarking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5706–5715, 2018. [56] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. [57] E. Ramzi, N. Thome, C. Rambour, N. Audebert, and X. Bitot. Robust and decom­ posable average precision for image retrieval. Advances in Neural Information Pro­ cessing Systems, 34, 2021. [58] N. Reimers and I. Gurevych. Sentence­bert: Sentence embeddings using siamese bert­networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. [59] K. Santhanam, O. Khattab, J. Saad­Falcon, C. Potts, and M. Zaharia. Colbertv2: Effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488, 2021. [60] P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt­text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. [61] D. Singh, S. Reddy, W. Hamilton, C. Dyer, and D. Yogatama. End­to­end training of multi­document reader and retriever for open­domain question answering. Advances in Neural Information Processing Systems, 34, 2021. [62] I. Soboroff, S. Huang, and D. Harman. Trec 2018 news track overview. In TREC, 2018. [63] K. Srinivasan, K. Raman, J. Chen, M. Bendersky, and M. Najork. Wit: Wikipedia­ based image text dataset for multimodal multilingual machine learning. In Proceed­ ings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2443–2449, 2021. [64] F.Tan,J.Yuan,andV.Ordonez.Instance­levelimageretrievalusingrerankingtrans­ formers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12105–12115, 2021. [65] H. Tan and M. Bansal. Lxmert: Learning cross­modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019. [66] J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal. Fever: a large­scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355, 2018. [67] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information pro­ cessing systems, 30, 2017. [68] C.Wah,S.Branson,P.Welinder,P.Perona,andS.Belongie.Thecaltech­ucsdbirds­ 200­2011 dataset. 2011. [69] X. Wang, M. Yang, T. Cour, S. Zhu, K. Yu, and T. X. Han. Contextual weight­ ing for vocabulary tree based image retrieval. In 2011 International Conference on Computer Vision, pages 209–216. IEEE, 2011. [70] L.Xiong,C.Xiong,Y.Li,K.­F.Tang,J.Liu,P.Bennett,J.Ahmed,andA.Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808, 2020. [71] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi­hop question answer­ ing. arXiv preprint arXiv:1809.09600, 2018. [72] Q. Ye, X. Shen, Y. Gao, Z. Wang, Q. Bi, P. Li, and G. Yang. Temporal cue guided video highlight detection with low­rank audio­visual fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7950–7959, 2021. [73] H. Zhang, Y. Gong, Y. Shen, W. Li, J. Lv, N. Duan, and W. Chen. Poolingformer: Long document modeling with pooling attention. In International Conference on Machine Learning, pages 12437–12446. PMLR, 2021. [74] L.Zheng,Y.Yang,andQ.Tian.Siftmeetscnn:Adecadesurveyofinstanceretrieval. IEEE transactions on pattern analysis and machine intelligence, 40(5):1224–1244, 2017.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84146-
dc.description.abstract多模態學習是近年來的一個挑戰,它將過去的單模態學習擴展到多模態,像是文本、圖像或語音。這種擴展需要模型去處理和理解來自多種模態的信息。在信息檢索中,傳統的檢索任務側重於單模態文檔和搜尋詞之間的相似性,而圖像-文本的檢索則假設大多數文本只有包含圖像的資訊。這種區分忽略了現實世界的關鍵字可能涉及文本內容、圖像概念或兩者皆包含。為了解決這個問題,我們介紹了基於文本和圖像的多模態檢索(Mr. Right),它包含了新穎且強大的多模態檢索數據集。我們使用具有豐富文本和圖像的維基百科數據集,並生成三種類型的關鍵字:文本相關、圖像相關和混合。為了驗證我們數據集的有效性,我們提供了一種多模態訓練方式並評估了以前文本檢索和圖像檢索的模型。結果表明,提出的多模態檢索可以提高檢索性能。然而,創建同時具有文本和圖像的多模態資訊仍然是一個挑戰。我們希望 Mr. Right 能讓我們更好地拓寬當前的檢索系統,並為信息檢索中多模態學習的進步做出貢獻。zh_TW
dc.description.abstractMultimodal learning is a recent challenge that extends unimodal learning by generalizing its domain to diverse modalities, such as texts, images, or speech. This extension requires models to process and relate information from multiple modalities. In Information Retrieval, traditional retrieval tasks focus on the similarity between unimodal documents and queries, while image-text retrieval hypothesizes that most texts contain the scene context from images. This separation has ignored that real-world queries may involve text content, image captions, or both. To address this, we introduce Multimodal Retrieval on Representation of ImaGe witH Text (Mr. Right), a novel and robust dataset for multimodal retrieval. We utilize the Wikipedia dataset with rich text-image examples and generate three types of queries: text-related, image-related, and mixed. To validate the effectiveness of our dataset, we provide a multimodal training paradigm and evaluate previous text retrieval and image retrieval frameworks. The results show that proposed multimodal retrieval can improve retrieval performance, but creating a well-unified document representation with texts and images is still a challenge. We hope Mr. Right allows us to broaden current retrieval systems better and contributes to accelerating the advancement of multimodal learning in the Information Retrieval.en
dc.description.provenanceMade available in DSpace on 2023-03-19T22:05:21Z (GMT). No. of bitstreams: 1
U0001-0807202217184200.pdf: 14878863 bytes, checksum: a7b13533a1fe523d169feaafa6c00a68 (MD5)
Previous issue date: 2022
en
dc.description.tableofcontentsContents Page Acknowledgements ii 摘要 iii Abstract iv Contents vi List of Figures viii List of Tables x Chapter 1 Introduction 1 1.1 background............................... 1 Chapter 2 Related Work 4 2.1 Retrievaldataset ............................ 4 2.2 Retrievalmodel............................. 5 Chapter 3 The Mr. Right Dataset 7 3.1 DataCollection............................. 7 3.2 AnnotatedQueryValidation ...................... 10 3.2.1 DatasetAnalysis............................ 11 3.2.2 Benchmark .............................. 13 3.2.3 EvaluationMetrics .......................... 13 Chapter 4 Multimodal Retrieval 14 4.1 Retrieval Formulation .......................... 14 4.2 Model Architecture .......................... 16 4.3 Training Objectives .......................... 17 Chapter 5 Experiments 19 5.1 Dataset .......................... 19 5.2 Baselines .......................... 19 5.3 Experiment Setup .......................... 20 5.4 Results and Analysis .......................... 21 Chapter 6 Conclusion 23 References 24 Appendix A — Supplementary Materials 33 Appendix B — Datasheets 34 B.1 Motivation ............................... 34 B.2 CollectionProcess ........................... 35 B.3 Filtering................................. 38 B.4 Usage.................................. 40 Appendix C — Evaluation Details 42 C.1 Benchmarktasks ............................ 42 C.2 Baselinemodels ............................ 43 C.3 Grad­CAMvisualizations ....................... 44 C.4 Failedexamples ............................ 44 Appendix D — Dataset License 49
dc.language.isozh-TW
dc.subject圖片檢索zh_TW
dc.subject資訊檢索zh_TW
dc.subject文檔檢索zh_TW
dc.subject多模態zh_TW
dc.subjectMultimodal Retrievalen
dc.subjectText Retrievalen
dc.subjectImage Retrievalen
dc.subjectInformation Retrievalen
dc.title圖片與文字的多模態檢索zh_TW
dc.titleMr. Right: Multimodal Retrieval on Representation of ImaGe witH Texten
dc.typeThesis
dc.date.schoolyear110-2
dc.description.degree碩士
dc.contributor.oralexamcommittee陳信希 (Hsin-Hsi Chen),魏志平(Chih-Ping Wei),高宏宇(Hung-Yu Kao),姜俊宇(Jyun-Yu Jiang)
dc.subject.keyword多模態,資訊檢索,圖片檢索,文檔檢索,zh_TW
dc.subject.keywordMultimodal Retrieval,Information Retrieval,Image Retrieval,Text Retrieval,en
dc.relation.page49
dc.identifier.doi10.6342/NTU202201355
dc.rights.note同意授權(限校園內公開)
dc.date.accepted2022-07-11
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊網路與多媒體研究所zh_TW
dc.date.embargo-lift2022-07-13-
顯示於系所單位:資訊網路與多媒體研究所

文件中的檔案:
檔案 大小格式 
U0001-0807202217184200.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
14.53 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved