基於影像與文字共同分解之文本監督式語意分割

吳季嘉; Ji-Jia Wu

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92659

Title:	基於影像與文字共同分解之文本監督式語意分割 Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation
Authors:	吳季嘉 Ji-Jia Wu
Advisor:	莊永裕 Yung-Yu Chuang
Co-Advisor:	林彥宇 Yen-Yu Lin
Keyword:	文本監督式學習,語意分割,多模態學習,提示學習,視覺-語言模型, Text-supervised learning,Semantic segmentation,Multi-modal learning,Prompt learning,Vision-language model,
Publication Year :	2024
Degree:	碩士
Abstract:	本篇論文旨在解決文本監督式語義分割問題。在這個任務中，我們希望能僅透過影像-文字配對而無需密集標註，訓練出一個語義分割模型，在圖像中對任意視覺概念進行分割。現有方法顯示，透過圖像-文字配對進行對比學習，可以有效地將影像局部與文字含義對齊。我們注意到此學習方式存在問題：一段文字通常包含多個語義概念，而語義分割則傾向於針對單一物件進行分割。為解決此問題，我們提出了一個新框架，Image-Text Co-Decomposition （CoDe），在此框架中，配對的圖像與文字被共同分解為一組影像區域和文字片段的配對，並透過對比學習來強化影像區域與文字片段之間的對齊。此外，我們提出了一種提示學習機制，目的是強調影像和文字中分割出的影像區段或文字片段，從而使視覺語言模型能夠對這些影像區域和文字片段提取出更有效的特徵。實驗結果顯示，我們的方法在六個數據集上相較於現有的文本監督式語義分割方法較為有效。我們將程式碼公開在https://github.com/072jiajia/image-text-co-decomposition。 This paper addresses text-supervised semantic segmentation, aiming to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. We notice that there is a discrepancy between text alignment and semantic segmentation: A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments. To address this issue, we propose a novel framework, Image-Text Co-Decomposition (CoDe), where the paired image and text are jointly decomposed into a set of image regions and a set of word segments, respectively, and contrastive learning is developed to enforce region-word alignment. To work with a vision-language model, we present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest, with which more effective features can be extracted from that segment. Comprehensive experimental results demonstrate that our method performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets. The code is available at https://github.com/072jiajia/image-text-co-decomposition.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92659
DOI:	10.6342/NTU202400974
Fulltext Rights:	同意授權(全球公開)
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
ntu-112-2.pdf	2.49 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets