基於影像與文字共同分解之文本監督式語意分割

吳季嘉; Ji-Jia Wu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92659

標題:	基於影像與文字共同分解之文本監督式語意分割 Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation
作者:	吳季嘉 Ji-Jia Wu
指導教授:	莊永裕 Yung-Yu Chuang
共同指導教授:	林彥宇 Yen-Yu Lin
關鍵字:	文本監督式學習,語意分割,多模態學習,提示學習,視覺-語言模型, Text-supervised learning,Semantic segmentation,Multi-modal learning,Prompt learning,Vision-language model,
出版年 :	2024
學位:	碩士
摘要:	本篇論文旨在解決文本監督式語義分割問題。在這個任務中，我們希望能僅透過影像-文字配對而無需密集標註，訓練出一個語義分割模型，在圖像中對任意視覺概念進行分割。現有方法顯示，透過圖像-文字配對進行對比學習，可以有效地將影像局部與文字含義對齊。我們注意到此學習方式存在問題：一段文字通常包含多個語義概念，而語義分割則傾向於針對單一物件進行分割。為解決此問題，我們提出了一個新框架，Image-Text Co-Decomposition （CoDe），在此框架中，配對的圖像與文字被共同分解為一組影像區域和文字片段的配對，並透過對比學習來強化影像區域與文字片段之間的對齊。此外，我們提出了一種提示學習機制，目的是強調影像和文字中分割出的影像區段或文字片段，從而使視覺語言模型能夠對這些影像區域和文字片段提取出更有效的特徵。實驗結果顯示，我們的方法在六個數據集上相較於現有的文本監督式語義分割方法較為有效。我們將程式碼公開在https://github.com/072jiajia/image-text-co-decomposition。 This paper addresses text-supervised semantic segmentation, aiming to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. We notice that there is a discrepancy between text alignment and semantic segmentation: A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments. To address this issue, we propose a novel framework, Image-Text Co-Decomposition (CoDe), where the paired image and text are jointly decomposed into a set of image regions and a set of word segments, respectively, and contrastive learning is developed to enforce region-word alignment. To work with a vision-language model, we present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest, with which more effective features can be extracted from that segment. Comprehensive experimental results demonstrate that our method performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets. The code is available at https://github.com/072jiajia/image-text-co-decomposition.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92659
DOI:	10.6342/NTU202400974
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf	2.49 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。