基於影像與文字共同分解之文本監督式語意分割

吳季嘉; Ji-Jia Wu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92659

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	莊永裕	zh_TW
dc.contributor.advisor	Yung-Yu Chuang	en
dc.contributor.author	吳季嘉	zh_TW
dc.contributor.author	Ji-Jia Wu	en
dc.date.accessioned	2024-05-31T16:05:10Z	-
dc.date.available	2024-06-01	-
dc.date.copyright	2024-05-31	-
dc.date.issued	2024	-
dc.date.submitted	2024-05-28	-
dc.identifier.citation	[1] N. Araslanov and S. Roth. Single-stage semantic segmentation from image labels. In CVPR, 2020. [2] S. Bird, E. Klein, and E. Loper. Natural language processing with Python: analyzing text with the natural language toolkit. 2009. [3] H. Caesar, J. Uijlings, and V. Ferrari. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018. [4] K. Cai, P. Ren, Y. Zhu, H. Xu, J. Liu, C. Li, G. Wang, and X. Liang. Mixreorg: Crossmodal mixed patch reorganization is a good mask learner for open-world semantic segmentation. In ICCV, 2023. [5] J. Cha, J. Mun, and B. Roh. Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In CVPR, 2023. [6] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut. Conceptual 12M: Pushing webscale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021. [7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020. [8] S. Chun, S. J. Oh, R. S. De Rezende, Y. Kalantidis, and D. Larlus. Probabilistic embeddings for cross-modal retrieval. In CVPR, 2021. [9] M. Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark, 2020. [10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. [11] F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu. Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS, 2020. [12] Y. Du, F. Wei, Z. Zhang, M. Shi, Y. Gao, and G. Li. Learning to prompt for openvocabulary object detection with vision-language model. In CVPR, 2022. [13] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010. [14] C. Feng, Y. Zhong, Z. Jie, X. Chu, H. Ren, X. Wei, W. Xie, and L. Ma. Promptdet: Towards open-vocabulary detection using uncurated images. In ECCV, 2022. [15] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm, W. Wiesbeck, and K. Dietmayer. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst., 2020. [16] G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022. [17] C. Han, Y. Zhong, D. Li, K. Han, and L. Ma. Open-vocabulary semantic segmentation with decoupled one-pass network. In ICCV, 2023. [18] Q. Huang, X. Dong, D. Chen, W. Zhang, F. Wang, G. Hua, and N. Yu. Diversityaware meta visual prompting. In CVPR, 2023. [19] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim. Visual prompt tuning. In ECCV, 2022. [20] M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan. Maple: Multi-modal prompt learning. In CVPR, 2023. [21] D. Kim, N. Kim, and S. Kwak. Improving cross-modal retrieval with set of diverse embeddings. In CVPR, 2023. [22] K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He. Stacked cross attention for image-text matching. In ECCV, 2018. [23] B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021. [24] B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl. Language-driven semantic segmentation. In ICLR, 2022. [25] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021. [26] F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023. [27] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 2023. [28] Q. Liu, Y. Wen, J. Han, C. Xu, H. Xu, and X. Liang. Open-world semantic segmentation via contrasting and clustering vision-language embedding. In ECCV, 2022. [29] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. [30] H. Luo, J. Bao, Y. Wu, X. He, and T. Li. SegCLIP: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. ICML, 2023. [31] C. Ma, Y. Yang, Y. Wang, Y. Zhang, and W. Xie. Open-vocabulary semantic segmentationwith frozen vision-language models. In BMVC, 2022. [32] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. CVPR, 2014. [33] P. Pandey, M. Chasmai, M. Natarajan, and B. Lall. A language-guided benchmark for weakly supervised open vocabulary semantic segmentation. arXiv preprint arXiv:2302.14163, 2023. [34] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. [35] Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu. Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, 2022. [36] P. Ren, C. Li, H. Xu, Y. Zhu, G. Wang, J. Liu, X. Chang, and X. Liang. Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. ICLR, 2023. [37] P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018. [38] G. Shin, W. Xie, and S. Albanie. Reco: Retrieve and co-segment for zero-shot transfer. NIPS, 2022. [39] Y. Song and M. Soleymani. Polysemous visual-semantic embedding for cross-modal retrieval. In CVPR, 2019. [40] R. Strudel, R. Garcia, I. Laptev, and C. Schmid. Segmenter: Transformer for semantic segmentation. In ICCV, 2021. [41] J. Wang, X. Li, J. Zhang, Q. Xu, Q. Zhou, Q. Yu, L. Sheng, and D. Xu. Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773, 2023. [42] J.-J. Wu, A. C.-H. Chang, C.-Y. Chuang, C.-P. Chen, Y.-L. Liu, M.-H. Chen, H.-N. Hu, Y.-Y. Chuang, and Y.-Y. Lin. Image-text co-decomposition for text-supervised semantic segmentation. arXiv preprint arXiv:2404.04231, 2024. [43] W. Wu, Y. Zhao, M. Z. Shou, H. Zhou, and C. Shen. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. ICCV, 2023. [44] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo. Segformer:Simple and efficient design for semantic segmentation with transformers. NIPS, 2021. [45] Y. Xing, J. Kang, A. Xiao, J. Nie, S. Ling, and S. Lu. Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation. In NIPS, 2023. [46] J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang. Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022. [47] J. Xu, J. Hou, Y. Zhang, R. Feng, Y. Wang, Y. Qiao, and W. Xie. Learning openvocabulary semantic segmentation models from natural language supervision. In CVPR, 2023. [48] M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai. Side adapter network for openvocabulary semantic segmentation. In CVPR, 2023. [49] H. Yao, R. Zhang, and C. Xu. Visual-language prompt tuning with knowledgeguided context optimization. In CVPR, 2023. [50] M. Yi, Q. Cui, H. Wu, C. Yang, O. Yoshie, and H. Lu. A simple framework for text-supervised semantic segmentation. In CVPR, 2023. [51] X. Yuan, J. Shi, and L. Gu. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Systems with Applications, 2021. [52] F. Zhang, T. Zhou, B. Li, H. He, C. Ma, T. Zhang, J. Yao, Y. Zhang, and Y. Wang. Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation. NIPS, 2023. [53] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021. [54] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019. [55] C. Zhou, C. C. Loy, and B. Dai. Extract free dense labels from clip. In ECCV, 2022. [56] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Conditional prompt learning for visionlanguage models. In CVPR, 2022. [57] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Learning to prompt for vision-language models. IJCV, 2022. [58] B. Zhu, Y. Niu, Y. Han, Y. Wu, and H. Zhang. Prompt-aligned gradient for prompt tuning. In ICCV, 2023.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92659	-
dc.description.abstract	本篇論文旨在解決文本監督式語義分割問題。在這個任務中，我們希望能僅透過影像-文字配對而無需密集標註，訓練出一個語義分割模型，在圖像中對任意視覺概念進行分割。現有方法顯示，透過圖像-文字配對進行對比學習，可以有效地將影像局部與文字含義對齊。我們注意到此學習方式存在問題：一段文字通常包含多個語義概念，而語義分割則傾向於針對單一物件進行分割。為解決此問題，我們提出了一個新框架，Image-Text Co-Decomposition （CoDe），在此框架中，配對的圖像與文字被共同分解為一組影像區域和文字片段的配對，並透過對比學習來強化影像區域與文字片段之間的對齊。此外，我們提出了一種提示學習機制，目的是強調影像和文字中分割出的影像區段或文字片段，從而使視覺語言模型能夠對這些影像區域和文字片段提取出更有效的特徵。實驗結果顯示，我們的方法在六個數據集上相較於現有的文本監督式語義分割方法較為有效。我們將程式碼公開在https://github.com/072jiajia/image-text-co-decomposition。	zh_TW
dc.description.abstract	This paper addresses text-supervised semantic segmentation, aiming to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. We notice that there is a discrepancy between text alignment and semantic segmentation: A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments. To address this issue, we propose a novel framework, Image-Text Co-Decomposition (CoDe), where the paired image and text are jointly decomposed into a set of image regions and a set of word segments, respectively, and contrastive learning is developed to enforce region-word alignment. To work with a vision-language model, we present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest, with which more effective features can be extracted from that segment. Comprehensive experimental results demonstrate that our method performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets. The code is available at https://github.com/072jiajia/image-text-co-decomposition.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-05-31T16:05:10Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-05-31T16:05:10Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements i 摘要 iii Abstract v Contents vii List of Figures ix List of Tables xi 1 Introduction 1 1.1 Background and Motivation 1 1.2 Contributions 3 1.3 Publication 4 2 Related Works 5 2.1 Open-Vocabulary Semantic Segmentation 5 2.2 Text-Supervised Semantic Segmentation 6 2.3 Prompt Tuning for Vision-Language Models 7 3 Methodology 9 3.1 Method Overview 9 3.2 Image-Text Co-Segmentation 11 3.3 Region-Word Highlighting 13 3.4 Region-Word Alignment 14 3.5 Implementation Details 16 4 Experiments 17 4.1 Datasets and Evaluation Settings 17 4.2 Quantitative Comparisons 19 4.3 Qualitative Results 20 4.4 Ablation Study 22 4.5 Ablation Study Visualization 24 4.6 Multi-Noun Queries 25 4.7 Failure case visualization 26 5 Conclusions 29 References 31	-
dc.language.iso	en	-
dc.subject	文本監督式學習	zh_TW
dc.subject	語意分割	zh_TW
dc.subject	多模態學習	zh_TW
dc.subject	提示學習	zh_TW
dc.subject	視覺-語言模型	zh_TW
dc.subject	Semantic segmentation	en
dc.subject	Text-supervised learning	en
dc.subject	Vision-language model	en
dc.subject	Prompt learning	en
dc.subject	Multi-modal learning	en
dc.title	基於影像與文字共同分解之文本監督式語意分割	zh_TW
dc.title	Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	林彥宇	zh_TW
dc.contributor.coadvisor	Yen-Yu Lin	en
dc.contributor.oralexamcommittee	孫民;陳尚澤;劉育綸	zh_TW
dc.contributor.oralexamcommittee	Min Sun;Shang-Tse Chen;Yu-Lun Liu	en
dc.subject.keyword	文本監督式學習,語意分割,多模態學習,提示學習,視覺-語言模型,	zh_TW
dc.subject.keyword	Text-supervised learning,Semantic segmentation,Multi-modal learning,Prompt learning,Vision-language model,	en
dc.relation.page	37	-
dc.identifier.doi	10.6342/NTU202400974	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2024-05-29	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf	2.49 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。