Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92659
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor莊永裕zh_TW
dc.contributor.advisorYung-Yu Chuangen
dc.contributor.author吳季嘉zh_TW
dc.contributor.authorJi-Jia Wuen
dc.date.accessioned2024-05-31T16:05:10Z-
dc.date.available2024-06-01-
dc.date.copyright2024-05-31-
dc.date.issued2024-
dc.date.submitted2024-05-28-
dc.identifier.citation[1] N. Araslanov and S. Roth. Single-stage semantic segmentation from image labels. In CVPR, 2020.
[2] S. Bird, E. Klein, and E. Loper. Natural language processing with Python: analyzing text with the natural language toolkit. 2009.
[3] H. Caesar, J. Uijlings, and V. Ferrari. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
[4] K. Cai, P. Ren, Y. Zhu, H. Xu, J. Liu, C. Li, G. Wang, and X. Liang. Mixreorg: Crossmodal mixed patch reorganization is a good mask learner for open-world semantic segmentation. In ICCV, 2023.
[5] J. Cha, J. Mun, and B. Roh. Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In CVPR, 2023.
[6] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut. Conceptual 12M: Pushing webscale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
[7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020.
[8] S. Chun, S. J. Oh, R. S. De Rezende, Y. Kalantidis, and D. Larlus. Probabilistic embeddings for cross-modal retrieval. In CVPR, 2021.
[9] M. Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark, 2020.
[10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
[11] F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu. Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS, 2020.
[12] Y. Du, F. Wei, Z. Zhang, M. Shi, Y. Gao, and G. Li. Learning to prompt for openvocabulary object detection with vision-language model. In CVPR, 2022.
[13] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010.
[14] C. Feng, Y. Zhong, Z. Jie, X. Chu, H. Ren, X. Wei, W. Xie, and L. Ma. Promptdet: Towards open-vocabulary detection using uncurated images. In ECCV, 2022.
[15] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm, W. Wiesbeck, and K. Dietmayer. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst., 2020.
[16] G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
[17] C. Han, Y. Zhong, D. Li, K. Han, and L. Ma. Open-vocabulary semantic segmentation with decoupled one-pass network. In ICCV, 2023.
[18] Q. Huang, X. Dong, D. Chen, W. Zhang, F. Wang, G. Hua, and N. Yu. Diversityaware meta visual prompting. In CVPR, 2023.
[19] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim. Visual prompt tuning. In ECCV, 2022.
[20] M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan. Maple: Multi-modal prompt learning. In CVPR, 2023.
[21] D. Kim, N. Kim, and S. Kwak. Improving cross-modal retrieval with set of diverse embeddings. In CVPR, 2023.
[22] K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He. Stacked cross attention for image-text matching. In ECCV, 2018.
[23] B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
[24] B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl. Language-driven semantic segmentation. In ICLR, 2022.
[25] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
[26] F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023.
[27] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 2023.
[28] Q. Liu, Y. Wen, J. Han, C. Xu, H. Xu, and X. Liang. Open-world semantic segmentation via contrasting and clustering vision-language embedding. In ECCV, 2022.
[29] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[30] H. Luo, J. Bao, Y. Wu, X. He, and T. Li. SegCLIP: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. ICML, 2023.
[31] C. Ma, Y. Yang, Y. Wang, Y. Zhang, and W. Xie. Open-vocabulary semantic segmentationwith frozen vision-language models. In BMVC, 2022.
[32] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. CVPR, 2014.
[33] P. Pandey, M. Chasmai, M. Natarajan, and B. Lall. A language-guided benchmark for weakly supervised open vocabulary semantic segmentation. arXiv preprint arXiv:2302.14163, 2023.
[34] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
[35] Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu. Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, 2022.
[36] P. Ren, C. Li, H. Xu, Y. Zhu, G. Wang, J. Liu, X. Chang, and X. Liang. Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. ICLR, 2023.
[37] P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
[38] G. Shin, W. Xie, and S. Albanie. Reco: Retrieve and co-segment for zero-shot transfer. NIPS, 2022.
[39] Y. Song and M. Soleymani. Polysemous visual-semantic embedding for cross-modal retrieval. In CVPR, 2019.
[40] R. Strudel, R. Garcia, I. Laptev, and C. Schmid. Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
[41] J. Wang, X. Li, J. Zhang, Q. Xu, Q. Zhou, Q. Yu, L. Sheng, and D. Xu. Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773, 2023.
[42] J.-J. Wu, A. C.-H. Chang, C.-Y. Chuang, C.-P. Chen, Y.-L. Liu, M.-H. Chen, H.-N. Hu, Y.-Y. Chuang, and Y.-Y. Lin. Image-text co-decomposition for text-supervised semantic segmentation. arXiv preprint arXiv:2404.04231, 2024.
[43] W. Wu, Y. Zhao, M. Z. Shou, H. Zhou, and C. Shen. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. ICCV, 2023.
[44] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo. Segformer:Simple and efficient design for semantic segmentation with transformers. NIPS, 2021.
[45] Y. Xing, J. Kang, A. Xiao, J. Nie, S. Ling, and S. Lu. Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation. In NIPS, 2023.
[46] J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang. Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022.
[47] J. Xu, J. Hou, Y. Zhang, R. Feng, Y. Wang, Y. Qiao, and W. Xie. Learning openvocabulary semantic segmentation models from natural language supervision. In CVPR, 2023.
[48] M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai. Side adapter network for openvocabulary semantic segmentation. In CVPR, 2023.
[49] H. Yao, R. Zhang, and C. Xu. Visual-language prompt tuning with knowledgeguided context optimization. In CVPR, 2023.
[50] M. Yi, Q. Cui, H. Wu, C. Yang, O. Yoshie, and H. Lu. A simple framework for text-supervised semantic segmentation. In CVPR, 2023.
[51] X. Yuan, J. Shi, and L. Gu. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Systems with Applications, 2021.
[52] F. Zhang, T. Zhou, B. Li, H. He, C. Ma, T. Zhang, J. Yao, Y. Zhang, and Y. Wang. Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation. NIPS, 2023.
[53] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
[54] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
[55] C. Zhou, C. C. Loy, and B. Dai. Extract free dense labels from clip. In ECCV, 2022.
[56] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Conditional prompt learning for visionlanguage models. In CVPR, 2022.
[57] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Learning to prompt for vision-language models. IJCV, 2022.
[58] B. Zhu, Y. Niu, Y. Han, Y. Wu, and H. Zhang. Prompt-aligned gradient for prompt tuning. In ICCV, 2023.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92659-
dc.description.abstract本篇論文旨在解決文本監督式語義分割問題。在這個任務中,我們希望能僅透過影像-文字配對而無需密集標註,訓練出一個語義分割模型,在圖像中對任意視覺概念進行分割。現有方法顯示,透過圖像-文字配對進行對比學習,可以有效地將影像局部與文字含義對齊。我們注意到此學習方式存在問題:一段文字通常包含多個語義概念,而語義分割則傾向於針對單一物件進行分割。為解決此問題,我們提出了一個新框架,Image-Text Co-Decomposition (CoDe),在此框架中,配對的圖像與文字被共同分解為一組影像區域和文字片段的配對,並透過對比學習來強化影像區域與文字片段之間的對齊。此外,我們提出了一種提示學習機制,目的是強調影像和文字中分割出的影像區段或文字片段,從而使視覺語言模型能夠對這些影像區域和文字片段提取出更有效的特徵。實驗結果顯示,我們的方法在六個數據集上相較於現有的文本監督式語義分割方法較為有效。我們將程式碼公開在https://github.com/072jiajia/image-text-co-decomposition。zh_TW
dc.description.abstractThis paper addresses text-supervised semantic segmentation, aiming to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. We notice that there is a discrepancy between text alignment and semantic segmentation: A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments. To address this issue, we propose a novel framework, Image-Text Co-Decomposition (CoDe), where the paired image and text are jointly decomposed into a set of image regions and a set of word segments, respectively, and contrastive learning is developed to enforce region-word alignment. To work with a vision-language model, we present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest, with which more effective features can be extracted from that segment. Comprehensive experimental results demonstrate that our method performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets. The code is available at https://github.com/072jiajia/image-text-co-decomposition.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-05-31T16:05:10Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-05-31T16:05:10Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 iii
Abstract v
Contents vii
List of Figures ix
List of Tables xi
1 Introduction 1
1.1 Background and Motivation 1
1.2 Contributions 3
1.3 Publication 4
2 Related Works 5
2.1 Open-Vocabulary Semantic Segmentation 5
2.2 Text-Supervised Semantic Segmentation 6
2.3 Prompt Tuning for Vision-Language Models 7
3 Methodology 9
3.1 Method Overview 9
3.2 Image-Text Co-Segmentation 11
3.3 Region-Word Highlighting 13
3.4 Region-Word Alignment 14
3.5 Implementation Details 16
4 Experiments 17
4.1 Datasets and Evaluation Settings 17
4.2 Quantitative Comparisons 19
4.3 Qualitative Results 20
4.4 Ablation Study 22
4.5 Ablation Study Visualization 24
4.6 Multi-Noun Queries 25
4.7 Failure case visualization 26
5 Conclusions 29
References 31
-
dc.language.isoen-
dc.subject文本監督式學習zh_TW
dc.subject語意分割zh_TW
dc.subject多模態學習zh_TW
dc.subject提示學習zh_TW
dc.subject視覺-語言模型zh_TW
dc.subjectSemantic segmentationen
dc.subjectText-supervised learningen
dc.subjectVision-language modelen
dc.subjectPrompt learningen
dc.subjectMulti-modal learningen
dc.title基於影像與文字共同分解之文本監督式語意分割zh_TW
dc.titleImage-Text Co-Decomposition for Text-Supervised Semantic Segmentationen
dc.typeThesis-
dc.date.schoolyear112-2-
dc.description.degree碩士-
dc.contributor.coadvisor林彥宇zh_TW
dc.contributor.coadvisorYen-Yu Linen
dc.contributor.oralexamcommittee孫民;陳尚澤;劉育綸zh_TW
dc.contributor.oralexamcommitteeMin Sun;Shang-Tse Chen;Yu-Lun Liuen
dc.subject.keyword文本監督式學習,語意分割,多模態學習,提示學習,視覺-語言模型,zh_TW
dc.subject.keywordText-supervised learning,Semantic segmentation,Multi-modal learning,Prompt learning,Vision-language model,en
dc.relation.page37-
dc.identifier.doi10.6342/NTU202400974-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2024-05-29-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-112-2.pdf2.49 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved