基於凍結預訓練文字影像模型骨幹的一階段高效率開放詞彙語義分割

吳凱濠; Kai-Hao Wu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98081

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	王勝德	zh_TW
dc.contributor.advisor	Sheng-De Wang	en
dc.contributor.author	吳凱濠	zh_TW
dc.contributor.author	Kai-Hao Wu	en
dc.date.accessioned	2025-07-24T16:07:16Z	-
dc.date.available	2025-07-25	-
dc.date.copyright	2025-07-24	-
dc.date.issued	2025	-
dc.date.submitted	2025-07-03	-
dc.identifier.citation	[1] H. Caesar, J. Uijlings, and V. Ferrari. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1209–1218, 2018. [2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017. [3] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017. [4] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. [5] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar. Maskedattention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. [6] B. Cheng, A. Schwing, and A. Kirillov. Per-pixel classification is not all you need for semantic segmentation. Advances in neural information processing systems, 34:17864–17875, 2021. [7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010. [8] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021. [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 770–778, 2016. [10] D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016. [11] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021. [12] J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pretraining for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022. [13] F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7061–7070, 2023. [14] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022. [15] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3431–3440, 2015. [16] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 891–898, 2014. [17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. [18] J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 2955–2966, 2023. [19] M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai. San: side adapter network for openvocabulary semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15546–15561, 2023. [20] M. Xu, Z. Zhang, F. Wei, Y. Lin, Y. Cao, H. Hu, and X. Bai. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In Proceedings of the European conference on computer vision (ECCV), pages 736–753. Springer, 2022. [21] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2881–2890, 2017. [22] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 633–641, 2017.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98081	-
dc.description.abstract	開放詞彙語義分割因需在實際應用中識別訓練階段未見過的類別，近年來成為電腦視覺領域的重要研究課題。現有主流方法多採用兩階段架構：首先由遮罩產生器生成候選遮罩，再將這些遮罩輸入預訓練的對比影像文字模型進行視覺與語言的特徵匹配，以確定每個遮罩對應的語義類別。然而，此類方法在推理階段需對同一影像反覆抽取特徵，導致記憶體消耗較大且推理延遲明顯增加，限制其在實務環境中的應用效率。雖然現行一階段架構的模型其推理速度相較於二階段模型有顯著優勢，但部分方法在訓練過程中所需的時間與資源甚至高於二階段模型。本研究旨在改善此問題，採用一種高效率的一階段開放詞彙語義分割方法，期望在加快訓練速度的同時，仍能維持穩定且良好的推理效能。將預訓練的對比影像文字模型直接作為模型的骨幹，並在訓練階段凍結其權重，僅更新遮罩產生器模組。此方法能夠減少訓練時間與顯著降低記憶體消耗。在推理階段，使用遮罩有效性篩選機制，透過池化處理各遮罩，以判斷其是否有效，從而避免無效遮罩影響最終的分割品質。實驗結果顯示，本研究提出的架構在精度與速度上皆可與現有主流方法媲美，並顯著改善推理效率與實用性。	zh_TW
dc.description.abstract	Open-vocabulary semantic segmentation has become an important research topic in the field of computer vision due to the necessity of identifying unseen categories during training in real-world applications. Existing mainstream approaches typically adopt a two-stage architecture, which first generates candidate masks using a mask generator and then feeds these masks into a pre-trained contrastive vision-language model, e.g., CLIP, to perform visual-language feature matching, determining the semantic category for each mask. However, these methods repeatedly extract features from the same image during inference, resulting in significant memory consumption and latency, thereby limiting their efficiency in practical deployment. Although current one-stage models offer significantly faster inference compared to two-stage models, some methods require even more training time and computational resources than two-stage counterparts. This study aims to address the above issue by adopting an efficient one-stage openvocabulary semantic segmentation approach, with the goal of accelerating training while maintaining stable and reliable inference performance. A pretrained contrastive visionlanguage model is directly employed as the backbone of the architecture, with its weights frozen during training, updating only the mask generator module. This approach effectively reduces training time and significantly reduces memory usage. During inference, a mask validity filtering mechanism using mask pooling is implemented to determine the effectiveness of each predicted mask, thereby preventing invalid masks from negatively affecting segmentation quality. Experimental results demonstrate that the proposed architecture achieves performance comparable to those of existing mainstream methods in both accuracy and speed, while substantially improving inference efficiency and practicality.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-24T16:07:16Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-07-24T16:07:16Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要iii Abstract iv Contents vi List of Figures viii List of Tables ix Chapter 1 Introduction 1 Chapter 2 Related Work 5 2.1 Fully Supervised Semantic Segmentation . . . . . . . . . . . . . . . 5 2.2 Vision–Language Pre-training . . . . . . . . . . . . . . . . . . . . . 6 2.3 Open-Vocabulary Semantic Segmentation . . . . . . . . . . . . . . . 8 Chapter 3 Methods 10 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Image Encoder and Text Encoder . . . . . . . . . . . . . . . . . . . 12 3.2.1 Image Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.2 Text Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Mask Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.1 Pixel Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.2 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.3 Transformer Decoder . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Open-vocabulary Inference . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 4 Experiments 22 4.1 Datasets and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Chapter 5 Ablation Study 30 5.1 Number of Transformer Encoder Layers . . . . . . . . . . . . . . . . 30 5.2 Number of Transformer Decoder Layers . . . . . . . . . . . . . . . . 31 5.3 Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.4 Effect of the fusion weights . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 6 Conclusion 36 References 38	-
dc.language.iso	en	-
dc.subject	對比影像文字預訓練模型	zh_TW
dc.subject	開放詞彙的語義分割	zh_TW
dc.subject	CLIP	en
dc.subject	Open-vocabulary segmentation	en
dc.title	基於凍結預訓練文字影像模型骨幹的一階段高效率開放詞彙語義分割	zh_TW
dc.title	SLCLIP: Efficient One-Stage Open-Vocabulary Semantic Segmentation with a Frozen CLIP Backbone	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	雷欽隆;于天立;余承叡	zh_TW
dc.contributor.oralexamcommittee	Chin-Laung Lei;Tian-Li Yu;Cheng-Juei Yu	en
dc.subject.keyword	開放詞彙的語義分割,對比影像文字預訓練模型,	zh_TW
dc.subject.keyword	Open-vocabulary segmentation,CLIP,	en
dc.relation.page	41	-
dc.identifier.doi	10.6342/NTU202501503	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-07-07	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電機工程學系	-
dc.date.embargo-lift	2030-07-02	-
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	8.87 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。