Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98081
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor王勝德zh_TW
dc.contributor.advisorSheng-De Wangen
dc.contributor.author吳凱濠zh_TW
dc.contributor.authorKai-Hao Wuen
dc.date.accessioned2025-07-24T16:07:16Z-
dc.date.available2025-07-25-
dc.date.copyright2025-07-24-
dc.date.issued2025-
dc.date.submitted2025-07-03-
dc.identifier.citation[1] H. Caesar, J. Uijlings, and V. Ferrari. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1209–1218, 2018.
[2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
[3] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
[4] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
[5] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar. Maskedattention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
[6] B. Cheng, A. Schwing, and A. Kirillov. Per-pixel classification is not all you need for semantic segmentation. Advances in neural information processing systems, 34:17864–17875, 2021.
[7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
[8] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 770–778, 2016.
[10] D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
[11] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
[12] J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pretraining for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
[13] F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7061–7070, 2023.
[14] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
[15] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3431–3440, 2015.
[16] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 891–898, 2014.
[17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[18] J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 2955–2966, 2023.
[19] M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai. San: side adapter network for openvocabulary semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15546–15561, 2023.
[20] M. Xu, Z. Zhang, F. Wei, Y. Lin, Y. Cao, H. Hu, and X. Bai. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In Proceedings of the European conference on computer vision (ECCV), pages 736–753. Springer, 2022.
[21] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2881–2890, 2017.
[22] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 633–641, 2017.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98081-
dc.description.abstract開放詞彙語義分割因需在實際應用中識別訓練階段未見過的類別,近年來成為電腦視覺領域的重要研究課題。現有主流方法多採用兩階段架構:首先由遮罩產生器生成候選遮罩,再將這些遮罩輸入預訓練的對比影像文字模型進行視覺與語言的特徵匹配,以確定每個遮罩對應的語義類別。然而,此類方法在推理階段需對同一影像反覆抽取特徵,導致記憶體消耗較大且推理延遲明顯增加,限制其在實務環境中的應用效率。雖然現行一階段架構的模型其推理速度相較於二階段模型有顯著優勢,但部分方法在訓練過程中所需的時間與資源甚至高於二階段模型。
本研究旨在改善此問題,採用一種高效率的一階段開放詞彙語義分割方法,期望在加快訓練速度的同時,仍能維持穩定且良好的推理效能。將預訓練的對比影像文字模型直接作為模型的骨幹,並在訓練階段凍結其權重,僅更新遮罩產生器模組。此方法能夠減少訓練時間與顯著降低記憶體消耗。在推理階段,使用遮罩有效性篩選機制,透過池化處理各遮罩,以判斷其是否有效,從而避免無效遮罩影響最終的分割品質。實驗結果顯示,本研究提出的架構在精度與速度上皆可與現有主流方法媲美,並顯著改善推理效率與實用性。
zh_TW
dc.description.abstractOpen-vocabulary semantic segmentation has become an important research topic in the field of computer vision due to the necessity of identifying unseen categories during training in real-world applications. Existing mainstream approaches typically adopt a two-stage architecture, which first generates candidate masks using a mask generator and then feeds these masks into a pre-trained contrastive vision-language model, e.g., CLIP, to perform visual-language feature matching, determining the semantic category for each mask. However, these methods repeatedly extract features from the same image during inference, resulting in significant memory consumption and latency, thereby limiting their efficiency in practical deployment. Although current one-stage models offer significantly faster inference compared to two-stage models, some methods require even more training time and computational resources than two-stage counterparts.
This study aims to address the above issue by adopting an efficient one-stage openvocabulary semantic segmentation approach, with the goal of accelerating training while maintaining stable and reliable inference performance. A pretrained contrastive visionlanguage model is directly employed as the backbone of the architecture, with its weights frozen during training, updating only the mask generator module. This approach effectively reduces training time and significantly reduces memory usage. During inference,
a mask validity filtering mechanism using mask pooling is implemented to determine the effectiveness of each predicted mask, thereby preventing invalid masks from negatively affecting segmentation quality. Experimental results demonstrate that the proposed architecture achieves performance comparable to those of existing mainstream methods in both accuracy and speed, while substantially improving inference efficiency and practicality.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-24T16:07:16Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-07-24T16:07:16Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee i
Acknowledgements ii
摘要iii
Abstract iv
Contents vi
List of Figures viii
List of Tables ix
Chapter 1 Introduction 1
Chapter 2 Related Work 5
2.1 Fully Supervised Semantic Segmentation . . . . . . . . . . . . . . . 5
2.2 Vision–Language Pre-training . . . . . . . . . . . . . . . . . . . . . 6
2.3 Open-Vocabulary Semantic Segmentation . . . . . . . . . . . . . . . 8
Chapter 3 Methods 10
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Image Encoder and Text Encoder . . . . . . . . . . . . . . . . . . . 12
3.2.1 Image Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Text Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Mask Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Pixel Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.3 Transformer Decoder . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Open-vocabulary Inference . . . . . . . . . . . . . . . . . . . . . . . 20
Chapter 4 Experiments 22
4.1 Datasets and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter 5 Ablation Study 30
5.1 Number of Transformer Encoder Layers . . . . . . . . . . . . . . . . 30
5.2 Number of Transformer Decoder Layers . . . . . . . . . . . . . . . . 31
5.3 Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 Effect of the fusion weights . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 6 Conclusion 36
References 38
-
dc.language.isoen-
dc.subject對比影像文字預訓練模型zh_TW
dc.subject開放詞彙的語義分割zh_TW
dc.subjectCLIPen
dc.subjectOpen-vocabulary segmentationen
dc.title基於凍結預訓練文字影像模型骨幹的一階段高效率開放詞彙語義分割zh_TW
dc.titleSLCLIP: Efficient One-Stage Open-Vocabulary Semantic Segmentation with a Frozen CLIP Backboneen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee雷欽隆;于天立;余承叡zh_TW
dc.contributor.oralexamcommitteeChin-Laung Lei;Tian-Li Yu;Cheng-Juei Yuen
dc.subject.keyword開放詞彙的語義分割,對比影像文字預訓練模型,zh_TW
dc.subject.keywordOpen-vocabulary segmentation,CLIP,en
dc.relation.page41-
dc.identifier.doi10.6342/NTU202501503-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-07-07-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電機工程學系-
dc.date.embargo-lift2030-07-02-
顯示於系所單位:電機工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
8.87 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved