基於凍結預訓練文字影像模型骨幹的一階段高效率開放詞彙語義分割

吳凱濠; Kai-Hao Wu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98081

標題:	基於凍結預訓練文字影像模型骨幹的一階段高效率開放詞彙語義分割 SLCLIP: Efficient One-Stage Open-Vocabulary Semantic Segmentation with a Frozen CLIP Backbone
作者:	吳凱濠 Kai-Hao Wu
指導教授:	王勝德 Sheng-De Wang
關鍵字:	開放詞彙的語義分割,對比影像文字預訓練模型, Open-vocabulary segmentation,CLIP,
出版年 :	2025
學位:	碩士
摘要:	開放詞彙語義分割因需在實際應用中識別訓練階段未見過的類別，近年來成為電腦視覺領域的重要研究課題。現有主流方法多採用兩階段架構：首先由遮罩產生器生成候選遮罩，再將這些遮罩輸入預訓練的對比影像文字模型進行視覺與語言的特徵匹配，以確定每個遮罩對應的語義類別。然而，此類方法在推理階段需對同一影像反覆抽取特徵，導致記憶體消耗較大且推理延遲明顯增加，限制其在實務環境中的應用效率。雖然現行一階段架構的模型其推理速度相較於二階段模型有顯著優勢，但部分方法在訓練過程中所需的時間與資源甚至高於二階段模型。本研究旨在改善此問題，採用一種高效率的一階段開放詞彙語義分割方法，期望在加快訓練速度的同時，仍能維持穩定且良好的推理效能。將預訓練的對比影像文字模型直接作為模型的骨幹，並在訓練階段凍結其權重，僅更新遮罩產生器模組。此方法能夠減少訓練時間與顯著降低記憶體消耗。在推理階段，使用遮罩有效性篩選機制，透過池化處理各遮罩，以判斷其是否有效，從而避免無效遮罩影響最終的分割品質。實驗結果顯示，本研究提出的架構在精度與速度上皆可與現有主流方法媲美，並顯著改善推理效率與實用性。 Open-vocabulary semantic segmentation has become an important research topic in the field of computer vision due to the necessity of identifying unseen categories during training in real-world applications. Existing mainstream approaches typically adopt a two-stage architecture, which first generates candidate masks using a mask generator and then feeds these masks into a pre-trained contrastive vision-language model, e.g., CLIP, to perform visual-language feature matching, determining the semantic category for each mask. However, these methods repeatedly extract features from the same image during inference, resulting in significant memory consumption and latency, thereby limiting their efficiency in practical deployment. Although current one-stage models offer significantly faster inference compared to two-stage models, some methods require even more training time and computational resources than two-stage counterparts. This study aims to address the above issue by adopting an efficient one-stage openvocabulary semantic segmentation approach, with the goal of accelerating training while maintaining stable and reliable inference performance. A pretrained contrastive visionlanguage model is directly employed as the backbone of the architecture, with its weights frozen during training, updating only the mask generator module. This approach effectively reduces training time and significantly reduces memory usage. During inference, a mask validity filtering mechanism using mask pooling is implemented to determine the effectiveness of each predicted mask, thereby preventing invalid masks from negatively affecting segmentation quality. Experimental results demonstrate that the proposed architecture achieves performance comparable to those of existing mainstream methods in both accuracy and speed, while substantially improving inference efficiency and practicality.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98081
DOI:	10.6342/NTU202501503
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2030-07-02
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	8.87 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。