融合視覺語言模型的多模態特徵：增強免訓練零樣本的分佈外偵測

陳弘修; Hong-Siou Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101044

標題:	融合視覺語言模型的多模態特徵：增強免訓練零樣本的分佈外偵測 Enhancing Zero-Shot Training-Free Out-of-Distribution Detection by Leveraging Multi-Modal Features and Vision-Language Models
作者:	陳弘修 Hong-Siou Chen
指導教授:	吳家麟 Ja-Ling Wu
關鍵字:	分佈外偵測,多模態融合零樣本學習視覺語言模型超類別自適應代理 Out-of-Distribution Detection,Multi-Modal FusionZero-Shot LearningVision-Language ModelsSuperclassAdaptive Proxies
出版年 :	2025
學位:	碩士
摘要:	分佈外 (Out-of-Distribution, OOD) 偵測的關鍵挑戰在於如何有效應對與訓練數據截然不同的未知樣本。現有的免訓練零樣本方法為此提供了兩種前沿思路：其一是利用大型語言模型 (LLM) 生成語義更抽象的「超類別 (Superclass)」，以探勘高品質的靜態負標籤 [10]；其二則是透過建立特徵記憶庫，動態生成能適應當前測試數據分佈的「自適應代理 (Adaptive Proxy)」[16]。然而，這兩種方法分別側重於靜態語義的精準度和動態特徵的適應性，未能將兩者優勢結合。本論文提出一個名為 SC+AdaNeg 的新穎框架，首次將上述兩種先進方法進行了深度融合與創新。我們的核心貢獻並非簡單地將兩者疊加，而是在於設計了一套全新的融合架構與計分公式。首先，我們沿用「超類別」策略以獲取高品質的文本先驗知識，同時利用 AdaNeg 的記憶庫機制來捕捉動態視覺特徵。其次，也是我們最關鍵的創新點，在於我們重新設計了融合兩者的計分機制：我們發現穩定性更高的「任務自適應代理」比樣本自適應代理更適合與靜態文本特徵互補；此外，我們引入了一個「文本放大因子 (H)」，策略性地強化由超類別生成的高品質文本先驗在最終決策中的主導地位。實驗結果證明，我們的融合框架在多個 OOD 基準資料集上均達到了最新的頂尖 (SOTA) 效能，顯著優於原始的兩個獨立方法。這項工作不僅驗證了結合靜態語義先驗與動態視覺適應的巨大潛力，更重要的是，提出了一種具原則性的融合範式，為未來 OOD 偵測領域的發展提供了新的思路。 A central challenge in Out-of-Distribution (OOD) detection lies in effectively identifying novel samples unseen during training. Recent training-free, zero-shot methods have offered two powerful yet distinct paradigms: one enhances semantic precision by using Large Language Models (LLMs) to generate abstract “Superclass” for high-quality static negative label mining [10]. In contrast, another enhances dynamic adaptability by creating an “Adaptive Proxy” from a feature memory bank to align with the actual test-time OOD distribution [16]. However, these approaches have remained separate, leaving the potential of their combined strengths untapped. This thesis introduces SC+AdaNeg, a novel framework that, for the first time, synergistically merges and innovates upon these two cutting-edge methods. Our core contribution is not a simple summation but a principled fusion architecture and a redesigned scoring mechanism. We architecturally integrate the Superclass strategy to secure high-quality textual priors while simultaneously leveraging the adaptive memory bank from AdaNeg to capture dynamic visual features. Critically, our key innovation lies in the fusion formula itself: we find that more stable task-adaptive proxies serve as a better complement to static text features than their sample-adaptive counterparts. Furthermore, we introduce a textual amplification factor (H) to strategically elevate the influence of our robust Superclass-derived text priors, establishing them as the central anchor in the final decision-making process. Extensive experiments demonstrate that our integrated framework achieves new state-of-the-art (SOTA) performance across multiple OOD benchmarks, significantly outperforming each of the original standalone methods. This work not only validates the powerful synergy between static semantic knowledge and dynamic visual adaptation but, more importantly, proposes a principled paradigm for their fusion, charting a new direction for future research in OOD detection.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101044
DOI:	10.6342/NTU202502024
全文授權:	未授權
電子全文公開日期:	N/A
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 未授權公開取用	6.33 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。