透過視覺提示提升基於 CLIP 的分佈外偵測效能並比較PEFT方法間的優劣

陳常安; Chang-An Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95340

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	吳家麟	zh_TW
dc.contributor.advisor	Ja-Lin Wu	en
dc.contributor.author	陳常安	zh_TW
dc.contributor.author	Chang-An Chen	en
dc.date.accessioned	2024-09-05T16:15:22Z	-
dc.date.available	2024-09-06	-
dc.date.copyright	2024-09-05	-
dc.date.issued	2024	-
dc.date.submitted	2024-08-09	-
dc.identifier.citation	[1] A. Aghajanyan, L. Zettlemoyer, and S. Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020. [2] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. [3] J. Dong, Y. Gao, H. Zhou, J. Cen, Y. Yao, S. Yoon, and P. D. Sun. Towards few-shot out-of-distribution detection. arXiv preprint arXiv:2311.12076, 2023. [4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un-terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [5] S. Esmaeilpour, B. Liu, E. Robertson, and L. Shu. Zero-shot out-of-distribution de- tection based on the pre-trained model clip. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 6568–6576, 2022. [6] P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2):581–595, 2024. [7] Z. Han, C. Gao, J. Liu, S. Q. Zhang, et al. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608, 2024. [8] D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016. [9] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. [10] R. Huang and Y. Li. Mos: Towards scaling out-of-distribution detection for large semantic space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8710–8719, 2021. [11] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022. [12] X. Jiang, F. Liu, Z. Fang, H. Chen, T. Liu, F. Zheng, and B. Han. Negative label guided ood detection with pretrained vision-language models. arXiv preprint arXiv:2403.20078, 2024. [13] J. Kim, J. Kim, and S. Hwang. Comparison of out-of-distribution detection performance of clip-based fine-tuning methods. In 2024 International Conference on Electronics, Information, and Communication (ICEIC), pages 1–4. IEEE, 2024. [14] H. Lee, L. Soldaini, A. Cohan, M. Seo, and K. Lo. Back to basics: A simple recipe for improving out-of-domain retrieval in dense encoders. arXiv preprint arXiv:2311.09765, 2023. [15] T. Li, G. Pang, X. Bai, W. Miao, and J. Zheng. Learning transferable negative prompts for out-of-distribution detection. arXiv preprint arXiv:2404.03248, 2024. [16] C. Liao, T. Tsiligkaridis, and B. Kulis. Descriptor and word soups: Overcoming the parameter efficiency accuracy tradeoff for out-of-distribution few-shot learning. arXiv preprint arXiv:2311.13612, 2023. [17] S. Liu, J. Keung, Z. Yang, F. Liu, Q. Zhou, and Y. Liao. Delving into parameter-efficient fine-tuning in code change learning: An empirical study. arXiv preprint arXiv:2402.06247, 2024. [18] Y. Ming, Z. Cai, J. Gu, Y. Sun, W. Li, and Y. Li. Delving into out-of-distribution detection with vision-language representations. Advances in neural information processing systems, 35:35087–35102, 2022. [19] Y. Ming and Y. Li. How does fine-tuning impact out-of-distribution detection for vision-language models? International Journal of Computer Vision, 132(2):596–609, 2024. [20] Y. Ming and Y. Li. How does fine-tuning impact out-of-distribution detection for vision-language models? International Journal of Computer Vision, 132(2):596–609, 2024. [21] A. Miyai, Q. Yu, G. Irie, and K. Aizawa. Locoop: Few-shot out-of-distribution detection via prompt learning. Advances in Neural Information Processing Systems, 36, 2024. [22] L. Niss, K. Vogt-Lowell, and T. Tsiligkaridis. Quantified task misalignment to inform peft: An exploration of domain generalization and catastrophic forgetting in clip. arXiv preprint arXiv:2402.09613, 2024. [23] K. O’shea and R. Nash. An introduction to convolutional neural networks. arXivpreprint arXiv:1511.08458, 2015. [24] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry,A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models fromnatural language supervision. In International conference on machine learning, pages8748–8763. PMLR, 2021. [25] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry,A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models fromnatural language supervision. In International conference on machine learning, pages8748–8763. PMLR, 2021. [26] W. Ren, X. Li, L. Wang, T. Zhao, and W. Qin. Analyzing and reducing catastrophicforgetting in parameter efficient tuning. arXiv preprint arXiv:2402.18865, 2024. [27] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Per-ona, and S. Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE conference on computer vision and pattern recognition,pages 8769–8778, 2018. [28] H. Wang, Y. Li, H. Yao, and X. Li. Clipn for zero-shot ood detection: Teaching clipto say no. In Proceedings of the IEEE/CVF International Conference on ComputerVision, pages 1802–1812, 2023. [29] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scalescene recognition from abbey to zoo. In 2010 IEEE computer society conference oncomputer vision and pattern recognition, pages 3485–3492. IEEE, 2010. [30] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 millionimage database for scene recognition. IEEE transactions on pattern analysis andmachine intelligence, 40(6):1452–1464, 2017. [31] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95340	-
dc.description.abstract	最近在視覺語言模型方面的進展，如CLIP，已經徹底改變了零樣本分類任務。儘管傳統的微調方法可以提升性能，但對於大型模型來說，它們的成本很高。因此，研究現在集中在參數高效的技術上。然而，目前的評估標準聚焦在分類性能上，卻忽略了模型的可靠性。我們的研究通過對基於CLIP的微調方法進行全面比較分析來解決這一空缺。我們評估了不同參數高效微調（PEFT）方法在少樣本分佈外檢測中的表現，這對於評估模型可靠性至關重要。本論文揭示了僅採用參數高效微調（PEFT）方法時，在分佈外檢測性能上的不足，相較於其他基於CLIP的方法。為了解決這一限制，我們從PEFT中選擇了視覺提示（VPT）。通過將VPT作為一種附加應用來增強其他分佈外檢測技術，我們實現了顯著的性能提升，即使與當前表現最好（SOTA）的基於CLIP的OOD檢測方法相比也是如此。	zh_TW
dc.description.abstract	Recent advances in vision-language models like CLIP have revolutionized zero-shot classification tasks. While traditional fine-tuning methods enhance performance, they’re costly for large-scale models. Thus, research now focuses on parameter-efficient techniques. However, current evaluations predominantly measure classification performance, neglecting model reliability. Our study addresses this gap by providing a thorough comparative analysis of CLIP-based fine-tuning methods. We assess few-shot out-of-distribution detection performance on different PEFT methods, which is crucial for evaluating model reliability. This thesis reveals a shortfall in out-of-distribution performance when employing only parameter-efficient fine-tuning (PEFT) methods compared to other CLIP-based approaches. To remedy this limitation, we select Vision Prompt Tuning from PEFT. By utilizing VPT as an add-on application to enhance other out-of-distribution detection techniques, we achieve notable performance gains even compared to the current state-of-the-art (SOTA) CLIP-based OOD detection methods.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-09-05T16:15:22Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-09-05T16:15:22Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	摘要 iii Abstract v Contents vii List of Figures ix List of Tables xi Chapter 1 Introduction 1 Chapter 2 Related work 7 2.1 Zero-Shot CLIP-based OOD Detection . . . . . . . . . . . . . . . . 7 2.2 Parameter-Efficient Fine-Tuning (PEFT) . . . . . . . . . . . . . . . 8 2.2.1 Additive-based methods . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1.1 Prompt Tuning . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1.2 Adapter . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Low-Rank Adaptation(LoRA) . . . . . . . . . . . . . . . . . . . . 9 2.3 PEFT-based OOD Detection . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Using PEFT methods as an Enhancement for other CLIP-based OOD Detection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 3 Method 13 3.1 Zero-Shot CLIP-based OOD Detection . . . . . . . . . . . . . . . . 13 3.2 Parameter-Efficient Fine-Tuning (PEFT) in CLIP-based OOD Detection 16 3.2.1 CoOp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.2 VPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.3 Unified Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.4 Low-Rank Adaptation (LoRA) . . . . . . . . . . . . . . . . . . . . 23 3.3 VPT to enhance other CLIP-based OOD Detection Methods . . . . . 25 3.3.1 Limitation of using only PEFT . . . . . . . . . . . . . . . . . . . . 25 3.3.2 Negative Label . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.3 Negative Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Chapter 4 Experiments 33 4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 The impact of PEFT methods on Zero-Shot CLIP’s OOD detection performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 VPT as an enhancement on other CLIP-based OOD Detection methods 37 Chapter 5 Conclusion and Future Work 41 References 45	-
dc.language.iso	zh_TW	-
dc.subject	插件式應用	zh_TW
dc.subject	分佈外偵測	zh_TW
dc.subject	基石模型	zh_TW
dc.subject	物件辨識	zh_TW
dc.subject	PEFT	en
dc.subject	Few-shot setting	en
dc.subject	Image classification	en
dc.subject	CLIP	en
dc.subject	Foundation model	en
dc.subject	OOD detection	en
dc.title	透過視覺提示提升基於 CLIP 的分佈外偵測效能並比較PEFT方法間的優劣	zh_TW
dc.title	Enhancing CLIP-based Out-of-Distribution Detection Performance with Visual Prompt Tuning and a Comparative Analysis of Parameter-Efficient Fine-Tuning (PEFT) Methods	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	陳文進;許永真;胡敏君;陳駿丞	zh_TW
dc.contributor.oralexamcommittee	Wun-Chin Chen;Yun-Jen Hsu;Ming-Ging Hu ;Jun-Cheng Chen	en
dc.subject.keyword	分佈外偵測,插件式應用,物件辨識,基石模型,	zh_TW
dc.subject.keyword	PEFT,CLIP,OOD detection,Few-shot setting,Image classification,Foundation model,	en
dc.relation.page	49	-
dc.identifier.doi	10.6342/NTU202402658	-
dc.rights.note	未授權	-
dc.date.accepted	2024-08-12	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 未授權公開取用	1.02 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。