人物交互檢測模型基於視覺語言模型轉移

張凱庭; Kai-Ting Chang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88293

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	賴飛羆	zh_TW
dc.contributor.advisor	Fei-Pei Lai	en
dc.contributor.author	張凱庭	zh_TW
dc.contributor.author	Kai-Ting Chang	en
dc.date.accessioned	2023-08-09T16:24:17Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-09	-
dc.date.issued	2023	-
dc.date.submitted	2023-07-18	-
dc.identifier.citation	[1] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. Endto-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020. [2] Y.-W. Chao, Y. Liu, X. Liu, H. Zeng, and J. Deng. Learning to detect human-object interactions. In 2018 IEEE winter conference on applications of computer vision (wacv), pages 381–389. IEEE, 2018. [3] Q. Dong, Z. Tu, H. Liao, Y. Zhang, V. Mahadevan, and S. Soatto. Visual relationship detection using part-and-sum transformers with composite queries. In Proceedingsof the IEEE/CVF International Conference on Computer Vision, pages 3550–3559, 2021. [4] C. Gao, J. Xu, Y. Zou, and J.-B. Huang. Drg: Dual relation graph for human object interaction detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pages 696–712. Springer, 2020. [5] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020. [6] B. Kim, J. Lee, J. Kang, E.-S. Kim, and H. J. Kim. Hotr: End-to-end human-object　interaction　detection　with　transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 74–83, 2021. [7] D.-J. Kim, X. Sun, J. Choi, S. Lin, and I. S. Kweon. Detecting human-object interactions with action co-occurrence priors. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 718–736. Springer, 2020. [8] J. Li, P. Zhou, C. Xiong, and S. C. Hoi. Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966, 2020. [9] Y.-L. Li, X. Liu, H. Lu, S. Wang, J. Liu, J. Li, and C. Lu. Detailed 2d-3d joint representation for human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10166–10175, 2020. [10] Z. Li, C. Zou, Y. Zhao, B. Li, and S. Zhong. Improving human-object interaction detection via phrase learning and label composition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1509–1517, 2022. [11] Y. Liao, A. Zhang, M. Lu, Y. Wang, X. Li, and S. Liu. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. arXiv preprint arXiv:2203.13954, 2022. [12] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry,　A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural　language supervision. In　International　conference　on　machine　learning, pages 8748–8763. PMLR, 2021. [13] M. Tamura, H. Ohashi, and T. Yoshinaga. QPIC: Query-based pairwise human　object interaction detection with image-wide contextual information. In CVPR, 2021. [14] O. Ulutan, A. Iftekhar, and B. S. Manjunath. Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13617– 13626, 2020. [15] P. Wang, K. Han, X.-S. Wei, L. Zhang, and L. Wang. Contrastive learning-based hybrid networks for long-tailed image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 943–952, 2021. [16] J. Zhu, Z. Wang, J. Chen, Y.-P. P. Chen, and Y.-G. Jiang. Balanced contrastive learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6908–6917, 2022. [17] C. Zou, B. Wang, Y. Hu, J. Liu, Q. Wu, Y. Zhao, B. Li, C. Zhang, C. Zhang, Y. Wei, and J. Sun. End-to-end human object interaction detection with hoi transformer. In CVPR, 2021. [18] HE, Kaiming, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 770-778. [19] VASWANI, Ashish, et al. Attention is all you need. Advances in neural information processing systems, 2017, 30. [20] CHEN, Mingfei, et al. Reformulating hoi detection as adaptive set prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 9004-9013. [21] ZHANG, Aixi, et al. Mining the benefits of two-stage and one-stage hoi detection. Advances in Neural Information Processing Systems, 2021, 34: 17209-17220. [22] ZHANG, Frederic Z.; CAMPBELL, Dylan; GOULD, Stephen. Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 20104-20112. [23] ZHANG, Yong, et al. Exploring structure-aware transformer over interaction proposals for human-object interaction detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 19548-19557.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88293	-
dc.description.abstract	人物與物件交互（HOI）檢測是計算機視覺中的一項基本任務，其目標是在視覺場景中定位人物與物件對並識別它們的交互作用。本研究提出了一種新穎的HOI檢測框架，通過將視覺語言模型的先前知識轉移，實現更好的推廣能力。我們的框架包括一個混合式的結構，主要由兩個分支組成：第一個分支使用對比損失來改進從CLIP中提取的人物與物件之代表特徵向量，而第二個分支是一個分類器，使用交叉熵損失根據學習到的特徵向量進行預測。雙分支框架改善了人物與物件交互表示的品質和分類性能。為了提升HOI特徵向量的品質，我們提出了一個加強過的交互解碼器(Interaction Decoder)，其中包括額外的交叉注意力層(Cross Attention)。我們提出的這個新的解碼器能夠讓模型更容易從CLIP的視覺特徵圖中抽取帶有豐富資訊、有意義的區塊。通過將此解碼器與檢測主幹結合，使用知識集成block，我們實現了更精確的人物與物件對定位。我們的框架在完全監督且僅新增少量的模型參數下展現了優異的性能。	zh_TW
dc.description.abstract	Human-object interaction (HOI) detection is a fundamental task in computer vision that objective to localize human-object pairs and identify their interactions in visual scenes. This work proposes a novel HOI detection framework that transfers prior knowledge from Visual Language Models to achieve better generalization. Our framework includes a hybrid network structure consisting of two branches: the first branch uses contrastive loss to improve the representation of human-object interactions extracted from CLIP, while the second branch is a classifier that uses cross-entropy loss to make predictions based on the learned features. The dual branch framework improves both the quality of representation of human-object interactions and the classification performance. In order to enhance the quality of interaction representations, we present an enhanced interaction decoder that incorporates an additional cross-attention layer. This new layer enables the extraction of informative localized regions from the visual feature map of CLIP. By integrating this decoder with the detection backbone using a knowledge integration block, we achieve more precise localization of human-object pairs. Our framework demonstrates superior performance in both fully-supervised scenarios and scenarios with limited parameters and data.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-09T16:24:17Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-09T16:24:17Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員會審定書 # 誌謝 i ABSTRACT iii 摘要 v CONTENTS vii LIST OF FIGURES ix LIST OF TABLES xi Chapter 1 Introduction 1 Chapter 2 Related Work 4 2.1 HOI Detection 4 2.2 Vision-Language Models 4 2.3 Contrastive Learning 5 Chapter 3 Methods 8 3.1 Framework Architecture 8 3.1.1 Text Embedding for Prototype Initialization 10 3.2 HOI Detector Architecture 12 3.2.1 Interaction Decoder with CLIP Visual Knowledge Fusion 13 3.3 Training and Inference 16 3.3.1 Training 16 3.3.2 Inference 18 Chapter 4 Experiments 19 4.1 Experiments Settings 19 4.2 Effectiveness for HOI Detection 20 4.3 Ablation Study 22 4.3.1 Architecture Analysis 22 4.3.2 Pre-train Feature Learning Branch or Not? 23 4.3.3 Protype Update Strategy. 24 4.4 Visualization 26 4.5 Limitation and Discussion 28 Chapter 5 Conclusion 30 REFERENCE 32	-
dc.language.iso	en	-
dc.subject	知識遷移	zh_TW
dc.subject	人物交互檢測	zh_TW
dc.subject	對比式學習	zh_TW
dc.subject	特徵學習	zh_TW
dc.subject	視覺語言模型	zh_TW
dc.subject	contrastive learning	en
dc.subject	HOI Detection	en
dc.subject	knowledge transferring	en
dc.subject	CLIP	en
dc.subject	representation learning	en
dc.title	人物交互檢測模型基於視覺語言模型轉移	zh_TW
dc.title	A Hybrid Framework for HOI Detection based on VLMs Knowledge Transferring	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	劉庭祿;傅楸善;張哲瑋;陳啟煌	zh_TW
dc.contributor.oralexamcommittee	Tyng-Lu Liu;Chiou-Shann Fuh;Che-Wei Chang;Chi-Huang Chen	en
dc.subject.keyword	人物交互檢測,對比式學習,特徵學習,視覺語言模型,知識遷移,	zh_TW
dc.subject.keyword	HOI Detection,contrastive learning,representation learning,CLIP,knowledge transferring,	en
dc.relation.page	35	-
dc.identifier.doi	10.6342/NTU202301583	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2023-07-19	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	2.53 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。