Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88293
Title: 人物交互檢測模型基於視覺語言模型轉移
A Hybrid Framework for HOI Detection based on VLMs Knowledge Transferring
Authors: 張凱庭
Kai-Ting Chang
Advisor: 賴飛羆
Fei-Pei Lai
Keyword: 人物交互檢測,對比式學習,特徵學習,視覺語言模型,知識遷移,
HOI Detection,contrastive learning,representation learning,CLIP,knowledge transferring,
Publication Year : 2023
Degree: 碩士
Abstract: 人物與物件交互(HOI)檢測是計算機視覺中的一項基本任務,其目標是在視覺場景中定位人物與物件對並識別它們的交互作用。本研究提出了一種新穎的HOI檢測框架,通過將視覺語言模型的先前知識轉移,實現更好的推廣能力。我們的框架包括一個混合式的結構,主要由兩個分支組成:第一個分支使用對比損失來改進從CLIP中提取的人物與物件之代表特徵向量,而第二個分支是一個分類器,使用交叉熵損失根據學習到的特徵向量進行預測。雙分支框架改善了人物與物件交互表示的品質和分類性能。為了提升HOI特徵向量的品質,我們提出了一個加強過的交互解碼器(Interaction Decoder),其中包括額外的交叉注意力層(Cross Attention)。我們提出的這個新的解碼器能夠讓模型更容易從CLIP的視覺特徵圖中抽取帶有豐富資訊、有意義的區塊。通過將此解碼器與檢測主幹結合,使用知識集成block,我們實現了更精確的人物與物件對定位。我們的框架在完全監督且僅新增少量的模型參數下展現了優異的性能。
Human-object interaction (HOI) detection is a fundamental task in computer vision that objective to localize human-object pairs and identify their interactions in visual scenes. This work proposes a novel HOI detection framework that transfers prior knowledge from Visual Language Models to achieve better generalization. Our framework includes a hybrid network structure consisting of two branches: the first branch uses contrastive loss to improve the representation of human-object interactions extracted from CLIP, while the second branch is a classifier that uses cross-entropy loss to make predictions based on the learned features. The dual branch framework improves both the quality of representation of human-object interactions and the classification performance. In order to enhance the quality of interaction representations, we present an enhanced interaction decoder that incorporates an additional cross-attention layer. This new layer enables the extraction of informative localized regions from the visual feature map of CLIP. By integrating this decoder with the detection backbone using a knowledge integration block, we achieve more precise localization of human-object pairs. Our framework demonstrates superior performance in both fully-supervised scenarios and scenarios with limited parameters and data.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88293
DOI: 10.6342/NTU202301583
Fulltext Rights: 同意授權(限校園內公開)
Appears in Collections:資訊工程學系

Files in This Item:
File SizeFormat 
ntu-111-2.pdf
Access limited in NTU ip range
2.53 MBAdobe PDF
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved