請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98694| 標題: | 結合時間特徵金字塔以釋放 Deformable DETR 於零樣本時間動作區段生成之潛力 TP2-DETR: Unlocking Deformable DETR for Zero-Shot Temporal Action Proposal Generation with Temporal Feature Pyramids |
| 作者: | 鄭雅勻 Ya-Yun Cheng |
| 指導教授: | 許永真 Yung-Jen Hsu |
| 共同指導教授: | 鄭文皇 Wen-Huang Cheng |
| 關鍵字: | 時間動作區間生成,零樣本學習,可變形DETR,特徵金字塔網路,短動作定位, Temporal Action Proposal Generation,Zero-Shot Learning,Deformable DETR,Feature Pyramid Network,Short Action Localization, |
| 出版年 : | 2025 |
| 學位: | 碩士 |
| 摘要: | 在時間動作定位任務中,由於影片本身幀與幀之間變化慢,使用標準Transformer 注意力機制時,易造成過度平滑的現象。其中一種有效的解法是引入Deformable DETR 中的可變形注意力機制。然而,特別是在零樣本設定下,所使用的特徵多來自視覺語言模型,因缺乏直觀的時間特徵金字塔,使得現有方法難以充分發揮 Deformable DETR 在偵測短動作方面的潛力,正如其原先在圖像中對小物體偵測所展現的優勢。
為了解決此一限制,我們提出 TP2-DETR,這是一種創新的端對端架構,融合特別設計的時間特徵金字塔網路,以全面釋放 Deformable DETR 在零樣本時間動作區間生成上的潛能。我們探索了不同的 FPN 變體來更好地讓 Deformable DETR 發揮功效。而進一步為了整體系統的效率與訓練穩定性,我們設計了一個共享、輕量且具多尺度感知能力的顯著性預測頭進行早期監督,並加以多層輔助的動作區間預測頭提供深層監督訊號。 我們在 THUMOS14 與 ActivityNet1.3 資料集上進行實驗, TP2-DETR在多數零樣本分割設定中達到最先進的表現,特別是在短動作比例較高的 THUMOS14 資料集中,在兩種常見的零樣本設定下,平均 mAP 分別提升了 5.14% 與 10.27%。上述結果顯示,我們所提出的設計能有效釋放 Deformable DETR 在零樣本時間動作區間生成任務中的潛力。 In temporal action localization, the inherent slowness of videos often leads to over-smoothing when using standard transformer attention mechanisms. A promising solution is to leverage deformable attention from Deformable DETR. However, due to the lack of an intuitive temporal feature pyramid, especially in zero-shot settings where features are extracted from vision-language models, existing methods underutilize Deformable DETR's ability to detect short actions, in the same way it benefits small object detection in images. In this paper, we introduce TP2-DETR, a novel end-to-end framework that integrates a dedicated Temporal Feature Pyramid Network (FPN) to unlock the full potential of Deformable DETR for Zero-Shot Temporal Action Proposal Generation (ZS-TAPG). We explore different FPN variants to better leverage the capabilities of Deformable DETR. To further ensure efficiency and training stability in the end-to-end system, we design a shared, lightweight, and multi-scale-aware salient head for early supervision, complemented by auxiliary prediction heads for deep supervision. We conducted experiments on the Thumos14 and ActivityNet1.3 datasets, demonstrating that TP2-DETR achieves state-of-the-art performance across most zero-shot split settings. Notably, it yields particularly significant improvements on Thumos14, which contains a high proportion of short actions, with average mAP gains of 5.14% and 10.27% under two common zero-shot split settings. These findings demonstrate the effectiveness of our design in fully harnessing Deformable DETR for ZS-TAPG. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98694 |
| DOI: | 10.6342/NTU202503736 |
| 全文授權: | 同意授權(全球公開) |
| 電子全文公開日期: | 2025-08-19 |
| 顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf | 2.36 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
