Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98694
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor許永真zh_TW
dc.contributor.advisorYung-Jen Hsuen
dc.contributor.author鄭雅勻zh_TW
dc.contributor.authorYa-Yun Chengen
dc.date.accessioned2025-08-18T16:07:49Z-
dc.date.available2025-08-19-
dc.date.copyright2025-08-18-
dc.date.issued2025-
dc.date.submitted2025-08-06-
dc.identifier.citation[1] J. Aklilu, X. Wang, and S. Yeung-Levy. Zero-shot Action Localization via the Confidence of Large Vision-Language Models. arXiv preprint arXiv:2410.14340, 2025.
[2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 213–229. Springer, 2020.
[3] Z. Chen, F. Zhong, Q. Luo, X. Zhang, and Y. Zheng. EdgeViT: Efficient visual modeling for edge computing. In Wireless Algorithms, Systems, and Applications–WASA 2022, volume 13644 of Lecture Notes in Computer Science, pages 393–405, Berlin, Heidelberg, 2022. Springer-Verlag.
[4] J.-R. Du, K.-Y. Lin, J. Meng, and W.-S. Zheng. Towards Completeness: A Generalizable Action Proposal Generator for Zero-Shot Temporal Action Localization. In Proceedings of the 27th International Conference on Pattern Recognition (ICPR), pages 252–267. Springer, 2024.
[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587, 2014.
[6] C. Han, H. Wang, J. Kuang, L. Zhang, and J. Gui. Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models. CoRR, abs/2501.13795, 2025.
[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
[8] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles. ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015.
[9] H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah. The THUMOS challenge on action recognition for videos“in the wild". Computer Vision and Image Understanding, 155:1–23, Feb. 2017.
[10] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie. Prompting Visual-Language Models for Efficient Video Understanding. In Proceedings of the European Conference on Computer Vision (ECCV), pages 105–124. Springer, 2022.
[11] C. Ju, Z. Li, P. Zhao, Y. Zhang, X. Zhang, Q. Tian, Y. Wang, and W. Xie. Multi-modal prompting for low-shot temporal action localization. CoRR, abs/2303.11732, 2023.
[12] H.-J. Kim, J.-H. Hong, H. Kong, and S.-W. Lee. TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18837–18846. IEEE, 2024.
[13] J. Kim, M. Lee, C.-H. Cho, J. Lee, and J.-P. Heo. Prediction-Feedback DETR for Temporal Action Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 4266–4274, 2025.
[14] J. Kim, M. Lee, and J.-P. Heo. Self-feedback detr for temporal action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10252–10262. IEEE, 2023.
[15] H. W. Kuhn. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.
[16] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 46(4):2239–2251, 2024.
[17] B. Liberatori, A. Conti, P. Rota, Y. Wang, and E. Ricci. Test-time zero-shot temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18720–18729. IEEE, 2024.
[18] C. Lin, C. Xu, D. Luo, Y. Wang, Y. Tai, C. Wang, J. Li, F. Huang, and Y. Fu. Learning Salient Boundary Feature for Anchor-free Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3319–3328. IEEE, 2021.
[19] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen. BMN: Boundary-Matching Network for Temporal Action Proposal Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3888–3897, 2019.
[20] S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang. DAB-DETR: Dynamic anchor boxes are better queries for detr. In International Conference on Learning Representations (ICLR), 2022.
[21] S. Liu, C.-L. Zhang, C. Zhao, and B. Ghanem. End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18591–18601, 2024.
[22] X. Liu, Q. Wang, Y. Hu, X. Tang, S. Zhang, S. Bai, and X. Bai. End-to-End Temporal Action Detection With Transformer. IEEE Transactions on Image Processing, 31:5427–5441, 2022.
[23] I. Loshchilov and F. Hutter. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
[24] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang. Conditional DETR for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3631–3640. IEEE, 2021.
[25] S. Nag, O. Goldstein, and A. K. Roy-Chowdhury. Semantics Guided Contrastive Learning of Transformers for Zero-shot Temporal Activity Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6232–6242. IEEE, 2023.
[26] S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang. Zero-Shot Temporal Action Detection via Vision-Language Prompting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 681–697. Springer, 2022.
[27] OpenAI. GPT-4 Technical Report, 2024.
[28] T. Phan, K. Vo, D. Le, G. Doretto, D. A. Adjeroh, and N. Le. ZEETAD: Adapt-ing Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection. CoRR, abs/2311.00729, 2023.
[29] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning Transferable Visual Models From Natural Language Supervision, 2021.
[30] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788. IEEE, 2016.
[31] D. Shi, Y. Zhong, Q. Cao, L. Ma, J. Li, and D. Tao. TriDet: Temporal Action Detection with Relative Boundary Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18857–18866, 2023.
[32] D. Shi, Y. Zhong, Q. Cao, J. Zhang, L. Ma, J. Li, and D. Tao. ReAct: Temporal Action Detection with Relational Queries. In Proceedings of the European Conference on Computer Vision (ECCV), pages 324–344. Springer, 2022.
[33] J. Tan, J. Tang, L. Wang, and G. Wu. Relaxed Transformer Decoders for Direct Action Proposal Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13506–13515. IEEE, 2021.
[34] J. Tan, X. Zhao, X. Shi, B. Kang, and L. Wang. PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), page 1111. Curran Associates Inc., 2022.
[35] M. Xu, C. Zhao, D. S. Rojas, A. Thabet, and B. Ghanem. G-TAD: Sub-Graph Localization for Temporal Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10153–10162. IEEE, 2020.
[36] S. Yan, X. Xiong, A. Nagrani, A. Arnab, Z. Wang, W. Ge, D. Ross, and C. Schmid. UnLoc: A Unified Framework for Video Localization Tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13577–13587. IEEE, 2023.
[37] L. Yang, H. Peng, D. Zhang, J. Fu, and J. Han. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, 29:8535–8548, 2020.
[38] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu. CoCa: Contrastive Captioners are Image-Text Foundation Models. Transactions on Machine Learning Research (TMLR), 2022.
[39] C. Zhang, J. Wu, and Y. Li. ActionFormer: Localizing Moments of Actions with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 492–510. Springer, 2022.
[40] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations (ICLR), 2021. Oral Presentation.
[41] Y. Zhu, G. Zhang, J. Tan, G. Wu, and L. Wang. Dual DETRs for Multi-Label Temporal Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18559–18569. IEEE, 2024.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98694-
dc.description.abstract在時間動作定位任務中,由於影片本身幀與幀之間變化慢,使用標準Transformer 注意力機制時,易造成過度平滑的現象。其中一種有效的解法是引入Deformable DETR 中的可變形注意力機制。然而,特別是在零樣本設定下,所使用的特徵多來自視覺語言模型,因缺乏直觀的時間特徵金字塔,使得現有方法難以充分發揮 Deformable DETR 在偵測短動作方面的潛力,正如其原先在圖像中對小物體偵測所展現的優勢。
為了解決此一限制,我們提出 TP2-DETR,這是一種創新的端對端架構,融合特別設計的時間特徵金字塔網路,以全面釋放 Deformable DETR 在零樣本時間動作區間生成上的潛能。我們探索了不同的 FPN 變體來更好地讓 Deformable DETR 發揮功效。而進一步為了整體系統的效率與訓練穩定性,我們設計了一個共享、輕量且具多尺度感知能力的顯著性預測頭進行早期監督,並加以多層輔助的動作區間預測頭提供深層監督訊號。
我們在 THUMOS14 與 ActivityNet1.3 資料集上進行實驗, TP2-DETR在多數零樣本分割設定中達到最先進的表現,特別是在短動作比例較高的 THUMOS14 資料集中,在兩種常見的零樣本設定下,平均 mAP 分別提升了 5.14% 與 10.27%。上述結果顯示,我們所提出的設計能有效釋放 Deformable DETR 在零樣本時間動作區間生成任務中的潛力。
zh_TW
dc.description.abstractIn temporal action localization, the inherent slowness of videos often leads to over-smoothing when using standard transformer attention mechanisms. A promising solution is to leverage deformable attention from Deformable DETR. However, due to the lack of an intuitive temporal feature pyramid, especially in zero-shot settings where features are extracted from vision-language models, existing methods underutilize Deformable DETR's ability to detect short actions, in the same way it benefits small object detection in images.
In this paper, we introduce TP2-DETR, a novel end-to-end framework that integrates a dedicated Temporal Feature Pyramid Network (FPN) to unlock the full potential of Deformable DETR for Zero-Shot Temporal Action Proposal Generation (ZS-TAPG). We explore different FPN variants to better leverage the capabilities of Deformable DETR. To further ensure efficiency and training stability in the end-to-end system, we design a shared, lightweight, and multi-scale-aware salient head for early supervision, complemented by auxiliary prediction heads for deep supervision.
We conducted experiments on the Thumos14 and ActivityNet1.3 datasets, demonstrating that TP2-DETR achieves state-of-the-art performance across most zero-shot split settings. Notably, it yields particularly significant improvements on Thumos14, which contains a high proportion of short actions, with average mAP gains of 5.14% and 10.27% under two common zero-shot split settings. These findings demonstrate the effectiveness of our design in fully harnessing Deformable DETR for ZS-TAPG.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-18T16:07:49Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-08-18T16:07:49Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 ii
Abstract iv
Contents vi
List of Figures x
List of Tables xiii
Chapter 1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2 Related Work 6
2.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Temporal Action Localization (TAL) . . . . . . . . . . . . . . . . . 7
2.3 Zero-Shot Temporal Action Localization (ZS-TAL) . . . . . . . . . . 8
2.3.1 Training-Based Approaches . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Training-Free Approaches . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Over-smoothing in Transformer-based Architectures for TAL . . . . 11
Chapter 3 Problem Statement 13
3.1 Problem Defintion . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 ZS-TAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 ZS-TAPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 4 Methodology 18
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.1 DETR and Deformable DETR Overview . . . . . . . . . . . . . . . 18
4.1.2 From Small Objects to Short Actions . . . . . . . . . . . . . . . . . 19
4.2 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Temporal Feature Pyramid Network . . . . . . . . . . . . . . . . . . 21
4.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 Observation from Spatial FPN in Object Detection . . . . . . . . . 22
4.3.3 Temporal FPN Variants . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.3.1 Direct Downsampling . . . . . . . . . . . . . . . . . . 23
4.3.3.2 CNN-based Design . . . . . . . . . . . . . . . . . . . 24
4.3.3.3 Transformer-based Design . . . . . . . . . . . . . . . . 25
4.4 Multi-Scale Aware Salient Head . . . . . . . . . . . . . . . . . . . . 27
4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.2 Salient Head Types . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.2.1 CNN-based . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.2.2 Unified MLP-based . . . . . . . . . . . . . . . . . . . 28
4.5 Auxiliary Heads for Stable End-to-End Learning . . . . . . . . . . . 29
4.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5.2 Early Supervision on Temporal FPN . . . . . . . . . . . . . . . . . 29
4.5.3 Deep Supervision on Decoder Layers . . . . . . . . . . . . . . . . . 30
4.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6.1 Bipartite Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6.2 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.7 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 5 Experiments 34
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1.1 Thumos14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1.2 ActivityNet1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.3 Zero-Shot Split Settings . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4.1 Comparison with State-of-the-Art Methods . . . . . . . . . . . . . 39
5.4.2 Comparison with Deformable DETR-based methods . . . . . . . . 41
5.5 Further Analysis and Ablation Study . . . . . . . . . . . . . . . . . . 43
5.5.1 Choice of Temporal FPN . . . . . . . . . . . . . . . . . . . . . . . 44
5.5.2 Design of Temporal FPN . . . . . . . . . . . . . . . . . . . . . . . 45
5.5.3 Design of Salient Head . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5.4 Effectiveness of Components . . . . . . . . . . . . . . . . . . . . . 47
5.5.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5.6 Prediction Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 6 Conclusion 53
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 54
References 58
-
dc.language.isoen-
dc.subject零樣本學習zh_TW
dc.subject時間動作區間生成zh_TW
dc.subject短動作定位zh_TW
dc.subject特徵金字塔網路zh_TW
dc.subject可變形DETRzh_TW
dc.subjectTemporal Action Proposal Generationen
dc.subjectZero-Shot Learningen
dc.subjectShort Action Localizationen
dc.subjectFeature Pyramid Networken
dc.subjectDeformable DETRen
dc.title結合時間特徵金字塔以釋放 Deformable DETR 於零樣本時間動作區段生成之潛力zh_TW
dc.titleTP2-DETR: Unlocking Deformable DETR for Zero-Shot Temporal Action Proposal Generation with Temporal Feature Pyramidsen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.coadvisor鄭文皇zh_TW
dc.contributor.coadvisorWen-Huang Chengen
dc.contributor.oralexamcommittee吳家麟;陳駿丞;楊智淵zh_TW
dc.contributor.oralexamcommitteeJia-Lin Wu;Jun-Cheng Chen;Chih-Yuan Yangen
dc.subject.keyword時間動作區間生成,零樣本學習,可變形DETR,特徵金字塔網路,短動作定位,zh_TW
dc.subject.keywordTemporal Action Proposal Generation,Zero-Shot Learning,Deformable DETR,Feature Pyramid Network,Short Action Localization,en
dc.relation.page63-
dc.identifier.doi10.6342/NTU202503736-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2025-08-11-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-lift2025-08-19-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf2.36 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved