請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98694完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 許永真 | zh_TW |
| dc.contributor.advisor | Yung-Jen Hsu | en |
| dc.contributor.author | 鄭雅勻 | zh_TW |
| dc.contributor.author | Ya-Yun Cheng | en |
| dc.date.accessioned | 2025-08-18T16:07:49Z | - |
| dc.date.available | 2025-08-19 | - |
| dc.date.copyright | 2025-08-18 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-08-06 | - |
| dc.identifier.citation | [1] J. Aklilu, X. Wang, and S. Yeung-Levy. Zero-shot Action Localization via the Confidence of Large Vision-Language Models. arXiv preprint arXiv:2410.14340, 2025.
[2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 213–229. Springer, 2020. [3] Z. Chen, F. Zhong, Q. Luo, X. Zhang, and Y. Zheng. EdgeViT: Efficient visual modeling for edge computing. In Wireless Algorithms, Systems, and Applications–WASA 2022, volume 13644 of Lecture Notes in Computer Science, pages 393–405, Berlin, Heidelberg, 2022. Springer-Verlag. [4] J.-R. Du, K.-Y. Lin, J. Meng, and W.-S. Zheng. Towards Completeness: A Generalizable Action Proposal Generator for Zero-Shot Temporal Action Localization. In Proceedings of the 27th International Conference on Pattern Recognition (ICPR), pages 252–267. Springer, 2024. [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587, 2014. [6] C. Han, H. Wang, J. Kuang, L. Zhang, and J. Gui. Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models. CoRR, abs/2501.13795, 2025. [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. [8] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles. ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015. [9] H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah. The THUMOS challenge on action recognition for videos“in the wild". Computer Vision and Image Understanding, 155:1–23, Feb. 2017. [10] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie. Prompting Visual-Language Models for Efficient Video Understanding. In Proceedings of the European Conference on Computer Vision (ECCV), pages 105–124. Springer, 2022. [11] C. Ju, Z. Li, P. Zhao, Y. Zhang, X. Zhang, Q. Tian, Y. Wang, and W. Xie. Multi-modal prompting for low-shot temporal action localization. CoRR, abs/2303.11732, 2023. [12] H.-J. Kim, J.-H. Hong, H. Kong, and S.-W. Lee. TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18837–18846. IEEE, 2024. [13] J. Kim, M. Lee, C.-H. Cho, J. Lee, and J.-P. Heo. Prediction-Feedback DETR for Temporal Action Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 4266–4274, 2025. [14] J. Kim, M. Lee, and J.-P. Heo. Self-feedback detr for temporal action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10252–10262. IEEE, 2023. [15] H. W. Kuhn. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955. [16] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 46(4):2239–2251, 2024. [17] B. Liberatori, A. Conti, P. Rota, Y. Wang, and E. Ricci. Test-time zero-shot temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18720–18729. IEEE, 2024. [18] C. Lin, C. Xu, D. Luo, Y. Wang, Y. Tai, C. Wang, J. Li, F. Huang, and Y. Fu. Learning Salient Boundary Feature for Anchor-free Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3319–3328. IEEE, 2021. [19] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen. BMN: Boundary-Matching Network for Temporal Action Proposal Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3888–3897, 2019. [20] S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang. DAB-DETR: Dynamic anchor boxes are better queries for detr. In International Conference on Learning Representations (ICLR), 2022. [21] S. Liu, C.-L. Zhang, C. Zhao, and B. Ghanem. End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18591–18601, 2024. [22] X. Liu, Q. Wang, Y. Hu, X. Tang, S. Zhang, S. Bai, and X. Bai. End-to-End Temporal Action Detection With Transformer. IEEE Transactions on Image Processing, 31:5427–5441, 2022. [23] I. Loshchilov and F. Hutter. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [24] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang. Conditional DETR for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3631–3640. IEEE, 2021. [25] S. Nag, O. Goldstein, and A. K. Roy-Chowdhury. Semantics Guided Contrastive Learning of Transformers for Zero-shot Temporal Activity Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6232–6242. IEEE, 2023. [26] S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang. Zero-Shot Temporal Action Detection via Vision-Language Prompting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 681–697. Springer, 2022. [27] OpenAI. GPT-4 Technical Report, 2024. [28] T. Phan, K. Vo, D. Le, G. Doretto, D. A. Adjeroh, and N. Le. ZEETAD: Adapt-ing Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection. CoRR, abs/2311.00729, 2023. [29] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning Transferable Visual Models From Natural Language Supervision, 2021. [30] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788. IEEE, 2016. [31] D. Shi, Y. Zhong, Q. Cao, L. Ma, J. Li, and D. Tao. TriDet: Temporal Action Detection with Relative Boundary Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18857–18866, 2023. [32] D. Shi, Y. Zhong, Q. Cao, J. Zhang, L. Ma, J. Li, and D. Tao. ReAct: Temporal Action Detection with Relational Queries. In Proceedings of the European Conference on Computer Vision (ECCV), pages 324–344. Springer, 2022. [33] J. Tan, J. Tang, L. Wang, and G. Wu. Relaxed Transformer Decoders for Direct Action Proposal Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13506–13515. IEEE, 2021. [34] J. Tan, X. Zhao, X. Shi, B. Kang, and L. Wang. PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), page 1111. Curran Associates Inc., 2022. [35] M. Xu, C. Zhao, D. S. Rojas, A. Thabet, and B. Ghanem. G-TAD: Sub-Graph Localization for Temporal Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10153–10162. IEEE, 2020. [36] S. Yan, X. Xiong, A. Nagrani, A. Arnab, Z. Wang, W. Ge, D. Ross, and C. Schmid. UnLoc: A Unified Framework for Video Localization Tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13577–13587. IEEE, 2023. [37] L. Yang, H. Peng, D. Zhang, J. Fu, and J. Han. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, 29:8535–8548, 2020. [38] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu. CoCa: Contrastive Captioners are Image-Text Foundation Models. Transactions on Machine Learning Research (TMLR), 2022. [39] C. Zhang, J. Wu, and Y. Li. ActionFormer: Localizing Moments of Actions with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 492–510. Springer, 2022. [40] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations (ICLR), 2021. Oral Presentation. [41] Y. Zhu, G. Zhang, J. Tan, G. Wu, and L. Wang. Dual DETRs for Multi-Label Temporal Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18559–18569. IEEE, 2024. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98694 | - |
| dc.description.abstract | 在時間動作定位任務中,由於影片本身幀與幀之間變化慢,使用標準Transformer 注意力機制時,易造成過度平滑的現象。其中一種有效的解法是引入Deformable DETR 中的可變形注意力機制。然而,特別是在零樣本設定下,所使用的特徵多來自視覺語言模型,因缺乏直觀的時間特徵金字塔,使得現有方法難以充分發揮 Deformable DETR 在偵測短動作方面的潛力,正如其原先在圖像中對小物體偵測所展現的優勢。
為了解決此一限制,我們提出 TP2-DETR,這是一種創新的端對端架構,融合特別設計的時間特徵金字塔網路,以全面釋放 Deformable DETR 在零樣本時間動作區間生成上的潛能。我們探索了不同的 FPN 變體來更好地讓 Deformable DETR 發揮功效。而進一步為了整體系統的效率與訓練穩定性,我們設計了一個共享、輕量且具多尺度感知能力的顯著性預測頭進行早期監督,並加以多層輔助的動作區間預測頭提供深層監督訊號。 我們在 THUMOS14 與 ActivityNet1.3 資料集上進行實驗, TP2-DETR在多數零樣本分割設定中達到最先進的表現,特別是在短動作比例較高的 THUMOS14 資料集中,在兩種常見的零樣本設定下,平均 mAP 分別提升了 5.14% 與 10.27%。上述結果顯示,我們所提出的設計能有效釋放 Deformable DETR 在零樣本時間動作區間生成任務中的潛力。 | zh_TW |
| dc.description.abstract | In temporal action localization, the inherent slowness of videos often leads to over-smoothing when using standard transformer attention mechanisms. A promising solution is to leverage deformable attention from Deformable DETR. However, due to the lack of an intuitive temporal feature pyramid, especially in zero-shot settings where features are extracted from vision-language models, existing methods underutilize Deformable DETR's ability to detect short actions, in the same way it benefits small object detection in images.
In this paper, we introduce TP2-DETR, a novel end-to-end framework that integrates a dedicated Temporal Feature Pyramid Network (FPN) to unlock the full potential of Deformable DETR for Zero-Shot Temporal Action Proposal Generation (ZS-TAPG). We explore different FPN variants to better leverage the capabilities of Deformable DETR. To further ensure efficiency and training stability in the end-to-end system, we design a shared, lightweight, and multi-scale-aware salient head for early supervision, complemented by auxiliary prediction heads for deep supervision. We conducted experiments on the Thumos14 and ActivityNet1.3 datasets, demonstrating that TP2-DETR achieves state-of-the-art performance across most zero-shot split settings. Notably, it yields particularly significant improvements on Thumos14, which contains a high proportion of short actions, with average mAP gains of 5.14% and 10.27% under two common zero-shot split settings. These findings demonstrate the effectiveness of our design in fully harnessing Deformable DETR for ZS-TAPG. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-18T16:07:49Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-08-18T16:07:49Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Acknowledgements i
摘要 ii Abstract iv Contents vi List of Figures x List of Tables xiii Chapter 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2 Related Work 6 2.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Temporal Action Localization (TAL) . . . . . . . . . . . . . . . . . 7 2.3 Zero-Shot Temporal Action Localization (ZS-TAL) . . . . . . . . . . 8 2.3.1 Training-Based Approaches . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Training-Free Approaches . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Over-smoothing in Transformer-based Architectures for TAL . . . . 11 Chapter 3 Problem Statement 13 3.1 Problem Defintion . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.1 ZS-TAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.2 ZS-TAPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 4 Methodology 18 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1.1 DETR and Deformable DETR Overview . . . . . . . . . . . . . . . 18 4.1.2 From Small Objects to Short Actions . . . . . . . . . . . . . . . . . 19 4.2 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3 Temporal Feature Pyramid Network . . . . . . . . . . . . . . . . . . 21 4.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3.2 Observation from Spatial FPN in Object Detection . . . . . . . . . 22 4.3.3 Temporal FPN Variants . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3.3.1 Direct Downsampling . . . . . . . . . . . . . . . . . . 23 4.3.3.2 CNN-based Design . . . . . . . . . . . . . . . . . . . 24 4.3.3.3 Transformer-based Design . . . . . . . . . . . . . . . . 25 4.4 Multi-Scale Aware Salient Head . . . . . . . . . . . . . . . . . . . . 27 4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4.2 Salient Head Types . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4.2.1 CNN-based . . . . . . . . . . . . . . . . . . . . . . . 27 4.4.2.2 Unified MLP-based . . . . . . . . . . . . . . . . . . . 28 4.5 Auxiliary Heads for Stable End-to-End Learning . . . . . . . . . . . 29 4.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.5.2 Early Supervision on Temporal FPN . . . . . . . . . . . . . . . . . 29 4.5.3 Deep Supervision on Decoder Layers . . . . . . . . . . . . . . . . . 30 4.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.6.1 Bipartite Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.6.2 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.7 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 5 Experiments 34 5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.1.1 Thumos14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.1.2 ActivityNet1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.1.3 Zero-Shot Split Settings . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4.1 Comparison with State-of-the-Art Methods . . . . . . . . . . . . . 39 5.4.2 Comparison with Deformable DETR-based methods . . . . . . . . 41 5.5 Further Analysis and Ablation Study . . . . . . . . . . . . . . . . . . 43 5.5.1 Choice of Temporal FPN . . . . . . . . . . . . . . . . . . . . . . . 44 5.5.2 Design of Temporal FPN . . . . . . . . . . . . . . . . . . . . . . . 45 5.5.3 Design of Salient Head . . . . . . . . . . . . . . . . . . . . . . . . 46 5.5.4 Effectiveness of Components . . . . . . . . . . . . . . . . . . . . . 47 5.5.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.5.6 Prediction Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Chapter 6 Conclusion 53 6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 54 References 58 | - |
| dc.language.iso | en | - |
| dc.subject | 零樣本學習 | zh_TW |
| dc.subject | 時間動作區間生成 | zh_TW |
| dc.subject | 短動作定位 | zh_TW |
| dc.subject | 特徵金字塔網路 | zh_TW |
| dc.subject | 可變形DETR | zh_TW |
| dc.subject | Temporal Action Proposal Generation | en |
| dc.subject | Zero-Shot Learning | en |
| dc.subject | Short Action Localization | en |
| dc.subject | Feature Pyramid Network | en |
| dc.subject | Deformable DETR | en |
| dc.title | 結合時間特徵金字塔以釋放 Deformable DETR 於零樣本時間動作區段生成之潛力 | zh_TW |
| dc.title | TP2-DETR: Unlocking Deformable DETR for Zero-Shot Temporal Action Proposal Generation with Temporal Feature Pyramids | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.coadvisor | 鄭文皇 | zh_TW |
| dc.contributor.coadvisor | Wen-Huang Cheng | en |
| dc.contributor.oralexamcommittee | 吳家麟;陳駿丞;楊智淵 | zh_TW |
| dc.contributor.oralexamcommittee | Jia-Lin Wu;Jun-Cheng Chen;Chih-Yuan Yang | en |
| dc.subject.keyword | 時間動作區間生成,零樣本學習,可變形DETR,特徵金字塔網路,短動作定位, | zh_TW |
| dc.subject.keyword | Temporal Action Proposal Generation,Zero-Shot Learning,Deformable DETR,Feature Pyramid Network,Short Action Localization, | en |
| dc.relation.page | 63 | - |
| dc.identifier.doi | 10.6342/NTU202503736 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2025-08-11 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| dc.date.embargo-lift | 2025-08-19 | - |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf | 2.36 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
