基於骨架的多尺度特徵對齊用於穩健時間動作定位

廖金億; Chin-Yi Liao

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96750

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	許永真	zh_TW
dc.contributor.advisor	Jane Yung-jen Hsu	en
dc.contributor.author	廖金億	zh_TW
dc.contributor.author	Chin-Yi Liao	en
dc.date.accessioned	2025-02-21T16:23:11Z	-
dc.date.available	2026-01-31	-
dc.date.copyright	2025-02-21	-
dc.date.issued	2024	-
dc.date.submitted	2024-12-16	-
dc.identifier.citation	[1] Shuming Liu, Chen-Lin Zhang, Chen Zhao, and Bernard Ghanem. End-to-end tem- poral action detection with 1b parameters across 1000 frames. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18591–18601, June 2024. [2] Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18857– 18866, 2023. [3] Chen-Lin Zhang, Jianxin Wu, and Yin Li. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision, volume 13664 of LNCS, pages 492–510, 2022. [4] Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. The thumos challenge on action recognition for videos“in the wild". Computer Vision and Image Understanding, 155:1–23, 2017. [5] Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015. [6] DimaDamen,HazelDoughty,GiovanniMariaFarinella,AntoninoFurnari,JianMa, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 130:33–55, 2022. [7] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one- stage object detection. In Proc. Int. Conf. Computer Vision (ICCV), 2019. [8] Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen Lin. Reppoints: Point set representation for object detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2019. [9] Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free mod- ule for single-shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 840–849, 2019. [10] Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In NeurIPS, 2020. [11] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017. [12] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017. [13] Krishna Kumar Singh and Yong Jae Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In International Conference on Computer Vision (ICCV), 2017. [14] Junsuk Choe and Hyunjung Shim. Attention-based dropout layer for weakly super- vised object localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2219–2228, 2019. [15] Junsuk Choe, Seong Joon Oh, Seungho Lee, Sanghyuk Chun, Zeynep Akata, and Hyunjung Shim. Evaluating weakly supervised object localization methods right. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [16] Xiaolin Zhang, Yunchao Wei, and Yi Yang. Inter-image communication for weakly supervised localization. In European Conference on Computer Vision. Springer, 2020. [17] EunjiKim,SiwonKim,JungbeomLee,HyunwooKim,andSungrohYoon.Bridging the gap between classification and localization for weakly supervised object local- ization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14258–14267, 2022. [18] Haolan Xue, Chang Liu, Fang Wan, Jianbin Jiao, Xiangyang Ji, and Qixiang Ye. Danet: Divergent activation for weakly supervised object localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6589–6598, 2019. [19] Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. SST: Single-stream temporal action proposals. In CVPR, 2017. [20] Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. Graph convolutional networks for temporal action lo- calization. In ICCV, 2019. [21] Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, and Bernard Ghanem. G- tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [22] D. Sridhar, N. Quader, S. Muralidharan, Y. Li, P. Dai, and J. Lu. Class semantics- based attention for action detection. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 13719–13728, Los Alamitos, CA, USA, oct 2021. IEEE Computer Society. [23] Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 485–494, 2021. [24] Y. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1130–1139, Los Alamitos, CA, USA, jun 2018. IEEE Computer Society. [25] Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In CVPR, 2017. [26] Zixin Zhu, Wei Tang, Le Wang, Nanning Zheng, and G. Hua. Enriching local and global contexts for temporal action localization. In ICCV, 2021. [27] Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 485–494, 2021. [28] D. Sridhar, N. Quader, S. Muralidharan, Y. Li, P. Dai, and J. Lu. Class semantics- based attention for action detection. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 13719–13728, Los Alamitos, CA, USA, oct 2021. IEEE Computer Society. [29] Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. Learning salient boundary fea- ture for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3320– 3329, June 2021. [30] Le Yang, Houwen Peng, Dingwen Zhang, Jianlong Fu, and Junwei Han. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, PP:1–1, 08 2020. [31] Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, and Limin Wang. Basictad: An astounding rgb-only baseline for temporal action detection. Computer Vision and Image Understanding, 232:103692, 2023. [32] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021. [33] Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 563–579, 2018. [34] Sanath Narayan, Hisham Cholakkal, Fahad Shahbaz Khan, and Ling Shao. 3c- net: Category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. [35] Linjiang Huang, Yan Huang, Wanli Ouyang, and Liang Wang. Relational prototyp- ical network for weakly supervised temporal action localization. Proceedings of the AAAI Conference on Artificial Intelligence, 34:11053–11060, 04 2020. [36] Kyle Min and Jason J Corso. Adversarial background-aware loss for weakly- supervised temporal activity localization. In European Conference on Computer Vision, pages 283–299. Springer, 2020. [37] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 499–515, Cham, 2016. Springer International Publishing. [38] Hong-MingYang,Xu-YaoZhang,FeiYin,andCheng-LinLiu.Robustclassification with convolutional prototype learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3474–3482, 2018. [39] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified em- bedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2015. [40] Jia-Xing Zhong, Nannan Li, Weijie Kong, Tao Zhang, Thomas H. Li, and Ge Li. Step-by-step erasion, one-by-one collection: A weakly supervised temporal action detector. In Proceedings of the 26th ACM International Conference on Multimedia, MM ’18, page 35–44, New York, NY, USA, 2018. Association for Computing Ma- chinery. [41] Linjiang Huang, Liang Wang, and Hongsheng Li. Foreground-action consistency network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8002–8011, October 2021. [42] Ashraful Islam, Chengjiang Long, and Richard J. Radke. A hybrid attention mech- anism for weakly-supervised temporal action localization, 2021. [43] Zhekun Luo, Devin Guillory, Baifeng Shi, Wei Ke, Fang Wan, Trevor Darrell, and Huijuan Xu. Weakly-supervised action localization with expectation-maximization multi-instance learning. In Computer Vision – ECCV 2020, 2020. [44] Linjiang Huang, Liang Wang, and Hongsheng Li. Weakly supervised temporal ac- tion localization via representative snippet knowledge propagation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [45] Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, and Gang Hua. Two-stream consensus network for weakly-supervised temporal action localization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 37–54. Springer, 2020. [46] W.Yang,T.Zhang,X.Yu,T.Qi,Y.Zhang,andFengWu.Uncertaintyguidedcollab- orative training for weakly supervised temporal action detection. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 53–63, Los Alamitos, CA, USA, jun 2021. IEEE Computer Society. [47] Qing Yu and Kent Fujiwara. Frame-level label refinement for skeleton-based weakly-supervised action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 3322–3330, 2023. [48] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017. [49] Jingqiu Zhou, Linjiang Huang, Liang Wang, Si Liu, and Hongsheng Li. Improving weakly supervised temporal action localization by bridging train-test gap in pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23003–23012, 2023. [50] Abhinanda R. Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J. Black. BABEL: Bodies, action and behavior with english labels. In Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 722–731, June 2021. [51] Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision, pages 5442–5451, October 2019. [52] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, October 2015. [53] Can Zhang, Meng Cao, Dongming Yang, Jie Chen, and Yuexian Zou. Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16010–16019, June 2021. [54] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In CVPR, 2019.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96750	-
dc.description.abstract	由於有限的標註和缺乏骨骼結構以外的資訊，基於骨骼資訊的弱監督時間動作定位問題面臨著巨大的挑戰。我們提出了SMART,這是一種創新的方法,透過幾項關鍵貢獻解決了這些限制。首先，藉由引入多尺度特徵金字塔概念,SMART捕捉了更豐富的特徵資訊,提升了對影片序列中動作的整體理解。另外，我們的研究提出了兩個創新模組,以提高動作定位的準確性和穩健性。類別加權特徵對齊模組透過有效地對齊不同尺度的特徵,提高了動作識別和定位的精確度。動態高斯實例融合模組生成更高品質的動作邊界,並改善了對各種動作類型和持續時間的適應性。在Babel資料集和我們實驗室專有的AIMS資料集上進行評估,SMART在基於骨骼的弱監督時間動作定位任務中達到了最先進的表現。這項研究代表了在解決多樣化影片中有限標註的動作定位挑戰方面的重大進展。	zh_TW
dc.description.abstract	Skeleton-based weakly supervised temporal action localization faces challenges due to limited annotations and the lack of information beyond skeletal structures. We introduce SMART, a novel approach that addresses these limitations through several key contributions. By incorporating a multi-scale feature pyramid concept, SMART captures richer feature information, enhancing the overall understanding of actions in video sequences. Our work presents two innovative modules to improve action localization accuracy and robustness. The Class-Weighted Feature Alignment module enhances action identification and localization precision by effectively aligning features across different scales. The Dynamic Gaussian-Based Instance Fusion module generates higher-quality action boundaries with improved adaptability to various action types and durations. Evaluated on the Babel dataset and our lab's proprietary AIMS dataset, SMART achieves state-of-the-art performance in skeleton-based weakly supervised temporal action localization. This work represents a significant advancement in addressing the challenges of action localization with limited annotations in diverse video understanding scenarios.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-02-21T16:23:11Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-02-21T16:23:11Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements i Abstract ii 摘要 iii List of Figures vii List of Tables ix List of Algorithms x Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Proposed Method 3 1.4 Outline of the Thesis 4 Chapter 2 Literature Review 5 2.1 Object Detection 5 2.2 Temporal Action Localization 6 2.2.1 Fully-supervised Temporal Action Localization 7 2.2.2 Weakly-supervised Temporal Action Localization 7 Chapter 3 Problem Statement 10 3.1 Symbol Table 10 3.2 Problem Definition 11 Chapter 4 Methodology 12 4.1 Overview 13 4.1.1 Basic Model 14 4.2 Feature Extractor: Feature Pyramid Network Enhanced GCN-based Architecture 15 4.3 Temporal Action Localization 16 4.3.1 Class-Weighted Feature Alignment 17 4.3.2 Training Objectives 19 4.4 Post-Processing Method 20 4.4.1 Dynamic Gaussian-Based Instance Fusion 20 Chapter 5 Experiments and Analysis 24 5.1 Datasets 24 5.1.1 Babel datasets 25 5.1.2 AIMS datasets 26 5.2 Metric of Evaluation 27 5.3 Implementation Details 28 5.4 Main Results 30 5.4.1 Comparison with The State-of-the-art 30 5.5 Further Analysis 33 5.5.1 Effectiveness of Each Component 33 5.5.2 Integrating Our Framework with Existing Methods 35 5.5.3 Analysis of Feature Alignment Impact 36 5.5.4 Effect of Timing for Adding the Feature Alignment Technique 38 5.5.5 Qualitative Results 40 Chapter 6 Conclusion 42 6.1 Contribution 42 6.2 Future work 43 Bibliography 45 Appendix A — AIMS Dataset Details 54 A.1 Alberta Infant Motor Scale (AIMS) 54 A.2 Data Processing Pipeline 56 Appendix B — Post-processing Methods for Temporal Action Detection 57 B.1 Overview of Instance Fusion Methods 57 B.1.1 Non-Maximum Suppression 57 B.1.2 Gaussian-Weighted Instance Fusion 58	-
dc.language.iso	en	-
dc.subject	骨架資料	zh_TW
dc.subject	時序動作定位	zh_TW
dc.subject	弱監督學習	zh_TW
dc.subject	影片理解	zh_TW
dc.subject	特徵對齊	zh_TW
dc.subject	Feature alignment	en
dc.subject	Skeleton Data	en
dc.subject	Video Understanding	en
dc.subject	Temporal action localization	en
dc.subject	Weakly supervised learning	en
dc.title	基於骨架的多尺度特徵對齊用於穩健時間動作定位	zh_TW
dc.title	SMART: Skeleton-based Multi-scale Feature Alignment for Robust Temporal Action Localization	en
dc.type	Thesis	-
dc.date.schoolyear	113-1	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	傅立成	zh_TW
dc.contributor.coadvisor	Li-Chen Fu	en
dc.contributor.oralexamcommittee	鄭素芳;郭彥伶	zh_TW
dc.contributor.oralexamcommittee	Suh-Fang Jeng;Yen-Ling Kuo	en
dc.subject.keyword	時序動作定位,弱監督學習,特徵對齊,骨架資料,影片理解,	zh_TW
dc.subject.keyword	Temporal action localization,Weakly supervised learning,Feature alignment,Skeleton Data,Video Understanding,	en
dc.relation.page	59	-
dc.identifier.doi	10.6342/NTU202404729	-
dc.rights.note	未授權	-
dc.date.accepted	2024-12-17	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
dc.date.embargo-lift	N/A	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf 未授權公開取用	3.42 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。