基於運動輔助表徵學習之弱監督細粒度影片異常檢測

鄒玲; Ling Zou

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98518

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	鄭文皇	zh_TW
dc.contributor.advisor	Wen-Huang Cheng	en
dc.contributor.author	鄒玲	zh_TW
dc.contributor.author	Ling Zou	en
dc.date.accessioned	2025-08-14T16:25:34Z	-
dc.date.available	2025-08-15	-
dc.date.copyright	2025-08-14	-
dc.date.issued	2025	-
dc.date.submitted	2025-08-01	-
dc.identifier.citation	[1] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6299–6308, 2017. [2] J. Chen, L. Li, L. Su, Z.-j. Zha, and Q. Huang. Prompt-enhanced multiple instance learning for weakly supervised video anomaly detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 18319–18329, 2024. [3] K. Chen, D. Zhang, L. Yao, B. Guo, Z. Yu, and Y. Liu. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Comput. Surv., 54(4):1–40, 2021. [4] W. Chen, K. T. Ma, Z. J. Yew, M. Hur, and D. A.-A. Khoo. Tevad: Improved video anomaly detection with captions. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5549–5559, 2023. [5] Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, and Y.-C. Wu. Mgfn: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection. In AAAI, volume 37, pages 387–395, 2023. [6] M. Cheng, K. Cai, and M. Li. Rwf-2000: An open large scale video database for violence detection. In Int. Conf. Pattern Recog., pages 4183–4190, 2021. [7] M. Cho, M. Kim, S. Hwang, C. Park, K. Lee, and S. Lee. Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12137–12146, 2023. [8] J.-C. Feng, F.-T. Hong, and W.-S. Zheng. Mist: Multiple instance self-training framework for video anomaly detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 14009–14018, 2021. [9] M. I. Georgescu, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah. A background-agnostic framework with adversarial training for abnormal event detection in video. IEEE Trans. Pattern Anal. Mach. Intell., 44(9):4505–4523, 2021. [10] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. Cnn architectures for large-scale audio classification. In ICASSP, pages 131–135, 2017. [11] O. Hirschorn and S. Avidan. Normalizing flows for human pose anomaly detection. In Int. Conf. Comput. Vis., pages 13545–13554, 2023. [12] Q. Huang, X. Dong, D. Chen, W. Zhang, F. Wang, G. Hua, and N. Yu. Diversity-aware meta visual prompting. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10878–10887, 2023. [13] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim. Visual prompt tuning. In Eur. Conf. Comput. Vis., pages 709–727, 2022. [14] J.-Y. Jiang-Lin, K.-Y. Huang, L. Lo, Y.-N. Huang, T. Lin, J.-C. Wu, H.-H. Shuai, and W.-H. Cheng. Record: Reasoning and correcting diffusion for hoi generation. In ACM Int. Conf. Multimedia, pages 9465–9474, 2024. [15] H. K. Joo, K. Vo, K. Yamazaki, and N. Le. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In IEEE Int. Conf. Image Process., pages 3230–3234, 2023. [16] B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021. [17] G. Li, G. Cai, X. Zeng, and R. Zhao. Scale-aware spatio-temporal relation learning for video anomaly detection. In Eur. Conf. Comput. Vis., pages 333–350, 2022. [18] Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, and L. Wang. Tea: Temporal excitation and aggregation for action recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pages 909–918, 2020. [19] X. Lin, Y. Chen, G. Li, and Y. Yu. A causal inference look at unsupervised video anomaly detection. In AAAI, volume 36, pages 1620–1629, 2022. [20] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9):1–35, 2023. [21] T. Liu, K.-M. Lam, and B.-K. Bao. Injecting text clues for improving anomalous event detection from weakly labeled videos. IEEE Trans. Image Process., 2024. [22] W. Liu, H. Chang, B. Ma, S. Shan, and X. Chen. Diversity-measurable anomaly detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12147–12156, 2023. [23] W. Liu, W. Luo, D. Lian, and S. Gao. Future frame prediction for anomaly detection–a new baseline. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6536–6545, 2018. [24] Y. Liu, D. Yang, Y. Wang, J. Liu, J. Liu, A. Boukerche, P. Sun, and L. Song. Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models. ACM Comput. Surv., 56(7):1–38, 2024. [25] Z. Liu, D. Luo, Y. Wang, L. Wang, Y. Tai, C. Wang, J. Li, F. Huang, and T. Lu. Teinet: Towards an efficient architecture for video recognition. In AAAI, volume 34, pages 11669–11676, 2020. [26] H. Lv, Z. Yue, Q. Sun, B. Luo, Z. Cui, and H. Zhang. Unbiased multiple instance learning for weakly supervised video anomaly detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8022–8031, 2023. [27] S. Majhi, G. D’Amicantonio, A. Dantcheva, Q. Kong, L. Garattoni, G. Francesca, E. Bondarev, and F. Brémond. Just dance with pi! a poly-modal inductor for weakly-supervised video anomaly detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 24265–24274, 2025. [28] R. Nawaratne, D. Alahakoon, D. De Silva, and X. Yu. Spatiotemporal anomaly detection using deep learning for real-time video surveillance. IEEE Trans. Industrial Informatics, 16(1):393–402, 2019. [29] Y. Nie, H. Huang, C. Long, Q. Zhang, P. Maji, and H. Cai. Interleaving one-class and weakly-supervised models with adaptive thresholding for unsupervised video anomaly detection. In Eur. Conf. Comput. Vis., pages 449–467, 2024. [30] G. Pang, C. Yan, C. Shen, A. v. d. Hengel, and X. Bai. Self-trained deep ordinal regression for end-to-end video anomaly detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12173–12182, 2020. [31] Y. Pu, X. Wu, L. Yang, and S. Wang. Learning prompt-enhanced context features for weakly-supervised video anomaly detection. IEEE Trans. Image Process., 2024. [32] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn., pages 8748–8763, 2021. [33] B. Ramachandra, M. J. Jones, and R. R. Vatsavai. A survey of single-scene video anomaly detection. IEEE Trans. Pattern Anal. Mach. Intell., 44(5):2293–2312, 2020. [34] N.-C. Ristea, F.-A. Croitoru, R. T. Ionescu, M. Popescu, F. S. Khan, M. Shah, et al. Self-distilled masked auto-encoders are efficient video anomaly detectors. In IEEE Conf. Comput. Vis. Pattern Recog., pages 15984–15995, 2024. [35] H. Shi, L. Wang, S. Zhou, G. Hua, and W. Tang. Learning anomalies with normality prior for unsupervised video anomaly detection. In Eur. Conf. Comput. Vis., pages 163–180, 2024. [36] Y. Su, G. Lin, J. Zhu, and Q. Wu. Human interaction learning on 3d skeleton point clouds for video violence recognition. In Eur. Conf. Comput. Vis., pages 74–90, 2020. [37] W. Sultani, C. Chen, and M. Shah. Real-world anomaly detection in surveillance videos. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6479–6488, 2018. [38] S. Sun and X. Gong. Hierarchical semantic contrast for scene-aware video anomaly detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 22846–22856, 2023. [39] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro. Weakly- supervised video anomaly detection with robust temporal feature magnitude learning. In Int. Conf. Comput. Vis., pages 4975–4986, 2021. [40] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Int. Conf. Comput. Vis., pages 4489–4497, 2015. [41] T. M. Tran, T. N. Vu, N. D. Vo, T. V. Nguyen, and K. Nguyen. Anomaly analysis in images and videos: A comprehensive review. ACM Comput. Surv., 55(7):1–37, 2022. [42] R. Tudor Ionescu, S. Smeureanu, B. Alexe, and M. Popescu. Unmasking the abnormal events in video. In Int. Conf. Comput. Vis., pages 2895–2903, 2017. [43] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. J. Mach. Learn. Res., 9(11), 2008. [44] B. Wang, C. Huang, J. Wen, W. Wang, Y. Liu, and Y. Xu. Federated weakly supervised video anomaly detection with multimodal prompt. In AAAI, volume 39, pages 21017–21025, 2025. [45] H. Wang, A. Xu, P. Ding, and J. Gui. Dual conditioned motion diffusion for pose-based video anomaly detection. In AAAI, volume 39, pages 7700–7708, 2025. [46] J.-C. Wu, H.-Y. Hsieh, D.-J. Chen, C.-S. Fuh, and T.-L. Liu. Self-supervised sparse representation for video anomaly detection. In Eur. Conf. Comput. Vis., pages 729–745, 2022. [47] J.-J. Wu, A. C.-H. Chang, C.-Y. Chuang, C.-P. Chen, Y.-L. Liu, M.-H. Chen, H.-N. Hu, Y.-Y. Chuang, and Y.-Y. Lin. Image-text co-decomposition for text-supervised semantic segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 26794–26803, 2024. [48] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Eur. Conf. Comput. Vis., pages 322–339, 2020. [49] P. Wu, X. Liu, and J. Liu. Weakly supervised audio-visual violence detection. IEEE Trans. Multimedia, 25:1674–1685, 2022. [50] P. Wu, W. Su, G. Pang, Y. Sun, Q. Yan, P. Wang, and Y. Zhang. Avadclip: Audio-visual collaboration for robust video anomaly detection. arXiv preprint arXiv:2504.04495, 2025. [51] P. Wu, X. Zhou, G. Pang, Z. Yang, Q. Yan, P. Wang, and Y. Zhang. Weakly supervised video anomaly detection and localization with spatio-temporal prompts. In ACM Int. Conf. Multimedia, pages 9301–9310, 2024. [52] P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y. Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In AAAI, volume 38, pages 6074–6082, 2024. [53] H. Xie, C.-J. Peng, Y.-W. Tseng, H.-J. Chen, C.-F. Hsu, H.-H. Shuai, and W.-H. Cheng. Emovit: Revolutionizing emotion insights with visual instruction tuning. In IEEE Conf. Comput. Vis. Pattern Recog., pages 26596–26605, 2024. [54] Y. Xing, Q. Wu, D. Cheng, S. Zhang, G. Liang, P. Wang, and Y. Zhang. Dual modality prompt tuning for vision-language pre-trained model. IEEE Trans. Multimedia, 26:2056–2068, 2023. [55] C. Yan, S. Zhang, Y. Liu, G. Pang, and W. Wang. Feature prediction diffusion model for video anomaly detection. In Int. Conf. Comput. Vis., pages 5527–5537, 2023. [56] Y. Yang, K. Lee, B. Dariush, Y. Cao, and S.-Y. Lo. Follow the rules: reasoning for video anomaly detection with large language models. In Eur. Conf. Comput. Vis., pages 304–322, 2024. [57] Z. Yang, J. Liu, and P. Wu. Text prompt with normality guidance for weakly supervised video anomaly detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 18899–18908, 2024. [58] G. Yu, S. Wang, Z. Cai, X. Liu, C. Xu, and C. Wu. Deep anomaly discovery from unlabeled videos via normality advantage and self-paced refinement. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13987–13998, 2022. [59] M. Z. Zaheer, A. Mahmood, M. Astrid, and S.-I. Lee. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In Eur. Conf. Comput. Vis., pages 358–376, 2020. [60] L. Zanella, W. Menapace, M. Mancini, Y. Wang, and E. Ricci. Harnessing large language models for training-free video anomaly detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 18527–18536, 2024. [61] A. Zareian, K. D. Rosa, D. H. Hu, and S.-F. Chang. Open-vocabulary object detection using captions. In IEEE Conf. Comput. Vis. Pattern Recog., pages 14393–14402, 2021. [62] C. Zhang, G. Li, Y. Qi, H. Ye, L. Qing, M.-H. Yang, and Q. Huang. Dynamic erasing network with adaptive temporal modeling for weakly supervised video anomaly detection. IEEE Trans. Neural Netw. Learn. Syst., 2025. [63] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1237–1246, 2019. [64] H. Zhou, J. Cai, Y. Ye, Y. Feng, C. Gao, J. Yu, Z. Song, and W. Yang. Video anomaly detection with motion and appearance guided patch diffusion model. In AAAI, volume 39, pages 10761–10769, 2025. [65] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Conditional prompt learning for vision-language models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 16816–16825, 2022. [66] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Learning to prompt for vision-language models. Int. J. Comput. Vis., 130(9):2337–2348, 2022.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98518	-
dc.description.abstract	細粒度影片異常檢測（Fine-grained Video Anomaly Detection, FG-VAD）旨在僅利用影片層級的異常存在指示和相應的語義類別標籤，對影片中的異常幀進行定位。儘管現有的大多數方法都利用了 CLIP 特徵來解決這個問題，但仍存在關鍵挑戰。在視覺方面，CLIP 特徵雖然在靜態影像上表現良好，但缺乏時序感知，常常因光照突變、物體快速移動或幀切換過快而導致誤報。在語意方面，許多方法難以充分捕捉所提供類別標籤的細微語義，導致相關異常行為的誤判。為了解決這些局限性，我們提出了一種新方法，包含兩個關鍵模組：（1）視覺時序平滑（Visual Temporal Smoothing，VTS）模組，透過引入時序一致性來減少誤報；（2）文字增強表徵模組（Text-Enhanced Representation，TER），利用大型語言模型（Large Language Model，LLM）豐富異常標籤的語義理解，從而實現更準確的幀級分類。在兩個基準資料集上的大量實驗和全面的消融研究表明，我們的方法有效性顯著，優於現有的最新方法。	zh_TW
dc.description.abstract	Fine-grained Video Anomaly Detection (FG-VAD) aims to localize anomalous frames within a video using only video-level indications of anomaly presence and a corresponding semantic category label. While most existing methods leverage CLIP features to tackle this problem, key challenges still remain. On the visual side, CLIP features are effective for static images but lack temporal awareness, often leading to false alarms caused by sudden changes in illumination, rapid object motion, or fast frame transitions. On the semantic side, many approaches struggle to capture the nuanced meaning of the provided category label, resulting in missed detections of relevant anomalous actions. To overcome these limitations, we propose a novel method with two key components: (1) Visual Temporal Smoothing (VTS) module designed to reduce false positives by incorporating temporal consistency, and (2) Text-Enhanced Representation (TER) module that utilizes LLMs to enrich the semantic understanding of anomaly labels, enabling more accurate frame-level classification. Extensive experiments on two benchmark datasets, along with comprehensive ablation studies, demonstrate the effectiveness of our approach, showing that it outperforms existing state-of-the-art methods.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-14T16:25:34Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-08-14T16:25:34Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要 iii Abstract iv Contents vi List of Figures ix List of Tables xi Chapter 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2 Related Works 6 2.1 Video Anomaly Detection and Coarse-Grained Video Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Fine-Grained Video Anomaly Detection . . . . . . . . . . . . . . . . 8 2.3 Vision-Language Pre-training and Prompt Tuning . . . . . . . . . . . 9 Chapter 3 Methodology 11 3.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Visual Temporal Smoothing . . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 Motion Smooth Module . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.2 Temporal Adapter Module . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Text-Enhanced Representation . . . . . . . . . . . . . . . . . . . . . 17 3.4 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 4 Experiments 21 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.1 XD-Violence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1.2 UCF-Crime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.4 Comparison with SOTA WS-VAD Methods . . . . . . . . . . . . . . 26 4.4.1 FG-VAD Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.4.2 CG-VAD Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.5 Qualitative Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.5.1 Qualitative FG-VAD Results . . . . . . . . . . . . . . . . . . . . . 29 4.5.2 Visualization of Learned Representations . . . . . . . . . . . . . . 31 4.6 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.6.1 Effectiveness of Proposed Components . . . . . . . . . . . . . . . . 32 4.6.2 Effectiveness of Proposed Strategies . . . . . . . . . . . . . . . . . 34 4.6.3 Effectiveness of Hyperparameters. . . . . . . . . . . . . . . . . . .36 4.6.4 Effectiveness of Prompt Templates. . . . . . . . . . . . . . . . . . . 38 Chapter 5 Conclusion 41 5.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 References 44 Appendix A — Supplementary Qualitative Analyses 53 A.1 Visualization Examples and Explanations . . . . . . . . . . . . . . .53	-
dc.language.iso	en	-
dc.subject	大型語言模型	zh_TW
dc.subject	細粒度影片異常偵測	zh_TW
dc.subject	多模態學習	zh_TW
dc.subject	動作輔助表徵學習	zh_TW
dc.subject	Large language model	en
dc.subject	Multi-modal learning	en
dc.subject	Fine-grained video anomaly detection	en
dc.subject	Motion-Assisted representation learning	en
dc.title	基於運動輔助表徵學習之弱監督細粒度影片異常檢測	zh_TW
dc.title	Weakly-Supervised Fine-Grained Video Anomaly Detection via Motion-Assisted Representation Learning	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	莊永裕;花凱龍;陳駿丞	zh_TW
dc.contributor.oralexamcommittee	Yung-Yu Chuang;Kai-Lung Hua;Jun-Cheng Chen	en
dc.subject.keyword	細粒度影片異常偵測,多模態學習,動作輔助表徵學習,大型語言模型,	zh_TW
dc.subject.keyword	Fine-grained video anomaly detection,Multi-modal learning,Motion-Assisted representation learning,Large language model,	en
dc.relation.page	57	-
dc.identifier.doi	10.6342/NTU202503387	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-08-06	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2025-08-15	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	27.79 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。