視覺音訊和相似網路用於影片重複動作計數

張智堯; Chih-Yao Chang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88589

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	吳沛遠	zh_TW
dc.contributor.advisor	Pei-Yuan Wu	en
dc.contributor.author	張智堯	zh_TW
dc.contributor.author	Chih-Yao Chang	en
dc.date.accessioned	2023-08-15T16:57:46Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-15	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-01	-
dc.identifier.citation	[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In Proceedings of the IEEE international conference on computer vision, pages 609–617, 2017. [2] J. Arevalo, T. Solorio, M. Montes-y Gómez, and F. A. González. Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992, 2017. [3] C. Arteta, V. Lempitsky, and A. Zisserman. Counting in the wild. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 483–498. Springer, 2016. [4] O. Azy and N. Ahuja. Segmentation of periodically moving objects. In 2008 19th International Conference on Pattern Recognition, pages 1–4. IEEE, 2008. [5] C. BenAbdelkader, R. Cutler, H. Nanda, and L. Davis. Eigengait: Motion-based recognition of people using image self-similarity. In Audio-and Video-Based Biometric Person Authentication: Third International Conference, AVBPA 2001 Halmstad, Sweden, June 6–8, 2001 Proceedings 3, pages 284–294. Springer, 2001. [6] C. BenAbdelkader, R. G. Cutler, and L. S. Davis. Gait recognition using image self-similarity. EURASIP Journal on Advances in Signal Processing, 2004:1–14, 2004. [7] L. Boominathan, S. S. Kruthiventi, and R. V. Babu. Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the 24th ACM international conference on Multimedia, pages 640–644, 2016. [8] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. [9] K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov. Zero-shot audio source separation through query-based learning from weakly-labeled data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 4441–4449, 2022. [10] D. Chetverikov and S. Fazekas. On motion periodicity of dynamic textures. In BMVC, volume 1, pages 167–176. Citeseer, 2006. [11] R. Cutler and L. S. Davis. Robust real-time periodic motion detection, analysis, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):781–796, 2000. [12] D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman. Counting out time: Class agnostic video repetition counting in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1038710396, 2020. [13] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619, 2018. [14] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019. [15] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multi-modal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016. [16] D. Gandhi, L. Pinto, and A. Gupta. Learning to fly by crashing. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 39483955, 2017. [17] R. Gao and K. Grauman. Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3879–3888, 2019. [18] R. Gao and K. Grauman. Visualvoice: Audio-visual speech separation with cross-modal consistency. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15490–15500. IEEE, 2021. [19] D. Griffin and J. Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236243, 1984. [20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [22] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017. [23] D. Hu, L. Mou, Q. Wang, J. Gao, Y. Hua, D. Dou, and X. X. Zhu. Ambient sound helps: Audiovisual crowd counting in extreme conditions. arXiv preprint arXiv:2005.07097, 2020. [24] H. Hu, S. Dong, Y. Zhao, D. Lian, Z. Li, and S. Gao. Transrac: Encoding multi-scale temporal correlation with transformers for repetitive action counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19013–19022, 2022. [25] I. N. Junejo, E. Dexter, I. Laptev, and P. Perez. View-independent action recognition from temporal self-similarities. IEEE transactions on pattern analysis and machine intelligence, 33(1):172–185, 2010. [26] G. Karvounas, I. Oikonomidis, and A. Argyros. Reactnet: Temporal localization of repetitive activities in real-world videos. arXiv preprint arXiv:1910.06096, 2019. [27] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5492–5501, 2019. [28] M. Körner and J. Denzler. Temporal self-similarity for appearance-based action recognition in multi-view setups. In Computer Analysis of Images and Patterns: 15th International Conference, CAIP 2013, York, UK, August 27-29, 2013, Proceedings, Part I 15, pages 163–171. Springer, 2013. [29] F. Laurent, M. Valderrama, M. Besserve, M. Guillard, J.-P. Lachaux, J. Martinerie, and G. Florence. Multimodal information improves the rapid detection of mental fatigue. Biomedical Signal Processing and Control, 8(4):400–408, 2013. [30] V. Lempitsky and A. Zisserman. Learning to count objects in images. Advances in neural information processing systems, 23, 2010. [31] O. Levy and L. Wolf. Live repetition counting. In Proceedings of the IEEE international conference on computer vision, pages 3020–3028, 2015. [32] D. Lian, X. Chen, J. Li, W. Luo, and S. Gao. Locating and counting heads in crowds with a depth prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9056–9072, 2021. [33] D. Lian, J. Li, J. Zheng, W. Luo, and S. Gao. Density map regression guided detection network for rgb-d crowd counting and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1821–1830, 2019. [34] D. Liu, T. Jiang, and Y. Wang. Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1298–1307, 2019. [35] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1001210022, October 2021. [36] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. [37] X. Long, C. Gan, G. De Melo, J. Wu, X. Liu, and S. Wen. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7834–7843, 2018. [38] E. Lu, W. Xie, and A. Zisserman. Class-agnostic counting. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pages 669–684. Springer, 2019. [39] C. Panagiotakis, G. Karvounas, and A. Argyros. Unsupervised detection of periodic segments in videos. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 923–927. IEEE, 2018. [40] E. Pogalin, A. W. Smeulders, and A. H. Thean. Visual quasi-periodicity. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008. [41] R. Rothe, R. Timofte, and L. Van Gool. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision, 126(2-4):144–157, 2018. [42] T. F. Runia, C. G. Snoek, and A. W. Smeulders. Real-world repetition estimation by div, grad and curl. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9009–9017, 2018. [43] T. F. Runia, C. G. Snoek, and A. W. Smeulders. Repetition estimation. International Journal of Computer Vision, 127(9):1361–1383, 2019. [44] N. Seenouvong, U. Watchareeruetai, C. Nuthong, K. Khongsomboon, and N. Ohnishi. A computer vision based vehicle detection and counting system. In 2016 8th International conference on knowledge and smart technology (KST), pages 224–227. IEEE, 2016. [45] E. Shechtman and M. Irani. Matching local self-similarities across images and videos. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007. [46] C. Sun, I. N. Junejo, M. Tappen, and H. Foroosh. Exploring sparseness and self-similarity for action recognition. IEEE Transactions on Image Processing, 24(8):2488–2501, 2015. [47] A. Thangali and S. Sclaroff. Periodic motion detection and estimation via space-time sampling. In 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION’05)-Volume 1, volume 2, pages 176–182. IEEE, 2005. [48] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018. [49] P.-S. Tsai, M. Shah, K. Keiter, and T. Kasparis. Cyclic motion detection for motion based recognition. Pattern recognition, 27(12):1591–1603, 1994. [50] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 74647475, 2023. [51] W. Wang, J. Dai, Z. Chen, Z. Huang, Z. Li, X. Zhu, X. Hu, T. Lu, L. Lu, H. Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14408–14419, 2023. [52] F. Xiao, Y. J. Lee, K. Grauman, J. Malik, and C. Feichtenhofer. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740, 2020. [53] F. Xiao, L. Sigal, and Y. Jae Lee. Weakly-supervised visual grounding of phrases with linguistic structures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5945–5954, 2017. [54] W. Xie, J. A. Noble, and A. Zisserman. Microscopy cell counting and detection with fully convolutional regression networks. Computer methods in biomechanics and biomedical engineering: Imaging & Visualization, 6(3):283–292, 2018. [55] S. S. A. Zaidi, M. S. Ansari, A. Aslam, N. Kanwal, M. Asghar, and B. Lee. A survey of modern deep learning based object detection models. Digital Signal Processing, 126:103514, 2022. [56] H. Zhang, X. Xu, G. Han, and S. He. Context-aware and scale-insensitive temporal repetition counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 670–678, 2020. [57] Y. Zhang, L. Shao, and C. G. Snoek. Repetitive activity counting by sight and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14070–14079, 2021. [58] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3357–3364. IEEE, 2017.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88589	-
dc.description.abstract	我們提出 VASsNet 作為視頻中重複動作計數的一種新穎方法，它結合了視覺、音頻及其相似度矩陣。在之前的工作中，視覺、視覺相似度和音頻特徵已分別用於重複運動計數。然而，由於缺乏對這三個方面信息的有效整合，在模糊和/或包含快速運動的視頻中很難獲得良好的計數結果。VASsNet 由四個路徑構成，即視覺、視覺相似性、音頻和音頻相似性路徑。通過採用多層跨模態信息融合方法，通過橫向連接有效地集成從這些路徑中提取的信息。通過實驗，我們演示瞭如何利用相似矩陣路徑來解決視頻中短期運動引起的先前無法檢測到的重複動作計數的問題；以及音頻路徑如何幫助提高模糊視頻的計數準確性。實驗結果表明，VASsNet 在 Countix 和Countix-AV 數據集上實現了最先進的性能。	zh_TW
dc.description.abstract	We propose VASsNet as a novel approach for repetitive action counting in video, which incorporates Vision, Audio, as well as their Similarity matrices. In previous works, vision, vision similarity, and audio features have been separately used for repetitive motion counting. However, due to the lack of effective integration of information from these three aspects, it is difficult to achieve decent counting results in videos which are blurry and/or contain rapid movements. The VASsNet is structured with four pathways, namely the vision, vision similarity, audio and audio similarity pathways. The information extracted from these pathways is effectively integrated through lateral connections by employing a multi-layers cross-modal information fusion approach. Through experiments, we demonstrate how the similarity matrix pathways can be utilized to solve the problem of the previously undetectable repetitive action counting which is caused by short-term motion in videos; and how the audio pathway can help to enhance the counting accuracy with blurry videos. Experiment results show that VASsNet achieves the state-of-the-art performance on Countix and Countix-AV datasets.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-15T16:57:46Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-15T16:57:46Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	摘要 i Abstract iii Contents v List of Figures vii List of Tables ix Chapter 1 Introduction 1 Chapter 2 Related works 5 2.1 Counting in video . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Video feature extraction . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Temporal Self-similarity Matrix (TSM) . . . . . . . . . . . . . . . . 8 2.4 Multiple stream model . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 3 Main Model 11 3.1 Vision Pathway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Audio Pathway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Vision Similarity Pathway . . . . . . . . . . . . . . . . . . . . . . . 14 3.4 Audio Similarity Pathway . . . . . . . . . . . . . . . . . . . . . . . 15 3.5 Lateral connections (Fusion) . . . . . . . . . . . . . . . . . . . . . . 15 3.6 Repetition counting predictor . . . . . . . . . . . . . . . . . . . . . 16 3.7 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 4 Experiments 19 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Chapter 5 Result 25 5.1 Compare with Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 25 5.2 Hard cases analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.3 Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 6 Conclusion 29 References 31	-
dc.language.iso	en	-
dc.subject	相似矩陣	zh_TW
dc.subject	聲音	zh_TW
dc.subject	重複性動作計數	zh_TW
dc.subject	視覺	zh_TW
dc.subject	Vision	en
dc.subject	Repetition counting	en
dc.subject	Audio	en
dc.subject	Similarity Matrix	en
dc.title	視覺音訊和相似網路用於影片重複動作計數	zh_TW
dc.title	Vision Audio and Similarity Networks for Video Repetition Counting	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	王鈺強;杜維洲	zh_TW
dc.contributor.oralexamcommittee	Yu-Chiang Wang;Wei-Zhou Du	en
dc.subject.keyword	重複性動作計數,視覺,聲音,相似矩陣,	zh_TW
dc.subject.keyword	Repetition counting,Vision,Audio,Similarity Matrix,	en
dc.relation.page	39	-
dc.identifier.doi	10.6342/NTU202302401	-
dc.rights.note	未授權	-
dc.date.accepted	2023-08-04	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 未授權公開取用	2.53 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。