Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88589
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor吳沛遠zh_TW
dc.contributor.advisorPei-Yuan Wuen
dc.contributor.author張智堯zh_TW
dc.contributor.authorChih-Yao Changen
dc.date.accessioned2023-08-15T16:57:46Z-
dc.date.available2023-11-09-
dc.date.copyright2023-08-15-
dc.date.issued2023-
dc.date.submitted2023-08-01-
dc.identifier.citation[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In Proceedings of the IEEE international conference on computer vision, pages 609–617, 2017.
[2] J. Arevalo, T. Solorio, M. Montes-y Gómez, and F. A. González. Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992, 2017.
[3] C. Arteta, V. Lempitsky, and A. Zisserman. Counting in the wild. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 483–498. Springer, 2016.
[4] O. Azy and N. Ahuja. Segmentation of periodically moving objects. In 2008 19th International Conference on Pattern Recognition, pages 1–4. IEEE, 2008.
[5] C. BenAbdelkader, R. Cutler, H. Nanda, and L. Davis. Eigengait: Motion-based recognition of people using image self-similarity. In Audio-and Video-Based Biometric Person Authentication: Third International Conference, AVBPA 2001 Halmstad, Sweden, June 6–8, 2001 Proceedings 3, pages 284–294. Springer, 2001.
[6] C. BenAbdelkader, R. G. Cutler, and L. S. Davis. Gait recognition using image self-similarity. EURASIP Journal on Advances in Signal Processing, 2004:1–14, 2004.
[7] L. Boominathan, S. S. Kruthiventi, and R. V. Babu. Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the 24th ACM international conference on Multimedia, pages 640–644, 2016.
[8] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[9] K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov. Zero-shot audio source separation through query-based learning from weakly-labeled data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 4441–4449, 2022.
[10] D. Chetverikov and S. Fazekas. On motion periodicity of dynamic textures. In BMVC, volume 1, pages 167–176. Citeseer, 2006.
[11] R. Cutler and L. S. Davis. Robust real-time periodic motion detection, analysis, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):781–796, 2000.
[12] D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman. Counting out time: Class agnostic video repetition counting in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1038710396, 2020.
[13] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619, 2018.
[14] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
[15] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multi-modal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
[16] D. Gandhi, L. Pinto, and A. Gupta. Learning to fly by crashing. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 39483955, 2017.
[17] R. Gao and K. Grauman. Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3879–3888, 2019.
[18] R. Gao and K. Grauman. Visualvoice: Audio-visual speech separation with cross-modal consistency. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15490–15500. IEEE, 2021.
[19] D. Griffin and J. Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236243, 1984.
[20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[22] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017.
[23] D. Hu, L. Mou, Q. Wang, J. Gao, Y. Hua, D. Dou, and X. X. Zhu. Ambient sound helps: Audiovisual crowd counting in extreme conditions. arXiv preprint arXiv:2005.07097, 2020.
[24] H. Hu, S. Dong, Y. Zhao, D. Lian, Z. Li, and S. Gao. Transrac: Encoding multi-scale temporal correlation with transformers for repetitive action counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19013–19022, 2022.
[25] I. N. Junejo, E. Dexter, I. Laptev, and P. Perez. View-independent action recognition from temporal self-similarities. IEEE transactions on pattern analysis and machine intelligence, 33(1):172–185, 2010.
[26] G. Karvounas, I. Oikonomidis, and A. Argyros. Reactnet: Temporal localization of repetitive activities in real-world videos. arXiv preprint arXiv:1910.06096, 2019.
[27] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5492–5501, 2019.
[28] M. Körner and J. Denzler. Temporal self-similarity for appearance-based action recognition in multi-view setups. In Computer Analysis of Images and Patterns: 15th International Conference, CAIP 2013, York, UK, August 27-29, 2013, Proceedings, Part I 15, pages 163–171. Springer, 2013.
[29] F. Laurent, M. Valderrama, M. Besserve, M. Guillard, J.-P. Lachaux, J. Martinerie, and G. Florence. Multimodal information improves the rapid detection of mental fatigue. Biomedical Signal Processing and Control, 8(4):400–408, 2013.
[30] V. Lempitsky and A. Zisserman. Learning to count objects in images. Advances in neural information processing systems, 23, 2010.
[31] O. Levy and L. Wolf. Live repetition counting. In Proceedings of the IEEE international conference on computer vision, pages 3020–3028, 2015.
[32] D. Lian, X. Chen, J. Li, W. Luo, and S. Gao. Locating and counting heads in crowds with a depth prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9056–9072, 2021.
[33] D. Lian, J. Li, J. Zheng, W. Luo, and S. Gao. Density map regression guided detection network for rgb-d crowd counting and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1821–1830, 2019.
[34] D. Liu, T. Jiang, and Y. Wang. Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1298–1307, 2019.
[35] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1001210022, October 2021.
[36] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
[37] X. Long, C. Gan, G. De Melo, J. Wu, X. Liu, and S. Wen. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7834–7843, 2018.
[38] E. Lu, W. Xie, and A. Zisserman. Class-agnostic counting. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pages 669–684. Springer, 2019.
[39] C. Panagiotakis, G. Karvounas, and A. Argyros. Unsupervised detection of periodic segments in videos. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 923–927. IEEE, 2018.
[40] E. Pogalin, A. W. Smeulders, and A. H. Thean. Visual quasi-periodicity. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
[41] R. Rothe, R. Timofte, and L. Van Gool. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision, 126(2-4):144–157, 2018.
[42] T. F. Runia, C. G. Snoek, and A. W. Smeulders. Real-world repetition estimation by div, grad and curl. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9009–9017, 2018.
[43] T. F. Runia, C. G. Snoek, and A. W. Smeulders. Repetition estimation. International Journal of Computer Vision, 127(9):1361–1383, 2019.
[44] N. Seenouvong, U. Watchareeruetai, C. Nuthong, K. Khongsomboon, and N. Ohnishi. A computer vision based vehicle detection and counting system. In 2016 8th International conference on knowledge and smart technology (KST), pages 224–227. IEEE, 2016.
[45] E. Shechtman and M. Irani. Matching local self-similarities across images and videos. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
[46] C. Sun, I. N. Junejo, M. Tappen, and H. Foroosh. Exploring sparseness and self-similarity for action recognition. IEEE Transactions on Image Processing, 24(8):2488–2501, 2015.
[47] A. Thangali and S. Sclaroff. Periodic motion detection and estimation via space-time sampling. In 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION’05)-Volume 1, volume 2, pages 176–182. IEEE, 2005.
[48] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
[49] P.-S. Tsai, M. Shah, K. Keiter, and T. Kasparis. Cyclic motion detection for motion based recognition. Pattern recognition, 27(12):1591–1603, 1994.
[50] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 74647475, 2023.
[51] W. Wang, J. Dai, Z. Chen, Z. Huang, Z. Li, X. Zhu, X. Hu, T. Lu, L. Lu, H. Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14408–14419, 2023.
[52] F. Xiao, Y. J. Lee, K. Grauman, J. Malik, and C. Feichtenhofer. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740, 2020.
[53] F. Xiao, L. Sigal, and Y. Jae Lee. Weakly-supervised visual grounding of phrases with linguistic structures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5945–5954, 2017.
[54] W. Xie, J. A. Noble, and A. Zisserman. Microscopy cell counting and detection with fully convolutional regression networks. Computer methods in biomechanics and biomedical engineering: Imaging & Visualization, 6(3):283–292, 2018.
[55] S. S. A. Zaidi, M. S. Ansari, A. Aslam, N. Kanwal, M. Asghar, and B. Lee. A survey of modern deep learning based object detection models. Digital Signal Processing, 126:103514, 2022.
[56] H. Zhang, X. Xu, G. Han, and S. He. Context-aware and scale-insensitive temporal repetition counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 670–678, 2020.
[57] Y. Zhang, L. Shao, and C. G. Snoek. Repetitive activity counting by sight and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14070–14079, 2021.
[58] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3357–3364. IEEE, 2017.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88589-
dc.description.abstract我們提出 VASsNet 作為視頻中重複動作計數的一種新穎方法,它結合了視覺、音頻及其相似度矩陣。在之前的工作中,視覺、視覺相似度和音頻特徵已分別用於重複運動計數。然而,由於缺乏對這三個方面信息的有效整合,在模糊和/或包含快速運動的視頻中很難獲得良好的計數結果。VASsNet 由四個路徑構成,即視覺、視覺相似性、音頻和音頻相似性路徑。通過採用多層跨模態信息融合方法,通過橫向連接有效地集成從這些路徑中提取的信息。通過實驗,我們演示瞭如何利用相似矩陣路徑來解決視頻中短期運動引起的先前無法檢測到的重複動作計數的問題;以及音頻路徑如何幫助提高模糊視頻的計數準確性。實驗結果表明,VASsNet 在 Countix 和Countix-AV 數據集上實現了最先進的性能。zh_TW
dc.description.abstractWe propose VASsNet as a novel approach for repetitive action counting in video, which incorporates Vision, Audio, as well as their Similarity matrices. In previous works, vision, vision similarity, and audio features have been separately used for repetitive motion counting. However, due to the lack of effective integration of information from these three aspects, it is difficult to achieve decent counting results in videos which are blurry and/or contain rapid movements. The VASsNet is structured with four pathways, namely the vision, vision similarity, audio and audio similarity pathways. The information extracted from these pathways is effectively integrated through lateral connections by employing a multi-layers cross-modal information fusion approach. Through experiments, we demonstrate how the similarity matrix pathways can be utilized to solve the problem of the previously undetectable repetitive action counting which is caused by short-term motion in videos; and how the audio pathway can help to enhance the counting accuracy with blurry videos. Experiment results show that VASsNet achieves the state-of-the-art performance on Countix and Countix-AV datasets.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-15T16:57:46Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2023-08-15T16:57:46Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents摘要 i
Abstract iii
Contents v
List of Figures vii
List of Tables ix
Chapter 1 Introduction 1
Chapter 2 Related works 5
2.1 Counting in video . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Video feature extraction . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Temporal Self-similarity Matrix (TSM) . . . . . . . . . . . . . . . . 8
2.4 Multiple stream model . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 3 Main Model 11
3.1 Vision Pathway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Audio Pathway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Vision Similarity Pathway . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Audio Similarity Pathway . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Lateral connections (Fusion) . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Repetition counting predictor . . . . . . . . . . . . . . . . . . . . . 16
3.7 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 4 Experiments 19
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 5 Result 25
5.1 Compare with Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Hard cases analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 6 Conclusion 29
References 31
-
dc.language.isoen-
dc.subject相似矩陣zh_TW
dc.subject聲音zh_TW
dc.subject重複性動作計數zh_TW
dc.subject視覺zh_TW
dc.subjectVisionen
dc.subjectRepetition countingen
dc.subjectAudioen
dc.subjectSimilarity Matrixen
dc.title視覺音訊和相似網路用於影片重複動作計數zh_TW
dc.titleVision Audio and Similarity Networks for Video Repetition Countingen
dc.typeThesis-
dc.date.schoolyear111-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee王鈺強;杜維洲zh_TW
dc.contributor.oralexamcommitteeYu-Chiang Wang;Wei-Zhou Duen
dc.subject.keyword重複性動作計數,視覺,聲音,相似矩陣,zh_TW
dc.subject.keywordRepetition counting,Vision,Audio,Similarity Matrix,en
dc.relation.page39-
dc.identifier.doi10.6342/NTU202302401-
dc.rights.note未授權-
dc.date.accepted2023-08-04-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電信工程學研究所-
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-111-2.pdf
  未授權公開取用
2.53 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved