DiffusionNet:利用粒子擴散辨識連續動作的高效率雙流神經網路

Yi-Jun Chen; 陳奕君

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77986

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	陳銘憲(Ming-Syan Chen)
dc.contributor.author	Yi-Jun Chen	en
dc.contributor.author	陳奕君	zh_TW
dc.date.accessioned	2021-07-11T14:38:59Z	-
dc.date.available	2025-08-20
dc.date.copyright	2020-08-21
dc.date.issued	2020
dc.date.submitted	2020-08-17
dc.identifier.citation	References [1] R. Allaoui, H. Mouane, Z. Asrih, S. Mars, I. El Hajjouji, et al. Fpga-based implementation of optical flow algorithm. In 2017 International Conference on Electrical and Information Technologies (ICEIT), pages 1–5, Rabat, Morocco, 2017. IEEE, IEEE. [2] M. Barekatain, M. Martí, H. Shih, S. Murray, K. Nakayama, Y. Matsuo, and H. Prendinger. Okutama-action: An aerial view video dataset for concurrent human action detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2153–2160, Honolulu, HI, USA, 2017. IEEE Computer Society. [3] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 3034–3042, Las Vegas, NV, USA, 2016. IEEE Computer Society. [4] A. F. Bobick and J. W. Davis. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3): 257–267, 2001. [5] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 25:120–125, 2000. [6] Z. Cai, L. Wang, X. Peng, and Y. Qiao. Multi-view super vector for action recognition. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 596–603, Columbus, OH, USA, 2014. IEEE Computer Society. [7] C. R. de Souza, A. Gaidon, E. Vig, and A. M. L. Peña. Sympathy for the details: Dense trajectories and hybrid classification architectures for action recognition. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VII, volume 9911 of Lecture Notes in Computer Science, pages 697–716, Amsterdam, The Netherlands, 2016. Springer. [8] Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 1110–1118, Boston, MA, USA, 2015. IEEE Computer Society. [9] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal residual networks for video action recognition. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 3476–3484, Red Hook, NY, USA, 2016. Curran Associates Inc. [10] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA,June 27-30, 2016, pages 1933–1941, Las Vegas, NV, USA, 2016. IEEE Computer Society. [11] L. Fridman. Human-centered autonomous vehicle systems: Principles of effective shared autonomy, 2018. [12] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber. Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems, 28(10):2222–2232, 2017. [13] D. Gu. 3D Densely Connected Convolutional Network for the Recognition of Human Shopping Actions. PhD thesis, Université d’Ottawa/University of Ottawa, 2017. [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, Las Vegas, NV, USA, 2016. IEEE Computer Society. [15] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017. [16] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2261–2269, Honolulu, HI, USA, 2017. IEEE Computer Society. [17] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size, 2016. [18] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1): 221–231, 2013. [19] V. Kantorov and I. Laptev. Efficient feature extraction, encoding, and classification for action recognition. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 2593–2600, Columbus, OH, USA, 2014. IEEE Computer Society. [20] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. Li. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 1725–1732, Columbus, OH, USA, 2014. IEEE Computer Society. [21] A. Kläser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In M. Everingham, C. J. Needham, and R. Fraile, editors, Proceedings of the British Machine Vision Conference 2008, Leeds, UK, September 2008, pages 1–10, Leeds, UK, 2008. British Machine Vision Association. [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84–90, May 2017. [23] I. Laptev and T. Lindeberg. Space-time interest points. In 9th IEEE International Conference on Computer Vision (ICCV 2003), 14-17 October 2003, Nice, France, pages 432–439, Nice, France, 2003. IEEE Computer Society. [24] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Toulon, France, 2017. OpenReview.net. [25] K. Liu, W. Liu, C. Gan, M. Tan, and H. Ma. T-C3D: temporal convolutional 3d network for real-time action recognition. In S. A. McIlraith and K. Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 7138–7145, New Orleans, Louisiana, USA, 2018. AAAI Press. [26] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’81, page 674–679, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc. [27] N. Ma, X. Zhang, H. Zheng, and J. Sun. Shufflenet V2: practical guidelines for efficient CNN architecture design. In V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, editors, Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, volume 11218 of Lecture Notes in Computer Science, pages 122–138, Munich, Germany, 2018. Springer. [28] S. Mahmoudi, M. Kierzynka, P. Manneback, and K. Kurowski. Real-time motion tracking using optical flow on multiple gpus. Bulletin of the Polish Academy of Sciences, Technical Sciences, 62:139–150, 03 2014. [29] J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4694–4702, Boston, MA, USA, 2015. IEEE Computer Society. [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. [31] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 4510–4520, Salt Lake City, UT, USA, 2018. IEEE Computer Society. [32] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM International Conference on Multimedia, MM ’07, page 357–360, New York, NY, USA, 2007. Association for Computing Machinery. [33] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, page 568–576, Cambridge, MA, USA, 2014. MIT Press. [34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, San Diego, CA, USA, 2015. https://iclr.cc/archive/www/2015.html. [35] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012. [36] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. Mnasnet: Platform-aware neural architecture search for mobile. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 2820–2828, Long Beach, CA, USA, 2019. Computer Vision Foundation / IEEE. [37] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 4489–4497, Santiago, Chile, 2015. IEEE Computer Society. [38] D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Convnet architecture search for spatiotemporal feature learning, 2017. [39] G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6): 1510–1517, 2017. [40] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103:60–79, 2012. [41] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision, 103(1):60–79, 2013. [42] H. Wang and C. Schmid. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 3551–3558, Sydney, Australia, 2013. IEEE Computer Society. [43] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao. Towards good practices for very deep two-stream convnets, 2015. [44] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, volume 9912 of Lecture Notes in Computer Science, pages 20–36, Amsterdam, The Netherlands, 2016. Springer. [45] A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers. An improved algorithm for tv-l1 optical flow. In D. Cremers, B. Rosenhahn, A. L. Yuille, and F. R. Schmidt, editors, Statistical and Geometrical Approaches to Visual Motion Analysis, pages 23–45, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg. [46] D. Weinland, R. Ronfard, and E. Boyer. Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst., 104(2):249–257, Nov. 2006. [47] C. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and P. Krähenbühl. Compressed video action recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6026–6035, Salt Lake City, UT, USA, 2018. IEEE Computer Society. [48] X. Xu, X. Zhang, B. Yu, X. S. Hu, C. Rowen, J. Hu, and Y. Shi. Dac-sdc low power object detection challenge for uav applications, 2018. [49] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(11):2072–2085, 2019. [50] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6848–6856, Salt Lake City, UT, USA, 2018. IEEE Computer Society. [51] Y. Zhu, Z. Lan, S. Newsam, and A. Hauptmann. Hidden two-stream convolutional networks for action recognition. In C. V. Jawahar, H. Li, G. Mori, and K. Schindler, editors, Computer Vision – ACCV 2018, pages 363–378, Cham, 2019. Springer International Publishing.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77986	-
dc.description.abstract	動作識別是電腦視覺一個非常熱門的研究領域，其應用廣泛，普遍影響著人們的生活，像是監視設備、人機互動系統，都仰賴從影片中辨識人體動作的能力。在一般單一的圖片影像辨識中，大多利用空間資訊來獲取訊息，然而，影片並不同於圖像處理簡單，影片辨識多了豐富卻也複雜的時間資訊。多數研究會善用雙流架構深度模型來分別處理空間與時間特徵，但如此龐大的模型常常難以實作在嵌入式系統中，反而降低實用性。此外，時間特徵的抽取也常常成為限制速度的瓶頸，要如何設計一個在速度上能達到即時性、準確度能達到實用性，卻又足夠輕量，使其能放在移動設備中，便成為一個很重要的問題。這篇論文提出了一個名為DiffusionNet的輕量級雙流模型架構，其在動作識別任務上可以達到良好的準確率以及即時性。我們假設光流具有粒子的特性，會隨著時間擴散，而我們的實驗證實此立論基礎能幫助我們更有效地萃取時間特徵。另外，我們也設計了自動調節的Focal Loss損失函數、注意力集中機制來幫助模型更有效地獲得空間領域特徵。在輕量化模型的部分，我們透過MobileNetV2的Depth-wise和Point-wise Convolution計算結構來降低計算複雜度，透過實作CUDA版本Pyramidal Lucas-Kanade演算法來快速地生成光流，解決多數傳統光流法中，時間上無法即時性的瓶頸問題。	zh_TW
dc.description.abstract	Human action recognition is one of the most active research fields in computer vision. Many applications, such as video surveillance systems, human-computer interaction, require recognizing human actions in video sequences. Different from image classification tasks, which only have spatial information, action recognition contains a lot of noisy temporal information. Many state-of-the-art methods are based on the two-stream architecture, which has two neural networks to process spatial and temporal data separately and fuse the results for final prediction. However, the size of temporal information is an order of magnitude larger than spatial information. Previous works based on traditional CNN models (\eg VGG and ResNet) cannot be deployed to edge devices. Moreover, most current models adopt TV-L1 or Brox algorithms to calculate optical flow, which has a long latency that prohibits real-time recognition. Those two issues limit the deployment of action recognition in timing-critical tasks such as autonomous navigation. In this thesis, we propose a new action recognition network architecture that can achieve real-time performance on edge devices. We employ the depthwise and pointwise convolutions for reducing computational complexity and leverage the Pyramidal Lucas-Kanade algorithm for shortening the latency. Besides, we propose an assumption about optical flow diffusion, and our experiments show that it helps model extracting temporal information. Moreover, we design the automated focal loss function and attention mechanism to extract features more efficiently. Empirical results show that our method can reduce much computational complexity while preserving similar accuracy as the original deep two-stream model. Therefore, our model can meet the strict requirements of running real-time applications on mobile devices.	en
dc.description.provenance	Made available in DSpace on 2021-07-11T14:38:59Z (GMT). No. of bitstreams: 1 U0001-1608202000171300.pdf: 1342325 bytes, checksum: d2c48a039dd70d035d40d89641db6e33 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	Contents Acknowledgements i 摘要 ii Abstract iii Contents v List of Figures vii List of Tables viii Chapter 1 Introduction 1 Chapter 2 Related Work 5 2.1 Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Lightweight Neural Network . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 3 Methodology 9 3.1 Reducing Computational Complexity of Convolution . . . . . . . . . 10 3.2 Optical Flow Process Speedup . . . . . . . . . . . . . . . . . . . . . 12 3.3 Optical Flow Diffusion Assumption . . . . . . . . . . . . . . . . . . 13 3.4 Spatial Feature Extraction Enhancement . . . . . . . . . . . . . . . . 13 3.4.1 Automated Focal Loss . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4.2 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4.3 Randomly Block Images Strategy . . . . . . . . . . . . . . . . . . 16 Chapter 4 Experiments and Results 17 4.1 Dataset and Evaluation Protocol . . . . . . . . . . . . . . . . . . . . 17 4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.2 Training Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.3 Testing Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.4 Optical Flow Implementation . . . . . . . . . . . . . . . . . . . . . 21 4.3 Optical Flow Diffusion Assumption . . . . . . . . . . . . . . . . . . 21 4.4 Spatial Model Design and Ablation Studies . . . . . . . . . . . . . . 22 4.5 Real-time Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.6 Effect of Fusion Method . . . . . . . . . . . . . . . . . . . . . . . . 24 4.7 The Comparison with the State of the Arts . . . . . . . . . . . . . . . 27 4.8 Spatial and Temporal MACs Trade-off . . . . . . . . . . . . . . . . . 28 Chapter 5 Conclusion 30 References 31
dc.language.iso	en
dc.subject	嵌入式系統	zh_TW
dc.subject	輕量化模型	zh_TW
dc.subject	雙流神經網路	zh_TW
dc.subject	動作識別	zh_TW
dc.subject	Lightweight Network	en
dc.subject	Two-Stream Architecture	en
dc.subject	Embedded Neural Networks	en
dc.subject	Action Recognition	en
dc.title	DiffusionNet:利用粒子擴散辨識連續動作的高效率雙流神經網路	zh_TW
dc.title	DiffusionNet: An Efficient Two-Stream Network for Continuous Action Recognition based on Particle Diffusion	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.author-orcid	0000-0002-8761-2418
dc.contributor.oralexamcommittee	楊得年(De-Nian Yang),葉彌妍(Mi-Yen Yeh),帥宏翰(Hong-Han Shuai)
dc.subject.keyword	動作識別,輕量化模型,雙流神經網路,嵌入式系統,	zh_TW
dc.subject.keyword	Action Recognition,Two-Stream Architecture,Lightweight Network,Embedded Neural Networks,	en
dc.relation.page	39
dc.identifier.doi	10.6342/NTU202003550
dc.rights.note	有償授權
dc.date.accepted	2020-08-18
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電信工程學研究所	zh_TW
dc.date.embargo-lift	2025-08-20	-
Appears in Collections:	電信工程學研究所

Files in This Item:

File	Size	Format
U0001-1608202000171300.pdf Restricted Access	1.31 MB	Adobe PDF

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets