Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77986
Full metadata record
???org.dspace.app.webui.jsptag.ItemTag.dcfield???ValueLanguage
dc.contributor.advisor陳銘憲(Ming-Syan Chen)
dc.contributor.authorYi-Jun Chenen
dc.contributor.author陳奕君zh_TW
dc.date.accessioned2021-07-11T14:38:59Z-
dc.date.available2025-08-20
dc.date.copyright2020-08-21
dc.date.issued2020
dc.date.submitted2020-08-17
dc.identifier.citationReferences
[1] R. Allaoui, H. Mouane, Z. Asrih, S. Mars, I. El Hajjouji, et al. Fpga-based implementation of optical flow algorithm. In 2017 International Conference on Electrical and Information Technologies (ICEIT), pages 1–5, Rabat, Morocco, 2017. IEEE, IEEE.
[2] M. Barekatain, M. Martí, H. Shih, S. Murray, K. Nakayama, Y. Matsuo, and H. Prendinger. Okutama-action: An aerial view video dataset for concurrent human action detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2153–2160, Honolulu, HI, USA, 2017. IEEE Computer Society.
[3] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 3034–3042, Las Vegas, NV, USA, 2016. IEEE Computer Society.
[4] A. F. Bobick and J. W. Davis. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3): 257–267, 2001.
[5] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 25:120–125, 2000.
[6] Z. Cai, L. Wang, X. Peng, and Y. Qiao. Multi-view super vector for action recognition. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 596–603, Columbus, OH, USA, 2014. IEEE Computer Society.
[7] C. R. de Souza, A. Gaidon, E. Vig, and A. M. L. Peña. Sympathy for the details: Dense trajectories and hybrid classification architectures for action recognition. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VII, volume 9911 of Lecture Notes in Computer Science, pages 697–716, Amsterdam, The Netherlands, 2016. Springer.
[8] Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 1110–1118, Boston, MA, USA, 2015. IEEE Computer Society.
[9] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal residual networks for video action recognition. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 3476–3484, Red Hook, NY, USA, 2016. Curran Associates Inc.
[10] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA,June 27-30, 2016, pages 1933–1941, Las Vegas, NV, USA, 2016. IEEE Computer Society.
[11] L. Fridman. Human-centered autonomous vehicle systems: Principles of effective shared autonomy, 2018.
[12] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber. Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems, 28(10):2222–2232, 2017.
[13] D. Gu. 3D Densely Connected Convolutional Network for the Recognition of Human Shopping Actions. PhD thesis, Université d’Ottawa/University of Ottawa, 2017.
[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, Las Vegas, NV, USA, 2016. IEEE Computer Society.
[15] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.
[16] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2261–2269, Honolulu, HI, USA, 2017. IEEE Computer Society.
[17] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size, 2016.
[18] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1): 221–231, 2013.
[19] V. Kantorov and I. Laptev. Efficient feature extraction, encoding, and classification for action recognition. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 2593–2600, Columbus, OH, USA, 2014. IEEE Computer Society.
[20] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. Li. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 1725–1732, Columbus, OH, USA, 2014. IEEE Computer Society.
[21] A. Kläser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In M. Everingham, C. J. Needham, and R. Fraile, editors, Proceedings of the British Machine Vision Conference 2008, Leeds, UK, September 2008, pages 1–10, Leeds, UK, 2008. British Machine Vision Association.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84–90, May 2017.
[23] I. Laptev and T. Lindeberg. Space-time interest points. In 9th IEEE International Conference on Computer Vision (ICCV 2003), 14-17 October 2003, Nice, France, pages 432–439, Nice, France, 2003. IEEE Computer Society.
[24] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Toulon, France, 2017. OpenReview.net.
[25] K. Liu, W. Liu, C. Gan, M. Tan, and H. Ma. T-C3D: temporal convolutional 3d network for real-time action recognition. In S. A. McIlraith and K. Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 7138–7145, New Orleans, Louisiana, USA, 2018. AAAI Press.
[26] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’81, page 674–679, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc.
[27] N. Ma, X. Zhang, H. Zheng, and J. Sun. Shufflenet V2: practical guidelines for efficient CNN architecture design. In V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, editors, Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, volume 11218 of Lecture Notes in Computer Science, pages 122–138, Munich, Germany, 2018. Springer.
[28] S. Mahmoudi, M. Kierzynka, P. Manneback, and K. Kurowski. Real-time motion tracking using optical flow on multiple gpus. Bulletin of the Polish Academy of Sciences, Technical Sciences, 62:139–150, 03 2014.
[29] J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4694–4702, Boston, MA, USA, 2015. IEEE Computer Society.
[30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
[31] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 4510–4520, Salt Lake City, UT, USA, 2018. IEEE Computer Society.
[32] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM International Conference on Multimedia, MM ’07, page 357–360, New York, NY, USA, 2007. Association for Computing Machinery.
[33] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, page 568–576, Cambridge, MA, USA, 2014. MIT Press.
[34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, San Diego, CA, USA, 2015. https://iclr.cc/archive/www/2015.html.
[35] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012.
[36] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. Mnasnet: Platform-aware neural architecture search for mobile. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 2820–2828, Long Beach, CA, USA, 2019. Computer Vision Foundation / IEEE.
[37] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 4489–4497, Santiago, Chile, 2015. IEEE Computer Society.
[38] D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Convnet architecture search for spatiotemporal feature learning, 2017.
[39] G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6): 1510–1517, 2017.
[40] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103:60–79, 2012.
[41] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision, 103(1):60–79, 2013.
[42] H. Wang and C. Schmid. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 3551–3558, Sydney, Australia, 2013. IEEE Computer Society.
[43] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao. Towards good practices for very deep two-stream convnets, 2015.
[44] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, volume 9912 of Lecture Notes in Computer Science, pages 20–36, Amsterdam, The Netherlands, 2016. Springer.
[45] A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers. An improved algorithm for tv-l1 optical flow. In D. Cremers, B. Rosenhahn, A. L. Yuille, and F. R. Schmidt, editors, Statistical and Geometrical Approaches to Visual Motion Analysis, pages 23–45, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.
[46] D. Weinland, R. Ronfard, and E. Boyer. Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst., 104(2):249–257, Nov. 2006.
[47] C. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and P. Krähenbühl. Compressed video action recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6026–6035, Salt Lake City, UT, USA, 2018. IEEE Computer Society.
[48] X. Xu, X. Zhang, B. Yu, X. S. Hu, C. Rowen, J. Hu, and Y. Shi. Dac-sdc low power object detection challenge for uav applications, 2018.
[49] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(11):2072–2085, 2019.
[50] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6848–6856, Salt Lake City, UT, USA, 2018. IEEE Computer Society.
[51] Y. Zhu, Z. Lan, S. Newsam, and A. Hauptmann. Hidden two-stream convolutional networks for action recognition. In C. V. Jawahar, H. Li, G. Mori, and K. Schindler, editors, Computer Vision – ACCV 2018, pages 363–378, Cham, 2019. Springer International Publishing.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/77986-
dc.description.abstract動作識別是電腦視覺一個非常熱門的研究領域,其應用廣泛,普遍影響著人們的生活,像是監視設備、人機互動系統,都仰賴從影片中辨識人體動作的能力。在一般單一的圖片影像辨識中,大多利用空間資訊來獲取訊息,然而,影片並不同於圖像處理簡單,影片辨識多了豐富卻也複雜的時間資訊。多數研究會善用雙流架構深度模型來分別處理空間與時間特徵,但如此龐大的模型常常難以實作在嵌入式系統中,反而降低實用性。此外,時間特徵的抽取也常常成為限制速度的瓶頸,要如何設計一個在速度上能達到即時性、準確度能達到實用性,卻又足夠輕量,使其能放在移動設備中,便成為一個很重要的問題。
這篇論文提出了一個名為DiffusionNet的輕量級雙流模型架構,其在動作識別任務上可以達到良好的準確率以及即時性。我們假設光流具有粒子的特性,會隨著時間擴散,而我們的實驗證實此立論基礎能幫助我們更有效地萃取時間特徵。另外,我們也設計了自動調節的Focal Loss損失函數、注意力集中機制來幫助模型更有效地獲得空間領域特徵。在輕量化模型的部分,我們透過MobileNetV2的Depth-wise和Point-wise Convolution計算結構來降低計算複雜度,透過實作CUDA版本Pyramidal Lucas-Kanade演算法來快速地生成光流,解決多數傳統光流法中,時間上無法即時性的瓶頸問題。
zh_TW
dc.description.abstractHuman action recognition is one of the most active research fields in computer vision. Many applications, such as video surveillance systems, human-computer interaction, require recognizing human actions in video sequences. Different from image classification tasks, which only have spatial information, action recognition contains a lot of noisy temporal information. Many state-of-the-art methods are based on the two-stream architecture, which has two neural networks to process spatial and temporal data separately and fuse the results for final prediction. However, the size of temporal information is an order of magnitude larger than spatial information. Previous works based on traditional CNN models (\eg VGG and ResNet) cannot be deployed to edge devices. Moreover, most current models adopt TV-L1 or Brox algorithms to calculate optical flow, which has a long latency that prohibits real-time recognition. Those two issues limit the deployment of action recognition in timing-critical tasks such as autonomous navigation.
In this thesis, we propose a new action recognition network architecture that can achieve real-time performance on edge devices. We employ the depthwise and pointwise convolutions for reducing computational complexity and leverage the Pyramidal Lucas-Kanade algorithm for shortening the latency. Besides, we propose an assumption about optical flow diffusion, and our experiments show that it helps model extracting temporal information. Moreover, we design the automated focal loss function and attention mechanism to extract features more efficiently. Empirical results show that our method can reduce much computational complexity while preserving similar accuracy as the original deep two-stream model. Therefore, our model can meet the strict requirements of running real-time applications on mobile devices.
en
dc.description.provenanceMade available in DSpace on 2021-07-11T14:38:59Z (GMT). No. of bitstreams: 1
U0001-1608202000171300.pdf: 1342325 bytes, checksum: d2c48a039dd70d035d40d89641db6e33 (MD5)
Previous issue date: 2020
en
dc.description.tableofcontentsContents
Acknowledgements i
摘要 ii
Abstract iii
Contents v
List of Figures vii
List of Tables viii
Chapter 1 Introduction 1
Chapter 2 Related Work 5
2.1 Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Lightweight Neural Network . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 3 Methodology 9
3.1 Reducing Computational Complexity of Convolution . . . . . . . . . 10
3.2 Optical Flow Process Speedup . . . . . . . . . . . . . . . . . . . . . 12
3.3 Optical Flow Diffusion Assumption . . . . . . . . . . . . . . . . . . 13
3.4 Spatial Feature Extraction Enhancement . . . . . . . . . . . . . . . . 13
3.4.1 Automated Focal Loss . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.2 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.3 Randomly Block Images Strategy . . . . . . . . . . . . . . . . . . 16
Chapter 4 Experiments and Results 17
4.1 Dataset and Evaluation Protocol . . . . . . . . . . . . . . . . . . . . 17
4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.2 Training Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.3 Testing Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.4 Optical Flow Implementation . . . . . . . . . . . . . . . . . . . . . 21
4.3 Optical Flow Diffusion Assumption . . . . . . . . . . . . . . . . . . 21
4.4 Spatial Model Design and Ablation Studies . . . . . . . . . . . . . . 22
4.5 Real-time Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6 Effect of Fusion Method . . . . . . . . . . . . . . . . . . . . . . . . 24
4.7 The Comparison with the State of the Arts . . . . . . . . . . . . . . . 27
4.8 Spatial and Temporal MACs Trade-off . . . . . . . . . . . . . . . . . 28
Chapter 5 Conclusion 30
References 31
dc.language.isoen
dc.subject嵌入式系統zh_TW
dc.subject輕量化模型zh_TW
dc.subject雙流神經網路zh_TW
dc.subject動作識別zh_TW
dc.subjectLightweight Networken
dc.subjectTwo-Stream Architectureen
dc.subjectEmbedded Neural Networksen
dc.subjectAction Recognitionen
dc.titleDiffusionNet:利用粒子擴散辨識連續動作的高效率雙流神經網路zh_TW
dc.titleDiffusionNet: An Efficient Two-Stream Network for Continuous Action Recognition based on Particle Diffusionen
dc.typeThesis
dc.date.schoolyear108-2
dc.description.degree碩士
dc.contributor.author-orcid0000-0002-8761-2418
dc.contributor.oralexamcommittee楊得年(De-Nian Yang),葉彌妍(Mi-Yen Yeh),帥宏翰(Hong-Han Shuai)
dc.subject.keyword動作識別,輕量化模型,雙流神經網路,嵌入式系統,zh_TW
dc.subject.keywordAction Recognition,Two-Stream Architecture,Lightweight Network,Embedded Neural Networks,en
dc.relation.page39
dc.identifier.doi10.6342/NTU202003550
dc.rights.note有償授權
dc.date.accepted2020-08-18
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電信工程學研究所zh_TW
dc.date.embargo-lift2025-08-20-
Appears in Collections:電信工程學研究所

Files in This Item:
File SizeFormat 
U0001-1608202000171300.pdf
  Restricted Access
1.31 MBAdobe PDF
Show simple item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved