Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電子工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21489
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳良基(Liang-Gee Chen)
dc.contributor.authorChia-Ho Linen
dc.contributor.author林家禾zh_TW
dc.date.accessioned2021-06-08T03:35:35Z-
dc.date.copyright2019-08-01
dc.date.issued2019
dc.date.submitted2019-07-30
dc.identifier.citation[1] S. Yin, P. Ouyang, T. Chen, L. Liu, and S. Wei, “A configurable par- allel hardware architecture for efficient integral histogram image com- puting,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 4, pp. 1305–1318, 2016.
[2] “Google glass.” [Online]. Available: https://www.google.com/glass/ start
[3] “Microsoft sensecam.” [Online]. Available: http://research.microsoft. com/en-us/um/cambridge/projects/sensecam
[4] G. Bertasius, H. S. Park, S. X. Yu, and J. Shi, “First person action- object detection with egonet,” arXiv preprint arXiv:1603.04908, 2016.
[5] Y. Huang, M. Cai, Z. Li, and Y. Sato, “Predicting gaze in egocentric video by learning task-dependent attention transition,” arXiv preprint arXiv:1803.09125, 2018.
[6] D. Damen, T. Leelasawassuk, O. Haines, A. Calway, and W. W. Mayol- Cuevas, “You-do, i-learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video.” in BMVC, vol. 2, 2014, p. 3.
[7] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
[8] H. Mao, S. Yao, T. Tang, B. Li, J. Yao, and Y. Wang, “Towards real- time object detection on embedded systems,” IEEE Transactions on Emerging Topics in Computing, vol. 6, no. 3, pp. 417–431, 2018.
[9] J. Wang, K. Yan, K. Guo, J. Yu, L. Sui, S. Yao, S. Han, and Y. Wang, “Real-time pedestrian detection and tracking on customized hard- ware,” in Embedded Systems For Real-time Multimedia (ESTIMedia), 2016 ACM/IEEE 14th Symposium on. IEEE, 2016, pp. 1–1.
[10] J. Yu, Y. Hu, X. Ning, J. Qiu, K. Guo, Y. Wang, and H. Yang, “Instruc- tion driven cross-layer cnn accelerator with winograd transformation on fpga,” in Field Programmable Technology (ICFPT), 2017 International Conference on. IEEE, 2017, pp. 227–230.
[11] J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,”
ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 16, 2011.
[12] G. Cheng, Y. Wan, A. N. Saudagar, K. Namuduri, and B. P. Buckles, “Advances in human action recognition: A survey,” arXiv preprint arXiv:1501.05964, 2015.
[13] S. Herath, M. Harandi, and F. Porikli, “Going deeper into action recog- nition: A survey,” Image and vision computing, vol. 60, pp. 4–21, 2017.
[14] L. W. Campbell and A. F. Bobick, “Recognition of human body motion using phase space constraints,” in Computer vision, 1995. proceedings., fifth international conference on. IEEE, 1995, pp. 624–630.
[15] Y. Yacoob and M. J. Black, “Parameterized modeling and recogni- tion of activities,” Computer Vision and Image Understanding, vol. 73, no. 2, pp. 232–247, 1999.
[16] F. Lv and R. Nevatia, “Single view human action recognition using key pose matching and viterbi path searching,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE, 2007, pp. 1–8.
[17] C. Wang, Y. Wang, and A. L. Yuille, “An approach to pose-based action recognition,” pp. 915–922, 2013.
[18] B. Z. Yao, B. X. Nie, Z. Liu, and S.-C. Zhu, “Animated pose templates for modeling and detecting human actions,” IEEE transactions on pat- tern analysis and machine intelligence, vol. 36, no. 3, pp. 436–452, 2014.
[19] H. Wang, A. Kl¨aser, C. Schmid, and C.-L. Liu, “Dense trajectories and motion boundary descriptors for action recognition,” International journal of computer vision, vol. 103, no. 1, pp. 60–79, 2013.
[20] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893.
[21] N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented histograms of flow and appearance,” in European conference on com- puter vision. Springer, 2006, pp. 428–441.
[22] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.
[23] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European Conference on Computer Vi- sion. Springer, 2016, pp. 20–36.
[24] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceed- ings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
[25] L. Wang, Y. Qiao, and X. Tang, “Action recognition with trajectory- pooled deep-convolutional descriptors,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4305– 4314.
[26] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian,
J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Gen- erating sentences from images,” in European conference on computer vision. Springer, 2010, pp. 15–29.
[27] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3668–3678.
[28] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relationship detection with language priors,” in European Conference on Computer Vision. Springer, 2016, pp. 852–869.
[29] E. H. Spriggs, F. De La Torre, and M. Hebert, “Temporal segmenta- tion and activity classification from first-person sensing,” in Computer Vision and Pattern Recognition Workshops, 2009. CVPR Workshops 2009. IEEE Computer Society Conference On. IEEE, 2009, pp. 17–24.
[30] A. Fathi, X. Ren, and J. M. Rehg, “Learning to recognize objects in egocentric activities,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference On. IEEE, 2011, pp. 3281–3288.
[31] M. S. Ryoo, B. Rothrock, and L. Matthies, “Pooled motion features for first-person videos,” in Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2015, pp. 896–904.
[32] Y. Poleg, A. Ephrat, S. Peleg, and C. Arora, “Compact cnn for indexing egocentric videos,” in Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on. IEEE, 2016, pp. 1–9.
[33] S. Singh, C. Arora, and C. Jawahar, “Trajectory aligned features for first person action recognition,” Pattern Recognition, vol. 62, pp. 45–55, 2017.
[34] H. Wang and C. Schmid, “Action recognition with improved trajecto- ries,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 3551–3558.
[35] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks.”
[36] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[37] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
[38] X. Peng and C. Schmid, “Multi-region two-stream r-cnn for action detection,” in European Conference on Computer Vision. Springer, 2016, pp. 744–759.
[39] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, “Joint learn- ing of object and action detectors,” in ICCV 2017-IEEE International Conference on Computer Vision, 2017.
[40] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint, 2017.
[41] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and
A. C. Berg, “Ssd: Single shot multibox detector,” in European confer- ence on computer vision. Springer, 2016, pp. 21–37.
[42] S. Singh, C. Arora, and C. Jawahar, “First person action recognition using deep learned descriptors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2620–2628.
[43] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer cnn ac- celerators,” in The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 2016, p. 22.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/21489-
dc.description.abstract近年來穿戴式裝置的發展催生了一系列電腦視覺相關應用,其中大多數涉及了如何有效識別人物動作及識別操縱物體的問題。另一方面,神經網絡加速器也促進了人物互動(Human-Object Interaction)系統的開發。然而針對人物互動之物件偵測模組的架構設計上相對較為缺乏,在應用層面上也會需要能夠在高解析度影片輸入下達成低功耗、即時輸出的模組。
本論文介紹了針對人物互動情境改良之的區域檢測網絡(Region Proposal Network)的硬體導向演算法設計。本論文的初衷為針對區域檢測網絡進行演算法的優化以符合未來穿戴式裝置的需求。首先,針對網絡的高記憶體用量,我們利用了特徵值層稀疏的特性設計了零係數跳躍機制及稀疏化卷積架構,達到大幅降低記憶體需求的效果。除此之外,針對網絡的高頻寬用量,我們設計了以塊(patch)為單位進行運算的池化層,並對初始預測框的數量進行簡化,達到降低頻寬需求的效果。我們是首位提出針對人物互動情境之區域檢測網絡進行硬體導向演算法設計的論文,所設計出來的晶片具有低記憶體需求、低頻寬用量,以及高吞吐量的特色,在未來能與深度類神經網路引擎結合,組成最先進的即時動作辨識系統。
zh_TW
dc.description.abstractGrowth of wearable devices in recent years have facilitated a series of advanced applications in computer vision, which usually involves recognizing human activity and manipulated objects. The CNN-based accelerators proposed in recent years have facilitate the progress in human-object interaction system. However, lack of robust and flexible architecture for object detection makes it difficult to build a reliable system for egocentric applications. Besides, the detector requires real-time response under high-resolution videos with low power consumption to fit in applications in egocentric view.
This thesis introduces a hardware-friendly algorithm and a hardware architecture for Region Proposal Network as video representations. The Region Proposal Network is a light-weight deep neural network for detection, which is easy to couple with network models of different size to balance between accuracy and complexity. Besides, it has been shown to benefit egocentric action recognition in recent works. Our goal is to achieve real-time computing for applications related to real-world egocentric action recognition. First, we propose a hardware-friendly algorithm, including anchor reduction and patch-based RoI pooling, to reduce resource requirement. Based on these techniques, we further design an architecture, including sparsity-aware convolution and channel-wise zero-skipping, to achieve high throughput, low memory cost and low bandwidth.
We are the first to propose a hardware-friendly algorithm and architecture for Region Proposal Network. With the chip's small area, low memory cost, and high throughput, it can be applied to many applications on mobile devices, or combined with deep neural network to implement state-of-the-art systems for real-time action recognition.
en
dc.description.provenanceMade available in DSpace on 2021-06-08T03:35:35Z (GMT). No. of bitstreams: 1
ntu-108-R05943040-1.pdf: 10493424 bytes, checksum: fac525aa0fb3e7f5b65a878bf972646c (MD5)
Previous issue date: 2019
en
dc.description.tableofcontentsAbstract xi
1 Introduction 1
2 Related Works 5
2.1 Object Detection 5
2.2 Action Recognition 8
2.2.1 Hand-crafted Methods 8
2.2.2 Deep-Learning-Based Methods 9
2.3 Visual Relationships 9
2.4 Human-Object Interactions 10
2.4.1 Hand-crafted feature 10
2.4.2 CNN-based feature 11
2.4.3 Comparison 11
3 Human Object Interaction 13
3.1 Introduction 13
3.2 System Overview 14
3.3 Robust Detection Module 16
3.4 Region Proposal Network and RoI Pooling 18
3.4.1 Core Concept 20
3.4.2 Experiment 23
3.5 Need for acceleration 25
3.5.1 Goals 25
3.5.2 Challenges 26
4 Architecture Design and Implementation 29
4.1 Goals and Challenges 29
4.2 Hardware-Friendly Algorithm 29
4.2.1 Convolution Stage 31
4.2.2 RoI Selection Stage 42
4.2.3 RoI Pooling Stage 43
4.3 Proposed Architectures 47
4.3.1 Convolution submodule 47
4.4 Implementation Result 52
4.4.1 Specification 52
4.4.2 Comparison 52
5 Conclusion 55
Bibliography 57
dc.language.isoen
dc.title人物互動之區域檢測網路硬體導向演算法zh_TW
dc.titleMemory-Efficient Hardware-Friendly Algorithm of
Region Proposal Network for Human-Object Interaction
en
dc.typeThesis
dc.date.schoolyear107-2
dc.description.degree碩士
dc.contributor.oralexamcommittee簡韶逸(Shao-Yi Chien),楊佳玲(Chia-Lin Yang),黃俊郎(Jiun-Lang Huang)
dc.subject.keyword即時動作辨識,區域檢測網絡,硬體導向演算法,zh_TW
dc.subject.keywordreal-time egocentric action recognition,region proposal network,hardware-friendly algorithm,architecture design,en
dc.relation.page62
dc.identifier.doi10.6342/NTU201902212
dc.rights.note未授權
dc.date.accepted2019-07-30
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電子工程學研究所zh_TW
顯示於系所單位:電子工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-108-1.pdf
  未授權公開取用
10.25 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved