利用多任務學習以關節移動圖像引導動作識別

Nien-Tse Lin; 林念澤

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/81225

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	許永真(Yung-Chen HSU)
dc.contributor.author	Nien-Tse Lin	en
dc.contributor.author	林念澤	zh_TW
dc.date.accessioned	2022-11-24T03:37:14Z	-
dc.date.available	2021-08-11
dc.date.available	2022-11-24T03:37:14Z	-
dc.date.copyright	2021-08-11
dc.date.issued	2021
dc.date.submitted	2021-08-01
dc.identifier.citation	D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459, 2018. A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019, 2016. S. Herath, M. Harandi, and F. Porikli, “Going deeper into action recognition: A survey,” Image and vision computing, vol. 60, pp. 4–21, 2017. A. F. Bobick and J. W. Davis, “The recognition of human movement using temporal templates,” IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 3, pp. 257–267, 2001. I. Laptev, “On space-time interest points,” International journal of computer vision, vol. 64, no. 2-3, pp. 107–123, 2005. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, pp. 4489–4497, 2015. S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the AAAI conference on artiﬁcial intelligence, vol. 32, 2018. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017. G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta,“Hollywood in homes: Crowdsourcing data collection for activity understanding,” in European Conference on Computer Vision, pp. 510–526, Springer, 2016. Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152, 2020. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017. K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” arXiv preprint arXiv:1406.2199, 2014. C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941, 2016. C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal multiplier networks for video action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4768–4777, 2017. S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009. K. Hara, H. Kataoka, and Y. Satoh, “Learning spatio-temporal features with 3d residual networks for action recognition,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 3154–3160, 2017. Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541, 2017. L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi, “Human action recognition using factorized spatio-temporal convolutional networks,” in Proceedings of the IEEE international conference on computer vision, pp. 4597–4605, 2015. Z. Zhang, “Microsoft kinect sensor and its eﬀect,” IEEE multimedia, vol. 19, no. 2, pp. 4–10, 2012. A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European conference on computer vision, pp. 483–499, Springer, 2016. Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: realtime multi-person 2d pose estimation using part aﬃnity ﬁelds,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 1, pp. 172–186, 2019. K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703, 2019. B. Ren, M. Liu, R. Ding, and H. Liu, “A survey on 3d skeleton-based action recognition using learning method,” arXiv preprint arXiv:2002.05907, 2020. D. Wu and L. Shao, “Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 724–731, 2014. H. Wang and L. Wang, “Modeling temporal dynamics and spatial conﬁgurations of actions using two-stream recurrent neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 499–508, 2017. Z. Ding, P. Wang, P. O. Ogunbona, and W. Li, “Investigation of diﬀerent skeleton features for cnn-based 3d action recognition,” in 2017 IEEE International Conference on Multimedia Expo Workshops (ICMEW), pp. 617–622, IEEE, 2017. P. Wang, Z. Li, Y. Hou, and W. Li, “Action recognition based on joint trajectory maps using convolutional neural networks,” in Proceedings of the 24th ACM international conference on Multimedia, pp. 102–106, 2016. C. Zheng, W. Wu, T. Yang, S. Zhu, C. Chen, R. Liu, J. Shen, N. Kehtarnavaz, and M. Shah, “Deep learning-based human pose estimation: A survey,” arXiv preprint arXiv:2012.13392, 2020. A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1653–1660, 2014. F. Zhang, X. Zhu, and M. Ye, “Fast human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3517–3526, 2019. B. Artacho and A. Savakis, “Unipose: Uniﬁed human pose estimation in single images and videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7035–7044, 2020. G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy, “Towards accurate multi-person pose estimation in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4903–4911, 2017. X. Nie, J. Feng, J. Zhang, and S. Yan, “Single-stage multi-person pose machines,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6951–6960, 2019. M. Crawshaw, “Multi-task learning with deep neural networks: A survey,”arXiv preprint arXiv:2009.09796, 2020. D. C. Luvizon, D. Picard, and H. Tabia, “2d/3d pose estimation and action recognition using multitask deep learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5137–5146, 2018. J. Y.-H. Ng, J. Choi, J. Neumann, and L. S. Davis, “Actionﬂownet: Learning motion representation for action recognition,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1616–1624, IEEE, 2018. J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3150–3158, 2016. D. Xu, W. Ouyang, X. Wang, and N. Sebe, “Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 675–684, 2018. Z. Yao, Y. Wang, M. Long, J. Wang, S. Y. Philip, and J. Sun, “Multi-task learning of generalizable representations for video action recognition,” in 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, IEEE, 2020. D. Luvizon, D. Picard, and H. Tabia, “Multi-task deep learning for real-time 3d human pose estimation and action recognition,” IEEE transactions on pattern analysis and machine intelligence, 2020. J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high performance deep learning library,” arXiv preprint arXiv:1912.01703, 2019. H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, “Mmtm: multi-modal transfer module for cnn fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13289–13299, 2020. K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation,” in Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015. Z. Zhang, Z. Lv, C. Gan, and Q. Zhu, “Human action recognition using convolutional lstm and fully-connected lstm with diﬀerent attentions,” Neurocomputing, vol. 410, pp. 304–316, 2020.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/81225	-
dc.description.abstract	動作識別在影片內容標注、監控系統與人機互動方面都有著廣泛的應用。隨著深度學習的演進，基於三維卷積網路(3D ConvNet)的演算法已經能夠在大型資料集上取得相當高的準確度。然而，這類演算法十分仰賴大量的訓練資料來歸納出複雜的分類規則，隨著資料集的縮小，其準確度與穩定度皆會受到嚴重的影響。我們認為在典型的訓練方式中，僅以動作類別標籤監督模型的學習無法有效引導三維卷積網路抽取利於動作識別的特徵。為解決這個問題，本研究提出一個多任務學習的框架與一個輔助任務，輔助任務的內容是學習人體移動的區域並藉此強化對模型特徵抽取的監督。在 NTU-RGB+D60 資料集上進行的實驗表明，所提出的多任務學習方法可以成功應用在3種不同的三維卷積網路中，並且可以在不增加額外推理時間(inference time)的情況下大幅增進三維卷積網路在小型資料集上的準確度。	zh_TW
dc.description.provenance	Made available in DSpace on 2022-11-24T03:37:14Z (GMT). No. of bitstreams: 1 U0001-2907202113081200.pdf: 7785344 bytes, checksum: dbbfb6feb69a08f7d0b14797a0c5caca (MD5) Previous issue date: 2021	en
dc.description.tableofcontents	1 Introduction...1 1.1 Background...1 1.2 Motivation...2 1.3 Proposed Method..3 1.4 Outline of the Thesis...4 2 Literature Review...5 2.1 Deep Learning for Action Recognition...5 2.1.1 Multi-stream Network Based Method...5 2.1.2 3D Convolutional Network Based Method...6 2.1.3 Skeleton Based Method...7 2.2 Human Pose Estimation...8 2.2.1 2D Single-Person Pose Estimation...8 2.2.2 2D Multi-Person Pose Estimation...8 2.3 Multi-task Learning for Action Recognition...9 3 Problem Definition...11 3.1 Action Recognition...12 3.2 Action Recognition on Small Scale Datasets...13 4 Methodology...14 4.1 Algorithm of Joint Movement Pattern Odentification...15 4.1.1 Full Image Joint Heatmap Evaluation...16 4.1.2 Joint Movement Pattern Identification...19 4.2 The Multi-task Learning Model...22 5 Experiments...25 5.1 Experiment Setup...25 5.1.1 The Dataset...25 5.1.2 Data Augmentation...26 5.1.3 Evaluation Metrics...27 5.2 Training Detail...27 5.3 Experiment Result...27 5.3.1 Effect and Generalization ability to different 3D ConvNets...27 5.3.2 Generalization ability to different size of sub-datasets...36 5.4 Discussion...39 5.4.1 The Weight for Joint Movement Patterns Loss...39 5.4.2 The Stability of 3D ConvNets on Small-scale datasets...40 5.4.3 Failure Cases...41 6 Conclusion...43 6.1 Summary and Contribution...43 6.2 Future Study...44 Bibliography...45
dc.language.iso	en
dc.subject	動作識別	zh_TW
dc.subject	小型資料集	zh_TW
dc.subject	三維卷積網路	zh_TW
dc.subject	多任務學習	zh_TW
dc.subject	Small-scale Dataset	en
dc.subject	Action Recognition	en
dc.subject	Multi-task Learning	en
dc.subject	3D Convolutional Neural Network	en
dc.title	利用多任務學習以關節移動圖像引導動作識別	zh_TW
dc.title	Joint Movement Pattern Guided Action Recognition Using Multi-task Learning	en
dc.date.schoolyear	109-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	王鈺強(Hsin-Tsai Liu),李明穗(Chih-Yang Tseng),鄭素芳,陳維超
dc.subject.keyword	動作識別,多任務學習,三維卷積網路,小型資料集,	zh_TW
dc.subject.keyword	Action Recognition,Multi-task Learning,3D Convolutional Neural Network,Small-scale Dataset,	en
dc.relation.page	50
dc.identifier.doi	10.6342/NTU202101889
dc.rights.note	同意授權(限校園內公開)
dc.date.accepted	2021-08-03
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-2907202113081200.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	7.6 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。