請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88899
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 許永真 | zh_TW |
dc.contributor.advisor | Yung-Jen Hsu | en |
dc.contributor.author | 魏資碩 | zh_TW |
dc.contributor.author | Tzu-Shuo Wei | en |
dc.date.accessioned | 2023-08-16T16:16:05Z | - |
dc.date.available | 2023-11-09 | - |
dc.date.copyright | 2023-08-16 | - |
dc.date.issued | 2023 | - |
dc.date.submitted | 2023-08-09 | - |
dc.identifier.citation | Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, “Modeling spatial-temporal clues in a hybrid deep learning framework for video classification,” in Proceedings of the 23rd ACM international conference on Multimedia, pp. 461–470, 2015.
Z. Wang, Y. Yang, Z. Liu, and Y. Zheng, “Deep neural networks in video human action recognition: A review,” 2023. S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, 2018. T. Ren, W. Li, Z. Jiang, X. Li, Y. Huang, and J. Peng, “Video-based human motion capture data retrieval via motionset network,” IEEE Access, vol. 8, pp. 186212–186221, 2020. H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani, “Infogcn: Representation learning for human skeleton-based action recognition,”in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20186–20196, June 2022. Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” 2019. Z. Liu, H. Chen, R. Feng, S. Wu, S. Ji, B. Yang, and X. Wang, “Deep dual consecutive network for human pose estimation,” 2021. R. Bajpai and D. Joshi, “Movenet: A deep neural network for joint profile prediction across variable walking speeds and slopes,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–11, 2021. S. Sharma, S. Verma, M. Kumar, and L. Sharma, “Use of motion capture in 3d animation: Motion capture systems, challenges, and recent trends,” in 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), pp. 289–294, 2019. Q. Cui, B. Chen, and H. Sun, “Nonlocal low-rank regularization for human motion recovery based on similarity analysis,” Information Sciences, vol. 493, pp. 57–74, 2019. G. Xia, H. Sun, B. Chen, Q. Liu, L. Feng, G. Zhang, and R. Hang, “Nonlinear low-rank matrix completion for human motion recovery,” IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 3011–3024, 2018. S. Lohit, R. Anirudh, and P. Turaga, “Recovering trajectories of unmarked joints in 3d human actions using latent space optimization,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2342–2351, 2021. Q. Cui and H. Sun, “Towards accurate 3d human motion prediction from incomplete observations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4801–4810, 2021. Q. Cui, H. Sun, Y. Li, and Y. Kong, “A deep bi-directional attention network for human motion recovery.,” in IJCAI, pp. 701–707,2019 M. Burke and J. Lasenby, “Estimating missing marker positions using low dimensional kalman smoothing,” Journal of Biomechanics, vol. 49, no. 9, pp. 1854–1858, 2016. S.-J. Peng, G.-F. He, X. Liu, and H.-Z. Wang, “Hierarchical block-based incomplete human mocap data recovery using adaptive nonnegative matrix factorization,” Computers & Graphics, vol. 49, pp. 10–23, 2015. C.-H. Tan, J. Hou, and L.-P. Chau, “Human motion capture data recovery using trajectory-based matrix completion,” Electronics letters, vol. 49, no. 12, pp. 752–754, 2013. C. Hong, J. Yu, J. Wan, D. Tao, and M. Wang, “Multimodal deep autoencoder for human pose recovery,” IEEE transactions on image processing, vol. 24, no. 12, pp. 5659–5670, 2015. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017. Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Sim-mim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9653–9663, June 2022. C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, pp. 1325–1339, jul 2014. C. S. Catalin Ionescu, Fuxin Li, “Latent structured models for human pose estimation,” in International Conference on Computer Vision, 2011. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019. H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” arXiv preprint arXiv:2106.08254, 2021. K. He, X. Chen, S. Xie, Y. Li, P. Doll ́ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16000–16009, June 2022. C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feichtenhofer, “Masked feature prediction for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14668–14678, 2022. B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1290–1299, June 2022. Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “Beit v2: Masked image modeling with vector-quantized visual tokenizers,” 2022. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021. M. Caron, H. Touvron, I. Misra, H. J ́egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660, 2021. Y.-B. Cheng, X. Chen, D. Zhang, and L. Lin, “Motion-transformer: Self-supervised pre-training for skeleton-based action recognition,” in Proceedings of the 2nd ACM International Conference on Multimedia in Asia, pp. 1–6, 2021. W. Wu, Y. Hua, C. Zheng, S. Wu, C. Chen, and A. Lu, “Skeletonmae: Spatial-temporal masked autoencoders for self-supervised skeleton action recognition,”2023. R. Y. Lai, P. C. Yuen, and K. K. Lee, “Motion capture data completion and denoising by singular value thresholding.,” in Eurographics (Short Papers), pp. 45–48, 2011. W. Hu, Z. Wang, S. Liu, X. Yang, G. Yu, and J. J. Zhang, “Motion capture data completion via truncated nuclear norm regularization,” IEEE Signal Processing Letters, vol. 25, no. 2, pp. 258–262, 2017. J. Xiao, Y. Feng, and W. Hu, “Predicting missing markers in human motion capture using l1-sparse representation,” Computer Animation and Virtual Worlds, vol. 22, no. 2-3, pp. 221–228, 2011. A. Hernandez, J. Gall, and F. Moreno-Noguer, “Human motion prediction via spatio-temporal inpainting,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7134–7143, 2019. D. Holden, T. Komura, and J. Saito, “Phase-functioned neural networks for character control,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, pp. 1–13, 2017. J. Martinez, M. J. Black, and J. Romero, “On human motion prediction using recurrent neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2891–2900, 2017. T. Kucherenko, J. Beskow, and H. Kjellstr ̈om, “A neural network approach to missing marker reconstruction in human motion capture,” arXiv preprint arXiv:1803.02665, 2018. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. S. Hochreiter, “The vanishing gradient problem during learning recurrent neura nets and problem solutions,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116, 1998. D. Weng, Y. Wang, and D. Li, “A time reversal symmetry based real-time optical motion capture missing marker recovery method,” in 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), pp. 772–773, IEEE, 2022. L. Ji, R. Liu, D. Zhou, Q. Zhang, and X. Wei, “Missing data recovery for human mocap data based on a-lstm and ls constraint,” in 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), pp. 729–734, IEEE, 2020. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016. J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013. S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018. Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1110–1118, 2015. J. N. Kundu, M. Gor, P. K. Uppala, and V. B. Radhakrishnan, “Unsupervised feature learning of human actions as trajectories in pose embedding manifold,” in 2019 IEEE winter conference on applications of computer vision (WACV), pp. 1459–1467, IEEE, 2019. Z. Yu, L. Zhang, Y. Xu, C. Tang, L. Tran, C. Keskin, and H. S. Park, “Multi-view human body reconstruction from uncalibrated cameras,” in Advances in Neural Information Processing Systems, 2022. M. Li, S. Chen, Z. Zhang, L. Xie, Q. Tian, and Y. Zhang, “Skeleton-parted graph scattering networks for 3d human motion prediction,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI, pp. 18–36, Springer, 2022. C. Xu, R. T. Tan, Y. Tan, S. Chen, Y. G. Wang, X. Wang, and Y. Wang, “Eqmotion: Equivariant multi-agent motion prediction with invariant interaction reasoning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1410–1420, June 2023. A. Zeng, X. Ju, L. Yang, R. Gao, X. Zhu, B. Dai, and Q. Xu, “Deciwatch: A simple baseline for 10× efficient 2d and 3d pose estimation,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pp. 607–624, Springer, 2022. B. G. Gerats, J. M. Wolterink, and I. A. Broeders, “3d human pose estimation in multi-view operating room videos using differentiable camera projections,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, pp. 1–9, 2022. A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” 2020. H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” 2022. | - |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88899 | - |
dc.description.abstract | 人類活動辨識(Human Activity Recognition)很常透過骨架座標的方式表達動態關係。然而,當我們透過人體姿態檢測模型(Pose Estimation Model)來對RGB影片進行辨識,以生成骨架資料時,會因為目標物被畫面邊緣切割而造成資料缺失。這些缺失都是發生在人體四隻肢體,且都是從距離人體軀幹最遠的部分開始缺失。而這些缺失對人類活動辨識造成負面影響。
為了避免人類活動辨識錯誤,已經有許多針對人體骨架缺失還原的研究。然而,過去都是針對少量缺失點散佈在骨架序列的情快來做修補。然而,當缺失點發生長時間缺失而且在四肢肢體時,目前都沒辦法準確的修補這些缺失。因此,本研究將針對人體骨架序列中,單一肢體的邊緣點長時間資料缺失,並利用深度學習模型來還原缺失的資料點。 在本論文中,我們提出的群組式抽樣(Group-based Sampling),來增加資料歧異度以及資料數量。在訓練方面,我們設計了專屬的兩階段訓練(Two-stage Training),同時透過掩碼語言模型(Masked Language Model)的遮蔽訓練方式,以漸進式的遮蔽骨架來讓模型逐漸學習不同動作下缺失區域的運動軌跡。我們實作了混合結構的變換器模型(Transformer Model),能同時萃取骨架結構特徵以及變化特徵,並將所獲得的特徵有效的混合,讓後續預測模塊能對缺失區塊做準確預測。 本研究首先於Human3.6M資料集做實驗。我們發現同時萃取骨架結構特徵以及變化特徵,在修補缺失區塊以及重建整體骨架序列的準確度最高。最後,我們的方法雖然目前只能針對人體骨架單一肢體長時間缺失做修補,但對於缺失區塊的修補能力以及骨架的重建能力都成功超過了最先進的(state-of-the-art)方法。 | zh_TW |
dc.description.abstract | Human Activity Recognition (HAR) often employs skeleton coordinates to express dynamic relationships. However, when RGB videos are recognized through pose estimation model to generate skeleton data, data loss may occur due to the target being cut off by the edge of the screen. This loss typically occurs in the four limbs of the human body, and starts from the part farthest from the torso. This kind of data loss have a negative impact on human activity recognition.
In order to avoid errors in human activity recognition, there have been many studies focusing on the recovery of missing human skeleton joints. However, previous research has primarily targeted the repair of a small number of missing joints scattered in the skeleton sequence. Nevertheless, when missing joints occur over a long period of time and in the limb, there are currently no accurate methods for recovering these missing joints. Therefore, this study specifically targets prolonged data loss in the distal joints of a single limb within the human skeleton sequence and employs a deep learning model to recover the missing joints. In this thesis, we propose a group-based sampling method to increase data diversity and quantity. For training, we design a two-stage training strategy along with a masking strategy that progressively masks the skeleton using a masked language modeling training technique. This allows the model to gradually learn the motion trajectories of missing areas across different actions. We also implement a hybrid transformer-based model that extracts both structural and motion features from the skeleton and effectively combines these features. This enables accurate prediction of missing areas by subsequent prediction modules of the model. We first conducted our experiments on the Human3.6M dataset. In our experiments, we found that simultaneously extracting structural and motion features achieves the highest accuracy in recovering missing areas and reconstructing the sequence of skeleton. Although our method currently addresses long-term missing in a single limb of the human skeleton, it surpasses state-of-the-art methods in terms of both the ability to recover missing areas and reconstruct the sequence of skeleton. | en |
dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-16T16:16:04Z No. of bitstreams: 0 | en |
dc.description.provenance | Made available in DSpace on 2023-08-16T16:16:05Z (GMT). No. of bitstreams: 0 | en |
dc.description.tableofcontents | 誌謝 i
摘要 ii Abstract iii 1 Introduction 1 1.1 Background and Motivation 1 1.2 Problem Description 4 1.3 Proposed Method 5 1.4 Outline of the Thesis 6 2 Literature Review 7 2.1 Applying Mask Language Modeling Techniques in Computer Vision 7 2.2 Human Skeleton Missing Joints Recovery 9 2.2.1 Prior-based Methods 10 2.2.2 Deep Neural Network Methods 11 3 Problem Definition 14 3.1 Notations 14 3.2 Missing Joints Recovery in Human Limb Skeleton 16 4 Methodology 18 4.1 Human Skeleton Data Preprocessing 19 4.2 Group-based Sampling Method 20 4.3 Masking Method 22 4.4 Two-stage Training Method 23 4.5 Masking Strategy with our Two-stage Training Method 24 4.6 Transformer-based Model Architecture 25 4.6.1 Structure-based Model 25 4.6.2 Movement-based Model 26 4.6.3 Hybrid Model 27 5 Experiments 31 5.1 Experiment Setup 31 5.1.1 The Dataset 32 5.1.2 Experiment Details 32 5.1.3 Evaluation Protocal 34 5.2 Experiment Result 35 5.2.1 The Impact of Different Skeleton Feature Extraction Methods on Model Structures 36 5.2.2 The Impact of the Length of the Skeleton Sequence Input on Performance 37 5.2.3 The Impact of Different Sampling Methods on the Results 38 5.2.4 The Impact of Two-stage Training Strategy and Different Masking Strategy 39 5.2.5 Comparing the Performance of Our Method with Other Methods 45 5.3 Discussion 45 5.3.1 Structure-based Model Results Observation and Analysis 46 5.3.2 Movement-based Model Results Observation and Analysis 46 5.3.3 Hybrid Model Results Observation and Analysis 47 5.3.4 Analysis of Missing Joints at Various Limb Locations 48 6 Conclusion 60 6.1 Summary of Work 60 6.2 Contribution 61 6.3 Limitation 62 6.4 Future Study 62 Bibliography 62 A Application on the AIMS Dataset 71 A.1 Introduction to the AIMS Dataset 71 A.2 Skeleton Data Collection 72 A.3 Testing on Real Lost Video Data 72 | - |
dc.language.iso | en | - |
dc.title | 以變換器方法來修補缺失關節點用於骨架為基礎的動作識別 | zh_TW |
dc.title | A Transformer Approach to Recovering Missing Joints in Skeleton-Based Human Activity Recognition | en |
dc.type | Thesis | - |
dc.date.schoolyear | 111-2 | - |
dc.description.degree | 碩士 | - |
dc.contributor.oralexamcommittee | 古倫維;郭彥伶;陳駿丞;鄭素芳 | zh_TW |
dc.contributor.oralexamcommittee | Lun-Wei Ku;Yen-Ling Kuo;Jun-Cheng Chen;Suh-Fang Jeng | en |
dc.subject.keyword | 資料缺失修補,變換器方法,人體骨架表示法,深度學習,人類活動辨識, | zh_TW |
dc.subject.keyword | Missing Joints Recovery,Transformer Approach,Human Skeleton Representation,Deep Learning,Human Activity Recognition, | en |
dc.relation.page | 73 | - |
dc.identifier.doi | 10.6342/NTU202302931 | - |
dc.rights.note | 同意授權(限校園內公開) | - |
dc.date.accepted | 2023-08-10 | - |
dc.contributor.author-college | 電機資訊學院 | - |
dc.contributor.author-dept | 資訊網路與多媒體研究所 | - |
顯示於系所單位: | 資訊網路與多媒體研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-111-2.pdf 授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務) | 10.34 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。