請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94825完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 傅立成 | zh_TW |
| dc.contributor.advisor | Li-Chen Fu | en |
| dc.contributor.author | 羅恩至 | zh_TW |
| dc.contributor.author | En-Jhih Lo | en |
| dc.date.accessioned | 2024-08-19T17:08:17Z | - |
| dc.date.available | 2024-08-20 | - |
| dc.date.copyright | 2024-08-19 | - |
| dc.date.issued | 2024 | - |
| dc.date.submitted | 2024-08-07 | - |
| dc.identifier.citation | [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention Is All You Need," in Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), 2017, pp. 6000-6010.
[2] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, "Cascaded Pyramid Network for Multi-Person Pose Estimation," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [3] K. Sun, B. Xiao, D. Liu, and J. Wang, "Deep High-Resolution Representation Learning for Human Pose Estimation," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [4] Y. Cheng, B. Yang, B. Wang, W. Yan, and R. T. Tan, "Occlusion-Aware Networks for 3D Human Pose Estimation in Video," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 723-732. [5] A. Toshev and C. Szegedy, "DeepPose: Human Pose Estimation via Deep Neural Networks," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. [6] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, "Convolutional Pose Machines," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [7] A. Newell, K. Yang, and J. Deng, "Stacked Hourglass Networks for Human Pose Estimation," in European Conference on Computer Vision (ECCV), 2016. [8] H. Fang, S. Xie, Y. Tai, C. Lu, K. Yang, and C. Huang, "RMPE: Regional Multi-Person Pose Estimation," in IEEE International Conference on Computer Vision (ICCV), 2017. [9] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [10] S. Yang, M. Li, H. Chen, X. Li, H. Wang, and H. Zha, "TransPose: Keypoint Localization via Transformer," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. [11] Y. Chen, Z. Zhu, S. Liu, G. Li, Z. Wang, J. Tan, and E. Ding, "Adversarial PoseNet: A Structure-Aware Convolutional Network for Human Pose Estimation," in IEEE International Conference on Computer Vision (ICCV), 2017. [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative Adversarial Nets," in Advances in Neural Information Processing Systems (NIPS), 2014. [13] W. Yu, B. Xiao, H. Guo, C. Yuan, and J. Feng. "Lite-HRNet: A Lightweight High-Resolution Network," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. [14] G. Moon, J. Chang, and K. Lee. "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. [15] K. Zhou, X. Han, N. Jiang, K. Jia, and J. Lu. "HEMlets Pose: Learning Part-Centric Heatmap Triplets for Accurate 3D Human Pose Estimation," in IEEE International Conference on Computer Vision (ICCV), 2019. [16] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei. "Integral Human Pose Regression," in European Conference on Computer Vision (ECCV), 2018. [17] J. Martinez, R. Hossain, J. Romero, and J. J. Little, "A Simple Yet Effective Baseline for 3D Human Pose Estimation," in IEEE International Conference on Computer Vision (ICCV), 2017. [18] M. R. I. Hossain and J. J. Little, "Exploiting Temporal Information for 3D Pose Estimation," in European Conference on Computer Vision (ECCV), 2018. [19] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei, "Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [20] Y. Cai, L. Ge, J. Liu, J. Cai, T.-J. Cham, J. Yuan, and N. M. Thalmann, "Exploiting Spatial-temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [21] S. Yan, Y. Xiong, and D. Lin, "Spatial temporal graph convolutional networks for skeleton-based action recognition," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018. [22] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas, "Semantic Graph Convolutional Networks for 3D Human Pose Regression," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [23] W. Li, H. Liu, T. Guo, H. Tang, and R. Ding, "GraphMLP: A Graph MLP-Like Architecture for 3D Human Pose Estimation," in arXiv, 2022. [24] S. Lutz, R. Blythman, K. Ghosal, M. Moynihan, C. Simms, and A. Smolic, "Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for 3D Human Pose Estimation," in Proceedings of the 26th International Conference on Pattern Recognition (ICPR), 2022. [25] W. Zhao, W. Wang, and Y. Tian, "GraFormer: Graph-Oriented Transformer for 3D Pose Estimation," in Proceedings of the 26th International Conference on Pattern Recognition (ICPR), 2022. [26] Z. Chen, J. Dai, J. Bai, and J. Pan, "DGFormer: Dynamic Graph Transformer for 3D Human Pose Estimation," Pattern Recognition, vol. 152, p.110446, 2022. [27] H. Lin, Y. Chiu, and P. Wu, "AMPose: Alternately Mixed Global-Local Attention Model for 3D Human Pose Estimation," arXiv preprint arXiv:2210.04216, 2022. [28] H. Kang, Y. Wang, M. Liu, D. Wu, P. Liu, and W. Yang, "Double-chain Constraints for 3D Human Pose Estimation in Images and Videos," arXiv preprint arXiv:2308.05298, 2023. [29] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, "3D Human Pose Estimation in Video with Temporal Convolutions and Semi-Supervised Training," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7753-7762, 2019. [30] C. Zheng, S. Zhu, M. Mendieta, T. Yang, C. Chen, and Z. Ding, "3D Human Pose Estimation With Spatial and Temporal Transformers," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11656-11665, 2021. [31] T. Chen, C. Fang, X. Shen, Y. Zhu, Z. Chen, and J. Luo, "Anatomy-Aware 3D Human Pose Estimation with Bone-Based Pose Decomposition," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 198-209, 2022. [32] W. Li, H. Liu, R. Ding, M. Liu, P. Wang, and W. Yang, "Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation," in IEEE Transactions on Multimedia (TMM), 25:1282-1293, 2022. [33] W. Li, H. Liu, H. Tang, P. Wang, and L. Van Gool, "MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13472-13481, 2022. [34] X. Zheng, W. Wang, W. Liu, P. Wang, H. Liu, and W. Yang, "P-STMO: Pre-trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2389-2398, 2022. [35] J. Zhang, Z. Tu, J. Yang, Y. Chen, and J. Yuan, "MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13232-13242, 2022. [36] Z. Tang, Z. Qiu, Y. Hao, R. Hong, and T. Yao, "STCFormer: 3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4790-4799, 2023. [37] H. Chen, J.-Y. He, W. Xiang, Z.-Q. Cheng, W. Liu, H. Liu, B. Luo, Y. Geng, and X. Xie, "HDFormer: High-order Directed Transformer for 3D Human Pose Estimation," arXiv preprint arXiv:2302.01825, 2023. [38] W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y. Wang, "MotionBERT: A Unified Perspective on Learning Human Motion Representations," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15085-15099, 2023. [39] S. Mehraban, V. Adeli, and B. Taati, "MotionAGFormer: Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 6920-6930. [40] A. F. Agarap, "Deep Learning using Rectified Linear Units (ReLU)", 2018. [41] D. Hendrycks and K. Gimpel, "Gaussian Error Linear Units (GELUs)", 2016. [42] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [43] Y. Yu, X. Si, W. Ma, Z. Gao, S. Liu, M. Sun, and H. Wang, "MetaFormer Is Actually What You Need for Vision," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10902-10912. [44] J. Y. Lee and I. Kim, "Multi-hop Modulated Graph Convolutional Networks for 3D Human Pose Estimation," in Proceedings of the 33rd British Machine Vision Conference (BMVC), 2022. [45] K. Jin, B.-S. Lim, G.-H. Lee, T.-K. Kang, and S.-W. Lee, "HANet: Kinematic-Aware Hierarchical Attention Network for Human Pose Estimation in Videos," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 5725-5734. [46] K. Matsune, M. Sudo, T. Mitani, and T. Kaneko, "A Geometry Loss Combination for 3D Human Pose Estimation," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024. [47] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, "Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments," in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 36, no. 7, pp. 1325-1339, 2014. [48] M. Mehta, H. Rhodin, D. Casas, P. Fua, O. Hilliges, and C. Theobalt, "Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision," in Proceedings of the IEEE International Conference on 3D Vision (3DV), 2017, pp. 506-516. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94825 | - |
| dc.description.abstract | 隨著電腦視覺與影像處理領域的蓬勃發展,三維人體姿態估計被廣泛應用在不同領域上。過去研究分為以輸入單張影像或多張影像進行骨架之重建,而該任務有幾個影像表現的關鍵點,包括模型的骨架約束能力與抗遮蔽能力,此外,若能以更快更即時的方式輸出骨架,將更能應用在更多不同場景上。
因此,本研究旨在以線上的偵測方式,透過提升模型之約束優化以及抗遮蔽能力,提升模型之三維人體骨架表現。為此我們提出了一種基於約束優化的時空間注意機制和圖混合網路方法。我們首先透過輸入RGB圖片,利用現有的二維骨架模型輸出二維姿勢,同時建立一個固定長度的儲存庫以儲存預測結果。接著利用所存儲之二維姿勢序列,設計一個二維姿勢提升至三維姿勢的網路,來預測所對應之三維姿勢。 為了使模型能抗遮蔽,我們提出運動注意編碼器,透過提取二維姿勢的位置、速度與加速度特徵以進行優化,並設計運動注意解碼器,另外輸出一個運動細化三維姿勢,透過線上共同損失,使兩姿勢能相互學習與優化。而在約束優化部分,我們採用平行化的網路結構,一個分支採用圖卷積注意機制混合模型來捕捉局部特徵,另一分支透過注意機制網路以抓取骨架之時空間全局特徵,並透過彼此相互交換特徵的方式,使各分支能同時捕捉局部和全局的骨架特徵,以達成對骨架的時空間維度施加全局與局部約束的效果。 此外,我們在訓練時引入新的損失函數,針對骨架的四肢關節長度、骨架關節點間角度等幾何特性進行約束,以進一步提升模型的性能。在實驗結果部分,我們的方法在兩個公開數據集Human3.6M和MPI-INF-3DHP具有比其他相關研究更加優異的表現,MPJPE分別為38.0 mm與15.6 mm,均達到SOTA。 | zh_TW |
| dc.description.abstract | With the rapid advancement in the fields of computer vision and image processing, 3D Human Pose Estimation has found extensive applications across various domains. Previous research primarily focuses on reconstructing skeletons from single or multiple input images, addressing critical factors such as the model's skeleton constraint ability and occlusion robustness. Moreover, achieving faster and more online real-time skeleton outputs enhances the applicability of these models in diverse scenarios.
This study aims to enhance 3D human skeleton performance through online detection by improving model constraint optimization and occlusion robustness. We propose a method based on constraint optimization utilizing an attention mechanism and a graph convolutional network on both spatial and temporal domain. Initially, we input RGB images and use off-the-shelf 2D pose estimator to output 2D poses while establishing a fix-sized cache to store the predicted results. Subsequently, we design a network to lift up the stored 2D pose sequences into 3D pose sequences, and predict the corresponding 3D skeleton. To ensure the model's robustness to occlusion, we introduce a motion-aware encoder, which is optimized by extracting positional, velocity, and acceleration features of the 2D poses, and design a motion-aware decoder to output motion refined 3D pose sequence. By employing online mutual loss, these two poses learn and optimize each other. For constraint optimization, we adopt a parallel network structure: one branch employs a graph convolutional attention mechanism to capture local features, while the other branch utilizes an attention mechanism network to extract global features of the skeleton. Through global local feature interaction, each branch captures both local and global skeleton features, imposing global and local constraints on the skeleton in both spatial and temporal dimensions. Additionally, we introduce new loss functions during training to set geometric constraints, such as bone lengths constraint loss and joint angle loss, to further enhancing the model's performance. In the experimental result section, our method exhibits markedly superior performance as compared to other related studies on the two public datasets, Human3.6M and MPI-INF-3DHP, where MPJPEs are 38.0 mm and 15.6 mm, respectively, both achieving state-of-the-art results. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-08-19T17:08:16Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2024-08-19T17:08:17Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 誌謝 i
中文摘要 ii ABSTRACT iii CONTENTS v LIST OF FIGURES viii LIST OF TABLES x Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Literature Review 5 1.3.1 2D Human Pose Estimation 5 1.3.2 One-Stage 3D Human Pose Estimation 7 1.3.3 Two-Stage 3D Human Pose Estimation 8 1.4 Contribution 12 1.5 Thesis Organization 12 Chapter 2 Preliminaries 14 2.1 Multi-layer Perceptron 14 2.2 Graph Convolutional Networks 16 2.3 Attention Mechanism 18 2.4 Transformer Encoder 20 Chapter 3 Methodology 23 3.1 Problem Foundation 23 3.2 System Overview 25 3.3 2D HPE 27 3.4 2D-to-3D Lift-Up Model 29 3.4.1 Feature Embedding 29 3.4.2 Motion-Aware Encoder 31 3.4.3 Global Local Constraint Module 35 3.4.4 Three-Level Feature Fusion and Regression Head 48 3.4.5 Motion-Embedding 49 3.4.6 Motion-Aware Decoder 51 3.5 Overall Objective Function 53 Chapter 4 Experimental Results 58 4.1 Environmental Setup 58 4.2 Datasets 59 4.2.1 Human3.6M 59 4.2.2 MPI-INF-3DHP 59 4.3 Implementation Details 60 4.3.1 Training Setup 60 4.3.2 Evaluation Metrics 61 4.4 Comparison with SOTA 62 4.4.1 Results on Human3.6M 62 4.4.2 Results on MPI-INF-3DHP 67 4.5 Ablation Studies 68 4.5.1 Effectiveness of MAE and MAD 69 4.5.2 Effectiveness of Global Local Constraint Modules 69 4.5.3 Effectiveness of GAJBE and Three-Level Fusion 71 4.6 Qualitative Results 73 Chapter 5 Conclusion and Future Work 75 REFERENCE 77 | - |
| dc.language.iso | en | - |
| dc.subject | 三維人體姿態估計 | zh_TW |
| dc.subject | 圖卷積 | zh_TW |
| dc.subject | 注意機制 | zh_TW |
| dc.subject | 抗遮蔽 | zh_TW |
| dc.subject | 約束 | zh_TW |
| dc.subject | constraints | en |
| dc.subject | attention mechanism | en |
| dc.subject | 3D Human Pose Estimation | en |
| dc.subject | graph convolutional network | en |
| dc.subject | robust to occlusion | en |
| dc.title | 基於全局與局部約束優化之時空間注意機制與圖卷積混合網路用於抗遮蔽之線上三維人體姿態估計 | zh_TW |
| dc.title | Global-Local Constraint-Based Optimized Spatio-Temporal GCN-Attention Mixed Model for Online 3D Human Pose Estimation Robust to Occlusion | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 112-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 王鈺強;傅楸善;黃世勳;黃正民 | zh_TW |
| dc.contributor.oralexamcommittee | Yu-Chiang Frank Wang;Chiou-Shann Fuh;Shih-Shinh Huang;Cheng-Ming Huang | en |
| dc.subject.keyword | 三維人體姿態估計,約束,抗遮蔽,注意機制,圖卷積, | zh_TW |
| dc.subject.keyword | 3D Human Pose Estimation,constraints,robust to occlusion,attention mechanism,graph convolutional network, | en |
| dc.relation.page | 83 | - |
| dc.identifier.doi | 10.6342/NTU202403802 | - |
| dc.rights.note | 同意授權(限校園內公開) | - |
| dc.date.accepted | 2024-08-10 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電機工程學系 | - |
| dc.date.embargo-lift | 2027-08-14 | - |
| 顯示於系所單位: | 電機工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-112-2.pdf 未授權公開取用 | 2.55 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
