基於全局與局部約束優化之時空間注意機制與圖卷積混合網路用於抗遮蔽之線上三維人體姿態估計

羅恩至; En-Jhih Lo

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94825

標題:	基於全局與局部約束優化之時空間注意機制與圖卷積混合網路用於抗遮蔽之線上三維人體姿態估計 Global-Local Constraint-Based Optimized Spatio-Temporal GCN-Attention Mixed Model for Online 3D Human Pose Estimation Robust to Occlusion
作者:	羅恩至 En-Jhih Lo
指導教授:	傅立成 Li-Chen Fu
關鍵字:	三維人體姿態估計,約束,抗遮蔽,注意機制,圖卷積, 3D Human Pose Estimation,constraints,robust to occlusion,attention mechanism,graph convolutional network,
出版年 :	2024
學位:	碩士
摘要:	隨著電腦視覺與影像處理領域的蓬勃發展，三維人體姿態估計被廣泛應用在不同領域上。過去研究分為以輸入單張影像或多張影像進行骨架之重建，而該任務有幾個影像表現的關鍵點，包括模型的骨架約束能力與抗遮蔽能力，此外，若能以更快更即時的方式輸出骨架，將更能應用在更多不同場景上。因此，本研究旨在以線上的偵測方式，透過提升模型之約束優化以及抗遮蔽能力，提升模型之三維人體骨架表現。為此我們提出了一種基於約束優化的時空間注意機制和圖混合網路方法。我們首先透過輸入RGB圖片，利用現有的二維骨架模型輸出二維姿勢，同時建立一個固定長度的儲存庫以儲存預測結果。接著利用所存儲之二維姿勢序列，設計一個二維姿勢提升至三維姿勢的網路，來預測所對應之三維姿勢。為了使模型能抗遮蔽，我們提出運動注意編碼器，透過提取二維姿勢的位置、速度與加速度特徵以進行優化，並設計運動注意解碼器，另外輸出一個運動細化三維姿勢，透過線上共同損失，使兩姿勢能相互學習與優化。而在約束優化部分，我們採用平行化的網路結構，一個分支採用圖卷積注意機制混合模型來捕捉局部特徵，另一分支透過注意機制網路以抓取骨架之時空間全局特徵，並透過彼此相互交換特徵的方式，使各分支能同時捕捉局部和全局的骨架特徵，以達成對骨架的時空間維度施加全局與局部約束的效果。此外，我們在訓練時引入新的損失函數，針對骨架的四肢關節長度、骨架關節點間角度等幾何特性進行約束，以進一步提升模型的性能。在實驗結果部分，我們的方法在兩個公開數據集Human3.6M和MPI-INF-3DHP具有比其他相關研究更加優異的表現，MPJPE分別為38.0 mm與15.6 mm，均達到SOTA。 With the rapid advancement in the fields of computer vision and image processing, 3D Human Pose Estimation has found extensive applications across various domains. Previous research primarily focuses on reconstructing skeletons from single or multiple input images, addressing critical factors such as the model's skeleton constraint ability and occlusion robustness. Moreover, achieving faster and more online real-time skeleton outputs enhances the applicability of these models in diverse scenarios. This study aims to enhance 3D human skeleton performance through online detection by improving model constraint optimization and occlusion robustness. We propose a method based on constraint optimization utilizing an attention mechanism and a graph convolutional network on both spatial and temporal domain. Initially, we input RGB images and use off-the-shelf 2D pose estimator to output 2D poses while establishing a fix-sized cache to store the predicted results. Subsequently, we design a network to lift up the stored 2D pose sequences into 3D pose sequences, and predict the corresponding 3D skeleton. To ensure the model's robustness to occlusion, we introduce a motion-aware encoder, which is optimized by extracting positional, velocity, and acceleration features of the 2D poses, and design a motion-aware decoder to output motion refined 3D pose sequence. By employing online mutual loss, these two poses learn and optimize each other. For constraint optimization, we adopt a parallel network structure: one branch employs a graph convolutional attention mechanism to capture local features, while the other branch utilizes an attention mechanism network to extract global features of the skeleton. Through global local feature interaction, each branch captures both local and global skeleton features, imposing global and local constraints on the skeleton in both spatial and temporal dimensions. Additionally, we introduce new loss functions during training to set geometric constraints, such as bone lengths constraint loss and joint angle loss, to further enhancing the model's performance. In the experimental result section, our method exhibits markedly superior performance as compared to other related studies on the two public datasets, Human3.6M and MPI-INF-3DHP, where MPJPEs are 38.0 mm and 15.6 mm, respectively, both achieving state-of-the-art results.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94825
DOI:	10.6342/NTU202403802
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2027-08-14
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 未授權公開取用	2.55 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。