基於視窗令牌剪枝的三維 Swin Transformer 在多視角三維人體姿態估計中的關鍵區域學習

謝宗翰; Tsung-Han Hsieh

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97952

Title:	基於視窗令牌剪枝的三維 Swin Transformer 在多視角三維人體姿態估計中的關鍵區域學習 3D Space Token Swinformer : Learning Critical Regions with Window-based Token Pruning in 3D Swin Transformer for Multi-View 3D Human Pose Estimation
Authors:	謝宗翰 Tsung-Han Hsieh
Advisor:	王勝德 Sheng-De Wang
Keyword:	多人多視角三維人體姿態估計,SwinTransformer,令牌剪枝,空間自注意力, 3D Human pose estimation,Swin Transformer,Token Pruning,Window attention,Space attention,
Publication Year :	2025
Degree:	碩士
Abstract:	本研究提出3DTokenSwinformer，一種注重空間注意力的3D Swin Transformer 新方法，應用於多視角三維人體姿態估計。在三維空間中，針對區域重要性不同的特性，透過實現空間注意力計算來提升模型的效能，利用注意力來劃分空間中的重要性。再來，透過移除低重要性的窗口及空間令牌，保留重要的區域，在降低計算量的同時維持效能。本方法首先將潛在特徵體素劃分為不重疊的令牌，其中每個令牌相當於空間中的小區域。接著，使用3DSwinRootNet定位人體中心點，並利用3D Swin PoseNet 預測人體關節。此外，為了選擇關鍵區域，我們透過計算窗口注意力來評估各窗口的重要性，並提出窗口選擇模組來移除低重要性的窗口。隨後，進一步引入Top-K令牌剪枝模組，從保留的窗口中篩選關鍵的令牌，以進一步強化對關鍵區域的關注。本研究使用Panoptic及Shelf 資料集進行評估，結果顯示無論在令牌剪枝前後，皆達到了具競爭力的表現。視覺化的成果也證實，透過窗口注意力機制有效識別空間中的關鍵區域（例如人體周圍），而令牌剪枝模組進一步精煉並保留最重要的令牌，從而同時提升人體姿態估計的準確性與效率。 In this work, we introduce 3D space token Swinformer for multi-view 3D human pose estimation. In 3D space, different regions exhibit varying levels of importance. We introduce this concept into the 3D Swin Transformer architecture and remove unimportant windows(regions) to retain the most critical areas. We first partition the latent feature volumeinto non-overlapping tokens, where each token represents a small region in 3D space. We then utilize 3D Swin RootNet to locate the human center point and 3D Swin PoseNet to predict body joints. We evaluate the importance of each window by computing window attention scores and propose a window selection module to remove low-importance windows (regions). Subsequently, we introduce a top-K selection module to select the most important tokens from the retaining windows, further emphasizing the critical regions. We evaluate our method on the Panoptic dataset, and our model achieves competitive results both before and after model compression. Visualization results demonstrate that our method effectively identifies key regions in 3D space (e.g., around the human body) through window attention, while the token selection module further refines and retains the most important tokens. Our study demonstrates that, in multi-view 3D human pose estimation tasks, the critical regions are primarily concentrated around the human body. We further integrate the 3D Swin PoseNet with the token selection module to retain the corresponding key tokens, thereby improving both the accuracy and efficiency of human pose estimation.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97952
DOI:	10.6342/NTU202500958
Fulltext Rights:	同意授權(限校園內公開)
metadata.dc.date.embargo-lift:	2030-01-01
Appears in Collections:	電機工程學系

Files in This Item:

File	Size	Format
ntu-113-2.pdf Restricted Access	16.39 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets