請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97952| 標題: | 基於視窗令牌剪枝的三維 Swin Transformer 在多視角三維人體姿態估計中的關鍵區域學習 3D Space Token Swinformer : Learning Critical Regions with Window-based Token Pruning in 3D Swin Transformer for Multi-View 3D Human Pose Estimation |
| 作者: | 謝宗翰 Tsung-Han Hsieh |
| 指導教授: | 王勝德 Sheng-De Wang |
| 關鍵字: | 多人多視角三維人體姿態估計,SwinTransformer,令牌剪枝,空間自注意力, 3D Human pose estimation,Swin Transformer,Token Pruning,Window attention,Space attention, |
| 出版年 : | 2025 |
| 學位: | 碩士 |
| 摘要: | 本研究提出3DTokenSwinformer,一種注重空間注意力的3D Swin Transformer 新方法,應用於多視角三維人體姿態估計。在三維空間中,針對區域重要性不同的特性,透過實現空間注意力計算來提升模型的效能,利用注意力來劃分空間中的重要性。再來,透過移除低重要性的窗口及空間令牌,保留重要的區域,在降低計算量的同時維持效能。
本方法首先將潛在特徵體素劃分為不重疊的令牌,其中每個令牌相當於空間中的小區域。接著,使用3DSwinRootNet定位人體中心點,並利用3D Swin PoseNet 預測人體關節。此外,為了選擇關鍵區域,我們透過計算窗口注意力來評估各窗口的重要性,並提出窗口選擇模組來移除低重要性的窗口。 隨後,進一步引入Top-K令牌剪枝模組,從保留的窗口中篩選關鍵的令牌,以進一步強化對關鍵區域的關注。本研究使用Panoptic及Shelf 資料集進行評估,結果顯示無論在令牌剪枝前後,皆達到了具競爭力的表現。視覺化的成果也證實,透過窗口注意力機制有效識別空間中的關鍵區域(例如人體周圍),而令牌剪枝模組進一步精煉並保留最重要的令牌,從而同時提升人體姿態估計的準確性與效率。 In this work, we introduce 3D space token Swinformer for multi-view 3D human pose estimation. In 3D space, different regions exhibit varying levels of importance. We introduce this concept into the 3D Swin Transformer architecture and remove unimportant windows(regions) to retain the most critical areas. We first partition the latent feature volumeinto non-overlapping tokens, where each token represents a small region in 3D space. We then utilize 3D Swin RootNet to locate the human center point and 3D Swin PoseNet to predict body joints. We evaluate the importance of each window by computing window attention scores and propose a window selection module to remove low-importance windows (regions). Subsequently, we introduce a top-K selection module to select the most important tokens from the retaining windows, further emphasizing the critical regions. We evaluate our method on the Panoptic dataset, and our model achieves competitive results both before and after model compression. Visualization results demonstrate that our method effectively identifies key regions in 3D space (e.g., around the human body) through window attention, while the token selection module further refines and retains the most important tokens. Our study demonstrates that, in multi-view 3D human pose estimation tasks, the critical regions are primarily concentrated around the human body. We further integrate the 3D Swin PoseNet with the token selection module to retain the corresponding key tokens, thereby improving both the accuracy and efficiency of human pose estimation. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97952 |
| DOI: | 10.6342/NTU202500958 |
| 全文授權: | 同意授權(限校園內公開) |
| 電子全文公開日期: | 2030-01-01 |
| 顯示於系所單位: | 電機工程學系 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 16.39 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
