Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97952
Title: 基於視窗令牌剪枝的三維 Swin Transformer 在多視角三維人體姿態估計中的關鍵區域學習
3D Space Token Swinformer : Learning Critical Regions with Window-based Token Pruning in 3D Swin Transformer for Multi-View 3D Human Pose Estimation
Authors: 謝宗翰
Tsung-Han Hsieh
Advisor: 王勝德
Sheng-De Wang
Keyword: 多人多視角三維人體姿態估計,SwinTransformer,令牌剪枝,空間自注意力,
3D Human pose estimation,Swin Transformer,Token Pruning,Window attention,Space attention,
Publication Year : 2025
Degree: 碩士
Abstract: 本研究提出3DTokenSwinformer,一種注重空間注意力的3D Swin Transformer 新方法,應用於多視角三維人體姿態估計。在三維空間中,針對區域重要性不同的特性,透過實現空間注意力計算來提升模型的效能,利用注意力來劃分空間中的重要性。再來,透過移除低重要性的窗口及空間令牌,保留重要的區域,在降低計算量的同時維持效能。
本方法首先將潛在特徵體素劃分為不重疊的令牌,其中每個令牌相當於空間中的小區域。接著,使用3DSwinRootNet定位人體中心點,並利用3D Swin PoseNet 預測人體關節。此外,為了選擇關鍵區域,我們透過計算窗口注意力來評估各窗口的重要性,並提出窗口選擇模組來移除低重要性的窗口。
隨後,進一步引入Top-K令牌剪枝模組,從保留的窗口中篩選關鍵的令牌,以進一步強化對關鍵區域的關注。本研究使用Panoptic及Shelf 資料集進行評估,結果顯示無論在令牌剪枝前後,皆達到了具競爭力的表現。視覺化的成果也證實,透過窗口注意力機制有效識別空間中的關鍵區域(例如人體周圍),而令牌剪枝模組進一步精煉並保留最重要的令牌,從而同時提升人體姿態估計的準確性與效率。
In this work, we introduce 3D space token Swinformer for multi-view 3D human pose estimation. In 3D space, different regions exhibit varying levels of importance. We introduce this concept into the 3D Swin Transformer architecture and remove unimportant windows(regions) to retain the most critical areas. We first partition the latent feature volumeinto non-overlapping tokens, where each token represents a small region in 3D space.
We then utilize 3D Swin RootNet to locate the human center point and 3D Swin PoseNet to predict body joints. We evaluate the importance of each window by computing window attention scores and propose a window selection module to remove low-importance windows (regions). Subsequently, we introduce a top-K selection module to select the most important tokens from the retaining windows, further emphasizing the critical regions. We evaluate our method on the Panoptic dataset, and our model achieves competitive results both before and after model compression. Visualization results demonstrate that our method effectively identifies key regions in 3D space (e.g., around the human body) through window attention, while the token selection module further refines and retains the most important tokens. Our study demonstrates that, in multi-view 3D human pose estimation tasks, the critical regions are primarily concentrated around the human body. We further integrate the 3D Swin PoseNet with the token selection module to retain the corresponding key tokens, thereby improving both the accuracy and efficiency of human pose estimation.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97952
DOI: 10.6342/NTU202500958
Fulltext Rights: 同意授權(限校園內公開)
metadata.dc.date.embargo-lift: 2030-01-01
Appears in Collections:電機工程學系

Files in This Item:
File SizeFormat 
ntu-113-2.pdf
  Restricted Access
16.39 MBAdobe PDFView/Open
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved