Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97952
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor王勝德zh_TW
dc.contributor.advisorSheng-De Wangen
dc.contributor.author謝宗翰zh_TW
dc.contributor.authorTsung-Han Hsiehen
dc.date.accessioned2025-07-23T16:13:31Z-
dc.date.available2025-07-24-
dc.date.copyright2025-07-23-
dc.date.issued2025-
dc.date.submitted2025-07-03-
dc.identifier.citation[1] Easymocap- make human motion capture easier. Github, 2021.
[2] D. Alexey. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929, 2020.
[3] V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic. 3d pictorial structures for multiple human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1669–1676, 2014.
[4] V.Belagiannis, S. Amin, M.Andriluka, B.Schiele, N.Navab, andS.Ilic. 3dpictorial structures revisited: Multiple human pose estimation. IEEE transactions on pattern analysis and machine intelligence, 38(10):1929–1942, 2015.
[5] L. Bridgeman, M. Volino, J.-Y. Guillemaut, and A. Hilton. Multi-person 3d pose estimation and tracking in sports. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
[6] B.-H. Chen and C.-c. Tsai. 3dsa: Multi-view 3d human pose estimation with 3d space attention mechanisms. In European Conference on Computer Vision, pages 323–339. Springer, 2024.
[7] H. Chen, P. Guo, P. Li, G. H. Lee, and G. Chirikjian. Multi-person 3d pose estimation in crowded scenes based on multi-view geometry. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 541–557. Springer, 2020.
[8] X. Chen, Z. Liu, H. Tang, L. Yi, H. Zhao, and S. Han. Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2061–2070, 2023.
[9] Y. Chen, R. Gu, O. Huang, and G. Jia. Vtp: volumetric transformer for multi-view multi-person 3d pose estimation. Applied Intelligence, 53(22):26568–26579, 2023.
[10] B. Cheng, A. Schwing, and A. Kirillov. Per-pixel classification is not all you need for semantic segmentation. Advances in neural information processing systems, 34:17864–17875, 2021.
[11] R. Choudhury, K. M. Kitani, and L. A. Jeni. Tempo: Efficient multi-view pose estimation, tracking, and forecasting. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14704–14714, 2023.
[12] J. Dong, Q. Fang, W. Jiang, Y. Yang, Q. Huang, H. Bao, and X. Zhou. Fast and robust multi-person 3d pose estimation and tracking from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6981–6992, 2021.
[13] J. Dong, W. Jiang, Q. Huang, H. Bao, and X. Zhou. Fast and robust multi-person 3d pose estimation from multiple views. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7792–7801, 2019.
[14] S. Ershadi-Nasab, E. Noury, S. Kasaei, and E. Sanaei. Multiple human 3d pose estimation from multiview images. Multimedia Tools and Applications, 77:15573–15601, 2018.
[15] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[17] C. Huang, S. Jiang, Y. Li, Z. Zhang, J. Traish, C. Deng, S. Ferguson, and R. Y. Da Xu. End-to-end dynamic matching network for multi-view multi-person 3d pose estimation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, pages 477–493. Springer,2020.
[18] K. Iskakov, E. Burkov, V. Lempitsky, and Y. Malkov. Learnable triangulation of human pose. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7718–7727, 2019.
[19] H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh. Panoptic studio: A massively multiview system for social motion capture. In Proceedings of the IEEE International Conference on Computer Vision, pages 3334–3342, 2015.
[20] Y. Li, S. Zhang, Z. Wang, S. Yang, W. Yang, S.-T. Xia, and E. Zhou. Token-pose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF International conference on computer vision, pages 11313–11322, 2021.
[21] Z. Liao, J. Zhu, C. Wang, H. Hu, and S. L. Waslander. Multiple view geometry transformers for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 708–717, 2024.
[22] J. Lin and G. H. Lee. Multi-view multi-person 3d pose estimation with plane sweep stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11886–11895, 2021.
[23] Y. Liu, M. Gehrig, N. Messikommer, M. Cannici, and D. Scaramuzza. Revisiting token pruning for object detection and instance segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2658–2668, 2024.
[24] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022,2021.
[25] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
[26] H. Ma, Z. Wang, Y. Chen, D. Kong, L. Chen, X. Liu, X. Yan, H. Tang, and X. Xie. Ppt: token-pruned pose transformer for monocular and multi-view human pose estimation. In European Conference on Computer Vision, pages 424–442. Springer,2022.
[27] W. Mao, Y. Ge, C. Shen, Z. Tian, X. Wang, and Z. Wang. Tfpose: Direct human pose estimation with transformers. arXiv preprint arXiv:2103.15320, 2021.
[28] Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021.
[29] N. D. Reddy, L. Guigues, L. Pishchulin, J. Eledath, and S. G. Narasimhan. Tessetrack: End-to-end learnable multi-person articulated 3d pose tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15190–15200, 2021.
[30] E. Remelli, S. Han, S. Honari, P. Fua, and R. Wang. Lightweight multi-view 3d pose estimation through camera-disentangled representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6040–6049, 2020.
[31] V. Srivastav, K. Chen, and N. Padoy. Selfpose3d: Self-supervised multi-person multi-view 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2502–2512, 2024.
[32] J. Su, C. Wang, X. Ma, W. Zeng, and Y. Wang. Virtualpose: Learning generalizable 3d human pose models from virtual data. In European Conference on Computer Vision, pages 55–71. Springer, 2022.
[33] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei. Integral human pose regression. In Proceedings of the European conference on computer vision (ECCV), pages 529–545, 2018.
[34] H. Tu, C. Wang, and W. Zeng. Voxelpose: Towards multi-camera 3d human pose estimation in wild environment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 197–212. Springer, 2020.
[35] H. Wang, J. Liu, J. Tang, G. Wu, B. Xu, Y. Chou, and Y. Wang. Gtpt: Group-based token pruning transformer for efficient human pose estimation. In European Conference on Computer Vision, pages 213–230. Springer, 2024.
[36] J. Wang, F. Yang, B. Li, W. Gou, D. Yan, A. Zeng, Y. Gao, J. Wang, Y. Jing, and R. Zhang. Freeman: Towards benchmarking 3d human pose estimation under real-world conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21978–21988, 2024.
[37] A. Waswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
[38] S. Wu, S. Jin, W. Liu, L. Bai, C. Qian, D. Liu, and W. Ouyang. Graph-based 3d multi-person pose estimation using multi-view images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11148–11157, 2021.
[39] Y. Xu, J. Zhang, Q. Zhang, and D. Tao. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems,35:38571–38584, 2022.
[40] S. Yang, Z. Quan, M. Nie, and W. Yang. Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11802–11812, 2021.
[41] H. Ye, W. Zhu, C. Wang, R. Wu, and Y. Wang. Faster voxelpose: Real-time 3d human pose estimation by orthographic projection. In European Conference on Computer Vision, pages 142–159. Springer, 2022.
[42] J. Zhang, Y. Cai, S. Yan, J. Feng, et al. Direct multi-view multi-person 3d pose estimation. Advances in Neural Information Processing Systems, 34:13153–13164, 2021.
[43] Y. Zhang, L. An, T. Yu, X. Li, K. Li, and Y. Liu. 4d association graph for real-time multi-person motion capture using multiple video cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1324–1333, 2020.
[44] Y. Zhang, C. Wang, X. Wang, W. Liu, and W. Zeng. Voxeltrack: Multi-person 3d human pose estimation and tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2613–2626, 2022.
[45] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159,2020.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97952-
dc.description.abstract本研究提出3DTokenSwinformer,一種注重空間注意力的3D Swin Transformer 新方法,應用於多視角三維人體姿態估計。在三維空間中,針對區域重要性不同的特性,透過實現空間注意力計算來提升模型的效能,利用注意力來劃分空間中的重要性。再來,透過移除低重要性的窗口及空間令牌,保留重要的區域,在降低計算量的同時維持效能。
本方法首先將潛在特徵體素劃分為不重疊的令牌,其中每個令牌相當於空間中的小區域。接著,使用3DSwinRootNet定位人體中心點,並利用3D Swin PoseNet 預測人體關節。此外,為了選擇關鍵區域,我們透過計算窗口注意力來評估各窗口的重要性,並提出窗口選擇模組來移除低重要性的窗口。
隨後,進一步引入Top-K令牌剪枝模組,從保留的窗口中篩選關鍵的令牌,以進一步強化對關鍵區域的關注。本研究使用Panoptic及Shelf 資料集進行評估,結果顯示無論在令牌剪枝前後,皆達到了具競爭力的表現。視覺化的成果也證實,透過窗口注意力機制有效識別空間中的關鍵區域(例如人體周圍),而令牌剪枝模組進一步精煉並保留最重要的令牌,從而同時提升人體姿態估計的準確性與效率。
zh_TW
dc.description.abstractIn this work, we introduce 3D space token Swinformer for multi-view 3D human pose estimation. In 3D space, different regions exhibit varying levels of importance. We introduce this concept into the 3D Swin Transformer architecture and remove unimportant windows(regions) to retain the most critical areas. We first partition the latent feature volumeinto non-overlapping tokens, where each token represents a small region in 3D space.
We then utilize 3D Swin RootNet to locate the human center point and 3D Swin PoseNet to predict body joints. We evaluate the importance of each window by computing window attention scores and propose a window selection module to remove low-importance windows (regions). Subsequently, we introduce a top-K selection module to select the most important tokens from the retaining windows, further emphasizing the critical regions. We evaluate our method on the Panoptic dataset, and our model achieves competitive results both before and after model compression. Visualization results demonstrate that our method effectively identifies key regions in 3D space (e.g., around the human body) through window attention, while the token selection module further refines and retains the most important tokens. Our study demonstrates that, in multi-view 3D human pose estimation tasks, the critical regions are primarily concentrated around the human body. We further integrate the 3D Swin PoseNet with the token selection module to retain the corresponding key tokens, thereby improving both the accuracy and efficiency of human pose estimation.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-23T16:13:31Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-07-23T16:13:31Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 iii
Abstract v
Contents vii
List of Figures ix
List of Tables xiii

Chapter 1 Introduction 1

Chapter 2 Related work 5
2.1 Voxel-based 3D human pose estimation method 5
2.2 Transformer in human pose estimation 6
2.3 Token pruning in Vision Transformers 7

Chapter 3 Method 9
3.1 3D token Swinformer 9
3.2 Token selection module 12

Chapter 4 Experiment 17
4.1 Implementation details 17
4.2 Datasets and evaluation metrics 17
4.3 Compare with existing method on Panoptic datasets 18
4.4 Compare with existing method on Shelf datasets 19
4.5 Ablation study 19
4.6 Token pruning on Panoptic datasets 22
4.7 Head pruning on Panoptic datasets 26

Chapter 5 Visualization 29
5.1 Posenet visualization results on Panoptic datasets 29
5.2 Rootnet visualization results on Panoptic datasets 32

Chapter 6 Conclusion 35

References 37
-
dc.language.isoen-
dc.subject空間自注意力zh_TW
dc.subject令牌剪枝zh_TW
dc.subjectSwinTransformerzh_TW
dc.subject多人多視角三維人體姿態估計zh_TW
dc.subjectSpace attentionen
dc.subject3D Human pose estimationen
dc.subjectSwin Transformeren
dc.subjectToken Pruningen
dc.subjectWindow attentionen
dc.title基於視窗令牌剪枝的三維 Swin Transformer 在多視角三維人體姿態估計中的關鍵區域學習zh_TW
dc.title3D Space Token Swinformer : Learning Critical Regions with Window-based Token Pruning in 3D Swin Transformer for Multi-View 3D Human Pose Estimationen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee雷欽隆;蔡家齊;余承叡zh_TW
dc.contributor.oralexamcommitteeChin-Laung Lei;Chia-Chi Tsai;Cheng-Juei Yuen
dc.subject.keyword多人多視角三維人體姿態估計,SwinTransformer,令牌剪枝,空間自注意力,zh_TW
dc.subject.keyword3D Human pose estimation,Swin Transformer,Token Pruning,Window attention,Space attention,en
dc.relation.page43-
dc.identifier.doi10.6342/NTU202500958-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-07-04-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電機工程學系-
dc.date.embargo-lift2030-01-01-
顯示於系所單位:電機工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
16.39 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved