Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101361
Full metadata record
???org.dspace.app.webui.jsptag.ItemTag.dcfield???ValueLanguage
dc.contributor.advisor徐宏民zh_TW
dc.contributor.advisorWinston H. Hsuen
dc.contributor.author刁一凡zh_TW
dc.contributor.authorEgil Diauen
dc.date.accessioned2026-01-27T16:12:42Z-
dc.date.available2026-01-28-
dc.date.copyright2026-01-27-
dc.date.issued2026-
dc.date.submitted2026-01-22-
dc.identifier.citation[1] S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. 2023.
[2] J. Chen, D. Gao, K. Q. Lin, and M. Z. Shou. Affordance grounding from demonstration video to target image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6799–6808, 2023.
[3] L. Chen, X. Chu, X. Zhang, and J. Sun. Simple baselines for image restoration. arXiv preprint arXiv:2204.04676, 2022.
[4] A. Delitzas, A. Takmaz, F. Tombari, R. Sumner, M. Pollefeys, and F. Engelmann. SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[5] H. Geng, Z. Li, Y. Geng, J. Chen, H. Dong, and H. Wang. Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations. arXiv preprint arXiv:2303.16958, 2023.
[6] K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V. Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383-19400, 2024.
[7] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick. Segment anything. arXiv:2304.02643, 2023.
[8] Y. Li, N. Zhao, J. Xiao, C. Feng, X. Wang, and T.-s. Chua. Laso: Language-guided affordance segmentation on 3d object. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14251–14260, 2024.
[9] Y.-C. Lin, P. Florence, A. Zeng, J. T. Barron, Y. Du, W.-C. Ma, A. Simeonov, A. R. Garcia, and P. Isola. Mira: Mental imagery for robotic affordances. In Conference on Robot Learning, pages 1916–1927. PMLR, 2023.
[10] Y. Liu, A. Gupta, P. Abbeel, and S. Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE international conference on robotics and automation (ICRA), pages 1118–1125. IEEE, 2018.
[11] H. Luo, W. Zhai, J. Zhang, Y. Cao, and D. Tao. Learning affordance grounding from exocentric images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2252–2261, 2022.
[12] R. Mendonca, S. Bahl, and D. Pathak. Structured world models from human videos. 2023.
[13] G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3D with transformers. In CVPR, 2024.
[14] Y. Qin, Y.-H. Wu, S. Liu, H. Jiang, R. Yang, Y. Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos, 2021.
[15] A. Rashid, S. Sharma, C. M. Kim, J. Kerr, L. Y. Chen, A. Kanazawa, and K. Goldberg. Language embedded radiance fields for zero-shot task-oriented grasping. In 7th Annual Conference on Robot Learning, 2023.
[16] D. Shan, J. Geng, M. Shu, and D. Fouhey. Understanding human hands in contact at internet scale. 2020.
[17] W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola. Distilled feature fields enable few-shot language-guided manipulation. In 7th Annual Conference on Robot Learning, 2023.
[18] A. Simeonov, Y. Du, A. Tagliasacchi, J. B. Tenenbaum, A. Rodriguez, P. Agrawal, and V. Sitzmann. Neural descriptor fields: Se(3)-equivariant object representations for manipulation. 2022.
[19] R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V. Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021.
[20] S. Wang, V. Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. In CVPR, 2024.
[21] T. Weng, D. Held, F. Meier, and M. Mukadam. Neural grasp distance fields for robot manipulation. IEEE International Conference on Robotics and Automation (ICRA), 2023.
[22] Y. Ye, X. Li, A. Gupta, S. D. Mello, S. Birchfield, J. Song, S. Tulsiani, and S. Liu. Affordance diffusion: Synthesizing hand-object interactions. In CVPR, 2023.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101361-
dc.description.abstract發展具備泛化能力的機器人技能至今仍極具挑戰性。受到心理學啟發,「可供性」(affordance)已被視為一種具潛力的中介表徵,可用來引導機器人進行物體操控。然而,多數現有研究主要聚焦於來自影片的二維可供性,忽略了攝影機位置、絕對空間座標、深度與幾何結構等關鍵空間資訊。為此,本研究提出一種無需訓練的創新方法,可從第一人稱操作示範影片中建構三維可供性。針對第一人稱影片中缺乏靜態高品質畫面而導致三維重建困難的問題,我們採用三維基礎模型 DUST3R,能在不使用 COLMAP 的情況下,從稀疏影像中重建場景。我們首先以手部偵測技術擷取接觸時間與二維接觸點,再透過 DUST3R 還原互動場景,並將接觸點以高斯熱圖投影至三維空間;同時,我們利用三維手部姿態估計取得手部軌跡,並透過線性回歸整合其時空動態,建構出完整的人物與物體互動歷程。實驗結果顯示,我們的方法能有效應用於 Ego4D-Exo 資料集中的七項真實世界料理任務,展現其於複雜操控場景中建構三維可供性的潛力。zh_TW
dc.description.abstractDeveloping robots capable of generalized skills remains an exceedingly challenging task. Drawing from psychology, the concept of affordance has emerged as a promising intermediate representation to guide robot manipulation. However, prior work has primarily focused on 2D affordances from video, neglecting critical spatial information such as camera positioning, absolute position, depth and geometry. In this paper, we present a novel training-free method that constructs 3D affordances from egocentric demonstration videos. To address the challenge of insufficient static, high-quality frames for 3D reconstruction in egocentric videos, we employ the 3D foundational model DUST3R, which reconstructs scenes from sparse images without requiring COLMAP. We analyze videos using hand detection to identify contact times and 2D contact points, reconstruct these interactions using DUST3R, and project the 2D contact points into 3D space using gaussian heatmaps. Finally, we derive hand trajectories through 3D hand pose estimation and process them using linear regression to integrate the spatiotemporal dynamics of human-object interactions. We demonstrate the effectiveness of our method on the ego4d-exo dataset for seven real-world hand-object manipulation tasks in cooking scenes.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-01-27T16:12:42Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2026-01-27T16:12:42Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要iii
Abstract v
Contents vii
List of Figures xi
List of Tables xiii
Chapter 1 Introduction 1
Chapter 2 Related Work 5
2.1 Affordance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Object Representation for Robot Manipulation . . . . . . . . . . . . 6
2.3 Robot Learning from Video . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 3 Method 9
3.1 Contact Detection and Hand Segmentation . . . . . . . . . . . . . . 9
3.2 Single Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . 10
3.3 3d contact point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Regress Post Contact Trajectory . . . . . . . . . . . . . . . . . . . . 11
Chapter 4 Experiments 13
4.1 Experiment Setup and dataset . . . . . . . . . . . . . . . . . . . . . 13
4.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 Completion Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.3 Affordance Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.4 Trajectory Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.1 Pre-process video . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Dealing Left, right hand problem . . . . . . . . . . . . . . . . . . . 18
4.3.3 Increase error tolerance in 3D hand pose . . . . . . . . . . . . . . . 18
4.4 Multi-view setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 5 Conclusion 21
Chapter 6 Future Direction 23
6.1 Model More Complex Post Contact Trajectory . . . . . . . . . . . . 23
6.2 Hand Object Contact Detection . . . . . . . . . . . . . . . . . . . . 23
6.3 Leverage Diffusion Model for More Complete 3d Scene Synthesis . . 23
References 25
Appendix A — Detailed Analysis 29
A.1 The Main Problem of 2d Affordance . . . . . . . . . . . . . . . . . . 29
A.2 More 3d Affordance Results . . . . . . . . . . . . . . . . . . . . . . 29
A.3 Evaluation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.3.1 2d Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.3.2 Trajectory Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A.4 Failure case analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 32
A.4.1 Contact Detection Failure . . . . . . . . . . . . . . . . . . . . . . . 32
A.4.2 Transparent and Reflective Object . . . . . . . . . . . . . . . . . . 32
A.4.3 3d Pose Estimation Failed . . . . . . . . . . . . . . . . . . . . . . . 33
A.4.4 Object Too Small or Too Complex Scene . . . . . . . . . . . . . . . 34
-
dc.language.isoen-
dc.subject基於影片的機器人學習-
dc.subject機器操弄的物體表徵-
dc.subject可供性-
dc.subjectRobot Learning from Video-
dc.subjectObject Representation for Robot Manipulation-
dc.subjectAffordance-
dc.title三維可供性之重建基於自我視角示範影片zh_TW
dc.title3D Affordance Reconstruction from Egocentric Demonstration Videoen
dc.typeThesis-
dc.date.schoolyear114-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee孫民;鄭文皇zh_TW
dc.contributor.oralexamcommitteeMin Sun;Wen-Huang Chengen
dc.subject.keyword基於影片的機器人學習,機器操弄的物體表徵可供性zh_TW
dc.subject.keywordRobot Learning from Video,Object Representation for Robot ManipulationAffordanceen
dc.relation.page34-
dc.identifier.doi10.6342/NTU202600215-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2026-01-23-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-lift2026-01-28-
Appears in Collections:資訊工程學系

Files in This Item:
File SizeFormat 
ntu-114-1.pdf5.05 MBAdobe PDFView/Open
Show simple item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved