三維可供性之重建基於自我視角示範影片

刁一凡; Egil Diau

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101361

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	徐宏民	zh_TW
dc.contributor.advisor	Winston H. Hsu	en
dc.contributor.author	刁一凡	zh_TW
dc.contributor.author	Egil Diau	en
dc.date.accessioned	2026-01-27T16:12:42Z	-
dc.date.available	2026-01-28	-
dc.date.copyright	2026-01-27	-
dc.date.issued	2026	-
dc.date.submitted	2026-01-22	-
dc.identifier.citation	[1] S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. 2023. [2] J. Chen, D. Gao, K. Q. Lin, and M. Z. Shou. Affordance grounding from demonstration video to target image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6799–6808, 2023. [3] L. Chen, X. Chu, X. Zhang, and J. Sun. Simple baselines for image restoration. arXiv preprint arXiv:2204.04676, 2022. [4] A. Delitzas, A. Takmaz, F. Tombari, R. Sumner, M. Pollefeys, and F. Engelmann. SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [5] H. Geng, Z. Li, Y. Geng, J. Chen, H. Dong, and H. Wang. Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations. arXiv preprint arXiv:2303.16958, 2023. [6] K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V. Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383-19400, 2024. [7] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick. Segment anything. arXiv:2304.02643, 2023. [8] Y. Li, N. Zhao, J. Xiao, C. Feng, X. Wang, and T.-s. Chua. Laso: Language-guided affordance segmentation on 3d object. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14251–14260, 2024. [9] Y.-C. Lin, P. Florence, A. Zeng, J. T. Barron, Y. Du, W.-C. Ma, A. Simeonov, A. R. Garcia, and P. Isola. Mira: Mental imagery for robotic affordances. In Conference on Robot Learning, pages 1916–1927. PMLR, 2023. [10] Y. Liu, A. Gupta, P. Abbeel, and S. Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE international conference on robotics and automation (ICRA), pages 1118–1125. IEEE, 2018. [11] H. Luo, W. Zhai, J. Zhang, Y. Cao, and D. Tao. Learning affordance grounding from exocentric images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2252–2261, 2022. [12] R. Mendonca, S. Bahl, and D. Pathak. Structured world models from human videos. 2023. [13] G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3D with transformers. In CVPR, 2024. [14] Y. Qin, Y.-H. Wu, S. Liu, H. Jiang, R. Yang, Y. Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos, 2021. [15] A. Rashid, S. Sharma, C. M. Kim, J. Kerr, L. Y. Chen, A. Kanazawa, and K. Goldberg. Language embedded radiance fields for zero-shot task-oriented grasping. In 7th Annual Conference on Robot Learning, 2023. [16] D. Shan, J. Geng, M. Shu, and D. Fouhey. Understanding human hands in contact at internet scale. 2020. [17] W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola. Distilled feature fields enable few-shot language-guided manipulation. In 7th Annual Conference on Robot Learning, 2023. [18] A. Simeonov, Y. Du, A. Tagliasacchi, J. B. Tenenbaum, A. Rodriguez, P. Agrawal, and V. Sitzmann. Neural descriptor fields: Se(3)-equivariant object representations for manipulation. 2022. [19] R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V. Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021. [20] S. Wang, V. Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. In CVPR, 2024. [21] T. Weng, D. Held, F. Meier, and M. Mukadam. Neural grasp distance fields for robot manipulation. IEEE International Conference on Robotics and Automation (ICRA), 2023. [22] Y. Ye, X. Li, A. Gupta, S. D. Mello, S. Birchfield, J. Song, S. Tulsiani, and S. Liu. Affordance diffusion: Synthesizing hand-object interactions. In CVPR, 2023.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101361	-
dc.description.abstract	發展具備泛化能力的機器人技能至今仍極具挑戰性。受到心理學啟發，「可供性」（affordance）已被視為一種具潛力的中介表徵，可用來引導機器人進行物體操控。然而，多數現有研究主要聚焦於來自影片的二維可供性，忽略了攝影機位置、絕對空間座標、深度與幾何結構等關鍵空間資訊。為此，本研究提出一種無需訓練的創新方法，可從第一人稱操作示範影片中建構三維可供性。針對第一人稱影片中缺乏靜態高品質畫面而導致三維重建困難的問題，我們採用三維基礎模型 DUST3R，能在不使用 COLMAP 的情況下，從稀疏影像中重建場景。我們首先以手部偵測技術擷取接觸時間與二維接觸點，再透過 DUST3R 還原互動場景，並將接觸點以高斯熱圖投影至三維空間；同時，我們利用三維手部姿態估計取得手部軌跡，並透過線性回歸整合其時空動態，建構出完整的人物與物體互動歷程。實驗結果顯示，我們的方法能有效應用於 Ego4D-Exo 資料集中的七項真實世界料理任務，展現其於複雜操控場景中建構三維可供性的潛力。	zh_TW
dc.description.abstract	Developing robots capable of generalized skills remains an exceedingly challenging task. Drawing from psychology, the concept of affordance has emerged as a promising intermediate representation to guide robot manipulation. However, prior work has primarily focused on 2D affordances from video, neglecting critical spatial information such as camera positioning, absolute position, depth and geometry. In this paper, we present a novel training-free method that constructs 3D affordances from egocentric demonstration videos. To address the challenge of insufficient static, high-quality frames for 3D reconstruction in egocentric videos, we employ the 3D foundational model DUST3R, which reconstructs scenes from sparse images without requiring COLMAP. We analyze videos using hand detection to identify contact times and 2D contact points, reconstruct these interactions using DUST3R, and project the 2D contact points into 3D space using gaussian heatmaps. Finally, we derive hand trajectories through 3D hand pose estimation and process them using linear regression to integrate the spatiotemporal dynamics of human-object interactions. We demonstrate the effectiveness of our method on the ego4d-exo dataset for seven real-world hand-object manipulation tasks in cooking scenes.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-01-27T16:12:42Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-01-27T16:12:42Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements i 摘要iii Abstract v Contents vii List of Figures xi List of Tables xiii Chapter 1 Introduction 1 Chapter 2 Related Work 5 2.1 Affordance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Object Representation for Robot Manipulation . . . . . . . . . . . . 6 2.3 Robot Learning from Video . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 3 Method 9 3.1 Contact Detection and Hand Segmentation . . . . . . . . . . . . . . 9 3.2 Single Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . 10 3.3 3d contact point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.4 Regress Post Contact Trajectory . . . . . . . . . . . . . . . . . . . . 11 Chapter 4 Experiments 13 4.1 Experiment Setup and dataset . . . . . . . . . . . . . . . . . . . . . 13 4.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2.2 Completion Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2.3 Affordance Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2.4 Trajectory Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3.1 Pre-process video . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3.2 Dealing Left, right hand problem . . . . . . . . . . . . . . . . . . . 18 4.3.3 Increase error tolerance in 3D hand pose . . . . . . . . . . . . . . . 18 4.4 Multi-view setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 5 Conclusion 21 Chapter 6 Future Direction 23 6.1 Model More Complex Post Contact Trajectory . . . . . . . . . . . . 23 6.2 Hand Object Contact Detection . . . . . . . . . . . . . . . . . . . . 23 6.3 Leverage Diffusion Model for More Complete 3d Scene Synthesis . . 23 References 25 Appendix A — Detailed Analysis 29 A.1 The Main Problem of 2d Affordance . . . . . . . . . . . . . . . . . . 29 A.2 More 3d Affordance Results . . . . . . . . . . . . . . . . . . . . . . 29 A.3 Evaluation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A.3.1 2d Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A.3.2 Trajectory Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 A.4 Failure case analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 32 A.4.1 Contact Detection Failure . . . . . . . . . . . . . . . . . . . . . . . 32 A.4.2 Transparent and Reflective Object . . . . . . . . . . . . . . . . . . 32 A.4.3 3d Pose Estimation Failed . . . . . . . . . . . . . . . . . . . . . . . 33 A.4.4 Object Too Small or Too Complex Scene . . . . . . . . . . . . . . . 34	-
dc.language.iso	en	-
dc.subject	基於影片的機器人學習	-
dc.subject	機器操弄的物體表徵	-
dc.subject	可供性	-
dc.subject	Robot Learning from Video	-
dc.subject	Object Representation for Robot Manipulation	-
dc.subject	Affordance	-
dc.title	三維可供性之重建基於自我視角示範影片	zh_TW
dc.title	3D Affordance Reconstruction from Egocentric Demonstration Video	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	孫民;鄭文皇	zh_TW
dc.contributor.oralexamcommittee	Min Sun;Wen-Huang Cheng	en
dc.subject.keyword	基於影片的機器人學習,機器操弄的物體表徵可供性	zh_TW
dc.subject.keyword	Robot Learning from Video,Object Representation for Robot ManipulationAffordance	en
dc.relation.page	34	-
dc.identifier.doi	10.6342/NTU202600215	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2026-01-23	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2026-01-28	-
Appears in Collections:	資訊工程學系

Files in This Item:

File	Size	Format
ntu-114-1.pdf	5.05 MB	Adobe PDF	View/Open

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets