Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101361Full metadata record
| ???org.dspace.app.webui.jsptag.ItemTag.dcfield??? | Value | Language |
|---|---|---|
| dc.contributor.advisor | 徐宏民 | zh_TW |
| dc.contributor.advisor | Winston H. Hsu | en |
| dc.contributor.author | 刁一凡 | zh_TW |
| dc.contributor.author | Egil Diau | en |
| dc.date.accessioned | 2026-01-27T16:12:42Z | - |
| dc.date.available | 2026-01-28 | - |
| dc.date.copyright | 2026-01-27 | - |
| dc.date.issued | 2026 | - |
| dc.date.submitted | 2026-01-22 | - |
| dc.identifier.citation | [1] S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. 2023.
[2] J. Chen, D. Gao, K. Q. Lin, and M. Z. Shou. Affordance grounding from demonstration video to target image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6799–6808, 2023. [3] L. Chen, X. Chu, X. Zhang, and J. Sun. Simple baselines for image restoration. arXiv preprint arXiv:2204.04676, 2022. [4] A. Delitzas, A. Takmaz, F. Tombari, R. Sumner, M. Pollefeys, and F. Engelmann. SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [5] H. Geng, Z. Li, Y. Geng, J. Chen, H. Dong, and H. Wang. Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations. arXiv preprint arXiv:2303.16958, 2023. [6] K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V. Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383-19400, 2024. [7] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick. Segment anything. arXiv:2304.02643, 2023. [8] Y. Li, N. Zhao, J. Xiao, C. Feng, X. Wang, and T.-s. Chua. Laso: Language-guided affordance segmentation on 3d object. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14251–14260, 2024. [9] Y.-C. Lin, P. Florence, A. Zeng, J. T. Barron, Y. Du, W.-C. Ma, A. Simeonov, A. R. Garcia, and P. Isola. Mira: Mental imagery for robotic affordances. In Conference on Robot Learning, pages 1916–1927. PMLR, 2023. [10] Y. Liu, A. Gupta, P. Abbeel, and S. Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE international conference on robotics and automation (ICRA), pages 1118–1125. IEEE, 2018. [11] H. Luo, W. Zhai, J. Zhang, Y. Cao, and D. Tao. Learning affordance grounding from exocentric images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2252–2261, 2022. [12] R. Mendonca, S. Bahl, and D. Pathak. Structured world models from human videos. 2023. [13] G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3D with transformers. In CVPR, 2024. [14] Y. Qin, Y.-H. Wu, S. Liu, H. Jiang, R. Yang, Y. Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos, 2021. [15] A. Rashid, S. Sharma, C. M. Kim, J. Kerr, L. Y. Chen, A. Kanazawa, and K. Goldberg. Language embedded radiance fields for zero-shot task-oriented grasping. In 7th Annual Conference on Robot Learning, 2023. [16] D. Shan, J. Geng, M. Shu, and D. Fouhey. Understanding human hands in contact at internet scale. 2020. [17] W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola. Distilled feature fields enable few-shot language-guided manipulation. In 7th Annual Conference on Robot Learning, 2023. [18] A. Simeonov, Y. Du, A. Tagliasacchi, J. B. Tenenbaum, A. Rodriguez, P. Agrawal, and V. Sitzmann. Neural descriptor fields: Se(3)-equivariant object representations for manipulation. 2022. [19] R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V. Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021. [20] S. Wang, V. Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. In CVPR, 2024. [21] T. Weng, D. Held, F. Meier, and M. Mukadam. Neural grasp distance fields for robot manipulation. IEEE International Conference on Robotics and Automation (ICRA), 2023. [22] Y. Ye, X. Li, A. Gupta, S. D. Mello, S. Birchfield, J. Song, S. Tulsiani, and S. Liu. Affordance diffusion: Synthesizing hand-object interactions. In CVPR, 2023. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101361 | - |
| dc.description.abstract | 發展具備泛化能力的機器人技能至今仍極具挑戰性。受到心理學啟發,「可供性」(affordance)已被視為一種具潛力的中介表徵,可用來引導機器人進行物體操控。然而,多數現有研究主要聚焦於來自影片的二維可供性,忽略了攝影機位置、絕對空間座標、深度與幾何結構等關鍵空間資訊。為此,本研究提出一種無需訓練的創新方法,可從第一人稱操作示範影片中建構三維可供性。針對第一人稱影片中缺乏靜態高品質畫面而導致三維重建困難的問題,我們採用三維基礎模型 DUST3R,能在不使用 COLMAP 的情況下,從稀疏影像中重建場景。我們首先以手部偵測技術擷取接觸時間與二維接觸點,再透過 DUST3R 還原互動場景,並將接觸點以高斯熱圖投影至三維空間;同時,我們利用三維手部姿態估計取得手部軌跡,並透過線性回歸整合其時空動態,建構出完整的人物與物體互動歷程。實驗結果顯示,我們的方法能有效應用於 Ego4D-Exo 資料集中的七項真實世界料理任務,展現其於複雜操控場景中建構三維可供性的潛力。 | zh_TW |
| dc.description.abstract | Developing robots capable of generalized skills remains an exceedingly challenging task. Drawing from psychology, the concept of affordance has emerged as a promising intermediate representation to guide robot manipulation. However, prior work has primarily focused on 2D affordances from video, neglecting critical spatial information such as camera positioning, absolute position, depth and geometry. In this paper, we present a novel training-free method that constructs 3D affordances from egocentric demonstration videos. To address the challenge of insufficient static, high-quality frames for 3D reconstruction in egocentric videos, we employ the 3D foundational model DUST3R, which reconstructs scenes from sparse images without requiring COLMAP. We analyze videos using hand detection to identify contact times and 2D contact points, reconstruct these interactions using DUST3R, and project the 2D contact points into 3D space using gaussian heatmaps. Finally, we derive hand trajectories through 3D hand pose estimation and process them using linear regression to integrate the spatiotemporal dynamics of human-object interactions. We demonstrate the effectiveness of our method on the ego4d-exo dataset for seven real-world hand-object manipulation tasks in cooking scenes. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-01-27T16:12:42Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2026-01-27T16:12:42Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Acknowledgements i
摘要iii Abstract v Contents vii List of Figures xi List of Tables xiii Chapter 1 Introduction 1 Chapter 2 Related Work 5 2.1 Affordance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Object Representation for Robot Manipulation . . . . . . . . . . . . 6 2.3 Robot Learning from Video . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 3 Method 9 3.1 Contact Detection and Hand Segmentation . . . . . . . . . . . . . . 9 3.2 Single Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . 10 3.3 3d contact point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.4 Regress Post Contact Trajectory . . . . . . . . . . . . . . . . . . . . 11 Chapter 4 Experiments 13 4.1 Experiment Setup and dataset . . . . . . . . . . . . . . . . . . . . . 13 4.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2.2 Completion Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2.3 Affordance Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2.4 Trajectory Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3.1 Pre-process video . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3.2 Dealing Left, right hand problem . . . . . . . . . . . . . . . . . . . 18 4.3.3 Increase error tolerance in 3D hand pose . . . . . . . . . . . . . . . 18 4.4 Multi-view setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 5 Conclusion 21 Chapter 6 Future Direction 23 6.1 Model More Complex Post Contact Trajectory . . . . . . . . . . . . 23 6.2 Hand Object Contact Detection . . . . . . . . . . . . . . . . . . . . 23 6.3 Leverage Diffusion Model for More Complete 3d Scene Synthesis . . 23 References 25 Appendix A — Detailed Analysis 29 A.1 The Main Problem of 2d Affordance . . . . . . . . . . . . . . . . . . 29 A.2 More 3d Affordance Results . . . . . . . . . . . . . . . . . . . . . . 29 A.3 Evaluation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A.3.1 2d Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A.3.2 Trajectory Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 A.4 Failure case analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 32 A.4.1 Contact Detection Failure . . . . . . . . . . . . . . . . . . . . . . . 32 A.4.2 Transparent and Reflective Object . . . . . . . . . . . . . . . . . . 32 A.4.3 3d Pose Estimation Failed . . . . . . . . . . . . . . . . . . . . . . . 33 A.4.4 Object Too Small or Too Complex Scene . . . . . . . . . . . . . . . 34 | - |
| dc.language.iso | en | - |
| dc.subject | 基於影片的機器人學習 | - |
| dc.subject | 機器操弄的物體表徵 | - |
| dc.subject | 可供性 | - |
| dc.subject | Robot Learning from Video | - |
| dc.subject | Object Representation for Robot Manipulation | - |
| dc.subject | Affordance | - |
| dc.title | 三維可供性之重建基於自我視角示範影片 | zh_TW |
| dc.title | 3D Affordance Reconstruction from Egocentric Demonstration Video | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 114-1 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 孫民;鄭文皇 | zh_TW |
| dc.contributor.oralexamcommittee | Min Sun;Wen-Huang Cheng | en |
| dc.subject.keyword | 基於影片的機器人學習,機器操弄的物體表徵可供性 | zh_TW |
| dc.subject.keyword | Robot Learning from Video,Object Representation for Robot ManipulationAffordance | en |
| dc.relation.page | 34 | - |
| dc.identifier.doi | 10.6342/NTU202600215 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2026-01-23 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| dc.date.embargo-lift | 2026-01-28 | - |
| Appears in Collections: | 資訊工程學系 | |
Files in This Item:
| File | Size | Format | |
|---|---|---|---|
| ntu-114-1.pdf | 5.05 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
