透過自適應模式選擇與點追蹤實現穩定的影片轉動作

游一心; Yi-Hsin Yu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101330

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	許永真	zh_TW
dc.contributor.advisor	Jane Yung-jen Hsu	en
dc.contributor.author	游一心	zh_TW
dc.contributor.author	Yi-Hsin Yu	en
dc.date.accessioned	2026-01-16T16:10:33Z	-
dc.date.available	2026-01-17	-
dc.date.copyright	2026-01-16	-
dc.date.issued	2026	-
dc.date.submitted	2026-01-12	-
dc.identifier.citation	S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023. C. Cheang, S. Chen, Z. Cui, Y. Hu, L. Huang, T. Kong, H. Li, Y. Li, Y. Liu, X. Ma, H. Niu, W. Ou, W. Peng, Z. Ren, H. Shi, J. Tian, H. Wu, X. Xiao, Y. Xiao, J. Xu, and Y. Yang. Gr-3 technical report, 2025. C.-L. Cheang, G. Chen, Y. Jing, T. Kong, H. Li, Y. Li, Y. Liu, H. Wu, J. Xu, Y. Yang, H. Zhang, and M. Zhu. Gr-2: A generative video-language-action model with webscale knowledge for robot manipulation, 2024. S. Chen, H. Guo, S. Zhu, F. Zhang, Z. Huang, J. Feng, and B. Kang. Video depth anything: Consistent depth estimation for super-long videos, 2025. C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024. C. Doersch, A. Gupta, L. Markeeva, A. Recasens, L. Smaira, Y. Aytar, J. Carreira, A. Zisserman, and Y. Yang. Tap-vid: A benchmark for tracking any point in a video, 2023. Y. Du, M. Yang, B. Dai, H. Dai, O. Nachum, J. B. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation, 2023. Y. Du, M. Yang, P. Florence, F. Xia, A. Wahid, B. Ichter, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenenbaum, L. Kaelbling, A. Zeng, and J. Tompson. Video language planning, 2023. A. Escontrela, A. Adeniji, W. Yan, A. Jain, X. B. Peng, K. Goldberg, Y. Lee, D. Hafner, and P. Abbeel. Video prediction models as rewards for reinforcement learning, 2023. M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, June 1981. Y. Fu, H. Zhang, D. Wu, W. Xu, and B. Boulet. Robot policy learning with temporal optimal transport reward, 2024. R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, USA, 2 edition, 2003. M. Hu, W. Yin, C. Zhang, Z. Cai, X. Long, H. Chen, K. Wang, G. Yu, C. Shen, and S. Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, Dec. 2024. T. Huang, G. Jiang, Y. Ze, and H. Xu. Diffusion reward: Learning rewards via conditional video diffusion, 2024. W. Huey, H. Wang, A. Wu, Y. Artzi, and S. Choudhury. Imitation learning from a single temporally misaligned video, 2025. N. Karaev, I. Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos, 2024. P.-C. Ko, J. Mao, Y. Du, S.-H. Sun, and J. B. Tenenbaum. Learning to act from actionless videos through dense correspondences. In The Twelfth International Conference on Learning Representations, 2024. S. Koppula, I. Rocco, Y. Yang, J. Heyward, J. Carreira, A. Zisserman, G. Brostow, and C. Doersch. Tapvid-3d: A benchmark for tracking any point in 3d, 2024. L. Piccinelli, Y.-H. Yang, C. Sakaridis, M. Segu, S. Li, L. V. Gool, and F. Yu. Unidepth: Universal monocular metric depth estimation, 2024. D. Schmidt and M. Jiang. Learning to act without actions, 2024. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018. C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y. Gao, and P. Abbeel. Any-point trajectory modeling for policy learning, 2024. H. Wu, Y. Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large-scale video generative pre-training for visual robot manipulation, 2023. Y. Xiao, Q. Wang, S. Zhang, N. Xue, S. Peng, Y. Shen, and X. Zhou. Spatialtracker: Tracking any 2d pixels in 3d space, 2024. L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao. Depth anything v2, 2024. T. Yu, D. Quillen, Z. He, R. Julian, A. Narayan, H. Shively, A. Bellathur, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multitask and meta reinforcement learning, 2021. Y. Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations, 2024	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101330	-
dc.description.abstract	近期以影片為基礎的機器人學習取得了快速進展，使得擴散模型能夠在完全不依賴動作標註的情況下生成視覺計畫，AVDC 即是一個代表性成果。然而，實際執行時的成功率往往並非受限於生成影片的品質，而是受到影片轉動作（video-to-action mapping）階段的缺陷所瓶頸：僵化的模式分類會導致系統性地將抓取與推動誤判，而逐幀串接的光流估計則會在長視野下累積誤差。為了解決這些限制，我們提出一個改進的影片轉動作框架，透過自適應的模式選擇機制，以及基於點追蹤與深度估計的 3D 重建流程來強化 AVDC。於 AVDC 所採用的 11 個 Meta-World 任務進行評估後，我們的方法在整體上提升了任務成功率，並能更忠實地執行擴散模型所生成的視覺計畫，從而縮小視覺規劃品質與實際機器人控制之間的落差。	zh_TW
dc.description.abstract	Recent progress in video-based robotic learning has enabled diffusion models to generate visual plans without requiring any action annotations, as exemplified by AVDC. However, in practice, the final performance is often bottlenecked not by the quality of the generated videos but by the imperfections in the video-to-action mapping: rigid mode classification causes systematic grasp/push errors, and sequential optical-flow estimation accumulates drift over long horizons. To address these limitations, we propose an improved video-to-action framework that augments AVDC with an adaptive mode selection mechanism and a more stable 3D motion reconstruction pipeline based on point tracking and temporally consistent depth estimation. Evaluated across the 11 Meta-World tasks used in AVDC, our method consistently increases task success rates and more faithfully executes the visual plans produced by the diffusion model, thereby narrowing the gap between visual planning quality and real robotic control.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-01-16T16:10:33Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-01-16T16:10:33Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements iii 摘要 v Abstract vii Contents ix List of Figures xi List of Tables xiii Denotation xv Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Proposed Method 3 1.4 Thesis Organization 4 Chapter 2 Related Work 5 2.1 Computer Vision 6 2.1.1 Camera Matrix 6 2.1.2 RANSAC 7 2.1.3 Point Tracking 8 2.1.4 Monocular Depth Estimation 9 2.2 Reinforcement Learning 10 2.2.1 RL Problem Formulation 11 2.2.2 Robot Learning from Videos 11 2.2.2.1 Video Models as Visual Planners 12 2.2.2.2 Video‑Language‑Action Models 12 2.2.2.3 Video Models as Policy Supervisors 13 2.2.2.4 Videos as Experts in Inverse Reinforcement Learning 14 Chapter 3 Problem Definition 15 Chapter 4 Methodology 17 4.1 Mode Selection 19 4.2 Point Tracker 20 Chapter 5 Experiments 23 5.1 Environment Setup 24 5.2 Comparative Evaluation with AVDC 25 5.3 Comparative Evaluation with Diffusion Policy 25 5.4 Ablation Study 27 5.4.1 The impact of diffusion-induced errors 29 Chapter 6 Conclusion 33 6.1 Contribution 33 6.2 Limitations and Future Work 34 References 37	-
dc.language.iso	en	-
dc.subject	電腦視覺	-
dc.subject	強化學習	-
dc.subject	基於影片的策略學習	-
dc.subject	Computer Vision	-
dc.subject	Reinforcement Learning	-
dc.subject	Video-based Policy Learning	-
dc.title	透過自適應模式選擇與點追蹤實現穩定的影片轉動作	zh_TW
dc.title	Stable Video-to-Action Mapping via Adaptive Mode Selection and Point Tracking	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	李濬屹	zh_TW
dc.contributor.coadvisor	Chun-Yi Lee	en
dc.contributor.oralexamcommittee	郭彥伶;柯宗瑋;陳駿丞	zh_TW
dc.contributor.oralexamcommittee	Yen-Ling Kuo;Tsung-Wei Ke;Jun-Cheng Chen	en
dc.subject.keyword	電腦視覺,強化學習基於影片的策略學習	zh_TW
dc.subject.keyword	Computer Vision,Reinforcement LearningVideo-based Policy Learning	en
dc.relation.page	40	-
dc.identifier.doi	10.6342/NTU202504713	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2026-01-13	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2026-01-17	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf	10.22 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。