請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101330完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 許永真 | zh_TW |
| dc.contributor.advisor | Jane Yung-jen Hsu | en |
| dc.contributor.author | 游一心 | zh_TW |
| dc.contributor.author | Yi-Hsin Yu | en |
| dc.date.accessioned | 2026-01-16T16:10:33Z | - |
| dc.date.available | 2026-01-17 | - |
| dc.date.copyright | 2026-01-16 | - |
| dc.date.issued | 2026 | - |
| dc.date.submitted | 2026-01-12 | - |
| dc.identifier.citation | S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023.
C. Cheang, S. Chen, Z. Cui, Y. Hu, L. Huang, T. Kong, H. Li, Y. Li, Y. Liu, X. Ma, H. Niu, W. Ou, W. Peng, Z. Ren, H. Shi, J. Tian, H. Wu, X. Xiao, Y. Xiao, J. Xu, and Y. Yang. Gr-3 technical report, 2025. C.-L. Cheang, G. Chen, Y. Jing, T. Kong, H. Li, Y. Li, Y. Liu, H. Wu, J. Xu, Y. Yang, H. Zhang, and M. Zhu. Gr-2: A generative video-language-action model with webscale knowledge for robot manipulation, 2024. S. Chen, H. Guo, S. Zhu, F. Zhang, Z. Huang, J. Feng, and B. Kang. Video depth anything: Consistent depth estimation for super-long videos, 2025. C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024. C. Doersch, A. Gupta, L. Markeeva, A. Recasens, L. Smaira, Y. Aytar, J. Carreira, A. Zisserman, and Y. Yang. Tap-vid: A benchmark for tracking any point in a video, 2023. Y. Du, M. Yang, B. Dai, H. Dai, O. Nachum, J. B. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation, 2023. Y. Du, M. Yang, P. Florence, F. Xia, A. Wahid, B. Ichter, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenenbaum, L. Kaelbling, A. Zeng, and J. Tompson. Video language planning, 2023. A. Escontrela, A. Adeniji, W. Yan, A. Jain, X. B. Peng, K. Goldberg, Y. Lee, D. Hafner, and P. Abbeel. Video prediction models as rewards for reinforcement learning, 2023. M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, June 1981. Y. Fu, H. Zhang, D. Wu, W. Xu, and B. Boulet. Robot policy learning with temporal optimal transport reward, 2024. R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, USA, 2 edition, 2003. M. Hu, W. Yin, C. Zhang, Z. Cai, X. Long, H. Chen, K. Wang, G. Yu, C. Shen, and S. Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, Dec. 2024. T. Huang, G. Jiang, Y. Ze, and H. Xu. Diffusion reward: Learning rewards via conditional video diffusion, 2024. W. Huey, H. Wang, A. Wu, Y. Artzi, and S. Choudhury. Imitation learning from a single temporally misaligned video, 2025. N. Karaev, I. Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos, 2024. P.-C. Ko, J. Mao, Y. Du, S.-H. Sun, and J. B. Tenenbaum. Learning to act from actionless videos through dense correspondences. In The Twelfth International Conference on Learning Representations, 2024. S. Koppula, I. Rocco, Y. Yang, J. Heyward, J. Carreira, A. Zisserman, G. Brostow, and C. Doersch. Tapvid-3d: A benchmark for tracking any point in 3d, 2024. L. Piccinelli, Y.-H. Yang, C. Sakaridis, M. Segu, S. Li, L. V. Gool, and F. Yu. Unidepth: Universal monocular metric depth estimation, 2024. D. Schmidt and M. Jiang. Learning to act without actions, 2024. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018. C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y. Gao, and P. Abbeel. Any-point trajectory modeling for policy learning, 2024. H. Wu, Y. Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large-scale video generative pre-training for visual robot manipulation, 2023. Y. Xiao, Q. Wang, S. Zhang, N. Xue, S. Peng, Y. Shen, and X. Zhou. Spatialtracker: Tracking any 2d pixels in 3d space, 2024. L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao. Depth anything v2, 2024. T. Yu, D. Quillen, Z. He, R. Julian, A. Narayan, H. Shively, A. Bellathur, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multitask and meta reinforcement learning, 2021. Y. Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations, 2024 | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101330 | - |
| dc.description.abstract | 近期以影片為基礎的機器人學習取得了快速進展,使得擴散模型能夠在完全不依賴動作標註的情況下生成視覺計畫,AVDC 即是一個代表性成果。然而,實際執行時的成功率往往並非受限於生成影片的品質,而是受到影片轉動作(video-to-action mapping)階段的缺陷所瓶頸:僵化的模式分類會導致系統性地將抓取與推動誤判,而逐幀串接的光流估計則會在長視野下累積誤差。為了解決這些限制,我們提出一個改進的影片轉動作框架,透過自適應的模式選擇機制,以及基於點追蹤與深度估計的 3D 重建流程來強化 AVDC。於 AVDC 所採用的 11 個 Meta-World 任務進行評估後,我們的方法在整體上提升了任務成功率,並能更忠實地執行擴散模型所生成的視覺計畫,從而縮小視覺規劃品質與實際機器人控制之間的落差。 | zh_TW |
| dc.description.abstract | Recent progress in video-based robotic learning has enabled diffusion models to generate visual plans without requiring any action annotations, as exemplified by AVDC. However, in practice, the final performance is often bottlenecked not by the quality of the generated videos but by the imperfections in the video-to-action mapping: rigid mode classification causes systematic grasp/push errors, and sequential optical-flow estimation accumulates drift over long horizons. To address these limitations, we propose an improved video-to-action framework that augments AVDC with an adaptive mode selection mechanism and a more stable 3D motion reconstruction pipeline based on point tracking and temporally consistent depth estimation. Evaluated across the 11 Meta-World tasks used in AVDC, our method consistently increases task success rates and more faithfully executes the visual plans produced by the diffusion model, thereby narrowing the gap between visual planning quality and real robotic control. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-01-16T16:10:33Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2026-01-16T16:10:33Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Verification Letter from the Oral Examination Committee i
Acknowledgements iii 摘要 v Abstract vii Contents ix List of Figures xi List of Tables xiii Denotation xv Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Proposed Method 3 1.4 Thesis Organization 4 Chapter 2 Related Work 5 2.1 Computer Vision 6 2.1.1 Camera Matrix 6 2.1.2 RANSAC 7 2.1.3 Point Tracking 8 2.1.4 Monocular Depth Estimation 9 2.2 Reinforcement Learning 10 2.2.1 RL Problem Formulation 11 2.2.2 Robot Learning from Videos 11 2.2.2.1 Video Models as Visual Planners 12 2.2.2.2 Video‑Language‑Action Models 12 2.2.2.3 Video Models as Policy Supervisors 13 2.2.2.4 Videos as Experts in Inverse Reinforcement Learning 14 Chapter 3 Problem Definition 15 Chapter 4 Methodology 17 4.1 Mode Selection 19 4.2 Point Tracker 20 Chapter 5 Experiments 23 5.1 Environment Setup 24 5.2 Comparative Evaluation with AVDC 25 5.3 Comparative Evaluation with Diffusion Policy 25 5.4 Ablation Study 27 5.4.1 The impact of diffusion-induced errors 29 Chapter 6 Conclusion 33 6.1 Contribution 33 6.2 Limitations and Future Work 34 References 37 | - |
| dc.language.iso | en | - |
| dc.subject | 電腦視覺 | - |
| dc.subject | 強化學習 | - |
| dc.subject | 基於影片的策略學習 | - |
| dc.subject | Computer Vision | - |
| dc.subject | Reinforcement Learning | - |
| dc.subject | Video-based Policy Learning | - |
| dc.title | 透過自適應模式選擇與點追蹤實現穩定的影片轉動作 | zh_TW |
| dc.title | Stable Video-to-Action Mapping via Adaptive Mode Selection and Point Tracking | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 114-1 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.coadvisor | 李濬屹 | zh_TW |
| dc.contributor.coadvisor | Chun-Yi Lee | en |
| dc.contributor.oralexamcommittee | 郭彥伶;柯宗瑋;陳駿丞 | zh_TW |
| dc.contributor.oralexamcommittee | Yen-Ling Kuo;Tsung-Wei Ke;Jun-Cheng Chen | en |
| dc.subject.keyword | 電腦視覺,強化學習基於影片的策略學習 | zh_TW |
| dc.subject.keyword | Computer Vision,Reinforcement LearningVideo-based Policy Learning | en |
| dc.relation.page | 40 | - |
| dc.identifier.doi | 10.6342/NTU202504713 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2026-01-13 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| dc.date.embargo-lift | 2026-01-17 | - |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-114-1.pdf | 10.22 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
