Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101330
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor許永真zh_TW
dc.contributor.advisorJane Yung-jen Hsuen
dc.contributor.author游一心zh_TW
dc.contributor.authorYi-Hsin Yuen
dc.date.accessioned2026-01-16T16:10:33Z-
dc.date.available2026-01-17-
dc.date.copyright2026-01-16-
dc.date.issued2026-
dc.date.submitted2026-01-12-
dc.identifier.citationS. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023.
C. Cheang, S. Chen, Z. Cui, Y. Hu, L. Huang, T. Kong, H. Li, Y. Li, Y. Liu, X. Ma, H. Niu, W. Ou, W. Peng, Z. Ren, H. Shi, J. Tian, H. Wu, X. Xiao, Y. Xiao, J. Xu, and Y. Yang. Gr-3 technical report, 2025.
C.-L. Cheang, G. Chen, Y. Jing, T. Kong, H. Li, Y. Li, Y. Liu, H. Wu, J. Xu, Y. Yang, H. Zhang, and M. Zhu. Gr-2: A generative video-language-action model with webscale knowledge for robot manipulation, 2024.
S. Chen, H. Guo, S. Zhu, F. Zhang, Z. Huang, J. Feng, and B. Kang. Video depth anything: Consistent depth estimation for super-long videos, 2025.
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024.
C. Doersch, A. Gupta, L. Markeeva, A. Recasens, L. Smaira, Y. Aytar, J. Carreira, A. Zisserman, and Y. Yang. Tap-vid: A benchmark for tracking any point in a video, 2023.
Y. Du, M. Yang, B. Dai, H. Dai, O. Nachum, J. B. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation, 2023.
Y. Du, M. Yang, P. Florence, F. Xia, A. Wahid, B. Ichter, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenenbaum, L. Kaelbling, A. Zeng, and J. Tompson. Video language planning, 2023.
A. Escontrela, A. Adeniji, W. Yan, A. Jain, X. B. Peng, K. Goldberg, Y. Lee, D. Hafner, and P. Abbeel. Video prediction models as rewards for reinforcement learning, 2023.
M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, June 1981.
Y. Fu, H. Zhang, D. Wu, W. Xu, and B. Boulet. Robot policy learning with temporal optimal transport reward, 2024.
R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, USA, 2 edition, 2003.
M. Hu, W. Yin, C. Zhang, Z. Cai, X. Long, H. Chen, K. Wang, G. Yu, C. Shen, and S. Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, Dec. 2024.
T. Huang, G. Jiang, Y. Ze, and H. Xu. Diffusion reward: Learning rewards via conditional video diffusion, 2024.
W. Huey, H. Wang, A. Wu, Y. Artzi, and S. Choudhury. Imitation learning from a single temporally misaligned video, 2025.
N. Karaev, I. Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos, 2024.
P.-C. Ko, J. Mao, Y. Du, S.-H. Sun, and J. B. Tenenbaum. Learning to act from actionless videos through dense correspondences. In The Twelfth International Conference on Learning Representations, 2024.
S. Koppula, I. Rocco, Y. Yang, J. Heyward, J. Carreira, A. Zisserman, G. Brostow, and C. Doersch. Tapvid-3d: A benchmark for tracking any point in 3d, 2024.
L. Piccinelli, Y.-H. Yang, C. Sakaridis, M. Segu, S. Li, L. V. Gool, and F. Yu. Unidepth: Universal monocular metric depth estimation, 2024.
D. Schmidt and M. Jiang. Learning to act without actions, 2024.
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018.
C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y. Gao, and P. Abbeel. Any-point trajectory modeling for policy learning, 2024.
H. Wu, Y. Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large-scale video generative pre-training for visual robot manipulation, 2023.
Y. Xiao, Q. Wang, S. Zhang, N. Xue, S. Peng, Y. Shen, and X. Zhou. Spatialtracker: Tracking any 2d pixels in 3d space, 2024.
L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao. Depth anything v2, 2024.
T. Yu, D. Quillen, Z. He, R. Julian, A. Narayan, H. Shively, A. Bellathur, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multitask and meta reinforcement learning, 2021.
Y. Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations, 2024
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101330-
dc.description.abstract近期以影片為基礎的機器人學習取得了快速進展,使得擴散模型能夠在完全不依賴動作標註的情況下生成視覺計畫,AVDC 即是一個代表性成果。然而,實際執行時的成功率往往並非受限於生成影片的品質,而是受到影片轉動作(video-to-action mapping)階段的缺陷所瓶頸:僵化的模式分類會導致系統性地將抓取與推動誤判,而逐幀串接的光流估計則會在長視野下累積誤差。為了解決這些限制,我們提出一個改進的影片轉動作框架,透過自適應的模式選擇機制,以及基於點追蹤與深度估計的 3D 重建流程來強化 AVDC。於 AVDC 所採用的 11 個 Meta-World 任務進行評估後,我們的方法在整體上提升了任務成功率,並能更忠實地執行擴散模型所生成的視覺計畫,從而縮小視覺規劃品質與實際機器人控制之間的落差。zh_TW
dc.description.abstractRecent progress in video-based robotic learning has enabled diffusion models to generate visual plans without requiring any action annotations, as exemplified by AVDC. However, in practice, the final performance is often bottlenecked not by the quality of the generated videos but by the imperfections in the video-to-action mapping: rigid mode classification causes systematic grasp/push errors, and sequential optical-flow estimation accumulates drift over long horizons. To address these limitations, we propose an improved video-to-action framework that augments AVDC with an adaptive mode selection mechanism and a more stable 3D motion reconstruction pipeline based on point tracking and temporally consistent depth estimation. Evaluated across the 11 Meta-World tasks used in AVDC, our method consistently increases task success rates and more faithfully executes the visual plans produced by the diffusion model, thereby narrowing the gap between visual planning quality and real robotic control.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-01-16T16:10:33Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2026-01-16T16:10:33Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee i
Acknowledgements iii
摘要 v
Abstract vii
Contents ix
List of Figures xi
List of Tables xiii
Denotation xv
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 2
1.3 Proposed Method 3
1.4 Thesis Organization 4
Chapter 2 Related Work 5
2.1 Computer Vision 6
2.1.1 Camera Matrix 6
2.1.2 RANSAC 7
2.1.3 Point Tracking 8
2.1.4 Monocular Depth Estimation 9
2.2 Reinforcement Learning 10
2.2.1 RL Problem Formulation 11
2.2.2 Robot Learning from Videos 11
2.2.2.1 Video Models as Visual Planners 12
2.2.2.2 Video‑Language‑Action Models 12
2.2.2.3 Video Models as Policy Supervisors 13
2.2.2.4 Videos as Experts in Inverse Reinforcement Learning 14
Chapter 3 Problem Definition 15
Chapter 4 Methodology 17
4.1 Mode Selection 19
4.2 Point Tracker 20
Chapter 5 Experiments 23
5.1 Environment Setup 24
5.2 Comparative Evaluation with AVDC 25
5.3 Comparative Evaluation with Diffusion Policy 25
5.4 Ablation Study 27
5.4.1 The impact of diffusion-induced errors 29
Chapter 6 Conclusion 33
6.1 Contribution 33
6.2 Limitations and Future Work 34
References 37
-
dc.language.isoen-
dc.subject電腦視覺-
dc.subject強化學習-
dc.subject基於影片的策略學習-
dc.subjectComputer Vision-
dc.subjectReinforcement Learning-
dc.subjectVideo-based Policy Learning-
dc.title透過自適應模式選擇與點追蹤實現穩定的影片轉動作zh_TW
dc.titleStable Video-to-Action Mapping via Adaptive Mode Selection and Point Trackingen
dc.typeThesis-
dc.date.schoolyear114-1-
dc.description.degree碩士-
dc.contributor.coadvisor李濬屹zh_TW
dc.contributor.coadvisorChun-Yi Leeen
dc.contributor.oralexamcommittee郭彥伶;柯宗瑋;陳駿丞zh_TW
dc.contributor.oralexamcommitteeYen-Ling Kuo;Tsung-Wei Ke;Jun-Cheng Chenen
dc.subject.keyword電腦視覺,強化學習基於影片的策略學習zh_TW
dc.subject.keywordComputer Vision,Reinforcement LearningVideo-based Policy Learningen
dc.relation.page40-
dc.identifier.doi10.6342/NTU202504713-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2026-01-13-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-lift2026-01-17-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-114-1.pdf10.22 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved