探討潛在空間表徵於多模態生成與具身推理之應用

黃啟斌; Chi-Pin Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101247

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	王鈺強	zh_TW
dc.contributor.advisor	Yu-Chiang Frank Wang	en
dc.contributor.author	黃啟斌	zh_TW
dc.contributor.author	Chi-Pin Huang	en
dc.date.accessioned	2026-01-13T16:04:56Z	-
dc.date.available	2026-01-14	-
dc.date.copyright	2026-01-13	-
dc.date.issued	2025	-
dc.date.submitted	2026-01-07	-
dc.identifier.citation	[1] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Highresolution image synthesis with latent diffusion models,” 2021. xi, 2, 5, 13 [2] R. Gandikota, H. Orgad, Y. Belinkov, J. Materzyńska, and D. Bau, “Unified concept editing in diffusion models,” arXiv preprint arXiv:2308.14761, 2023. xi, 2, 3, 4, 6, 13, 14, 18 [3] R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau, “Erasing concepts from diffusion models,” arXiv preprint arXiv:2303.07345, 2023. xi, 2, 3, 4, 6, 9, 10, 13, 14, 15, 18 [4] Y.-L. Tsai, C.-Y. Hsu, C. Xie, C.-H. Lin, J.-Y. Chen, B. Li, P.-Y. Chen, C.M. Yu, and C.-Y. Huang, “Ring-a-bell! how reliable are concept removal methods for diffusion models?” arXiv preprint arXiv:2310 10012, 2023. xii, 7, 16, 17, 19, 20 [5] Y. Wei, S. Zhang, Z. Qing, H. Yuan, Z. Liu, Y. Liu, Y. Zhang, J. Zhou, and H. Shan, “Dreamvideo: Composing your dream videos with customized subject and motion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 6537–6549. xiii, xviii, 25, 27, 28, 35, 37, 38, 39 [6] R. Zhao, Y. Gu, J. Z. Wu, D. J. Zhang, J. Liu, W. Wu, J. Keppo, and M. Z. Shou, “Motiondirector: Motion customization of text-to-video diffusion models,” arXiv preprint arXiv:2310.08465, 2023. xiii, xviii, 25, 28, 35, 37, 38, 39 [7] J. Wu, X. Li, Y. Zeng, J. Zhang, Q. Zhou, Y. Li, Y. Tong, and K. Chen, “Motionbooth: Motion-aware customized text-to-video generation,” arXiv preprint arXiv:2406.17758, 2024. xiii, 28, 31, 38 [8] C.-P. Huang, Y.-H. Wu, M.-H. Chen, Y.-C. F. Wang, and F.-E. Yang, “Thinkact: Vision-language-action reasoning via reinforced visual latent planning,” arXiv preprint arXiv:2507.16815, 2025. xiv, 44, 64, 65, 67, 71, 73, 74, 75, 76, 79 [9] B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,” arXiv preprint arXiv:2306.03310, 2023. xv, xviii, 49, 53, 54, 55, 56, 59, 61, 73, 75, 78 [10] X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao, “Evaluating real-world robot manipulation policies in simulation,” arXiv preprint arXiv:2405.05941, 2024. xv, xviii, 54, 55, 56, 73, 75, 78 [11] W. Lu, M. Ye, Z. Ye, R. Tao, S. Yang, and B. Zhao, “Robofac: A comprehensive framework for robotic failure analysis and correction,” arXiv preprint arXiv:2505.12224, 2025. xvi, 74, 75, 78, 79 [12] P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. xvii, 5, 12, 13, 15, 16 [13] b. k. Bedapudi Praneeth and lireza Ayinmehr, “Nudenet: Neural nets for nudity classification, detection and selective censoring,” 2019. xvii, 13, 15 [14] Y. Chen, Y. Ge, Y. Ge, M. Ding, B. Li, R. Wang, R. Xu, Y. Shan, and X. Liu, “Egoplan-bench: Benchmarking multimodal large language models for human-level planning,” arXiv preprint arXiv:2312.06722, 2023. xviii, 54, 56, 74 [15] P. Sermanet, T. Ding, J. Zhao, F. Xia, D. Dwibedi, K. Gopalakrishnan, C. Chan, G. Dulac-Arnold, S. Maddineni, N. J. Joshi et al., “Robovqa: Multimodal long-horizon reasoning for robotics,” in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 645–652. xviii, 51, 54, 55, 56, 58, 74, 75, 76, 77 [16] T. Chen, Z. Chen, B. Chen, Z. Cai, Y. Liu, Z. Li, Q. Liang, X. Lin, Y. Ge, Z. Gu et al., “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”arXiv preprint arXiv:2506.18088, 2025. xviii, 73, 75, 76, 78, 79 [17] L. Qiu, Y. Chen, Y. Ge, Y. Ge, Y. Shan, and X. Liu, “Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios,” arXiv preprint arXiv:2412.04447, 2024. xviii, 75, 76, 77 [18] A. Majumdar, A. Ajay, X. Zhang, P. Putta, S. Yenamandra, M. Henaff, S. Silwal, P. Mcvay, O. Maksymets, S. Arnaud et al., “Openeqa: Embodied question answering in the era of foundation models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 16 488–16 498. xviii, 75, 76, 77 [19] C.-P. Huang, K.-P. Chang, C.-T. Tsai, Y.-H. Lai, F.-E. Yang, and Y.-C. F. Wang, “Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers,” in European Conference on Computer Vision. Springer, 2024, pp. 360–376. 1 [20] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022. 2, 5 [21] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” in Adv. Neural Inform. Process. Syst., 2022. 2, 5 [22] Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, K. Kreis, M. Aittala, T. Aila, S. Laine, B. Catanzaro et al., “ediffi: Text-to-image diffusion models with an ensemble of expert denoisers,” arXiv preprint arXiv:2211.01324, 2022. 2, 5 [23] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021. 2, 5 [24] T. Hunter, “Ai porn is easy to make now. for women, that’s a nightmare.”The Washington Post, pp. NA–NA, 2023. 2 [25] G. Somepalli, V. Singla, M. Goldblum, J. Geiping, and T. Goldstein, “Diffusion art or digital forgery? investigating data replication in diffusion models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2 [26] N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V. Sehwag, F. Tramer, B. Balle, D. Ippolito, and E. Wallace, “Extracting training data from diffusion models,” in 32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 5253–5270. 2 [27] N. Carlini, M. Jagielski, C. Zhang, N. Papernot, A. Terzis, and F. Tramer, “The privacy onion effect: Memorization is relative,” in Adv. Neural Inform. Process. Syst., 2022. 3 [28] R. Rombach, “Stable diffusion 2.0 release,” Nov. 2022. 3, 18 [29] N. Kumari, B. Zhang, S.-Y. Wang, E. Shechtman, R. Zhang, and J.-Y. Zhu, “Ablating concepts in text-to-image diffusion models,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023. 3, 4, 6, 13, 14, 16 [30] A. Sinitsin, V. Plokhotnyuk, D. Pyrkin, S. Popov, and A. Babenko, “Editable neural networks,” arXiv preprint arXiv:2004.00345, 2020. 3 [31] E. Zhang, K. Wang, X. Xu, Z. Wang, and H. Shi, “Forget-me-not: Learning to forget in text-to-image diffusion models,” arXiv preprint arXiv:2303.17591, 2023. 3, 4, 6, 13, 14, 16 [32] M. Ni, C. Wu, X. Wang, S. Yin, L. Wang, Z. Liu, and N. Duan, “Ores: Openvocabulary responsible visual synthesis,” arXiv preprint arXiv:2308.13785, 2023. 3 [33] N. De Cao, W. Aziz, and I. Titov, “Editing factual knowledge in language models,” arXiv preprint arXiv:2104.08164, 2021. 3 [34] Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng, H. Chen, and N. Zhang, “Editing large language models: Problems, methods, and opportunities,”arXiv preprint arXiv:2305.13172, 2023. 3 [35] Z. Huang, Y. Shen, X. Zhang, J. Zhou, W. Rong, and Z. Xiong, “Transformer-patcher: One mistake worth one neuron,” arXiv preprint arXiv:2301.09785, 2023. 3 [36] K. Meng, D. Bau, A. Andonian, and Y. Belinkov, “Locating and editing factual associations in gpt,” in Adv. Neural Inform. Process. Syst., 2022. 3 [37] J. Chen and D. Yang, “Unlearn what you want to forget: Efficient unlearning for LLMs,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Dec. 2023. 3 [38] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022. 3, 6, 9, 10 [39] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” in Adv. Neural Inform. Process. Syst., 2022. 5 [40] N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V. Sehwag, F. Tramer, B. Balle, D. Ippolito, and E. Wallace, “Extracting training data from diffusion models,” in 32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 5253–5270. 5 [41] G. Somepalli, V. Singla, M. Goldblum, J. Geiping, and T. Goldstein, “Diffusion art or digital forgery? investigating data replication in diffusion models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 5 [42] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022. 7, 27, 32 [43] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subjectdriven generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 7, 27, 36 [44] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, “Multiconcept customization of text-to-image diffusion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 1931–1941. 7, 27, 36 [45] C. Mou, X. Wang, L. Xie, J. Zhang, Z. Qi, Y. Shan, and X. Qie, “T2iadapter: Learning adapters to dig out more controllable ability for text-toimage diffusion models,” arXiv preprint arXiv:2302.08453, 2023. 7 [46] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to textto-image diffusion models,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023. 7 [47] H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,” arXiv preprint arXiv:2308.06721, 2023. 7 [48] S. Zhao, D. Chen, Y.-C. Chen, J. Bao, S. Hao, L. Yuan, and K.-Y. K. Wong, “Uni-controlnet: All-in-one control to text-to-image diffusion models,” arXiv preprint arXiv:2305.16322, 2023. 7 [49] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 7 [50] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” in Artificial intelligence safety and security. Chapman and Hall/CRC, 2018, pp. 99–112. 7 [51] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014. 7, 8 [52] N. Akhtar, A. Mian, N. Kardan, and M. Shah, “Advances in adversarial attacks and defenses in computer vision: A survey,” IEEE Access, vol. 9, pp. 155 161–155 196, 2021. 7 [53] Z.-Y. Chin, C.-M. Jiang, C.-C. Huang, P.-Y. Chen, and W.-C. Chiu, “Prompting4debugging: Red-teaming text-to-image diffusion models by finding problematic prompts,” arXiv preprint arXiv:2309.06135, 2023. 7, 16, 17 [54] Y. Zhang, J. Jia, X. Chen, A. Chen, Y. Zhang, J. Liu, K. Ding, and S. Liu, “To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now,” arXiv preprint arXiv:2310.11868, 2023. 7 [55] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763. 7, 13, 16 [56] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017. 8 [57] T. Bai, J. Luo, J. Zhao, B. Wen, and Q. Wang, “Recent advances in adversarial training for adversarial robustness,” arXiv preprint arXiv:2102.01356, 2021. 8 [58] C. Xiang, F. Bao, C. Li, H. Su, and J. Zhu, “A closer look at parameterefficient tuning in diffusion models,” arXiv preprint arXiv:2303.18181, 2023. 9 [59] R. Tang, L. Liu, A. Pandey, Z. Jiang, G. Yang, K. Kumar, P. Stenetorp, J. Lin, and F. Ture, “What the daam: Interpreting stable diffusion using cross attention,” arXiv preprint arXiv:2210.04885, 2022. 10 [60] M. Cao, X. Wang, Z. Qi, Y. Shan, X. Qie, and Y. Zheng, “Masactrl: Tuningfree mutual self-attention control for consistent image synthesis and editing,” arXiv preprint arXiv:2304.08465, 2023. 10 [61] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Master’s thesis, Department of Computer Science, University of Toronto, 2009. 12, 15 [62] Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models,” arXiv preprint arXiv:2210.14896, 2022. 12 [63] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023. 13 [64] P. Schramowski, C. Tauchmann, and K. Kersting, “Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content?” in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 1350–1361. 13 [65] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740– 755. 13, 16 [66] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017. 13, 16 [67] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,”arXiv preprint arXiv:2010.02502, 2020. 13, 24, 37 [68] C.-P. Huang, Y.-S. Wu, H.-K. Chung, K.-P. Chang, F.-E. Yang, and Y.-C. F. Wang, “Videomage: Multi-subject and motion customization of text-tovideo diffusion models,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 17 603–17 612. 24 [69] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in Neural Information Processing Systems (NIPS), vol. 33, pp. 6840–6851, 2020. 24, 29, 30 [70] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020. 24 [71] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,” Advances in Neural Information Processing Systems (NIPS), vol. 35, pp. 8633–8646, 2022. 24, 27, 29 [72] J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022. 24, 27, 29 [73] U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni et al., “Make-a-video: Text-to-video generation without text-video data,” arXiv preprint arXiv:2209.14792, 2022. 24, 27, 29 [74] A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22 563–22 575. 24, 27, 29 [75] Y. He, T. Yang, Y. Zhang, Y. Shan, and Q. Chen, “Latent video diffusion models for high-fidelity long video generation,” arXiv preprint arXiv:2211.13221, 2022. 24, 27, 29, 30 [76] Y. Jiang, T. Wu, S. Yang, C. Si, D. Lin, Y. Qiao, C. C. Loy, and Z. Liu, “Videobooth: Diffusion-based video generation with image prompts,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 6689–6700. 24, 25, 27 [77] Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023. 25, 27 [78] Z. Wang, A. Li, E. Xie, L. Zhu, Y. Guo, Q. Dou, and Z. Li, “Customvideo: Customizing text-to-video generation with multiple subjects,”arXiv preprint arXiv:2401.09962, 2024. 25, 27, 36, 37 [79] H. Jeong, G. Y. Park, and J. C. Ye, “Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 9212–9221. 25, 28 [80] Y. Ren, Y. Zhou, J. Yang, J. Shi, D. Liu, F. Liu, M. Kwon, and A. Shrivastava, “Customize-a-video: One-shot motion customization of text-to-video diffusion models,” arXiv preprint arXiv:2402.14780, 2024. 25, 28 [81] Y.-S. Wu, C.-P. Huang, F.-E. Yang, and Y.-C. F. Wang, “Motionmatcher: Motion customization of text-to-video diffusion models via motion feature matching,” arXiv preprint arXiv:2502.13234, 2025. 25, 28 [82] J. Materzyńska, J. Sivic, E. Shechtman, A. Torralba, R. Zhang, and B. Russell, “Newmove: Customizing text-to-video models with novel motions,”in Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 1634–1651. 25, 28 [83] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021. 25, 28 [84] R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau, “Erasing concepts from diffusion models,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023, pp. 2426–2436. 26, 32 [85] C.-P. Huang, K.-P. Chang, C.-T. Tsai, Y.-H. Lai, F.-E. Yang, and Y.-C. F. Wang, “Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers,” arXiv preprint arXiv:2311.17717, 2023. 26, 32 [86] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “Mocogan: Decomposing motion and content for video generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1526–1535. 27 [87] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” Advances in Neural Information Processing Systems (NIPS), vol. 29, 2016. 27 [88] M. Saito, E. Matsumoto, and S. Saito, “Temporal generative adversarial nets with singular value clipping,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2830–2839. 27 [89] I. Skorokhodov, S. Tulyakov, and M. Elhoseiny, “Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3626–3636. 27 [90] C. Wu, L. Huang, Q. Zhang, B. Li, L. Ji, F. Yang, G. Sapiro, and N. Duan, “Godiva: Generating open-domain videos from natural descriptions,” arXiv preprint arXiv:2104.14806, 2021. 27 [91] W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas, “Videogpt: Video generation using vq-vae and transformers,” arXiv preprint arXiv:2104.10157, 2021. 27 [92] D. Weissenborn, O. Täckström, and J. Uszkoreit, “Scaling autoregressive video models,” arXiv preprint arXiv:1906.02634, 2019. 27 [93] W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “Cogvideo: Large-scale pretraining for text-to-video generation via transformers,” arXiv preprint arXiv:2205.15868, 2022. 27 [94] H. Chen, M. Xia, Y. He, Y. Zhang, X. Cun, S. Yang, J. Xing, Y. Liu, Q. Chen, X. Wang et al., “Videocrafter1: Open diffusion models for highquality video generation,” arXiv preprint arXiv:2310.19512, 2023. 27, 30 [95] J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, and S. Zhang, “Modelscope text-to-video technical report,” arXiv preprint arXiv:2308.06571, 2023. 27, 30 [96] S. Sterling, “Zeroscope,” https://huggingface.co/cerspense/zeroscope_v2_ 576w, 2023. 27, 37 [97] H. Chen, X. Wang, Y. Zhang, Y. Zhou, Z. Zhang, S. Tang, and W. Zhu, “Disenstudio: Customized multi-subject text-to-video generation with disentangled spatial control,” arXiv preprint arXiv:2405.12796, 2024. 27 [98] H. Chen, X. Wang, G. Zeng, Y. Zhang, Y. Zhou, F. Han, and W. Zhu, “Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning,” arXiv preprint arXiv:2311.00990, 2023. 27 [99] D. P. Kingma, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013. 30 [100] T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-w. Chao, B. E. Jeon, Y. Fang, H.-Y. Lee, J. Ren, M.-H. Yang et al., “Panda-70m: Captioning 70m videos with multiple cross-modality teachers,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 320–13 331. 31 [101] Y. Gu, X. Wang, J. Z. Wu, Y. Shi, Y. Chen, Z. Fan, W. Xiao, R. Zhao, S. Chang, W. Wu et al., “Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models,” Advances in Neural Information Processing Systems (NIPS), vol. 36, 2024. 33, 36 [102] T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan, Z. Zeng, H. Zhang, F. Li, J. Yang, H. Li, Q. Jiang, and L. Zhang, “Grounded sam: Assembling open-world models for diverse visual tasks,” 2024. 34 [103] N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson et al., “Sam 2: Segment anything in images and videos,” arXiv preprint arXiv:2408.00714, 2024. 34 [104] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 6023–6032. 34 [105] L. Han, Y. Li, H. Zhang, P. Milanfar, D. Metaxas, and F. Yang, “Svdiff: Compact parameter space for diffusion fine-tuning,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023, pp. 7323–7334. 34 [106] P. Ling, J. Bu, P. Zhang, X. Dong, Y. Zang, T. Wu, H. Chen, J. Wang, and Y. Jin, “Motionclone: Training-free motion cloning for controllable video generation,” arXiv preprint arXiv:2406.05338, 2024. 36 [107] A. Agarwal, S. Karanam, K. Joseph, A. Saxena, K. Goswami, and B. V. Srinivasan, “A-star: Test-time attention segregation and retention for textto-image synthesis,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023, pp. 2283–2293. 36 [108] H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or, “Attend-andexcite: Attention-based semantic guidance for text-to-image diffusion models,” ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–10, 2023. 36, 37 [109] M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in IEEE International Conference on Computer Vision, 2021. 36 [110] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763. 37, 53 [111] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” inICCV, 2021, pp. 9650–9660. 37 [112] P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in ICCV, 2023, pp. 7346–7356. 37 [113] G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang et al., “Gemini 1.5: Unlocking multi- modal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024. 44 [114] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023. 44, 64 [115] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang et al., “Qwen2. 5-vl technical report,” arXiv preprint arXiv:2502.13923, 2025. 44, 53, 56, 64, 74, 76 [116] M. Shi, F. Liu, S. Wang, S. Liao, S. Radhakrishnan, Y. Zhao, D.-A. Huang, H. Yin, K. Sapra, Y. Yacoob et al., “Eagle: Exploring the design space for multimodal llms with mixture of encoders,” arXiv preprint arXiv:2408.15998, 2024. 44 [117] J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 26 689– 26 699. 44 [118] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023. 44, 56, 76, 77 [119] B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liu et al., “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024. 44, 56 [120] Z. Chen, W. Wang, Y. Cao, Y. Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liu et al., “Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024. 44, 56, 76 [121] Z. Liu, L. Zhu, B. Shi, Z. Zhang, Y. Lou, S. Yang, H. Xi, S. Cao, Y. Gu, D. Liet al., “Nvila: Efficient frontier visual language models,” arXiv preprint arXiv:2412.04468, 2024. 44, 56, 64, 76 [122] J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, Y. Duan, H. Tian, W. Su, J. Shao et al., “Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models,” arXiv preprint arXiv:2504.10479, 2025. 44, 56, 76 [123] Z. Li, G. Chen, S. Liu, S. Wang, V. VS, Y. Ji, S. Lan, H. Zhang, Y. Zhao, S. Radhakrishnan et al., “Eagle 2: Building post-training data strategies from scratch for frontier vision-language models,” arXiv preprint arXiv:2501.14818, 2025. 44 [124] G. Chen, Z. Li, S. Wang, J. Jiang, Y. Liu, L. Lu, D.-A. Huang, W. Byeon, M. Le, T. Rintamaki et al., “Eagle 2.5: Boosting long-context post-training for frontier vision-language models,” arXiv preprint arXiv:2504.15271, 2025. 44, 64 [125] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al., “Rt-2: Vision-languageaction models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818, 2023. 44, 64, 67 [126] M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,” arXiv preprint arXiv:2406.09246, 2024. 44, 47, 54, 55, 61, 64, 67, 74, 76 [127] R. Zheng, Y. Liang, S. Huang, J. Gao, H. Daumé III, A. Kolobov, F. Huang, and J. Yang, “Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies,” arXiv preprint arXiv:2412.10345, 2024. 44, 47, 50, 55, 67 [128] J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y. Fang, D. Fox, F. Hu, S. Huang et al., “Gr00t n1: An open foundation model for generalist humanoid robots,” arXiv preprint arXiv:2503.14734, 2025. 44, 47, 64, 67 [129] O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu et al., “Octo: An open-source generalist robot policy,” arXiv preprint arXiv:2405.12213, 2024. 44, 55, 64, 67 [130] A. O＇Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain et al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0,” in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903. 44, 47, 53, 64, 74 [131] M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, “Robotic control via embodied chain-of-thought reasoning,” arXiv preprint arXiv:2407.08693, 2024. 45, 48, 65, 67 [132] J. Clark, S. Mirchandani, D. Sadigh, and S. Belkhale, “Action-free reasoning for policy generalization,” arXiv preprint arXiv:2502.03729, 2025. 45, 48 [133] Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn et al., “Cot-vla: Visual chain-of-thought reasoning for visionlanguage-action models,” arXiv preprint arXiv:2503.22020, 2025. 45, 48, 54, 55, 65, 67, 76 [134] L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai et al., “Hi robot: Open-ended instruction following with hierarchical vision-language-action models,” arXiv preprint arXiv:2502.19417, 2025. 45, 67 [135] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022. 45, 47 [136] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024. 45, 48, 49, 51, 53, 65, 70, 74 [137] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025. 45, 48, 49, 50 [138] K. Feng, K. Gong, B. Li, Z. Guo, Y. Wang, T. Peng, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,” arXiv preprint arXiv:2503.21776, 2025. 45, 48, 54, 74 [139] NVIDIA, A. Azzolini, H. Brandon, P. Chattopadhyay, H. Chen, J. Chu, Y. Cui, J. Diamond, Y. Ding, F. Ferroni, R. Govindaraju, J. Gu, S. Gururani, I. El Hanafi, Z. Hao, J. Huffman, J. Jin, B. Johnson, R. Khan, G. Kurian, E. Lantz, N. Lee, Z. Li, X. Li, T.-Y. Lin, Y.-C. Lin, M.-Y. Liu, A. Mathau, Y. Ni, L. Pavao, W. Ping, D. W. Romero, M. Smelyanskiy, S. Song, L. Tchapmi, A. Z. Wang, B. Wang, H. Wang, F. Wei, J. Xu, Y. Xu, X. Yang, Z. Yang, X. Zeng, and Z. Zhang, “Cosmos-reason1: From physical common sense to embodied reasoning,” 2025. [Online]. Available: https://arxiv.org/abs/2503.15558 45, 48, 50, 52 [140] H. Tan, Y. Ji, X. Hao, M. Lin, P. Wang, Z. Wang, and S. Zhang, “Reasonrft: Reinforcement fine-tuning for visual reasoning,” arXiv preprint arXiv:2503.20752, 2025. 45, 48 [141] X. Li, C. Mata, J. Park, K. Kahatapitiya, Y. S. Jang, J. Shang, K. Ranasinghe, R. Burgert, M. Cai, Y. J. Lee et al., “Llara: Supercharging robot learning data for vision-language policy,” arXiv preprint arXiv:2406.20095, 2024. 47 [142] Y. Yin, Z. Wang, Y. Sharma, D. Niu, T. Darrell, and R. Herzig, “In-context learning enables robot action prediction in llms,” in 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 8972–8979. 47 [143] W. Yuan, J. Duan, V. Blukis, W. Pumacay, R. Krishna, A. Murali, A. Mousavian, and D. Fox, “Robopoint: A vision-language model for spatial affordance prediction for robotics,” arXiv preprint arXiv:2406.10721, 2024. 47 [144] D. Niu, Y. Sharma, G. Biamby, J. Quenum, Y. Bai, B. Shi, T. Darrell, and R. Herzig, “Llarva: Vision-action instruction tuning enhances robot learning,” arXiv preprint arXiv:2406.11815, 2024. 47, 50 [145] J. Duan, W. Pumacay, N. Kumar, Y. R. Wang, S. Tian, W. Yuan, R. Krishna, D. Fox, A. Mandlekar, and Y. Guo, “Aha: A vision-language-model for detecting and reasoning over failures in robotic manipulation,” arXiv preprint arXiv:2410.00371, 2024. 47 [146] A. Szot, B. Mazoure, O. Attia, A. Timofeev, H. Agrawal, D. Hjelm, Z. Gan, Z. Kira, and A. Toshev, “From multimodal llms to generalist embodied agents: Methods and lessons,” arXiv preprint arXiv:2412.08442, 2024. 47 [147] Y. Li, Y. Deng, J. Zhang, J. Jang, M. Memmel, R. Yu, C. R. Garrett, F. Ramos, D. Fox, A. Li et al., “Hamster: Hierarchical action models for open-world robot manipulation,” arXiv preprint arXiv:2502.05485, 2025. 47, 64, 67 [148] J. Yang, R. Tan, Q. Wu, R. Zheng, B. Peng, Y. Liang, Y. Gu, M. Cai, S. Ye, J. Jang et al., “Magma: A foundation model for multimodal ai agents,”arXiv preprint arXiv:2502.13130, 2025. 47, 50, 55, 56, 59, 64, 67, 76 [149] A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis et al., “Droid: A large-scale in-the-wild robot manipulation dataset,” arXiv preprint arXiv:2403.12945, 2024. 47 [150] X. Wang and D. Zhou, “Chain-of-thought reasoning without prompting,”arXiv preprint arXiv:2402.10200, 2024. 47 [151] E. Yeo, Y. Tong, M. Niu, G. Neubig, and X. Yue, “Demystifying long chainof-thought reasoning in llms,” arXiv preprint arXiv:2502.03373, 2025. 47 [152] Z. Liu, Z. Sun, Y. Zang, X. Dong, Y. Cao, H. Duan, D. Lin, and J. Wang, “Visual-rft: Visual reinforcement fine-tuning,” arXiv preprint arXiv:2503.01785, 2025. 48 [153] P. Senin, “Dynamic time warping algorithm review,” Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA, vol. 855, no. 1-23, p. 40, 2008. 50 [154] Z. Liu, A. Bahety, and S. Song, “Reflect: Summarizing robot experiences for failure explanation and correction,” arXiv preprint arXiv:2306.15724, 2023. 51, 54, 59 [155] C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” The International Journal of Robotics Research, p. 02783649241273668, 2023. 52, 53, 55, 74, 75, 76 [156] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023. 53 [157] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping languageimage pre-training with frozen image encoders and large language models,” in International conference on machine learning. PMLR, 2023, pp. 19 730–19 742. 53 [158] R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag et al., “The” something something” video database for learning and evaluating visual common sense,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5842–5850. 54 [159] Y. Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,” arXiv preprint arXiv:2410.02713, 2024. 54 [160] L. Qiu, Y. Ge, Y. Chen, Y. Ge, Y. Shan, and X. Liu, “Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios,” arXiv preprint arXiv:2412.04447, 2024. 54, 55 [161] A. Majumdar, A. Ajay, X. Zhang, P. Putta, S. Yenamandra, M. Henaff, S. Silwal, P. Mcvay, O. Maksymets, S. Arnaud, K. Yadav, Q. Li, B. Newman, M. Sharma, V. Berges, S. Zhang, P. Agrawal, Y. Bisk, D. Batra, M. Kalakrishnan, F. Meier, C. Paxton, S. Sax, and A. Rajeswaran, “OpenEQA: Embodied Question Answering in the Era of Foundation Models,” in CVPR, 2024. 54, 55, 58 [162] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311– 318. 54, 75 [163] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022. 55, 64, 67 [164] G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” arXiv preprint arXiv:2507.06261, 2025. 64, 76, 77 [165] W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao et al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,” arXiv preprint arXiv:2508.18265, 2025. 64 [166] J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y. Gu, Z. Chen, Z. Yang, and M. Z. Shou, “Show-o: One single transformer to unify multimodal understanding and generation,” arXiv preprint arXiv:2408.12528, 2024. 64 [167] K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter et al., “π0: A vision-language-action flow model for general robot control,” arXiv preprint arXiv:2410.24164, 2024. 64, 67, 75, 76, 79 [168] J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee et al., “Molmoact: Action reasoning models that can reason in space,” arXiv preprint arXiv:2508.07917, 2025. 65, 67, 74, 76 [169] Y. Wu, A. Li, T. Hermans, F. Ramos, A. Bajcsy, and C. P’erezD’Arpino, “Do what you say: Steering vision-language-action models via runtime reasoning-action alignment verification,” arXiv preprint arXiv:2510.16281, 2025. 65, 67 [170] D. Qu, H. Song, Q. Chen, Z. Chen, X. Gao, X. Ye, Q. Lv, M. Shi, G. Ren, C. Ruan et al., “Eo-1: Interleaved vision-text-action pretraining for general robot control,” arXiv preprint arXiv:2508.21112, 2025. 65, 67 [171] D. Kim, S. Park, H. Jang, J. Shin, J. Kim, and Y. Seo, “Robot-r1: Reinforcement learning for enhanced embodied reasoning in robotics,” arXiv preprint arXiv:2506.00070, 2025. 65, 67 [172] W. Guan, Q. Hu, A. Li, and J. Cheng, “Efficient vision-language-action models for embodied manipulation: A systematic survey,” arXiv preprint arXiv:2510.17111, 2025. 65 [173] Z. Yu, B. Wang, P. Zeng, H. Zhang, J. Zhang, L. Gao, J. Song, N. Sebe, and H. T. Shen, “A survey on efficient vision-language-action models,” arXiv preprint arXiv:2510.24795, 2025. 65 [174] Y. Wang, W. Luo, J. Bai, Y. Cao, T. Che, K. Chen, Y. Chen, J. Diamond, Y. Ding, W. Ding et al., “Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail,” arXiv preprint arXiv:2511.00088, 2025. 65 [175] W. Chen, S. Belkhale, S. Mirchandani, O. Mees, D. Driess, K. Pertsch, and S. Levine, “Training strategies for efficient embodied reasoning,” arXiv preprint arXiv:2505.08243, 2025. 65, 68 [176] K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for visionlanguage-action models,” arXiv preprint arXiv:2501.09747, 2025. 67 [177] D. Driess, J. T. Springenberg, B. Ichter, L. Yu, A. Li-Bell, K. Pertsch, A. Z. Ren, H. Walke, Q. Vuong, L. X. Shi et al., “Knowledge insulating vision-language-action models: Train fast, run fast, generalize better,”arXiv preprint arXiv:2505.23705, 2025. 67 [178] Q. Bu, J. Cai, L. Chen, X. Cui, Y. Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang et al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,” arXiv preprint arXiv:2503.06669, 2025. 67 [179] G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakrishna, R. Baruch, M. Bauza, M. Blokzijl et al., “Gemini robotics: Bringing ai into the physical world,” arXiv preprint arXiv:2503.20020, 2025. 67 [180] Y. Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, P. Hou et al., “Vla-adapter: An effective paradigm for tiny-scale visionlanguage-action model,” arXiv preprint arXiv:2509.09372, 2025. 67 [181] Y. Yuan, H. Cui, Y. Huang, Y. Chen, F. Ni, Z. Dong, P. Li, Y. Zheng, and J. Hao, “Embodied-r1: Reinforced embodied reasoning for general robotic manipulation,” arXiv preprint arXiv:2508.13998, 2025. 67 [182] A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Balakrishna, N. Batchelor, A. Bewley, J. Bingham, M. Bloesch et al., “Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer,” arXiv preprint arXiv:2510.03342, 2025. 67, 79 [183] M. Dai, S. Liu, and Q. Si, “Stable reinforcement learning for efficient reasoning,” arXiv preprint arXiv:2505.18086, 2025. 68 [184] D. Yuan, T. Xie, S. Huang, Z. Gong, H. Zhang, C. Luo, F. Wei, and D. Zhao, “Efficient rl training for reasoning models via length-aware optimization,”arXiv preprint arXiv:2505.12284, 2025. 68 [185] V. Xiang, C. Blagden, R. Rafailov, N. Lile, S. Truong, C. Finn, and N. Haber, “Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning,” arXiv preprint arXiv:2506.05256, 2025. 68 [186] P. Aggarwal and S. Welleck, “L1: Controlling how long a reasoning model thinks with reinforcement learning,” arXiv preprint arXiv:2503.04697, 2025. 68 [187] S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y. Tian, “Training large language models to reason in a continuous latent space,” arXiv preprint arXiv:2412.06769, 2024. 68 [188] Z. Shen, H. Yan, L. Zhang, Z. Hu, Y. Du, and Y. He, “Codi: Compressing chain-of-thought into continuous space via self-distillation,” arXiv preprint arXiv:2502.21074, 2025. 68 [189] Z. Zhang, X. He, W. Yan, A. Shen, C. Zhao, S. Wang, Y. Shen, and X. E. Wang, “Soft thinking: Unlocking the reasoning potential of llms in continuous concept space,” arXiv preprint arXiv:2505.15778, 2025. 68 [190] J. Cheng and B. Van Durme, “Compressed chain of thought: Efficient reasoning through dense representations,” arXiv preprint arXiv:2412.13171, 2024. 68 [191] Y. Xu, X. Guo, Z. Zeng, and C. Miao, “Softcot: Soft chain-of-thought for efficient reasoning with llms,” arXiv preprint arXiv:2502.12134, 2025. 68 [192] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in neural information processing systems, vol. 36, pp. 53 728–53 741, 2023. 71 [193] S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,” arXiv preprint arXiv:2410.07864, 2024. 72, 73, 74, 75, 76, 79 [194] T. Motoda, M. Murooka, R. Nakajo, M. A. Muttaqien, K. Makihara, H. Oh, K. Shirai, F. Erich, R. Hanai, and Y. Domae, “Aist-bimanual manipulation,” 2025. 74 [195] M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini et al., “Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models,” arXiv eprints, pp. arXiv–2409, 2024. 74 [196] Y. Ji, H. Tan, J. Shi, X. Hao, Y. Zhang, H. Zhang, P. Wang, M. Zhao, Y. Mu, P. An et al., “Robobrain: A unified brain model for robotic manipulation from abstract to concrete,” arXiv preprint arXiv:2502.21257, 2025. 74 [197] L. X. Shi, A. Sharma, T. Z. Zhao, and C. Finn, “Waypoint-based imitation learning for robotic manipulation,” arXiv preprint arXiv:2307.14326, 2023. 74 [198] T. Z. Zhao, V. Kumar, S. Levine, and C. Finn, “Learning finegrained bimanual manipulation with low-cost hardware,” arXiv preprint arXiv:2304.13705, 2023. 74, 75, 76 [199] B. R. Team, M. Cao, H. Tan, Y. Ji, X. Chen, M. Lin, Z. Li, Z. Cao, P. Wang, E. Zhou et al., “Robobrain 2.0 technical report,” arXiv preprint arXiv:2507.02029, 2025. 76	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101247	-
dc.description.abstract	近年來，生成式模型與具身人工智慧的發展，顯著提升了多模態生成與機器人推理的能力。然而，當潛在空間表示被應用於具備不同需求的任務時，往往面臨可控性不足、組合性受限、推理能力不穩定以及效率瓶頸等問題，使其難以同時支援生成與具身推理等複雜應用。本論文以潛在空間表示為核心，探討其在任務需求逐步提升的情境下，如何有效支援多模態生成與具身推理。研究內容涵蓋四個相互關聯的主題：首先，探討潛在空間中語義操作的可靠性，以實現精確且穩定的概念抹除；其次，針對影片生成任務，研究主體與動作的解耦表示，以支援多主體、動作的影片客製化生成；接著，進一步將潛在空間延伸至具身推理與決策，探討其在複雜、長期的機器人任務中的幫助；最後，考量機器人在現實生活中的實際部署需求，研究如何透過壓縮冗餘文字推理至緊湊潛在空間，在維持推理能力的同時，也能夠提升推理效率。綜合而言，本論文系統性地分析潛在空間表示在不同任務需求下的角色與限制，並說明其在多模態生成與具身推理中作為共同表示的潛力，為未來整合生成、控制與推理的人工智慧系統奠定基礎。	zh_TW
dc.description.abstract	Recent advancements in generative models and embodied AI have significantly expanded the capabilities of multimodal generation and robotic reasoning. However, when latent space representations are expected to support tasks with increasingly diverse and demanding requirements, they often encounter fundamental limitations in controllability, compositionality, reasoning stability, and efficiency. These challenges hinder their ability to serve as a shared representation across generation and embodied reasoning. This thesis investigates latent space representations for bridging multimodal generation and embodied reasoning in progressively complex task settings. We first examine the reliability of semantic operations in latent spaces, focusing on precise and stable concept-level manipulation in diffusion-based generation. We then study disentangled spatial and temporal representations in video generation, enabling coherent customization of multiple subjects and motions. Beyond generative tasks, we extend latent representations to embodied reasoning and decision-making, demonstrating their effectiveness in complex and long-horizon robotic manipulation. Finally, we explore how compressing redundant textual reasoning into compact latent representations can preserve reasoning capability while substantially improving efficiency. Taken together, this thesis provides a systematic analysis of the role and limitations of latent space representations across task regimes, and demonstrates their potential as a shared foundation for integrating generation, control, and reasoning in multimodal AI systems.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-01-13T16:04:56Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-01-13T16:04:56Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	致謝 i 中文摘要 iii Abstract v Contents vii List of Figures xi List of Tables xvii 1 Latent Space Representations for Reliable Concept Erasing 1 1.1 Publication Preface . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Erasing Concepts from Diffusion Models . . . . . . . . . 5 1.3.2 Controlling Text-to-Image Diffusion Models . . . . . . . 6 1.3.3 Adversarial Attack & Training . . . . . . . . . . . . . . . 7 1.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.1 Concept Erasing with Lightweight Eraser . . . . . . . . . 8 1.4.2 Concept-Localized Regularization for Erasing Locality . . 10 1.4.3 Adversarial Prompt Learning for Erasing Robustness . . . 11 1.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5.1 Quantitative Evaluation . . . . . . . . . . . . . . . . . . 15 1.5.2 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . 18 1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2 Latent Space Representations for Video Customization 23 2.1 Publication Preface . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.1 Text-to-Video Generation . . . . . . . . . . . . . . . . . 27 2.3.2 Video Content Customization . . . . . . . . . . . . . . . 27 2.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Preliminary: Video Diffusion Models . . . . . . . . . . . 29 2.4.2 Subject and Motion Customization . . . . . . . . . . . . . 30 2.4.3 Spatial-Temporal Collaborative Composition . . . . . . . 33 2.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 36 2.5.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . 40 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3 Latent Space Representations for Embodied Reasoning 43 3.1 Publication Preface . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.1 Vision-Language-Action Models . . . . . . . . . . . . . . 47 3.3.2 Reasoning in Vision-Language-(Action) Models . . . . . 47 3.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . 48 3.4.2 Reinforced Visual Latent Planning for Embodied Reasoning 49 3.4.3 Reasoning-Enhanced Action Adaptation . . . . . . . . . . 52 3.4.4 Learning Strategy and Inference . . . . . . . . . . . . . . 52 3.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 53 3.5.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . . 54 3.5.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . 56 3.5.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.5 Analysis of ThinkAct . . . . . . . . . . . . . . . . . . . . 59 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4 Latent Space Representations for Efficient Embodied Reasoning 63 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.1 Vision-Language-Action (VLA) Models . . . . . . . . . . 67 4.2.2 Efficient Reasoning . . . . . . . . . . . . . . . . . . . . . 68 4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . 69 4.3.2 Efficient Embodied Reasoning . . . . . . . . . . . . . . 69 4.3.3 Reasoning-Enhanced Policy Learning . . . . . . . . . . . 72 4.3.4 Learning Strategy and Inference . . . . . . . . . . . . . . 73 4.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 74 4.4.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . . 75 4.4.3 Analysis of Fast-ThinkAct . . . . . . . . . . . . . . . . . 77 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5 Conclusion 83 Reference 85	-
dc.language.iso	en	-
dc.subject	深度學習	-
dc.subject	電腦視覺	-
dc.subject	多模態生成	-
dc.subject	擴散模型	-
dc.subject	概念抹除	-
dc.subject	影片客製化	-
dc.subject	視覺語言動作模型	-
dc.subject	具身推理	-
dc.subject	deep learning	-
dc.subject	computer vision	-
dc.subject	multimodal generation	-
dc.subject	diffusion models	-
dc.subject	concept erasing	-
dc.subject	video customization	-
dc.subject	vision-language-action models	-
dc.subject	embodied reasoning	-
dc.title	探討潛在空間表徵於多模態生成與具身推理之應用	zh_TW
dc.title	Exploring Latent Space Representations for Multimodal Generation and Embodied Reasoning	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	博士	-
dc.contributor.coadvisor	孫紹華	zh_TW
dc.contributor.coadvisor	Shao-Hua Sun	en
dc.contributor.oralexamcommittee	莊永裕;陳祝嵩;羅紹元;林彥宇;劉育綸;賴尚宏;陳煥宗	zh_TW
dc.contributor.oralexamcommittee	Yung-Yu Chuang;Chu-Song Chen;Shao-Yuan Lo;Yen-Yu Lin;Yu-Lun Liu;Shang-Hong Lai;Hwann-Tzong Chen	en
dc.subject.keyword	深度學習,電腦視覺多模態生成擴散模型概念抹除影片客製化視覺語言動作模型具身推理	zh_TW
dc.subject.keyword	deep learning,computer visionmultimodal generationdiffusion modelsconcept erasingvideo customizationvision-language-action modelsembodied reasoning	en
dc.relation.page	111	-
dc.identifier.doi	10.6342/NTU202600030	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2026-01-07	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
dc.date.embargo-lift	2026-01-14	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf	25.89 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。