探討潛在空間表徵於多模態生成與具身推理之應用

黃啟斌; Chi-Pin Huang

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101247

Title:	探討潛在空間表徵於多模態生成與具身推理之應用 Exploring Latent Space Representations for Multimodal Generation and Embodied Reasoning
Authors:	黃啟斌 Chi-Pin Huang
Advisor:	王鈺強 Yu-Chiang Frank Wang
Co-Advisor:	孫紹華 Shao-Hua Sun
Keyword:	深度學習,電腦視覺多模態生成擴散模型概念抹除影片客製化視覺語言動作模型具身推理 deep learning,computer visionmultimodal generationdiffusion modelsconcept erasingvideo customizationvision-language-action modelsembodied reasoning
Publication Year :	2025
Degree:	博士
Abstract:	近年來，生成式模型與具身人工智慧的發展，顯著提升了多模態生成與機器人推理的能力。然而，當潛在空間表示被應用於具備不同需求的任務時，往往面臨可控性不足、組合性受限、推理能力不穩定以及效率瓶頸等問題，使其難以同時支援生成與具身推理等複雜應用。本論文以潛在空間表示為核心，探討其在任務需求逐步提升的情境下，如何有效支援多模態生成與具身推理。研究內容涵蓋四個相互關聯的主題：首先，探討潛在空間中語義操作的可靠性，以實現精確且穩定的概念抹除；其次，針對影片生成任務，研究主體與動作的解耦表示，以支援多主體、動作的影片客製化生成；接著，進一步將潛在空間延伸至具身推理與決策，探討其在複雜、長期的機器人任務中的幫助；最後，考量機器人在現實生活中的實際部署需求，研究如何透過壓縮冗餘文字推理至緊湊潛在空間，在維持推理能力的同時，也能夠提升推理效率。綜合而言，本論文系統性地分析潛在空間表示在不同任務需求下的角色與限制，並說明其在多模態生成與具身推理中作為共同表示的潛力，為未來整合生成、控制與推理的人工智慧系統奠定基礎。 Recent advancements in generative models and embodied AI have significantly expanded the capabilities of multimodal generation and robotic reasoning. However, when latent space representations are expected to support tasks with increasingly diverse and demanding requirements, they often encounter fundamental limitations in controllability, compositionality, reasoning stability, and efficiency. These challenges hinder their ability to serve as a shared representation across generation and embodied reasoning. This thesis investigates latent space representations for bridging multimodal generation and embodied reasoning in progressively complex task settings. We first examine the reliability of semantic operations in latent spaces, focusing on precise and stable concept-level manipulation in diffusion-based generation. We then study disentangled spatial and temporal representations in video generation, enabling coherent customization of multiple subjects and motions. Beyond generative tasks, we extend latent representations to embodied reasoning and decision-making, demonstrating their effectiveness in complex and long-horizon robotic manipulation. Finally, we explore how compressing redundant textual reasoning into compact latent representations can preserve reasoning capability while substantially improving efficiency. Taken together, this thesis provides a systematic analysis of the role and limitations of latent space representations across task regimes, and demonstrates their potential as a shared foundation for integrating generation, control, and reasoning in multimodal AI systems.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101247
DOI:	10.6342/NTU202600030
Fulltext Rights:	同意授權(全球公開)
metadata.dc.date.embargo-lift:	2026-01-14
Appears in Collections:	電信工程學研究所

Files in This Item:

File	Size	Format
ntu-114-1.pdf	25.89 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets