探討潛在空間表徵於多模態生成與具身推理之應用

黃啟斌; Chi-Pin Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101247

標題:	探討潛在空間表徵於多模態生成與具身推理之應用 Exploring Latent Space Representations for Multimodal Generation and Embodied Reasoning
作者:	黃啟斌 Chi-Pin Huang
指導教授:	王鈺強 Yu-Chiang Frank Wang
共同指導教授:	孫紹華 Shao-Hua Sun
關鍵字:	深度學習,電腦視覺多模態生成擴散模型概念抹除影片客製化視覺語言動作模型具身推理 deep learning,computer visionmultimodal generationdiffusion modelsconcept erasingvideo customizationvision-language-action modelsembodied reasoning
出版年 :	2025
學位:	博士
摘要:	近年來，生成式模型與具身人工智慧的發展，顯著提升了多模態生成與機器人推理的能力。然而，當潛在空間表示被應用於具備不同需求的任務時，往往面臨可控性不足、組合性受限、推理能力不穩定以及效率瓶頸等問題，使其難以同時支援生成與具身推理等複雜應用。本論文以潛在空間表示為核心，探討其在任務需求逐步提升的情境下，如何有效支援多模態生成與具身推理。研究內容涵蓋四個相互關聯的主題：首先，探討潛在空間中語義操作的可靠性，以實現精確且穩定的概念抹除；其次，針對影片生成任務，研究主體與動作的解耦表示，以支援多主體、動作的影片客製化生成；接著，進一步將潛在空間延伸至具身推理與決策，探討其在複雜、長期的機器人任務中的幫助；最後，考量機器人在現實生活中的實際部署需求，研究如何透過壓縮冗餘文字推理至緊湊潛在空間，在維持推理能力的同時，也能夠提升推理效率。綜合而言，本論文系統性地分析潛在空間表示在不同任務需求下的角色與限制，並說明其在多模態生成與具身推理中作為共同表示的潛力，為未來整合生成、控制與推理的人工智慧系統奠定基礎。 Recent advancements in generative models and embodied AI have significantly expanded the capabilities of multimodal generation and robotic reasoning. However, when latent space representations are expected to support tasks with increasingly diverse and demanding requirements, they often encounter fundamental limitations in controllability, compositionality, reasoning stability, and efficiency. These challenges hinder their ability to serve as a shared representation across generation and embodied reasoning. This thesis investigates latent space representations for bridging multimodal generation and embodied reasoning in progressively complex task settings. We first examine the reliability of semantic operations in latent spaces, focusing on precise and stable concept-level manipulation in diffusion-based generation. We then study disentangled spatial and temporal representations in video generation, enabling coherent customization of multiple subjects and motions. Beyond generative tasks, we extend latent representations to embodied reasoning and decision-making, demonstrating their effectiveness in complex and long-horizon robotic manipulation. Finally, we explore how compressing redundant textual reasoning into compact latent representations can preserve reasoning capability while substantially improving efficiency. Taken together, this thesis provides a systematic analysis of the role and limitations of latent space representations across task regimes, and demonstrates their potential as a shared foundation for integrating generation, control, and reasoning in multimodal AI systems.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101247
DOI:	10.6342/NTU202600030
全文授權:	同意授權(全球公開)
電子全文公開日期:	2026-01-14
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf	25.89 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。