Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 工學院
  3. 機械工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100918
標題: 即時性多模態大型語言模型代理人架構於機械手臂對話機器人一般性取放任務及自訂義按摩任務之應用
Real-Time MLLM Agent Frameworks for Robot Arm Chatbots in Pick-and-Place and Customized Massage Tasks
作者: 薛龍
Lung Hsueh
指導教授: 蔡孟勳
Meng-Shiun Tsai
關鍵字: 多模態大型語言模型(MLLM),檢索增強生成(RAG)代理式控制系統語音互動機械手臂控制人機互動(HRI)
Multimodal Large Language Models (MLLM),Retrieval-Augmented Generation (RAG)Agent-Based Control SystemsVoice InteractionRobot Arm ControlHuman–Robot Interaction (HRI)
出版年 : 2025
學位: 碩士
摘要: 本論文提出一個即時性,具語音控制能力之多模態大型語言模型(Multimodal Large Language Model, MLLM)框架,該框架結合大型語言模型(Large Language Models, LLMs)、視覺語言模型(Vision Language Models, VLMs)、語音轉文字與文字轉語音之語音模組,以及檢索增強生成(Retrieval-Augmented Generation, RAG),建構為一個統一的代理人式(agent-based)機械手臂控制系統。系統核心設計引入中央主對話代理(Central Main Chat-Agent)與路由代理(Router Agent),其中路由代理負責將任務分派至背景模組,確保前端語音互動過程不中斷,並實現明確的前端–後端分離。此設計保證了框架的通用性,使前端流程可廣泛應用於多種機械手臂任務;同時透過任務平行化與可自訂程式碼生成,並搭配專用的機械手臂連線函式庫,實現可擴充性。
在一般取放任務中,本框架展示了如何透過 LLM 推理直接控制機械手臂,並利用多模態模組的串接鏈結(concatenated chaining)來實現感知、推理與執行的連續流程。在機械按摩任務中,系統進一步展現了語音模組如何與中央主對話代理及路由代理協同運作,實現前端與後端分離的互動架構,確保流暢且即時的語音控制。本案例同時驗證了 MLLM 在處理更複雜與可自訂的人本服務型應用上的潛力。
傳統工業機械手臂雖在重複性、預編程任務中表現優異,但缺乏面對開放式互動或個人化服務的適應性。透過將多模態人工智慧直接嵌入控制迴路,本框架突破此限制,使系統能在結構化與服務導向環境中實現即時感知、推理與動作。整體系統以對話代理的形式運作,支援自然語言對話、語音互動及視覺理解。基於 LLM 的聊天模組負責協調任務,並呼叫專門模組執行檢索、推理與軌跡生成;RAG 流程確保回應的準確性與語境相關性,而記憶機制則維持跨多輪互動的一致性。在感知面向上,本框架結合 YOLO 物件偵測與 Transformer 式 VLM,實現即時物件辨識、分割與情境解讀。由自訂的 Python 整合器負責連接高階 AI 推理與低階機械手臂控制,確保流暢銜接。
實驗驗證結果顯示,本框架能有效將自然語言指令轉換為可執行的機械手臂軌跡,能即時因應視覺輸入進行調整,並維持與使用者的連貫互動。本研究的貢獻在於展示如何將 LLM 推理、多模態對齊與代理式協作整合進入機械手臂控制系統,實現從抽象推理到物理行動的橋接,並為可擴充、互動式與以使用者為中心的智慧機器人平台鋪路。
This thesis proposes a Realtime Voice-Controlled-Enabled Multimodal Large Language Model (MLLM) Framework that integrates large language models (LLMs), vision–language models (VLMs), speech-to-text and text-to-speech voice modules, and retrieval-augmented generation (RAG) into a unified agent-based control system for robotic arms. At its core, the system introduces a Central Main Chat-Agent coupled with a Router Agent that delegates tasks to background modules, ensuring uninterrupted voice interaction through a clear frontend–backend separation. This design guarantees generalizability, as the frontend pipeline can be seamlessly applied to a wide range of robotic tasks. Scalability is achieved by enabling task parallelism and customizable code generation through the router agent, while dedicated libraries ensure reliable connections to robotic hardware.
In the general pick-and-place task, the framework demonstrates how LLM reasoning can directly control robots, while a concatenated chaining of multimodal modules enables sequential perception, reasoning, and execution for structured manipulation tasks. In the robot massage task, the system showcases how the voice module functions in conjunction with a central main chat-agent and router-agent, implementing a frontend–backend separation framework that ensures smooth, real-time interactions. This case further demonstrates the potential of MLLMs to handle more complex, customizable, and human-centered robotic applications.
Traditional industrial robots excel in repetitive, pre-programmed routines but lack adaptability for open-ended interaction or personalized services. By embedding multimodal AI directly into the control loop, the proposed framework addresses these limitations, enabling real-time perception, reasoning, and action across both structured and service-oriented environments. The system functions as a conversational agent supporting natural language dialogue, voice interaction, and visual grounding. An LLM-based chatbot orchestrates operations by invoking specialized modules for retrieval, reasoning, and trajectory generation. RAG pipelines maintain factual accuracy and contextual relevance, while memory mechanisms ensure continuity across extended interactions. On the perception side, the framework integrates YOLO-based vision modules with Transformer-based VLMs, enabling real-time object detection, segmentation, and contextual interpretation. A custom Python integrator bridges high-level reasoning with low-level robotic execution, ensuring seamless control.
Experimental validation demonstrates the framework’s capacity to translate natural language instructions into executable robotic trajectories, adapt dynamically to visual input, and sustain coherent user interactions. The contributions of this work lie in showing how LLM reasoning, multimodal grounding, and agent-based orchestration can be unified within a robotic control system. This integration illustrates the feasibility of bridging abstract AI reasoning with physical action, paving the way for scalable, interactive, and user-centered robotic platforms.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100918
DOI: 10.6342/NTU202504691
全文授權: 同意授權(全球公開)
電子全文公開日期: 2025-11-27
顯示於系所單位:機械工程學系

文件中的檔案:
檔案 大小格式 
ntu-114-1.pdf7.66 MBAdobe PDF檢視/開啟
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved