基於可供性引導的粗至細探索方法應用於移動操作之基座定位

林資融; Tzu-Jung Lin

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98910

標題:	基於可供性引導的粗至細探索方法應用於移動操作之基座定位 Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Mobile Manipulation
作者:	林資融 Tzu-Jung Lin
指導教授:	徐宏民 Winston H. Hsu
關鍵字:	底座定位,視覺語言模型,開放詞彙移動操作, Base placement,Vision-language models,Open-vocabulary mobile manipulation,
出版年 :	2025
學位:	碩士
摘要:	在開放詞彙的移動操作中，任務是否成功往往取決於機器人基座位置的選擇。現有的方法通常只會導航至接近目標的位置，卻沒有考慮可供性（物體或場景能提供的可能操作方式），導致操作失敗的情況經常發生。我們提出一種新的零樣本基座選擇框架，稱為「可供性引導的由粗至細探索」。該方法透過視覺-語言模型提供的語意理解，結合幾何可行性，進行迭代式優化。我們構建了兩種跨模態表示，分別是「可供性RGB圖」與「障礙地圖+」，用來將語意與空間資訊結合，使推理能突破RGB視角的限制。為了讓機器人的操作與任務所需的可供性相符，我們利用VLM提供的粗略語意先驗，引導搜尋過程集中在與任務相關的區域，並透過幾何限制進一步細化機器人的基座位置，降低陷入局部最佳解的風險。我們在五個不同類型的開放詞彙移動操作任務中對系統進行測試，達到了85%的成功率，顯著優於傳統幾何規劃器和基於VLM的方法。這顯示了可供性感知與多模態推理在開放詞彙移動操作中的廣泛應用潛力，並能實現具泛化能力、依指令執行的智能規劃。 In open-vocabulary mobile manipulation (OVMM), task success often hinges on the selection of an appropriate base placement for the robot. Existing approaches typically navigate to proximity-based regions without considering affordances, resulting in frequent manipulation failures. We propose Affordance-Guided Coarse-to-Fine Exploration, a zero-shot framework for base placement that integrates semantic understanding from vision-language models (VLMs) with geometric feasibility through an iterative optimization process. Our method constructs cross-modal representations, namely Affordance RGB and Obstacle Map+, to align semantics with spatial context. This enables reasoning that extends beyond the egocentric limitations of RGB perception. To ensure interaction is guided by task-relevant affordances, we leverage coarse semantic priors from VLMs to guide the search toward task-relevant regions and refine placements with geometric constraints, thereby reducing the risk of convergence to local optima. Evaluated on five diverse open-vocabulary mobile manipulation tasks, our system achieves an 85% success rate, significantly outperforming classical geometric planners and VLM-based methods. This demonstrates the promise of affordance-aware and multimodal reasoning for generalizable, instruction-conditioned planning in OVMM.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98910
DOI:	10.6342/NTU202503918
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2029-08-05
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	6.57 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。