Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98910
Title: 基於可供性引導的粗至細探索方法應用於移動操作之基座定位
Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Mobile Manipulation
Authors: 林資融
Tzu-Jung Lin
Advisor: 徐宏民
Winston H. Hsu
Keyword: 底座定位,視覺語言模型,開放詞彙移動操作,
Base placement,Vision-language models,Open-vocabulary mobile manipulation,
Publication Year : 2025
Degree: 碩士
Abstract: 在開放詞彙的移動操作中,任務是否成功往往取決於機器人基座位置的選擇。現有的方法通常只會導航至接近目標的位置,卻沒有考慮可供性(物體或場景能提供的可能操作方式),導致操作失敗的情況經常發生。我們提出一種新的零樣本基座選擇框架,稱為「可供性引導的由粗至細探索」。該方法透過視覺-語言模型提供的語意理解,結合幾何可行性,進行迭代式優化。我們構建了兩種跨模態表示,分別是「可供性RGB圖」與「障礙地圖+」,用來將語意與空間資訊結合,使推理能突破RGB視角的限制。為了讓機器人的操作與任務所需的可供性相符,我們利用VLM提供的粗略語意先驗,引導搜尋過程集中在與任務相關的區域,並透過幾何限制進一步細化機器人的基座位置,降低陷入局部最佳解的風險。我們在五個不同類型的開放詞彙移動操作任務中對系統進行測試,達到了85%的成功率,顯著優於傳統幾何規劃器和基於VLM的方法。這顯示了可供性感知與多模態推理在開放詞彙移動操作中的廣泛應用潛力,並能實現具泛化能力、依指令執行的智能規劃。
In open-vocabulary mobile manipulation (OVMM), task success often hinges on the selection of an appropriate base placement for the robot. Existing approaches typically navigate to proximity-based regions without considering affordances, resulting in frequent manipulation failures. We propose Affordance-Guided Coarse-to-Fine Exploration, a zero-shot framework for base placement that integrates semantic understanding from vision-language models (VLMs) with geometric feasibility through an iterative optimization process. Our method constructs cross-modal representations, namely Affordance RGB and Obstacle Map+, to align semantics with spatial context. This enables reasoning that extends beyond the egocentric limitations of RGB perception. To ensure interaction is guided by task-relevant affordances, we leverage coarse semantic priors from VLMs to guide the search toward task-relevant regions and refine placements with geometric constraints, thereby reducing the risk of convergence to local optima. Evaluated on five diverse open-vocabulary mobile manipulation tasks, our system achieves an 85% success rate, significantly outperforming classical geometric planners and VLM-based methods. This demonstrates the promise of affordance-aware and multimodal reasoning for generalizable, instruction-conditioned planning in OVMM.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98910
DOI: 10.6342/NTU202503918
Fulltext Rights: 同意授權(限校園內公開)
metadata.dc.date.embargo-lift: 2029-08-05
Appears in Collections:資訊網路與多媒體研究所

Files in This Item:
File SizeFormat 
ntu-113-2.pdf
  Restricted Access
6.57 MBAdobe PDFView/Open
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved