Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98910
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor徐宏民zh_TW
dc.contributor.advisorWinston H. Hsuen
dc.contributor.author林資融zh_TW
dc.contributor.authorTzu-Jung Linen
dc.date.accessioned2025-08-20T16:15:10Z-
dc.date.available2025-08-21-
dc.date.copyright2025-08-20-
dc.date.issued2025-
dc.date.submitted2025-08-11-
dc.identifier.citation[1] Bolte, B., Wang, A., Yang, J., Mukadam, M., Kalakrishnan, M., & Paxton, C. (2023). Usa-net: Unified semantic and affordance representations for robot memory. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1–8. IEEE.
[2] Chang, M., Gervet, T., Khanna, M., Yenamandra, S., Shah, D., Min, S. Y., Shah, K., Paxton, C., Gupta, S., Batra, D., Mottaghi, R., Malik, J., & Chaplot, D. S. (2023). Goat: Go to any thing.
[3] Hart, P. E., Nilsson, N. J., & Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2), 100–107.
[4] Huang, H., Lin, F., Hu, Y., Wang, S., & Gao, Y. (2024). Copa: General robotic manipulation through spatial constraints of parts with foundation models.
[5] Huang, J., Liang, J., Shi, B., Yang, Y., Liu, Y., Driess, D., Toussaint, M., Fu, K., Lin, K., Liu, Z. (2023). Ok-robot: Zero-shot object navigation using multimodal world models. In Conference on Robot Learning (CoRL).
[6] Huang, J., Yang, D., Driess, D., et al. (2023). Vlmaps: Grounding large language models with spatial maps for navigation. In Conference on Robot Learning (CoRL).
[7] Huang, W., Wang, C., Li, Y., Zhang, R., & Fei-Fei, L. (2024). Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652.
[8] Karaman, S., & Frazzoli, E. (2011). Sampling-based algorithms for optimal motion planning. The International Journal of Robotics Research, 30(7), 846–894.
[9] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P., & Girshick, R. (2023). Segment anything.
[10] Laina, S. B., Boche, S., Papatheodorou, S., Schaefer, S., Jung, J., & Leutenegger, S. (2025). Findanything: Open-vocabulary and object-centric mapping for robot exploration in any environment.
[11] Lee, O. Y., Xie, A., Fang, K., Pertsch, K., & Finn, C. (2025). Affordance-guided reinforcement learning via visual prompting.
[12] Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning.
[13] Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kipf, T., & Housby, N. (2022). Simple open-vocabulary object detection with vision transformers.
[14] Nasiriany, S., Xia, F., Yu, W., Xiao, T., Liang, J., Dasgupta, I., Xie, A., Driess, D., Wahid, A., Xu, Z. (2024). Pivot: Iterative visual prompting elicits actionable knowledge for vlms. arXiv preprint arXiv:2402.07872.
[15] OpenAI. (2023). Gpt-4 technical report. https://arxiv.org/abs/2303.08774 (Accessed: June 6, 2025).
[16] Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., & Bojanowski, P. (2024). DINOv2: Learning robust visual features without supervision.
[17] Qiu, D., Ma, W., Pan, Z., Xiong, H., & Liang, J. (2024). Open-vocabulary mobile manipulation in unseen dynamic environments with 3D semantic maps. arXiv preprint arXiv:2406.18115.
[18] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, P., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision.
[19] Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., & Zhang, L. (2024). Grounded SAM: Assembling open-world models for diverse visual tasks.
[20] Sathyamoorthy, A. J., Weerakoon, K., Elnoor, M., Zore, A., Ichter, B., Xia, F., Tan, J., Yu, W., & Manocha, D. (2024). Convoi: Context-aware navigation using vision language models in outdoor and indoor environments.
[21]Shao, B., Cao, N., Ding, Y., Wang, X., Gu, F., & Chen, C. (2024). Moma-pos: An efficient object-kinematic-aware base placement optimization framework for mobile manipulation.
[22] Singh, A. M., Yang, J., Chen, A., Wu, J., & Finn, C. (2023). Clip-fields: Weakly supervised semantic fields for robotic manipulation. In Conference on Robot Learning (CoRL).
[23] Tan, S., Zhou, D., Shao, X., Wang, J., & Sun, G. (2025). Language-conditioned open-vocabulary mobile manipulation with pretrained models.
[24] Gemini Team, Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K. (2023). Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
[25] Yang, J., Zhang, H., Li, F., Zou, X., Li, C., & Gao, J. (2023). Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.
[26] Yenamandra, S., Ramachandran, A., Yadav, K., Wang, A., Khanna, M., Gervet, T., Yang, T.-Y., Jain, V., Clegg, A. W., Turner, J., et al. (2023). Homerobot: Open-vocabulary mobile manipulation. arXiv preprint arXiv:2306.11565.
[27] Zhang, P., Gao, X., Wu, Y., Liu, K., Wang, D., Wang, Z., Zhao, B., Ding, Y., & Li, X. (2025). Moma-kitchen: A 100k+ benchmark for affordance-grounded last-mile navigation in mobile manipulation.
[28] Zhi, P., Zhang, Z., Zhao, Y., Han, M., Zhang, Z., Li, Z., Jiao, Z., Jia, B., & Huang, S. (2025). Closed-loop open-vocabulary mobile manipulation with gpt-4v.
[29] Zhu, J., Du, Z., Xu, H., Lan, F., Zheng, Z., Ma, B., Wang, S., & Zhang, T. (2024). Navi2gaze: Leveraging foundation models for navigation and target gazing.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98910-
dc.description.abstract在開放詞彙的移動操作中,任務是否成功往往取決於機器人基座位置的選擇。現有的方法通常只會導航至接近目標的位置,卻沒有考慮可供性(物體或場景能提供的可能操作方式),導致操作失敗的情況經常發生。我們提出一種新的零樣本基座選擇框架,稱為「可供性引導的由粗至細探索」。該方法透過視覺-語言模型提供的語意理解,結合幾何可行性,進行迭代式優化。我們構建了兩種跨模態表示,分別是「可供性RGB圖」與「障礙地圖+」,用來將語意與空間資訊結合,使推理能突破RGB視角的限制。為了讓機器人的操作與任務所需的可供性相符,我們利用VLM提供的粗略語意先驗,引導搜尋過程集中在與任務相關的區域,並透過幾何限制進一步細化機器人的基座位置,降低陷入局部最佳解的風險。我們在五個不同類型的開放詞彙移動操作任務中對系統進行測試,達到了85%的成功率,顯著優於傳統幾何規劃器和基於VLM的方法。這顯示了可供性感知與多模態推理在開放詞彙移動操作中的廣泛應用潛力,並能實現具泛化能力、依指令執行的智能規劃。zh_TW
dc.description.abstractIn open-vocabulary mobile manipulation (OVMM), task success often hinges on the selection of an appropriate base placement for the robot. Existing approaches typically navigate to proximity-based regions without considering affordances, resulting in frequent manipulation failures. We propose Affordance-Guided Coarse-to-Fine Exploration, a zero-shot framework for base placement that integrates semantic understanding from vision-language models (VLMs) with geometric feasibility through an iterative optimization process. Our method constructs cross-modal representations, namely Affordance RGB and Obstacle Map+, to align semantics with spatial context. This enables reasoning that extends beyond the egocentric limitations of RGB perception. To ensure interaction is guided by task-relevant affordances, we leverage coarse semantic priors from VLMs to guide the search toward task-relevant regions and refine placements with geometric constraints, thereby reducing the risk of convergence to local optima. Evaluated on five diverse open-vocabulary mobile manipulation tasks, our system achieves an 85% success rate, significantly outperforming classical geometric planners and VLM-based methods. This demonstrates the promise of affordance-aware and multimodal reasoning for generalizable, instruction-conditioned planning in OVMM.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-20T16:15:10Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-08-20T16:15:10Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements .............................................. i
摘要 .......................................................... iii
Abstract ...................................................... v
Contents ...................................................... vii
List of Figures ............................................... ix
List of Tables ................................................ xiii
Chapter 1 Introduction ....................................... 1
1.1 Introduction .................................................... 1
1.2 Related Work .................................................... 3
Chapter 2 Preliminaries ...................................... 7
2.1 Pipeline Overview ............................................... 7
2.2 Problem Statement ............................................... 8
Chapter 3 Method ............................................ 11
3.1 Affordance Guidance Projection .................................. 12
3.2 Affordance-Driven Coarse-to-Fine Optimization ................... 13
3.2.1 Affordance Point Selection ................................ 14
3.2.2 Iterative Optimization .................................... 14
Chapter 4 Experiments ....................................... 19
4.1 Experimental Setup .............................................. 19
4.2 Task Description ................................................ 19
4.3 Baseline Methods ................................................ 20
4.4 Comparison Results .............................................. 21
4.5 Alpha Comparison ................................................ 23
4.6 Ablation on Affordance Guidance Projection ...................... 25
4.7 Conclusion ...................................................... 26
References .................................................. 29
Appendix A — More Method Details ............................ 33
A.1 Map Representation .............................................. 33
A.2 Affordance Guidance Projection .................................. 33
A.3 Affordance Point Selection ...................................... 36
A.4 Baseline Methods ................................................ 37
A.5 Base placement distribution evolution ........................... 38
-
dc.language.isoen-
dc.subject視覺語言模型zh_TW
dc.subject底座定位zh_TW
dc.subject開放詞彙移動操作zh_TW
dc.subjectOpen-vocabulary mobile manipulationen
dc.subjectBase placementen
dc.subjectVision-language modelsen
dc.title基於可供性引導的粗至細探索方法應用於移動操作之基座定位zh_TW
dc.titleAffordance-Guided Coarse-to-Fine Exploration for Base Placement in Mobile Manipulationen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee陳奕廷;葉梅珍zh_TW
dc.contributor.oralexamcommitteeYi-Ting Chen;Mei-Chen Yehen
dc.subject.keyword底座定位,視覺語言模型,開放詞彙移動操作,zh_TW
dc.subject.keywordBase placement,Vision-language models,Open-vocabulary mobile manipulation,en
dc.relation.page42-
dc.identifier.doi10.6342/NTU202503918-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-08-14-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊網路與多媒體研究所-
dc.date.embargo-lift2029-08-05-
顯示於系所單位:資訊網路與多媒體研究所

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
6.57 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved