基於可供性引導的粗至細探索方法應用於移動操作之基座定位

林資融; Tzu-Jung Lin

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98910

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	徐宏民	zh_TW
dc.contributor.advisor	Winston H. Hsu	en
dc.contributor.author	林資融	zh_TW
dc.contributor.author	Tzu-Jung Lin	en
dc.date.accessioned	2025-08-20T16:15:10Z	-
dc.date.available	2025-08-21	-
dc.date.copyright	2025-08-20	-
dc.date.issued	2025	-
dc.date.submitted	2025-08-11	-
dc.identifier.citation	[1] Bolte, B., Wang, A., Yang, J., Mukadam, M., Kalakrishnan, M., & Paxton, C. (2023). Usa-net: Unified semantic and affordance representations for robot memory. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1–8. IEEE. [2] Chang, M., Gervet, T., Khanna, M., Yenamandra, S., Shah, D., Min, S. Y., Shah, K., Paxton, C., Gupta, S., Batra, D., Mottaghi, R., Malik, J., & Chaplot, D. S. (2023). Goat: Go to any thing. [3] Hart, P. E., Nilsson, N. J., & Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2), 100–107. [4] Huang, H., Lin, F., Hu, Y., Wang, S., & Gao, Y. (2024). Copa: General robotic manipulation through spatial constraints of parts with foundation models. [5] Huang, J., Liang, J., Shi, B., Yang, Y., Liu, Y., Driess, D., Toussaint, M., Fu, K., Lin, K., Liu, Z. (2023). Ok-robot: Zero-shot object navigation using multimodal world models. In Conference on Robot Learning (CoRL). [6] Huang, J., Yang, D., Driess, D., et al. (2023). Vlmaps: Grounding large language models with spatial maps for navigation. In Conference on Robot Learning (CoRL). [7] Huang, W., Wang, C., Li, Y., Zhang, R., & Fei-Fei, L. (2024). Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652. [8] Karaman, S., & Frazzoli, E. (2011). Sampling-based algorithms for optimal motion planning. The International Journal of Robotics Research, 30(7), 846–894. [9] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P., & Girshick, R. (2023). Segment anything. [10] Laina, S. B., Boche, S., Papatheodorou, S., Schaefer, S., Jung, J., & Leutenegger, S. (2025). Findanything: Open-vocabulary and object-centric mapping for robot exploration in any environment. [11] Lee, O. Y., Xie, A., Fang, K., Pertsch, K., & Finn, C. (2025). Affordance-guided reinforcement learning via visual prompting. [12] Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. [13] Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kipf, T., & Housby, N. (2022). Simple open-vocabulary object detection with vision transformers. [14] Nasiriany, S., Xia, F., Yu, W., Xiao, T., Liang, J., Dasgupta, I., Xie, A., Driess, D., Wahid, A., Xu, Z. (2024). Pivot: Iterative visual prompting elicits actionable knowledge for vlms. arXiv preprint arXiv:2402.07872. [15] OpenAI. (2023). Gpt-4 technical report. https://arxiv.org/abs/2303.08774 (Accessed: June 6, 2025). [16] Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., & Bojanowski, P. (2024). DINOv2: Learning robust visual features without supervision. [17] Qiu, D., Ma, W., Pan, Z., Xiong, H., & Liang, J. (2024). Open-vocabulary mobile manipulation in unseen dynamic environments with 3D semantic maps. arXiv preprint arXiv:2406.18115. [18] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, P., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. [19] Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., & Zhang, L. (2024). Grounded SAM: Assembling open-world models for diverse visual tasks. [20] Sathyamoorthy, A. J., Weerakoon, K., Elnoor, M., Zore, A., Ichter, B., Xia, F., Tan, J., Yu, W., & Manocha, D. (2024). Convoi: Context-aware navigation using vision language models in outdoor and indoor environments. [21]Shao, B., Cao, N., Ding, Y., Wang, X., Gu, F., & Chen, C. (2024). Moma-pos: An efficient object-kinematic-aware base placement optimization framework for mobile manipulation. [22] Singh, A. M., Yang, J., Chen, A., Wu, J., & Finn, C. (2023). Clip-fields: Weakly supervised semantic fields for robotic manipulation. In Conference on Robot Learning (CoRL). [23] Tan, S., Zhou, D., Shao, X., Wang, J., & Sun, G. (2025). Language-conditioned open-vocabulary mobile manipulation with pretrained models. [24] Gemini Team, Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K. (2023). Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805. [25] Yang, J., Zhang, H., Li, F., Zou, X., Li, C., & Gao, J. (2023). Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. [26] Yenamandra, S., Ramachandran, A., Yadav, K., Wang, A., Khanna, M., Gervet, T., Yang, T.-Y., Jain, V., Clegg, A. W., Turner, J., et al. (2023). Homerobot: Open-vocabulary mobile manipulation. arXiv preprint arXiv:2306.11565. [27] Zhang, P., Gao, X., Wu, Y., Liu, K., Wang, D., Wang, Z., Zhao, B., Ding, Y., & Li, X. (2025). Moma-kitchen: A 100k+ benchmark for affordance-grounded last-mile navigation in mobile manipulation. [28] Zhi, P., Zhang, Z., Zhao, Y., Han, M., Zhang, Z., Li, Z., Jiao, Z., Jia, B., & Huang, S. (2025). Closed-loop open-vocabulary mobile manipulation with gpt-4v. [29] Zhu, J., Du, Z., Xu, H., Lan, F., Zheng, Z., Ma, B., Wang, S., & Zhang, T. (2024). Navi2gaze: Leveraging foundation models for navigation and target gazing.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98910	-
dc.description.abstract	在開放詞彙的移動操作中，任務是否成功往往取決於機器人基座位置的選擇。現有的方法通常只會導航至接近目標的位置，卻沒有考慮可供性（物體或場景能提供的可能操作方式），導致操作失敗的情況經常發生。我們提出一種新的零樣本基座選擇框架，稱為「可供性引導的由粗至細探索」。該方法透過視覺-語言模型提供的語意理解，結合幾何可行性，進行迭代式優化。我們構建了兩種跨模態表示，分別是「可供性RGB圖」與「障礙地圖+」，用來將語意與空間資訊結合，使推理能突破RGB視角的限制。為了讓機器人的操作與任務所需的可供性相符，我們利用VLM提供的粗略語意先驗，引導搜尋過程集中在與任務相關的區域，並透過幾何限制進一步細化機器人的基座位置，降低陷入局部最佳解的風險。我們在五個不同類型的開放詞彙移動操作任務中對系統進行測試，達到了85%的成功率，顯著優於傳統幾何規劃器和基於VLM的方法。這顯示了可供性感知與多模態推理在開放詞彙移動操作中的廣泛應用潛力，並能實現具泛化能力、依指令執行的智能規劃。	zh_TW
dc.description.abstract	In open-vocabulary mobile manipulation (OVMM), task success often hinges on the selection of an appropriate base placement for the robot. Existing approaches typically navigate to proximity-based regions without considering affordances, resulting in frequent manipulation failures. We propose Affordance-Guided Coarse-to-Fine Exploration, a zero-shot framework for base placement that integrates semantic understanding from vision-language models (VLMs) with geometric feasibility through an iterative optimization process. Our method constructs cross-modal representations, namely Affordance RGB and Obstacle Map+, to align semantics with spatial context. This enables reasoning that extends beyond the egocentric limitations of RGB perception. To ensure interaction is guided by task-relevant affordances, we leverage coarse semantic priors from VLMs to guide the search toward task-relevant regions and refine placements with geometric constraints, thereby reducing the risk of convergence to local optima. Evaluated on five diverse open-vocabulary mobile manipulation tasks, our system achieves an 85% success rate, significantly outperforming classical geometric planners and VLM-based methods. This demonstrates the promise of affordance-aware and multimodal reasoning for generalizable, instruction-conditioned planning in OVMM.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-20T16:15:10Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-08-20T16:15:10Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements .............................................. i 摘要 .......................................................... iii Abstract ...................................................... v Contents ...................................................... vii List of Figures ............................................... ix List of Tables ................................................ xiii Chapter 1 Introduction ....................................... 1 1.1 Introduction .................................................... 1 1.2 Related Work .................................................... 3 Chapter 2 Preliminaries ...................................... 7 2.1 Pipeline Overview ............................................... 7 2.2 Problem Statement ............................................... 8 Chapter 3 Method ............................................ 11 3.1 Affordance Guidance Projection .................................. 12 3.2 Affordance-Driven Coarse-to-Fine Optimization ................... 13 3.2.1 Affordance Point Selection ................................ 14 3.2.2 Iterative Optimization .................................... 14 Chapter 4 Experiments ....................................... 19 4.1 Experimental Setup .............................................. 19 4.2 Task Description ................................................ 19 4.3 Baseline Methods ................................................ 20 4.4 Comparison Results .............................................. 21 4.5 Alpha Comparison ................................................ 23 4.6 Ablation on Affordance Guidance Projection ...................... 25 4.7 Conclusion ...................................................... 26 References .................................................. 29 Appendix A — More Method Details ............................ 33 A.1 Map Representation .............................................. 33 A.2 Affordance Guidance Projection .................................. 33 A.3 Affordance Point Selection ...................................... 36 A.4 Baseline Methods ................................................ 37 A.5 Base placement distribution evolution ........................... 38	-
dc.language.iso	en	-
dc.subject	視覺語言模型	zh_TW
dc.subject	底座定位	zh_TW
dc.subject	開放詞彙移動操作	zh_TW
dc.subject	Open-vocabulary mobile manipulation	en
dc.subject	Base placement	en
dc.subject	Vision-language models	en
dc.title	基於可供性引導的粗至細探索方法應用於移動操作之基座定位	zh_TW
dc.title	Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Mobile Manipulation	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	陳奕廷;葉梅珍	zh_TW
dc.contributor.oralexamcommittee	Yi-Ting Chen;Mei-Chen Yeh	en
dc.subject.keyword	底座定位,視覺語言模型,開放詞彙移動操作,	zh_TW
dc.subject.keyword	Base placement,Vision-language models,Open-vocabulary mobile manipulation,	en
dc.relation.page	42	-
dc.identifier.doi	10.6342/NTU202503918	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-08-14	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
dc.date.embargo-lift	2029-08-05	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	6.57 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。