請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98910完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 徐宏民 | zh_TW |
| dc.contributor.advisor | Winston H. Hsu | en |
| dc.contributor.author | 林資融 | zh_TW |
| dc.contributor.author | Tzu-Jung Lin | en |
| dc.date.accessioned | 2025-08-20T16:15:10Z | - |
| dc.date.available | 2025-08-21 | - |
| dc.date.copyright | 2025-08-20 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-08-11 | - |
| dc.identifier.citation | [1] Bolte, B., Wang, A., Yang, J., Mukadam, M., Kalakrishnan, M., & Paxton, C. (2023). Usa-net: Unified semantic and affordance representations for robot memory. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1–8. IEEE.
[2] Chang, M., Gervet, T., Khanna, M., Yenamandra, S., Shah, D., Min, S. Y., Shah, K., Paxton, C., Gupta, S., Batra, D., Mottaghi, R., Malik, J., & Chaplot, D. S. (2023). Goat: Go to any thing. [3] Hart, P. E., Nilsson, N. J., & Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2), 100–107. [4] Huang, H., Lin, F., Hu, Y., Wang, S., & Gao, Y. (2024). Copa: General robotic manipulation through spatial constraints of parts with foundation models. [5] Huang, J., Liang, J., Shi, B., Yang, Y., Liu, Y., Driess, D., Toussaint, M., Fu, K., Lin, K., Liu, Z. (2023). Ok-robot: Zero-shot object navigation using multimodal world models. In Conference on Robot Learning (CoRL). [6] Huang, J., Yang, D., Driess, D., et al. (2023). Vlmaps: Grounding large language models with spatial maps for navigation. In Conference on Robot Learning (CoRL). [7] Huang, W., Wang, C., Li, Y., Zhang, R., & Fei-Fei, L. (2024). Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652. [8] Karaman, S., & Frazzoli, E. (2011). Sampling-based algorithms for optimal motion planning. The International Journal of Robotics Research, 30(7), 846–894. [9] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P., & Girshick, R. (2023). Segment anything. [10] Laina, S. B., Boche, S., Papatheodorou, S., Schaefer, S., Jung, J., & Leutenegger, S. (2025). Findanything: Open-vocabulary and object-centric mapping for robot exploration in any environment. [11] Lee, O. Y., Xie, A., Fang, K., Pertsch, K., & Finn, C. (2025). Affordance-guided reinforcement learning via visual prompting. [12] Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. [13] Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kipf, T., & Housby, N. (2022). Simple open-vocabulary object detection with vision transformers. [14] Nasiriany, S., Xia, F., Yu, W., Xiao, T., Liang, J., Dasgupta, I., Xie, A., Driess, D., Wahid, A., Xu, Z. (2024). Pivot: Iterative visual prompting elicits actionable knowledge for vlms. arXiv preprint arXiv:2402.07872. [15] OpenAI. (2023). Gpt-4 technical report. https://arxiv.org/abs/2303.08774 (Accessed: June 6, 2025). [16] Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., & Bojanowski, P. (2024). DINOv2: Learning robust visual features without supervision. [17] Qiu, D., Ma, W., Pan, Z., Xiong, H., & Liang, J. (2024). Open-vocabulary mobile manipulation in unseen dynamic environments with 3D semantic maps. arXiv preprint arXiv:2406.18115. [18] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, P., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. [19] Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., & Zhang, L. (2024). Grounded SAM: Assembling open-world models for diverse visual tasks. [20] Sathyamoorthy, A. J., Weerakoon, K., Elnoor, M., Zore, A., Ichter, B., Xia, F., Tan, J., Yu, W., & Manocha, D. (2024). Convoi: Context-aware navigation using vision language models in outdoor and indoor environments. [21]Shao, B., Cao, N., Ding, Y., Wang, X., Gu, F., & Chen, C. (2024). Moma-pos: An efficient object-kinematic-aware base placement optimization framework for mobile manipulation. [22] Singh, A. M., Yang, J., Chen, A., Wu, J., & Finn, C. (2023). Clip-fields: Weakly supervised semantic fields for robotic manipulation. In Conference on Robot Learning (CoRL). [23] Tan, S., Zhou, D., Shao, X., Wang, J., & Sun, G. (2025). Language-conditioned open-vocabulary mobile manipulation with pretrained models. [24] Gemini Team, Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K. (2023). Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805. [25] Yang, J., Zhang, H., Li, F., Zou, X., Li, C., & Gao, J. (2023). Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. [26] Yenamandra, S., Ramachandran, A., Yadav, K., Wang, A., Khanna, M., Gervet, T., Yang, T.-Y., Jain, V., Clegg, A. W., Turner, J., et al. (2023). Homerobot: Open-vocabulary mobile manipulation. arXiv preprint arXiv:2306.11565. [27] Zhang, P., Gao, X., Wu, Y., Liu, K., Wang, D., Wang, Z., Zhao, B., Ding, Y., & Li, X. (2025). Moma-kitchen: A 100k+ benchmark for affordance-grounded last-mile navigation in mobile manipulation. [28] Zhi, P., Zhang, Z., Zhao, Y., Han, M., Zhang, Z., Li, Z., Jiao, Z., Jia, B., & Huang, S. (2025). Closed-loop open-vocabulary mobile manipulation with gpt-4v. [29] Zhu, J., Du, Z., Xu, H., Lan, F., Zheng, Z., Ma, B., Wang, S., & Zhang, T. (2024). Navi2gaze: Leveraging foundation models for navigation and target gazing. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98910 | - |
| dc.description.abstract | 在開放詞彙的移動操作中,任務是否成功往往取決於機器人基座位置的選擇。現有的方法通常只會導航至接近目標的位置,卻沒有考慮可供性(物體或場景能提供的可能操作方式),導致操作失敗的情況經常發生。我們提出一種新的零樣本基座選擇框架,稱為「可供性引導的由粗至細探索」。該方法透過視覺-語言模型提供的語意理解,結合幾何可行性,進行迭代式優化。我們構建了兩種跨模態表示,分別是「可供性RGB圖」與「障礙地圖+」,用來將語意與空間資訊結合,使推理能突破RGB視角的限制。為了讓機器人的操作與任務所需的可供性相符,我們利用VLM提供的粗略語意先驗,引導搜尋過程集中在與任務相關的區域,並透過幾何限制進一步細化機器人的基座位置,降低陷入局部最佳解的風險。我們在五個不同類型的開放詞彙移動操作任務中對系統進行測試,達到了85%的成功率,顯著優於傳統幾何規劃器和基於VLM的方法。這顯示了可供性感知與多模態推理在開放詞彙移動操作中的廣泛應用潛力,並能實現具泛化能力、依指令執行的智能規劃。 | zh_TW |
| dc.description.abstract | In open-vocabulary mobile manipulation (OVMM), task success often hinges on the selection of an appropriate base placement for the robot. Existing approaches typically navigate to proximity-based regions without considering affordances, resulting in frequent manipulation failures. We propose Affordance-Guided Coarse-to-Fine Exploration, a zero-shot framework for base placement that integrates semantic understanding from vision-language models (VLMs) with geometric feasibility through an iterative optimization process. Our method constructs cross-modal representations, namely Affordance RGB and Obstacle Map+, to align semantics with spatial context. This enables reasoning that extends beyond the egocentric limitations of RGB perception. To ensure interaction is guided by task-relevant affordances, we leverage coarse semantic priors from VLMs to guide the search toward task-relevant regions and refine placements with geometric constraints, thereby reducing the risk of convergence to local optima. Evaluated on five diverse open-vocabulary mobile manipulation tasks, our system achieves an 85% success rate, significantly outperforming classical geometric planners and VLM-based methods. This demonstrates the promise of affordance-aware and multimodal reasoning for generalizable, instruction-conditioned planning in OVMM. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-20T16:15:10Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-08-20T16:15:10Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Acknowledgements .............................................. i
摘要 .......................................................... iii Abstract ...................................................... v Contents ...................................................... vii List of Figures ............................................... ix List of Tables ................................................ xiii Chapter 1 Introduction ....................................... 1 1.1 Introduction .................................................... 1 1.2 Related Work .................................................... 3 Chapter 2 Preliminaries ...................................... 7 2.1 Pipeline Overview ............................................... 7 2.2 Problem Statement ............................................... 8 Chapter 3 Method ............................................ 11 3.1 Affordance Guidance Projection .................................. 12 3.2 Affordance-Driven Coarse-to-Fine Optimization ................... 13 3.2.1 Affordance Point Selection ................................ 14 3.2.2 Iterative Optimization .................................... 14 Chapter 4 Experiments ....................................... 19 4.1 Experimental Setup .............................................. 19 4.2 Task Description ................................................ 19 4.3 Baseline Methods ................................................ 20 4.4 Comparison Results .............................................. 21 4.5 Alpha Comparison ................................................ 23 4.6 Ablation on Affordance Guidance Projection ...................... 25 4.7 Conclusion ...................................................... 26 References .................................................. 29 Appendix A — More Method Details ............................ 33 A.1 Map Representation .............................................. 33 A.2 Affordance Guidance Projection .................................. 33 A.3 Affordance Point Selection ...................................... 36 A.4 Baseline Methods ................................................ 37 A.5 Base placement distribution evolution ........................... 38 | - |
| dc.language.iso | en | - |
| dc.subject | 視覺語言模型 | zh_TW |
| dc.subject | 底座定位 | zh_TW |
| dc.subject | 開放詞彙移動操作 | zh_TW |
| dc.subject | Open-vocabulary mobile manipulation | en |
| dc.subject | Base placement | en |
| dc.subject | Vision-language models | en |
| dc.title | 基於可供性引導的粗至細探索方法應用於移動操作之基座定位 | zh_TW |
| dc.title | Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Mobile Manipulation | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 陳奕廷;葉梅珍 | zh_TW |
| dc.contributor.oralexamcommittee | Yi-Ting Chen;Mei-Chen Yeh | en |
| dc.subject.keyword | 底座定位,視覺語言模型,開放詞彙移動操作, | zh_TW |
| dc.subject.keyword | Base placement,Vision-language models,Open-vocabulary mobile manipulation, | en |
| dc.relation.page | 42 | - |
| dc.identifier.doi | 10.6342/NTU202503918 | - |
| dc.rights.note | 同意授權(限校園內公開) | - |
| dc.date.accepted | 2025-08-14 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊網路與多媒體研究所 | - |
| dc.date.embargo-lift | 2029-08-05 | - |
| 顯示於系所單位: | 資訊網路與多媒體研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 6.57 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
