動態環境中基於常識及物體可供性的任務規劃

陳姵安; Pei-An Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98918

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	徐宏民	zh_TW
dc.contributor.advisor	Winston H. Hsu	en
dc.contributor.author	陳姵安	zh_TW
dc.contributor.author	Pei-An Chen	en
dc.date.accessioned	2025-08-20T16:16:57Z	-
dc.date.available	2025-08-26	-
dc.date.copyright	2025-08-20	-
dc.date.issued	2025	-
dc.date.submitted	2025-08-11	-
dc.identifier.citation	[1] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022. [2] V. Blukis, C. Paxton, D. Fox, A. Garg, and Y. Artzi. A persistent spatial semantic representation for high-level natural language instruction execution. In Conference on Robot Learning, pages 706–717. PMLR, 2022. [3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. [4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. [5] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021. [6] R. Hidalgo, A. S. Varde, J. Parron, and W. Wang. Incorporating commonsense knowledge to enhance robot perception. IEEE Transactions on Automation Science and Engineering, 2025. [7] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022. [8] B. Kim, J. Kim, Y. Kim, C. Min, and J. Choi. Context-aware planning and environment-aware memory for instruction following embodied agents. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10936–10946, 2023. [9] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y. Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017. [10] C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Martín-Martín, C. Wang, G. Levine, M. Lingelbach, J. Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In Conference on Robot Learning, pages 80–93. PMLR, 2023. [11] L. Logeswaran, S. Sohn, Y. Lyu, A. Liu, D.-K. Kim, D. Shim, M. Lee, and H. Lee. Code models are zero-shot precondition reasoners. In K. Duh, H. Gomez, and S. Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5681–5697, Mexico City, Mexico, June 2024. Association for Computational Linguistics. [12] S. Y. Min, D. S. Chaplot, P. Ravikumar, Y. Bisk, and R. Salakhutdinov. Film: Following instructions in language with modular methods. arXiv preprint arXiv:2110.07342, 2021. [13] A. Pashevich, C. Schmid, and C. Sun. Episodic transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15942–15952, 2021. [14] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8494–8502, 2018. [15] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021. [16] M. Sap, R. Le Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith, and Y. Choi. Atomic: An atlas of machine commonsense for if-then reasoning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3027–3035, 2019. [17] M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation. In Conference on robot learning, pages 894–906. PMLR, 2022. [18] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020. [19] K. P. Singh, S. Bhambri, B. Kim, R. Mottaghi, and J. Choi. Factorizing perception and policy for interactive instruction following. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1888–1897, 2021. [20] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2998–3009, 2023. [21] R. Speer, J. Chin, and C. Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017. [22] F. Xiang, Y. Qin, K. Mo, Y. Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y. Yuan, H. Wang, et al. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020. [23] A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V. Sindhwani, et al. Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning, pages 726–747. PMLR, 2021. [24] Y. Zhang and J. Chai. Hierarchical task learning from language instructions with unified transformers and self-monitoring. arXiv preprint arXiv:2106.03427, 2021.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98918	-
dc.description.abstract	具身機器人不應只是單純地依照指令執行，因為真實世界的環境常常充滿了意料之外的情況與例外。然而，現有的方法多半僅著重於直接執行指令，卻忽略了目標物件是否實際可操作，亦即缺乏對可用性的判斷能力。為了解決這一限制，我們提出 ADAPT，一個實用的基準測試（benchmark），用以挑戰機器人超越單純指令執行的能力。在 ADAPT 中，物件可能處於不可使用的狀態，例如正在被使用或是髒污，且這些資訊並未明確地在指令中指出。機器人必須能夠感知環境、辨識物件狀態，並規劃出使物件回復可用狀態的行動，例如在將髒叉子放進杯子前先將其清潔。為了實現這項能力，我們提出一個具備模組化設計、可獨立整合的架構，稱為可用性推理與動作感知記憶模組（Affordance Reasoner and Action-aware Memory, ARAM）。該模組能幫助機器人推理物件的可用性，並應用常識型規劃來偵測不可用狀態並產生替代方案。當 ARAM 整合至 FILM 系統中時，在任務成功率上可達 70% 的相對提升，在目標完成度上可提升 33%，顯示其在動態環境中能顯著增強任務完成表現與規劃效率。	zh_TW
dc.description.abstract	Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they lack the ability to assess available affordances. To address this limitation, we introduce ADAPT, a practical benchmark that challenges agents to go beyond simple instruction-following. In ADAPT, objects may be in unusable states, such as being in use or dirty, without this information being specified in the instructions. Agents must perceive the environment, recognize the object’s condition, and plan actions to restore the object to a usable state, such as cleaning a dirty fork before placing it into a cup. To enable this capability, we propose a plug-and-play module called Affordance Reasoner and Action-aware Memory (ARAM). The module helps the agent reason about object affordances and apply commonsense planning to detect unusable states and generate alternative solutions. When integrated with FILM, ARAM achieves up to 70% relative improvement in task success and 33% in goal completion, demonstrating its ability to enhance both task completion and planning efficiency in dynamic environments.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-20T16:16:56Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-08-20T16:16:57Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements iii 摘要 v Abstract vii Contents ix List of Figures xiii List of Tables xv Chapter 1 Introduction 1 Chapter 2 Related Work 5 2.1 Embodied Instruction Following . . . . . . . . . . . . . . . . . . . . 5 2.2 Commonsense Reasoning in Robotics . . . . . . . . . . . . . . . . . 6 2.2.1 Knowledge-Based Reasoning . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Language Model-Based Reasoning . . . . . . . . . . . . . . . . . . 6 2.3 Affordance Reasoning in Robotics . . . . . . . . . . . . . . . . . . . 7 Chapter 3 The ADAPT Benchmark 9 3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 Data Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 4 Method 13 4.1 Affordance Reasoner . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1.1 Trigger Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1.2 LoRA Fine-Tuned LLaVA . . . . . . . . . . . . . . . . . . . . . . 14 4.1.3 Multimodal In-Context Learning . . . . . . . . . . . . . . . . . . . 15 4.2 Action-Aware Memory . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 5 Experiments 19 5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.3 Evaluation Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.4 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.4.1 Supervised Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.4.2 Few-Shot Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.5 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.7 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Chapter 6 Conclusion 27 6.1 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 28 References 29 Appendix A — Embodied AI Benchmarks: A Comparative Perspective 33 Appendix B — Task Distribution and Evaluation Results 35 B.1 Static object affordance tasks . . . . . . . . . . . . . . . . . . . . . . 35 B.2 Dynamic object affordance tasks . . . . . . . . . . . . . . . . . . . . 35 Appendix C — Dataset Construction Details 37 C.1 Expert Demonstrations . . . . . . . . . . . . . . . . . . . . . . . . . 37 C.2 Object Affordance Setting . . . . . . . . . . . . . . . . . . . . . . . 37 C.3 Task Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 C.4 Annotation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Appendix D — Affordance Reasoning Capability 39 Appendix E — Affordance Reasoner Implementation 41 E.1 Visibility Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 E.2 Affordance Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Appendix F — Action-Aware Memory Implementation 43 F.1 High-level Action Replanning . . . . . . . . . . . . . . . . . . . . . 43 F.2 Replanning and Recovery Strategies . . . . . . . . . . . . . . . . . . 43 Appendix G — Case Study 45 Appendix H — Failure Case 47 Appendix I — Code and Data Availability 49 I.1 Future Release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 I.2 Fine-Tuning and Computing Infrastructure Details . . . . . . . . . . 50	-
dc.language.iso	en	-
dc.subject	具身人工智慧	zh_TW
dc.subject	移動式操控	zh_TW
dc.subject	物體可供性任務規劃	zh_TW
dc.subject	多模態情境學習	zh_TW
dc.subject	Mobile Manipulation	en
dc.subject	Affordance-Aware Task Planning	en
dc.subject	Embodied AI	en
dc.subject	Multimodal In-Context Learning	en
dc.title	動態環境中基於常識及物體可供性的任務規劃	zh_TW
dc.title	ADAPT: Commonsense and Affordance-aware Planning in Dynamic Environments	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	葉梅珍;陳奕廷	zh_TW
dc.contributor.oralexamcommittee	Mei-Chen Yeh;Yi-Ting Chen	en
dc.subject.keyword	具身人工智慧,移動式操控,物體可供性任務規劃,多模態情境學習,	zh_TW
dc.subject.keyword	Embodied AI,Mobile Manipulation,Affordance-Aware Task Planning,Multimodal In-Context Learning,	en
dc.relation.page	57	-
dc.identifier.doi	10.6342/NTU202503907	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2025-08-14	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2025-08-26	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	4.67 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。