基於視覺語言模型及認知地圖之機器人室內環境問答系統

涂志宏; Chih-Hung Tu

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90532

Title:	基於視覺語言模型及認知地圖之機器人室內環境問答系統 A Robot System for Indoor Environment Question Answering with Cognitive Map Leveraging Vision-language Models
Authors:	涂志宏 Chih-Hung Tu
Advisor:	傅立成 Li-Chen Fu
Keyword:	認知地圖,場景識別,機器人環境認知,環境物件問答,認知機器人, Cognitive Map,Place Recognition,Environmental Cognition,Environment Question Answering,Cognitive Robot,
Publication Year :	2023
Degree:	碩士
Abstract:	隨著全球人口結構高齡化，高齡照護需求日益上升，人類社會對機器人的需求愈發迫切。也因此產生新的任務，機器人環境問答 (Embodied Question Answering)。然而過往的方法並未探討如何對建立環境認知，且尚未有研究討論如何有效利用現行視覺語言模型完成這類任務。因此導致探索效率低下及系統所能產生的回覆受到限制。而近年大規模預訓練模型獲得可觀的發展並開始展現出強健的視覺語言理解能力，成為機器人步入人類生活的曙光。然而，這些模型並非為機器人任務設計，因此往往難以直接使用，且重新訓練或微調都需要龐大的資源，如何有效的重構和利用這些模型，進一步提昇機器人能力成為新的趨勢及挑戰。為此我們提出一套階層式環境問答架構，能夠有效利用及擴增現有大規模預訓練視覺語言模型的能力，實現機器人在真實室內環境問答，並更進一步考慮過往機器人環境問答系統所不具備的環境記憶能力。我們利用InstructBLIP 和 FlanT5 作為視覺語言模型基礎，透過提出之架構，使系統能夠達成對環境的深度認知。該機器人系統同時具備以下四個能力: 1) 透過預訓練視覺模型理解環境中的物體及其狀態、2) 理解人類自然語言之問題並達成視覺和語言的雙向關聯、3) 自主探索環境並建立認知地圖、4) 更新及利用環境認知地圖來協助導航並回答問題。除此之外，我們也提出一種創新的零樣本 (Zero-shot) 方法，使預訓練視覺骨架模型具備視覺場景識別（Visual Place Recognition, VPR）的能力，確保機器人能正確定位於認知地圖，並正確的更新地圖。我們提出的視覺場景識別，在數個VPR 資料集上超越過去的研究。我們也透過實體機器人實驗，驗證認知地圖導航及回答問題的有效性及表現。相較於過去的系統，本研究提出之系統能有效利用視覺語言預訓練模型，使系統更不易受環境遷移所影響。綜所上述，我們成功利用視覺語言模型，提升機器人理解人類問句並與環境進行關聯的能力。我們相信如何有效利用這些模型，將成為未來機器人發展的關鍵。 With increasing aging population worldwide, the demand for robots in human society is more urgent than ever. This has given rise to a new task, namely Embodied Question Answering (EQA). However, the previous methods have not explored how to establish an understanding of the environment nor have discussed the utilization of the existing visual language models. Consequently, this has resulted in low exploration efficiency and limitations in the generated responses. In recent years, large-scale pre-trained vision-language models have demonstrated powerful visual-language understanding abilities, uncovering the possibility of robots entering human life. However, these models are not designed for robot tasks and require substantial resources for training. How to effectively restructure and utilize these models to further enhance robots' abilities has become a new challenge for robotics research. In this study, we propose a hierarchical architecture to effectively utilize and enhance the capabilities of existing vision-language models to achieve a question-answering system for the real-world indoor environment and further construct the environmental memory that existing EQA systems lacked. We use InstructBLIP and FlanT5 as our vision-language foundations and propose an architecture that enables the system to achieve deep environmental cognition. The robot system has four main capabilities: 1) Understanding the objects in the environment and their states through the pre-trained vision backbone, 2) Understanding human natural language questions and achieving bidirectional correlation between vision and language, 3) Autonomous exploration of the environment, and the construction of cognitive maps, 4) Updating and utilizing the cognitive map to assist navigation and answer questions. In addition, we also propose an innovative zero-shot method that enables pre-trained vision backbone models to have visual place recognition (VPR) capabilities, ensuring that the robot can locate itself on the cognitive map and update the map correctly. Our visual place recognition surpasses existing research on several VPR datasets. We also verified the effectiveness and performance of cognitive map navigation and question-answering through experiments with physical robots. Compared with existing systems, the proposed system can effectively utilize pre-trained vision-language models and is less susceptible to environmental transition. In summary, we successfully used vision-language models to enhance the robot's ability to understand human questions and correlate them with the environment. We believe that effective utilization of these models will be the key to next-generation robot development.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90532
DOI:	10.6342/NTU202303014
Fulltext Rights:	同意授權(限校園內公開)
Appears in Collections:	電機工程學系

Files in This Item:

File	Size	Format
ntu-111-2.pdf Restricted Access	46.46 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets