Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90532
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor傅立成zh_TW
dc.contributor.advisorLi-Chen Fuen
dc.contributor.author涂志宏zh_TW
dc.contributor.authorChih-Hung Tuen
dc.date.accessioned2023-10-03T16:30:46Z-
dc.date.available2023-11-09-
dc.date.copyright2023-10-03-
dc.date.issued2023-
dc.date.submitted2023-08-06-
dc.identifier.citationA. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image isworth 16x16 words: Transformers for image recognition at scale,” arXiv preprintarXiv:2010.11929, 2020.
A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song,A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” arXiv preprint arXiv:1709.06158, 2017.
A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied questionanswering,” in Proceedings of the IEEE conference on computer vision and patternrecognition, 2018, pp. 1–10.
P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid,S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpretingvisually-grounded navigation instructions in real environments,” in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June2018.
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pretraining with frozen image encoders and large language models,” arXiv preprintarXiv:2301.12597, 2023.
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, andS. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” arXiv preprint arXiv:2305.06500, 2023.
D. Olid, J. M. Fácil, and J. Civera, “Single-view place recognition under seasonalchanges,” in PPNIV Workshop at IROS 2018, 2018.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry,A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR,2021, pp. 8748–8763.
M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon,C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastivelanguage-image learning,” 2022.
Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao,“Eva: Exploring the limits of masked visual representation learning at scale,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2023, pp. 19 358–19 369.
J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-trainingfor unified vision-language understanding and generation,” in International Conference on Machine Learning. PMLR, 2022, pp. 12 888–12 900.87
H. Zhang, P. Zhang, X. Hu, Y.-C. Chen, L. Li, X. Dai, L. Wang, L. Yuan, J.-N.Hwang, and J. Gao, “Glipv2: Unifying localization and vision-language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 067–36 080, 2022.
S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture:Baselines and benchmarks for language-driven zero-shot object navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2023, pp. 23 171–23 181.
P. Sharma, B. Sundaralingam, V. Blukis, C. Paxton, T. Hermans, A. Torralba, J. Andreas, and D. Fox, “Correcting robot plans with natural language feedback,” arXivpreprint arXiv:2204.05186, 2022.
N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam, “Clipfields: Weakly supervised semantic fields for robotic memory,” arXiv preprintarXiv:2210.05663, 2022.
D. Shah, B. Osiński, S. Levine et al., “Lm-nav: Robotic navigation with large pretrained models of language, vision, and action,” in Conference on Robot Learning.PMLR, 2023, pp. 492–504.
J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu,“Coca: Contrastive captioners are image-text foundation models,” arXiv preprintarXiv:2205.01917, 2022.
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang,“Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and patternrecognition, 2018, pp. 6077–6086.
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh,“Vqa: Visual question answering,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian, “Building generalizable agents with a realistic and rich 3d environment,” arXiv preprint arXiv:1801.02209, 2018.
Y.-T. Chen, M. Garbade, and J. Gall, “3d semantic scene completion from a singledepth image using adversarial training,” in 2019 IEEE International Conference onImage Processing (ICIP). IEEE, 2019, pp. 1835–1839.
A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Neural modular control forembodied question answering,” in Conference on Robot Learning. PMLR, 2018,pp. 53–62.
E. Wijmans, S. Datta, O. Maksymets, A. Das, G. Gkioxari, S. Lee, I. Essa, D. Parikh,and D. Batra, “Embodied question answering in photorealistic environments withpoint cloud perception,” in Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition, 2019, pp. 6659–6668.88
H. Luo, G. Lin, F. Shen, X. Huang, Y. Yao, and H. Shen, “Robust-eqa: robust learning for embodied question answering with noisy labels,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied ai: From simulators to research tasks,” IEEE Transactions on Emerging Topics in ComputationalIntelligence, vol. 6, no. 2, pp. 230–244, 2022.
F. K. Kenfack, F. A. Siddiky, F. Balint-Benczedi, and M. Beetz, “Robotvqa—ascene-graph-and deep-learning-based visual question answering system for robotmanipulation,” in 2020 IEEE/RSJ International Conference on Intelligent Robotsand Systems (IROS). IEEE, 2020, pp. 9667–9674.
J. O’keefe and L. Nadel, “Précis of o’keefe & nadel’s the hippocampus as a cognitivemap,” Behavioral and Brain Sciences, vol. 2, no. 4, pp. 487–494, 1979.
E. C. Tolman, “Cognitive maps in rats and men.” Psychological review, vol. 55,no. 4, p. 189, 1948.
T. E. Behrens, T. H. Muller, J. C. Whittington, S. Mark, A. B. Baram, K. L. Stachenfeld, and Z. Kurth-Nelson, “What is a cognitive map? organizing knowledge forflexible behavior,” Neuron, vol. 100, no. 2, pp. 490–509, 2018.
S. A. Park, D. S. Miller, H. Nili, C. Ranganath, and E. D. Boorman, “Map making:constructing, combining, and inferring on abstract cognitive maps,” Neuron, vol.107, no. 6, pp. 1226–1238, 2020.
D. George, R. V. Rikhye, N. Gothoskar, J. S. Guntupalli, A. Dedieu, and M. LázaroGredilla, “Clone-structured graph representations enable flexible learning and vicarious evaluation of cognitive maps,” Nature communications, vol. 12, no. 1, p. 2392,2021.
W. K. Yeap and M. Hossain, “What is a cognitive map? unravelling its mystery usingrobots,” Cognitive processing, vol. 20, pp. 203–225, 2019.
J. C. Whittington, D. McCaffary, J. J. Bakermans, and T. E. Behrens, “How to builda cognitive map,” Nature Neuroscience, vol. 25, no. 10, pp. 1257–1272, 2022.
S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-netvlad: Multiscale fusion of locally-global descriptors for place recognition,” in Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.14 141–14 152.
A. Ali-bey, B. Chaib-draa, and P. Giguère, “Gsv-cities: Toward appropriate supervised visual place recognition,” Neurocomputing, vol. 513, pp. 194–203, 2022.
R. Wang, Y. Shen, W. Zuo, S. Zhou, and N. Zheng, “Transvpr: Transformerbased place recognition with multi-level attention aggregation,” in Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.13 648–13 657.89
A. Ali-bey, B. Chaib-draa, and P. Giguère, “Mixvpr: Feature mixing for visual placerecognition,” in Proceedings of the IEEE/CVF Winter Conference on Applicationsof Computer Vision, 2023, pp. 2998–3007.
S. Han, “googletrans [software] (version 3.0.0),” 2015. [Online]. Available:https://pypi.org/project/googletrans/
A. Zhang, “Speech recognition (version 3.8) [software],” 2017. [Online]. Available:https://github.com/Uberi/speech_recognition#readme
T. T. Mac, C. Copot, D. T. Tran, and R. De Keyser, “Heuristic approaches in robotpath planning: A survey,” Robotics and Autonomous Systems, vol. 86, pp. 13–28,2016.
H. Baltzakis, A. Argyros, and P. Trahanias, “Fusion of laser and visual data for robotmotion planning and collision avoidance,” Machine Vision and Applications, vol. 15,pp. 92–100, 2003.
A. Dhall, K. Chelani, V. Radhakrishnan, and K. M. Krishna, “LiDAR-Camera Calibration using 3D-3D Point correspondences,” ArXiv e-prints, May 2017.
P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in Sensor fusionIV: control paradigms and data structures, vol. 1611. Spie, 1992, pp. 586–606.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov,P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y. Huang, H. Xu,V. Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve,I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2:Learning robust visual features without supervision,” arXiv:2304.07193, 2023.
S. F. Popham, A. G. Huth, N. Y. Bilenko, F. Deniz, J. S. Gao, A. O. Nunez-Elizalde,and J. L. Gallant, “Visual and linguistic semantic representations are aligned at theborder of human visual cortex,” Nature neuroscience, vol. 24, no. 11, pp. 1628–1636,2021.
P.-H. Huang, “User intent-driven navigation of home service robot based on semanticscene cognition,” Master’s thesis, National Taiwan University, 2022.
D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta, “Neural topological slamfor visual navigation,” in CVPR, 2020.
P. K. Panigrahi and S. K. Bisoy, “Localization strategies for autonomous mobilerobots: A review,” Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 8, pp. 6019–6039, 2022.
O. Sorkine-Hornung and M. Rabinovich, “Least-squares rigid motion using svd,”Computing, vol. 1, no. 1, pp. 1–5, 2017.90
A. Censi, “An icp variant using a point-to-line metric,” in 2008 IEEE InternationalConference on Robotics and Automation. Ieee, 2008, pp. 19–25.
H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang,M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,”arXiv preprint arXiv:2210.11416, 2022.
R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A survey,”arXiv preprint arXiv:1901.03407, 2019.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Languagemodels are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
A. Hagberg, P. Swart, and D. S Chult, “Exploring network structure, dynamics, andfunction using networkx,” Los Alamos National Lab.(LANL), Los Alamos, NM(United States), Tech. Rep., 2008.
C. Bron and J. Kerbosch, “Algorithm 457: finding all cliques of an undirectedgraph,” Communications of the ACM, vol. 16, no. 9, pp. 575–577, 1973.
E. Tomita, A. Tanaka, and H. Takahashi, “The worst-case time complexity for generating all maximal cliques and computational experiments,” Theoretical computerscience, vol. 363, no. 1, pp. 28–42, 2006.
P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the heuristic determination of minimum cost paths,” IEEE transactions on Systems Science and Cybernetics,vol. 4, no. 2, pp. 100–107, 1968.
S. Quinlan and O. Khatib, “Elastic bands: Connecting path planning and control,”in 1993 Proceedings IEEE International Conference on Robotics and Automation.IEEE, 1993, pp. 802–807.
S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-netvlad: Multiscale fusion of locally-global descriptors for place recognition,” arXiv preprintarXiv:2103.01486, 2021.
D. Fox, W. Burgard, F. Dellaert, and S. Thrun, “Monte carlo localization: Efficientposition estimation for mobile robots,” Aaai/iaai, vol. 1999, no. 343-349, pp. 2–2,1999.
D. Fox, “Adapting the sample size in particle filters through kld-sampling,” Theinternational Journal of robotics research, vol. 22, no. 12, pp. 985–1003, 2003.
D. G. Lowe, “Object recognition from local scale-invariant features,” in Proceedingsof the seventh IEEE international conference on computer vision, vol. 2. Ieee, 1999,pp. 1150–1157.91
M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu,V. Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AIResearch,” in Proceedings of the IEEE/CVF International Conference on ComputerVision (ICCV), 2019.
A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y. Zhao, J. Turner, N. Maestre,M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V. Vondrus, S. Dharur,F. Meier, W. Galuba, A. Chang, Z. Kira, V. Koltun, J. Malik, M. Savva, and D. Batra, “Habitat 2.0: Training home assistants to rearrange their habitat,” in Advancesin Neural Information Processing Systems (NeurIPS), 2021.
C.-A. Yu, H.-Y. Chen, C.-C. Wang, and L.-C. Fu, “Complex environment localization system using complementary ceiling and ground map information,” AutonomousRobots, pp. 1–15, 2023.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90532-
dc.description.abstract隨著全球人口結構高齡化,高齡照護需求日益上升,人類社會對機器人的需求愈發迫切。也因此產生新的任務,機器人環境問答 (Embodied Question Answering)。然而過往的方法並未探討如何對建立環境認知,且尚未有研究討論如何有效利用現行視覺語言模型完成這類任務。因此導致探索效率低下及系統所能產生的回覆受到限制。而近年大規模預訓練模型獲得可觀的發展並開始展現出強健的視覺語言理解能力,成為機器人步入人類生活的曙光。然而,這些模型並非為機器人任務設計,因此往往難以直接使用,且重新訓練或微調都需要龐大的資源,如何有效的重構和利用這些模型,進一步提昇機器人能力成為新的趨勢及挑戰。
為此我們提出一套階層式環境問答架構,能夠有效利用及擴增現有大規模預訓練視覺語言模型的能力,實現機器人在真實室內環境問答,並更進一步考慮過往機器人環境問答系統所不具備的環境記憶能力。我們利用InstructBLIP 和 FlanT5 作為視覺語言模型基礎,透過提出之架構,使系統能夠達成對環境的深度認知。該機器人系統同時具備以下四個能力: 1) 透過預訓練視覺模型理解環境中的物體及其狀態、2) 理解人類自然語言之問題並達成視覺和語言的雙向關聯、3) 自主探索環境並建立認知地圖、4) 更新及利用環境認知地圖來協助導航並回答問題。除此之外,我們也提出一種創新的零樣本 (Zero-shot) 方法,使預訓練視覺骨架模型具備視覺場景識別(Visual Place Recognition, VPR)的能力,確保機器人能正確定位於認知地圖,並正確的更新地圖。我們提出的視覺場景識別,在數個VPR 資料集上超越過去的研究。我們也透過實體機器人實驗,驗證認知地圖導航及回答問題的有效性及表現。相較於過去的系統,本研究提出之系統能有效利用視覺語言預訓練模型,使系統更不易受環境遷移所影響。綜所上述,我們成功利用視覺語言模型,提升機器人理解人類問句並與環境進行關聯的能力。我們相信如何有效利用這些模型,將成為未來機器人發展的關鍵。
zh_TW
dc.description.abstractWith increasing aging population worldwide, the demand for robots in human society is more urgent than ever. This has given rise to a new task, namely Embodied Question Answering (EQA). However, the previous methods have not explored how to establish an understanding of the environment nor have discussed the utilization of the existing visual language models. Consequently, this has resulted in low exploration efficiency and limitations in the generated responses. In recent years, large-scale pre-trained vision-language models have demonstrated powerful visual-language understanding abilities, uncovering the possibility of robots entering human life. However, these models are not designed for robot tasks and require substantial resources for training. How to effectively restructure and utilize these models to further enhance robots' abilities has become a new challenge for robotics research.
In this study, we propose a hierarchical architecture to effectively utilize and enhance the capabilities of existing vision-language models to achieve a question-answering system for the real-world indoor environment and further construct the environmental memory that existing EQA systems lacked. We use InstructBLIP and FlanT5 as our vision-language foundations and propose an architecture that enables the system to achieve deep environmental cognition. The robot system has four main capabilities: 1) Understanding the objects in the environment and their states through the pre-trained vision backbone, 2) Understanding human natural language questions and achieving bidirectional correlation between vision and language, 3) Autonomous exploration of the environment, and the construction of cognitive maps, 4) Updating and utilizing the cognitive map to assist navigation and answer questions. In addition, we also propose an innovative zero-shot method that enables pre-trained vision backbone models to have visual place recognition (VPR) capabilities, ensuring that the robot can locate itself on the cognitive map and update the map correctly. Our visual place recognition surpasses existing research on several VPR datasets. We also verified the effectiveness and performance of cognitive map navigation and question-answering through experiments with physical robots. Compared with existing systems, the proposed system can effectively utilize pre-trained vision-language models and is less susceptible to environmental transition. In summary, we successfully used vision-language models to enhance the robot's ability to understand human questions and correlate them with the environment. We believe that effective utilization of these models will be the key to next-generation robot development.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-10-03T16:30:46Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2023-10-03T16:30:46Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員會審定書 i
致謝 ii
摘要 iii
Abstract v
Contents vii
List of Figures xi
List of Tables xiii
Chapter 1 Introduction 1
1.1 Background and Motivation 1
1.2 Research Objectives 3
1.3 Contributions 4
1.4 Thesis Overview 5
Chapter 2 Related Works 6
2.1 Vision-Language Models 6
2.2 Vision Question Answering and Caption 7
2.3 Environment Question Answering 7
2.4 Cognitive Map 9
2.5 Visual Place Recognition 10
Chapter 3 Methodology 11
3.1 System Overview 11
3.2 Perception 13
3.2.1 Speech Recognition 13
3.2.2 Max-Min Boundary 13
3.2.2.1 Camera LiDAR and Laser Calibration 14
3.2.2.2 Maximum Boundary 16
3.2.2.3 Minimum Boundary 17
3.2.2.4 Uniform Spatial Sampling 18
3.2.3 Local Obstacle Map 18
3.2.4 Visual Perception Module 20
3.2.4.1 Large-Scale Pre-trained Vision Foundations 20
3.2.4.2 Vision Foundations as Visual Perception 22
3.2.4.3 Semantic Homogeneity Score 24
3.3 Cognitive Map 25
3.3.1 Definition 26
3.3.1.1 Cognitive Map Structure 26
3.3.1.2 Node on Cognitive Map 26
3.3.1.3 Edge on Cognitive Map 28
3.3.2 Cognitive Map Management 29
3.3.2.1 Cognitive Map Construction 29
3.3.2.2 Cognitive Map Exploration 30
3.3.2.3 Cognitive Map Update 33
3.4 Hybrid Localizer 35
3.4.1 ViTPR 36
3.4.1.1 Mutual Similarity 37
3.4.1.2 Salience Filter 38
3.4.1.3 Iterative Consensus 39
3.4.1.4 ViTPR-multi 41
3.4.2 Augmented Scan Matching 42
3.4.2.1 PLICP 43
3.4.2.2 Spatial Similarity 43
3.5 Reasoning 44
3.5.1 Vision-Language Model 45
3.5.1.1 Visual Question Answering 47
3.5.1.2 Evaluator 48
3.5.1.3 Classifier 49
3.5.1.4 Image-Text Matcher 50
3.5.1.5 Conditioned Generation 51
3.5.2 Coordinator 51
3.5.2.1 Query Classification 51
3.5.2.2 Observation Query Generation 52
3.5.2.3 Question Answering 53
3.5.3 Observation Retrieval 54
3.5.3.1 Query Matching 55
3.5.3.2 Visual-Context Grouping 57
3.6 Navigation 58
3.6.1 TopoBand 58
Chapter 4 Experiments 60
4.1 Visual Place Recognition 60
4.1.1 Experimental Setup 61
4.1.1.1 Nordland 61
4.1.1.2 Y412-VPR 62
4.1.2 Performance Evaluation 64
4.1.2.1 Nordland Dataset 64
4.1.2.2 Y412-VPR Dataset 65
4.1.3 Ablation Study 67
4.1.3.1 Feature Layer 69
4.1.3.2 Rotation Invariance 70
4.2 Global Localization 71
4.2.1 Experimental Setup 71
4.2.2 Evaluation 76
4.3 Observation Retrieval 77
4.3.1 Experimental Setup 78
4.3.2 Quantitative Result 80
4.4 Environment Question Answering 81
4.4.1 Experimental Setup 81
4.4.2 Quantitative Result 81
4.4.3 Qualitative Result 82
Chapter 5 Conclusion and Future Works 85
References 87
Appendix A — Tables 93
A.1 Observation Query Generation 93
A.2 Response Generation 93
A.3 Response Classification 94
A.4 Questions Sets 94
-
dc.language.isoen-
dc.subject環境物件問答zh_TW
dc.subject認知地圖zh_TW
dc.subject場景識別zh_TW
dc.subject機器人環境認知zh_TW
dc.subject認知機器人zh_TW
dc.subjectEnvironmental Cognitionen
dc.subjectCognitive Mapen
dc.subjectCognitive Roboten
dc.subjectPlace Recognitionen
dc.subjectEnvironment Question Answeringen
dc.title基於視覺語言模型及認知地圖之機器人室內環境問答系統zh_TW
dc.titleA Robot System for Indoor Environment Question Answering with Cognitive Map Leveraging Vision-language Modelsen
dc.typeThesis-
dc.date.schoolyear111-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee林沛群;郭重顯;張文中;宋開泰zh_TW
dc.contributor.oralexamcommitteePei-Chun Lin;Chung-Hsien Kuo;Wen-Chung Chang;Kai-Tai Songen
dc.subject.keyword認知地圖,場景識別,機器人環境認知,環境物件問答,認知機器人,zh_TW
dc.subject.keywordCognitive Map,Place Recognition,Environmental Cognition,Environment Question Answering,Cognitive Robot,en
dc.relation.page96-
dc.identifier.doi10.6342/NTU202303014-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2023-08-08-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電機工程學系-
dc.date.embargo-lift2026-08-05-
顯示於系所單位:電機工程學系

文件中的檔案:
檔案 大小格式 
ntu-111-2.pdf
  未授權公開取用
46.46 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved