Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101167
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor徐宏民zh_TW
dc.contributor.advisorWinston H. Hsuen
dc.contributor.author王廷郡zh_TW
dc.contributor.authorTing-Jun Wangen
dc.date.accessioned2025-12-31T16:11:09Z-
dc.date.available2026-01-01-
dc.date.copyright2025-12-31-
dc.date.issued2025-
dc.date.submitted2025-12-17-
dc.identifier.citation[1] Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018.
[2] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-andlanguage navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018.
[3] Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A Plummer. A dataset for interactive vision-language navigation with unknown command feasibility. In European Conference on Computer Vision, pages 312–328. Springer, 2022.
[4] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017.
[5] Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K. Wong. Mapgpt: Map-guided prompting with adaptive path planning for vision-andlanguage navigation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.
[6] Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation. Advances in neural information processing systems, 34:5834–5847, 2021.
[7] Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-andlanguage navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537–16547, 2022.
[8] An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in visionlanguage models. In Advances in Neural Information Processing Systems, 2024.
[9] Iulia Comsa and Srini Narayanan. A benchmark for reasoning with spatial prepositions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16328–16335, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.1015. URL https: //aclanthology.org/2023.emnlp-main.1015/.
[10] Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Daniilidis. Cross-modal map learning for vision and language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15460–15470, 2022.
[11] Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. URL https://arxiv.org/abs/ 2403.05530.
[12] Meera Hahn, Amit Raj, and James M. Rehg. Which way is `right'?: Uncovering limitations of vision-and-language navigation models. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 2415–2417, Richland, SC, 2023. URL https://api. semanticscholar.org/CorpusID:253381693.
[13] Xianghao Kong, Jinyu Chen, Wenguan Wang, Hang Su, Xiaolin Hu, Yi Yang, and Si Liu. Controllable navigation instruction generation with chain of thought prompting. In European Conference on Computer Vision, pages 37–54. Springer, 2024.
[14] Jacob Krantz, Erik Wijmans, Arjun Majundar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision and language navigation in continuous environments. In European Conference on Computer Vision (ECCV), 2020.
[15] Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Roomacross-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. arXiv preprint arXiv:2010.07954, 2020.
[16] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10975, 2022.
[17] Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han,Hang Xu, Xiaojun Chang, and Xiaodan Liang. Navcot: Boosting llm-based visionand-language navigation via learning disentangled reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025.
[18] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision, pages 38–55. Springer, 2024.
[19] Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Discuss before moving: Visual language navigation via multi-expert discussions. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024.
[20] OpenAI. Gpt-4 technical report, 2023. URL https://openai.com/research/ gpt-4. Accessed: 2025-05-18.
[21] Bowen Pan, Rameswar Panda, SouYoung Jin, Rogerio Feris, Aude Oliva, Phillip Isola, and Yoon Kim. Langnav: Language as a perceptual representation for navigation. arXiv preprint arXiv:2310.07889, 2023.
[22] Alexander Pashevich, Cordelia Schmid, and Chen Sun. Episodic transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15942–15952, 2021.
[23] Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020.
[24] Yanyuan Qiao, Qianyi Liu, Jiajun Liu, Jing Liu, and Qi Wu. Llm as copilot for coarse-grained vision-and-language navigation. In European Conference on Computer Vision, pages 459–476. Springer, 2024.
[25] Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling openworld models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024.
[26] Francesco Taioli, Stefano Rosa, Alberto Castellini, Lorenzo Natale, Alessio Del Bue, Alessandro Farinelli, Marco Cristani, and Yiming Wang. Mind the error! detection and localization of instruction errors in vision-and-language navigation. In 2024 IEEE/RSJInternational Conference on Intelligent Robots and Systems(IROS), pages 12993–13000. IEEE, 2024.
[27] Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. Visionand-dialog navigation. In Conference on Robot Learning, pages 394–406. PMLR, 2020.
[28] Hongbin Wang, Todd R Johnson, Jiajie Zhang, and Yue Wang. A study of object-location memory. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 24, 2002. URL https://escholarship.org/uc/item/ 62s8d5p0.
[29] Xiaohan Wang, Wenguan Wang, Jiayi Shao, and Yi Yang. Lana: A language-capable navigator for instruction following and generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19048–19058, 2023.
[30] Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. In Proceedings of the IEEE/CVF International conference on computer vision, pages 15625–15636, 2023.
[31] He Yan, Xinyao Hu, Xiangpeng Wan, Chengyu Huang, Kai Zou, and Shiqi Xu. Inherent limitations of gpt=4 regarding spatial information, 2023. URL https: //arxiv.org/abs/2312.03042.
[32] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
[33] Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces, 2024. URL https://arxiv.org/abs/2412.14171.
[34] Yue Zhang and Parisa Kordjamshidi. Vln-trans: Translator for the vision and language navigation agent. arXiv preprint arXiv:2302.09230, 2023.
[35] Yue Zhang, Quan Guo, and Parisa Kordjamshidi. Navhint: Vision and language navigation agent with a hint generator. Association for Computational Linguistics, 2024.
[36] Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. In European Conference on Computer Vision, pages 260–278. Springer, 2024.
[37] Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in visionand-language navigation with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7641–7649, 2024.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101167-
dc.description.abstract現有的視覺語言導航(VLN)多假設使用者指令皆可達成,忽略現實中人類常因記憶錯誤而提供不存在的目標物,導致機器人無限搜尋或過早停止。為使系統能處理此類不可靠指令,本研究提出新任務 Navigation Not Found(NAV-NF),要求機器人在抵達目標房間後,能於確認目標物不存在時輸出 NOT-FOUND。
我們設計一套以大型語言模型(LLM)為核心的資料生成流程,透過指令重寫與開放詞彙物件辨識驗證物件缺失,以建立語意自然但事實錯誤的指令;人工檢驗顯示錯誤率低於 2%。此外,我們提出新的評估指標,包括 Reach & Found SR 與 Reach & Found SPL,以量化模型在不確定情境下的探索效率與判定品質。
實驗結果顯示,現有監督式與 LLM 式 VLN 模型在 NAV-NF 表現不佳,Reach & Found SR 指標僅 5.4%–34.1%。為此,我們提出 ROAM(Room-Object Aware Movement),一個粗到細的雙階段框架:先以監督式模型進行房間定位,再由 VLM/LLM 進行房內探索。ROAM 在所有指標皆達到最佳成績,Reach & Found SR 指標提升至 41.4%。
本研究提供首個處理不可行指令的 VLN 基準資料集與強健模型,推動 VLN 機器人領域朝更可靠、具錯誤辨識能力的方向前進。
zh_TW
dc.description.abstractConventional Vision-and-Language Navigation agents assume that tasks are always feasible and lack mechanisms to handle cases where the target object cannot be found. To address this limitation, we propose a novel task, Navigation Not Found(NAV-NF), which introduces unreliable instructions—scenarios where the target object does not exist—reflecting real-world situations where humans may provide erroneous instructions. We develop a data generation pipeline that leverages Large Language Models to revise existing instructions and verify their correctness. Experimental results demonstrate that state-of-the-art models, whether supervised or LLM-based, struggle with exploration and often hallucinate or terminate prematurely. To mitigate this, we introduce a hybrid framework, Room Object Aware Movement (ROAM), which achieves state-of-the-art performance across all evaluation metrics.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-12-31T16:11:09Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-12-31T16:11:09Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee i
Acknowledgements iii
摘要 v
Abstract vii
Contents ix
List of Figures xi
List of Tables xiii
Chapter 1 Introduction 1
Chapter 2 Related Work 5
2.1 Vision-and-Language Navigation . . . . . . . . . . . . . . . . . . . 5
2.2 Instruction Errors in Vision-and-Language Navigation . . . . . . . . 6
Chapter 3 The NAV-NF Dataset 9
3.1 Task Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Infeasible Instruction Generation . . . . . . . . . . . . . . . . . . . 11
3.2.2 Verifying Instruction Infeasibility . . . . . . . . . . . . . . . . . . . 12
3.2.3 Generating Exploration Paths . . . . . . . . . . . . . . . . . . . . . 12
3.3 Human evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 4 Room-Object Aware Movement Framework 17
4.1 Coarse-grained Stage . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Fine-grained Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.2 Free-space Raycasting Estimation Engine . . . . . . . . . . . . . . 19
Chapter 5 Experiments 21
5.1 Baselines Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Comparison with Baselines . . . . . . . . . . . . . . . . . . . . . . . 21
5.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4 ROAM Improves Original REVERIE VLN . . . . . . . . . . . . . . 23
Chapter 6 Conclusion 25
References 29
Appendix A — 35
A.1 Implementation Detail . . . . . . . . . . . . . . . . . . . . . . . . . 35
A.1.1 Dataset Generation Pipeline . . . . . . . . . . . . . . . . . . . . . . 35
A.1.2 ROAN Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A.1.3 ROAN Framework - Free-space Raycasting Estimation Engine (FREE) . 41
A.2 Design Considerations for Reach & Found SPL . . . . . . . . . . . . 42
-
dc.language.isoen-
dc.subject視覺語言導航-
dc.subject視覺語言模型-
dc.subjectVLN-
dc.subjectVision-Language Navigation-
dc.subjectVision-Language Model-
dc.title針對潛在不可行指令之視覺語言導航之基準與方法設計zh_TW
dc.titleNAV-NF: A Benchmark and Framework for Vision-Language Navigation under Infeasible Instructionsen
dc.typeThesis-
dc.date.schoolyear114-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee陳奕廷;陳文進zh_TW
dc.contributor.oralexamcommitteeYi-Ting Chen;Wen-Chin Chenen
dc.subject.keyword視覺語言導航,視覺語言模型zh_TW
dc.subject.keywordVLN,Vision-Language NavigationVision-Language Modelen
dc.relation.page43-
dc.identifier.doi10.6342/NTU202504771-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2025-12-17-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-lift2026-01-01-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-114-1.pdf9.51 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved