結合多模態大型語言模型與街景影像生成行人路徑描述

丘絲盈; Si Ying Yau

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102226

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	亞歷山卓克里維	zh_TW
dc.contributor.advisor	Alessandro Crivellari	en
dc.contributor.author	丘絲盈	zh_TW
dc.contributor.author	Si Ying Yau	en
dc.date.accessioned	2026-04-08T16:26:49Z	-
dc.date.available	2026-04-09	-
dc.date.copyright	2026-04-08	-
dc.date.issued	2026	-
dc.date.submitted	2026-03-09	-
dc.identifier.citation	1. Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., & Reynolds, M. (2022). Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35, 23716-23736. 2. Alfonzo, M. A. (2005). To walk or not to walk? The hierarchy of walking needs. Environment and Behavior, 37(6), 808-836. 3. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE conference on computer vision and pattern recognition, 4. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). Vqa: Visual question answering. Proceedings of the IEEE international conference on computer vision, 5. Apple. (2025). Apple Maps. https://maps.apple.com/. 6. Arellana, J., Saltarín, M., Larrañaga, A. M., Alvarez, V., & Henao, C. A. (2020). Urban walkability considering pedestrians’ perceptions of the built environment: a 10-year review and a case study in a medium-sized city in Latin America. Transport reviews, 40(2), 183-203. 7. Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., & Huang, F. (2023). Qwen technical report. arXiv preprint arXiv:2309.16609. 8. Baidu. (2025). Baidu Total View. https://map.baidu.com/. 9. Basu, N., Haque, M. M., King, M., Kamruzzaman, M., & Oviedo-Trespalacios, O. (2022). A systematic review of the factors associated with pedestrian route choice. Transport reviews, 42(5), 672-694. 10. Bazi, Y., Bashmal, L., Al Rahhal, M. M., Ricci, R., & Melgani, F. (2024). Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery. Remote Sensing, 16(9), 1477. 11. Biljecki, F., & Ito, K. (2021). Street view imagery in urban analytics and GIS: A review. Landscape and Urban Planning, 215, 104217. 12. Bivina, G., Gupta, A., & Parida, M. (2020). Walk accessibility to metro stations: An analysis based on meso-or micro-scale built environment factors. Sustainable Cities and Society, 55, 102047. 13. Blečić, I., Saiu, V., & A. Trunfio, G. (2024). Enhancing Urban Walkability Assessment with Multimodal Large Language Models. International Conference on Computational Science and Its Applications, 14. Bordes, F., Pang, R. Y., Ajay, A., Li, A. C., Bardes, A., Petryk, S., Mañas, O., Lin, Z., Mahmoud, A., & Jayaraman, B. (2024). An introduction to vision-language modeling. arXiv preprint arXiv:2405.17247. 15. Center, N. L. S. a. M. (2006). National Land Use Investigation Data. https://www.nlsc.gov.tw/cp.aspx?n=13706 16. De Vos, J., Lättman, K., Van der Vlugt, A.-L., Welsch, J., & Otsuka, N. (2023). Determinants and effects of perceived walkability: a literature review, conceptual model and research agenda. Transport reviews, 43(2), 303-324. 17. Duncan, D. T., Aldstadt, J., Whalen, J., Melly, S. J., & Gortmaker, S. L. (2011). Validation of Walk Score® for estimating neighborhood walkability: an analysis of four US metropolitan areas. International journal of environmental research and public health, 8(11), 4160-4179. https://mdpi-res.com/d_attachment/ijerph/ijerph-08-04160/article_deploy/ijerph-08-04160.pdf?version=1403139473 18. Fang, F., Zeng, L., Li, S., Zheng, D., Zhang, J., Liu, Y., & Wan, B. (2022). Spatial context-aware method for urban land use classification using street view images. ISPRS Journal of Photogrammetry and Remote Sensing, 192, 1-12. 19. Feng, J., Wang, S., Liu, T., Xi, Y., & Li, Y. (2025). UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence. Proceedings of the IEEE/CVF International Conference on Computer Vision, 20. Feng, Y., Ding, L., & Xiao, G. (2023). Geoqamap-geographic question answering with maps leveraging LLM and open knowledge base (short paper). 12th International Conference on Geographic Information Science (GIScience 2023), 21. Google. (2025a). Google maps. https://www.google.com/maps. 22. Google. (2025b). Street View Static API. Retrieved 2025/02/22, from https://developers.google.com/maps/documentation/streetview. 23. Götschi, T., de Nazelle, A., Brand, C., & Gerike, R. (2017). Towards a comprehensive conceptual framework of active travel behavior: a review and synthesis of published frameworks. Current environmental health reports, 4(3), 286-295. https://pmc.ncbi.nlm.nih.gov/articles/PMC5591356/ 24. Gu, J., Han, Z., Chen, S., Beirami, A., He, B., Zhang, G., Liao, R., Qin, Y., Tresp, V., & Torr, P. (2023). A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980. 25. Hartsock, I., & Rasool, G. (2024). Vision-language models for medical report generation and visual question answering: A review. Frontiers in Artificial Intelligence, 7, 1430984. https://pmc.ncbi.nlm.nih.gov/articles/PMC11611889/pdf/frai-07-1430984.pdf 26. Harvey, C., & Aultman-Hall, L. (2016). Measuring urban streetscapes for livability: A review of approaches. The Professional Geographer, 68(1), 149-158. 27. He, X., & He, S. Y. (2025). How does the effect of walkability on walking behavior vary with the time of day? A study of Shenzhen, China. Journal of transport geography, 126, 104210. 28. HeiGIT. (2025). Retrieved from https://openrouteservice.org/. 29. Hu, R., & Singh, A. (2021). Unit: Multimodal multitask learning with a unified transformer. Proceedings of the IEEE/CVF international conference on computer vision, 30. Hu, Z., Wang, L., Lan, Y., Xu, W., Lim, E.-P., Bing, L., Xu, X., Poria, S., & Lee, R. K.-W. (2023). Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933. 31. Huang, D., Yan, C., Li, Q., & Peng, X. (2024). From Large Language Models to Large Multimodal Models: A Literature Review. Applied Sciences, 14(12), 5068. 32. Huang, M.-R. (2022). Using the Hierarchical Model of Walkability to Evaluate the Walking Environment around the Metro Station: The Case of Zhongxiao Xinsheng Station, MRT National Taipei University of Technology]. 33. Ikeda, T., Hsu, M.-Y., Imai, H., Nishimura, S., Shimoura, H., Hashimoto, T., Tenmoku, K., & Mitoh, K. (1994). A fast algorithm for finding better routes by AI search techniques. Proceedings of VNIS'94-1994 Vehicle Navigation and Information Systems Conference, 34. Jain, P., Ienco, D., Interdonato, R., Berchoux, T., & Marcos, D. (2025). SenCLIP: Enhancing zero-shot land-use mapping for Sentinel-2 with ground-level prompting. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 35. Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. International conference on machine learning, 36. Jiao, J., & Wang, H. (2023). Forecasting traffic speed during daytime from Google street view images using deep learning. Transportation research record, 2677(12), 743-753. 37. Jin, W., Cheng, Y., Shen, Y., Chen, W., & Ren, X. (2021). A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. arXiv preprint arXiv:2110.08484. 38. Kang, B., Lee, S., & Zou, S. (2021). Developing sidewalk inventory data using street view images. Sensors, 21(9), 3300. https://mdpi-res.com/d_attachment/sensors/sensors-21-03300/article_deploy/sensors-21-03300-v3.pdf?version=1620792975 39. Kang, J., Körner, M., Wang, Y., Taubenböck, H., & Zhu, X. X. (2018). Building instance classification using street view images. ISPRS Journal of Photogrammetry and Remote Sensing, 145, 44-59. 40. Koo, B. W., Guhathakurta, S., & Botchwey, N. (2022). How are neighborhood and street-level walkability factors associated with walking behaviors? A big data approach using street view images. Environment and Behavior, 54(1), 211-241. 41. Kuckreja, K., Danish, M. S., Naseer, M., Das, A., Khan, S., & Khan, F. S. (2024). Geochat: Grounded large vision-language model for remote sensing. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 42. Kuo, C.-L., & Lin, Z.-S. (2024). A cost-effective approach to uncovering and mapping buildings’ exterior arcades using street view imagery. Remote Sensing Applications: Society and Environment, 34, 101164. 43. Leong, Y.-Y. (2021). A Study on the Pedestrian Environment Improvement around MRT Stations Affecting Pedestrian' Walking Willingness Chinese Culture University]. 44. Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., & Gao, J. (2023). Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in neural information processing systems, 36, 28541-28564. 45. Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. International conference on machine learning, 46. Li, J., Li, D., Xiong, C., & Hoi, S. (2022). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. International conference on machine learning, 47. Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34, 9694-9705. 48. Li, M., Sheng, H., Irvin, J., Chung, H., Ying, A., Sun, T., Ng, A. Y., & Rodriguez, D. A. (2023). Marked crosswalks in US transit-oriented station areas, 2007–2020: A computer vision approach using street view imagery. Environment and Planning B: Urban Analytics and City Science, 50(2), 350-369. 49. Li, X. (2021). Examining the spatial distribution and temporal change of the green view index in New York City using Google Street View images and deep learning. Environment and Planning B: Urban Analytics and City Science, 48(7), 2039-2054. 50. Li, X., Wen, C., Hu, Y., & Zhou, N. (2023). RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision. International Journal of Applied Earth Observation and Geoinformation, 124, 103497. 51. Li, X., Zhang, C., & Li, W. (2017). Building block level urban land-use information retrieval based on Google Street View images. GIScience & Remote Sensing, 54(6), 819-835. 52. Li, X., Zhang, C., Li, W., Ricard, R., Meng, Q., & Zhang, W. (2015). Assessing street-level urban greenery using Google Street View and a modified green view index. Urban Forestry & Urban Greening, 14(3), 675-685. 53. Li, Y., Zhao, Q., & Wang, M. (2024). Understanding urban traffic flows in response to COVID-19 pandemic with emerging urban big data in Glasgow. Cities, 154, 105381. 54. Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Text summarization branches out, 55. Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). Visual instruction tuning. Advances in neural information processing systems, 36. 56. Liu, J., Sun, L., Fu, R., & Yang, B. (2025). Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models. arXiv preprint arXiv:2509.22221. 57. Liu, X., Zhou, T., Wang, C., Wang, Y., Wang, Y., Cao, Q., Du, W., Yang, Y., He, J., & Qiao, Y. (2025). Toward the unification of generative and discriminative visual foundation model: a survey. The Visual Computer, 41(5), 3371-3412. 58. Lu, J., Goswami, V., Rohrbach, M., Parikh, D., & Lee, S. (2020). 12-in-1: Multi-task vision and language representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 59. Luan, J.-Y. (2023). The Development of Age-friendly Walkable Environment, around Kaiyuan Business District, Tainan: The Awakening of Community Awareness and The Experiment of Possibilities (Publication Number 7) National Cheng Kung University]. 60. Lugtigheid, F., Park, A. J., Hwang, E., Spicer, V., & Brantingham, P. L. (2024). Sidewalk-Based Accessible Pedestrian Routing. 2024 IEEE 15th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), 61. Map8. (2025). Map8. https://www.map8.zone/. 62. Meta. (2025). Mapillary. https://www.mapillary.com. 63. Nguyen, Q. C., Huang, Y., Kumar, A., Duan, H., Keralis, J. M., Dwivedi, P., Meng, H.-W., Brunisholz, K. D., Jay, J., & Javanmardi, M. (2020). Using 164 million google street view images to derive built environment predictors of COVID-19 cases. International journal of environmental research and public health, 17(17), 6359. https://mdpi-res.com/d_attachment/ijerph/ijerph-17-06359/article_deploy/ijerph-17-06359-v2.pdf?version=1599039272 64. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J. (2021). Learning transferable visual models from natural language supervision. International conference on machine learning, 65. Sahoo, P., Singh, A. K., Saha, S., Jain, V., Mondal, S., & Chadha, A. (2024). A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927. 66. Shah, D., Osiński, B., & Levine, S. (2023). Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. Conference on robot learning, 67. Shashank, A., & Schuurman, N. (2019). Unpacking walkability indices and their inherent assumptions. Health & place, 55, 145-154. 68. Shatu, F., & Yigitcanlar, T. (2018). Development and validity of a virtual street walkability audit tool for pedestrian route choice analysis—SWATCH. Journal of transport geography, 70, 148-160. 69. Shatu, F., Yigitcanlar, T., & Bunker, J. (2019). Objective vs. subjective measures of street environments in pedestrian route choice behaviour: Discrepancy and correlates of non-concordance. Transportation research part A: policy and practice, 126, 1-23. 70. Sinnott, R., & Zhong, S. (2023). Real-time route planning to reduce pedestrian pollution exposure in urban settings. Proceedings of the IEEE/ACM 10th International Conference on Big Data Computing, Applications and Technologies, 71. Siriaraya, P., Wang, Y., Zhang, Y., Wakamiya, S., Jeszenszky, P., Kawai, Y., & Jatowt, A. (2020). Beyond the shortest route: A survey on quality-aware route navigation for pedestrians. IEEE Access, 8, 135569-135590. 72. Steinmetz-Wood, M., Velauthapillai, K., O’Brien, G., & Ross, N. A. (2019). Assessing the micro-scale environment using Google Street View: the virtual systematic tool for evaluating pedestrian streetscapes (virtual-STEPS). BMC Public Health, 19, 1-11. 73. Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., & Dai, J. (2019). Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. 74. Tan, H., & Bansal, M. (2019). Lxmert: Learning cross-modality encoder representations from transformers. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 75. Tanprasert, T., Siripanpornchana, C., Surasvadi, N., & Thajchayapong, S. (2020). Recognizing traffic black spots from street view images using environment-aware image processing and neural network. IEEE Access, 8, 121469-121478. 76. Tian, X., Gu, J., Li, B., Liu, Y., Wang, Y., Zhao, Z., Zhan, K., Jia, P., Lang, X., & Zhao, H. (2024). Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289. 77. Tong, M., She, J., Tan, J., Li, M., Ge, R., & Gao, Y. (2020). Evaluating street greenery by multiple indicators using street-level imagery and satellite images: a case study in Nanjing, China. Forests 11 (12): 1347. In. 78. Tong, Y., & Bode, N. W. (2022). The principles of pedestrian route choice. Journal of the Royal Society Interface, 19(189), 20220061. https://pmc.ncbi.nlm.nih.gov/articles/PMC8984324/ 79. Varga, B., Ormándi, T., Tettamanti, T., Aba, A., & Esztergár-Kiss, D. (2025). Strategic trip planning for commuting considering user preferences of employees in organizations. Cities, 165, 106097. 80. Vartholomaios, A. (2023). Follow the shade: detection of optimally shaded pedestrian paths within the historic city center of Thessaloniki. IOP Conference Series: Earth and Environmental Science, 81. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. 82. Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., & Yang, H. (2022). Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. International conference on machine learning, 83. Wang, X., Peng, Z., & Yang, X. (2025). Multimodal Data-Driven Hourly Dynamic Assessment of Walkability on Urban Streets and Exploration of Regulatory Mechanisms for Diurnal Changes: A Case Study of Wuhan City. Land, 14(8), 1551. 84. Wang, X., Shao, B., & Kim, H. (2025). Toward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure, Bias, and Inference in Korean Street Views. arXiv preprint arXiv:2506.03371. 85. Waze. (2025). Waze. https://www.waze.com/. 86. Wen, C., Lin, Y., Qu, X., Li, N., Liao, Y., Lin, H., & Li, X. (2025). RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model. arXiv preprint arXiv:2504.04988. 87. Winter, S. (2002). Modeling costs of turns in route planning. GeoInformatica, 6(4), 345-361. 88. Wozniak, M., Filomena, G., & Wronkowski, A. (2025). What’s your type? A taxonomy of pedestrian route choice behaviour in cities. Transportation Research Part F: Traffic Psychology and Behaviour, 109, 1257-1274. 89. Wu, D., Gong, J., Liang, J., Sun, J., & Zhang, G. (2020). Analyzing the influence of urban street greening and street buildings on summertime air pollution based on street view image data. ISPRS International Journal of Geo-Information, 9(9), 500. 90. Wu, M., Huang, Q., Gao, S., & Zhang, Z. (2023). Mixed land use measurement and mapping with street view images and spatial context-aware prompts via zero-shot multimodal learning. International Journal of Applied Earth Observation and Geoinformation, 125, 103591. 91. Xia, Y., Yabuki, N., & Fukuda, T. (2021). Development of a system for assessing the quality of urban street-level greenery using street view images and deep learning. Urban Forestry & Urban Greening, 59, 126995. 92. Yang, J., Zhao, L., Mcbride, J., & Gong, P. (2009). Can you see green? Assessing the visibility of urban forests in cities. Landscape and Urban Planning, 91(2), 97-104. 93. Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., & Chen, E. (2024). A survey on multimodal large language models. National Science Review, 11(12), nwae403. https://pmc.ncbi.nlm.nih.gov/articles/PMC11645129/ 94. Yuan, Z., Zhang, T., Deng, Y., Zhang, J., Zhu, Y., Jia, Z., Zhou, J., & Zhang, J. (2024). Walkvlm: Aid visually impaired people walking by vision language model. arXiv preprint arXiv:2412.20903. 95. Yussif, A.-M., Zayed, T., Taiwo, R., & Fares, A. (2025). Sidewalk Pavement Defect Detection with Instance Segmentation: A Step Towards Improving Sidewalk Quality. CIB Conferences, 96. Zhang, G., Guhathakurta, S., Sanford, J., & Woo Koo, B. (2021). Application for locational intelligence and geospatial navigation (align): Smart navigation tool for generating routes that meet individual preferences. Urban informatics and future cities, 191-209. 97. Zhang, J., Huang, J., Jin, S., & Lu, S. (2024). Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8), 5625-5644. https://ieeexplore.ieee.org/stampPDF/getPDF.jsp?tp=&arnumber=10445007&ref= 98. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675. 99. Zhang, Y., & Dong, R. (2018). Impacts of street-visible greenery on housing prices: Evidence from a hedonic price model and a massive street view image dataset in Beijing. ISPRS International Journal of Geo-Information, 7(3), 104. 100. Zhanga, J., Lia, Y., Fukudab, T., & Wang, B. (2024). Revolutionizing Urban Safety Perception Assessments: Integrating Multimodal Large Language Models with Street View Images. arXiv preprint arXiv:2407.19719. 101. Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., & Dong, Z. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223, 1(2), 1-124. 102. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. Proceedings of the IEEE conference on computer vision and pattern recognition,	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102226	-
dc.description.abstract	行人路徑規劃不僅協助行人抵達目的地，亦有助於提升步行效率。行人基於路徑規劃結果所選擇的路徑，將直接影響其移動的品質與感受，而此一決策過程受到建成環境、個人偏好與行動情境等多重因素的影響。由於其中多為難以量化的質性因子，如安全性、舒適性與便利性等考量，若欲於實務中納入這些因素，往往需列舉各項相關條件，導致路徑規劃在應用上面臨高度複雜性。此外，現有導航系統所提供的資訊多以時間、距離與轉彎數等量化指標為主，即便已提供多條路徑選項，其在步行體驗上的差異，也往往未被凸顯，導致使用者難以根據個人需求做出最適切的選擇。有鑑於此，本研究提出「行人路徑描述-LLaVA (PRD-LLaVA)」架構，透過多模態大型語言模型 (multimodal large language model)，分析路徑規劃所推薦路徑的街景影像，萃取影響行人步行體驗的客觀空間特徵，如人行道、土地利用、綠化比例與阻礙物等，並生成文字描述，以補足傳統導航系統缺乏的微觀環境資訊，促進行人路徑選擇。該架構包含三個處理階段：1. 首先，從導航成果路徑中(已具最快或最短路路徑考量)，辨識與萃取行人相關設施與步行環境；2. 其次生成單張影像的語意描述；3. 最終歸納為完整路徑說明。實驗結果顯示，透過結構化提示 (prompt) 與模型微調 (fine-tuning)，PRD-LLaVA 於第一階段人行設施辨識任務達 93.01% 的準確率；在第二階段計算 BERTScore 之 F1 指標評估單張影像描述與人工標記資料之語意相似度，結果顯示 PRD-LLaVA 模型的表現為 0.70，優於 ChatGPT-4o 的 0.38；在第三階段，PRD-LLaVA 所生成路徑說明能概括行人路徑沿線之空間配置特徵，包括 6 種人行道情況、4 種土地利用、沿途地景特徵如商店、道路寬窄、綠化比例等。此外，本研究亦將模型成果整合為地理資訊系統，展示結合地圖之語意化路徑描述，提升行人在路徑選擇上的理解與判斷。總結而言，本研究運用大型語言模型具有彈性的語言生成能力，結合街景影像、透過模型訓練，提出自動化描述路徑的方法，突破傳統路徑規劃中需手動定義、窮舉因子並蒐集資料的限制，提供一種具備彈性與效率的智慧步行導航解方。	zh_TW
dc.description.abstract	Pedestrian route planning not only assists pedestrians in reaching their destinations but also contributes to enhancing the walking experience. Pedestrian route planning outcomes influences the quality of daily mobility. This decision-making process is shaped by built environment, individual preferences, and situational contexts. Many of these factors are qualitative and difficult to quantify, such safety, comfort, and convenience. Incorporating such factors into practical routing applications often requires exhaustive enumeration of relevant conditions, resulting in high complexity in route planning. Moreover, existing navigation systems primarily provide information through quantitative indicators such as travel time, distance, and number of turns. Even when multiple alternative routes are offered, differences in walking experience are often insufficiently highlighted, making it difficult for users to select routes that best align with their individual needs and preferences. To address these limitations, this study proposes the Pedestrian Route Description-LLaVA (PRD-LLaVA) framework, which leverages a multimodal large language model to analyze street view imagery along routes recommended by existing routing algorithms. The framework extracts objective spatial features that influence pedestrian walking experience, such as sidewalks, land use, greenery, and obstacles, and generates textual descriptions to complement the micro-scale environmental information lacking in conventional navigation systems. PRD-LLaVA consists of three processing stages: 1. identifying pedestrian facilities and walking environments, based on the navigation output route which has already considered the shortest or fastest path; 2. generating semantic descriptions for each street view images; and; 3. aggregating these descriptions into route summary. Experimental results demonstrate that, through structured prompting and model fine-tuning, PRD-LLaVA achieved an accuracy of 93.01% in the first-stage pedestrian facility identification task. In the second stage, F1-scores of BERTScore is used to evaluate the semantic similarity between generated image descriptions and manually annotated references, with PRD-LLaVA achieving a score of 0.70, outperforming ChatGPT-4o, which scores 0.38. In the third stage, the route descriptions generated by PRD-LLaVA effectively capture the spatial configuration along pedestrian routes, including six types of sidewalk conditions, four land-use categories, and streetscape features such as shops, street width, and the proportion of greenery. Furthermore, the model outputs are integrated into a geographic information system to demonstrate how semantic route descriptions can be combined with map-based interfaces, enhancing pedestrians’ understanding during route selection. In summary, this study exploits the flexible language generation capabilities of large language models, in combination with model training strategies and street view imagery, to propose an automated approach for generating pedestrian route descriptions. By overcoming the limitations of traditional route planning methods that rely on manual feature definition, exhaustive factor enumeration, and labor-intensive data collection, the proposed framework offers a flexible and efficient solution for intelligent pedestrian navigation.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-04-08T16:26:49Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-04-08T16:26:49Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員審定書 I 謝誌 II 摘要 IV Abstract V Table of Contents VII List of Images IX List of Tables XI List of Acronyms XII Chapter 1. Introduction 1 1.1 Background and motivation 1 1.2 Aim and objectives 5 Chapter 2. Literature review 7 2.1 Pedestrian factors 7 2.1.1 Analysis of pedestrian factors 7 2.1.2 Pedestrian factors in Taiwan 11 2.2 Street view image analysis 12 2.3 Vision-Language models and prompt engineering 15 2.3.1 Vision-Language models 16 2.3.2 Prompt engineering 18 Chapter 3. Methods 23 3.1 Data acquisition 24 3.2 Data annotation and prompt design 25 3.2.1 Pedestrian factors in this study 26 3.2.2 Task 1: Facilities recognition 27 3.2.3 Task 2: Image description 30 3.2.4 Task 3: Summary 33 3.3 Model building 36 3.3.1 Large Language and Vision Assistant (LLaVA) 37 3.3.2 Pedestrian Route Description-LLaVA (PRD-LLaVA) model 38 3.3.3 Evaluation 39 Chapter 4. Results and discussion 44 4.1 Study area and materials 44 4.2 Training process and cross-validation results 49 4.3 Three-task testing results 52 4.3.1 Task 1 evaluation 52 4.3.2 Task 2 evaluation 54 4.3.3 Task 3 evaluation 55 4.4 Comparison and discussion 56 4.4.1 Task 1 56 4.4.2 Task 2 58 4.4.3 Task 3 63 4.5 Navigation system integration and application scenarios 76 4.6 Limitations 85 Chapter 5. Conclusions and future work 86 5.1 Conclusions 86 5.2 Future work 87 Reference 89	-
dc.language.iso	en	-
dc.subject	行人路徑規劃	-
dc.subject	街景影像	-
dc.subject	路徑描述	-
dc.subject	多模態大型語言模型	-
dc.subject	行人路徑描述-LLaVA	-
dc.subject	pedestrian route planning	-
dc.subject	PRD-LLaVA	-
dc.subject	street view imagery	-
dc.subject	multimodal large language model	-
dc.title	結合多模態大型語言模型與街景影像生成行人路徑描述	zh_TW
dc.title	Generating Pedestrian Route Description Using Multimodal Large Language Models and Street View Imagery	en
dc.type	Thesis	-
dc.date.schoolyear	114-2	-
dc.description.degree	碩士	-
dc.contributor.coadvisor	郭巧玲	zh_TW
dc.contributor.coadvisor	Chiao-Ling Kuo	en
dc.contributor.oralexamcommittee	郭佩棻;林峰田	zh_TW
dc.contributor.oralexamcommittee	Pei-Fen Kuo;Feng-Tyan Lin	en
dc.subject.keyword	行人路徑規劃,街景影像路徑描述多模態大型語言模型行人路徑描述-LLaVA	zh_TW
dc.subject.keyword	pedestrian route planning,PRD-LLaVAstreet view imagerymultimodal large language model	en
dc.relation.page	101	-
dc.identifier.doi	10.6342/NTU202600838	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2026-03-10	-
dc.contributor.author-college	理學院	-
dc.contributor.author-dept	地理環境資源學系	-
dc.date.embargo-lift	2026-04-09	-
顯示於系所單位：	地理環境資源學系

文件中的檔案：

檔案	大小	格式
ntu-114-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	9.57 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。