結合多模態大型語言模型與街景影像生成行人路徑描述

丘絲盈; Si Ying Yau

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102226

標題:	結合多模態大型語言模型與街景影像生成行人路徑描述 Generating Pedestrian Route Description Using Multimodal Large Language Models and Street View Imagery
作者:	丘絲盈 Si Ying Yau
指導教授:	亞歷山卓克里維 Alessandro Crivellari
共同指導教授:	郭巧玲 Chiao-Ling Kuo
關鍵字:	行人路徑規劃,街景影像路徑描述多模態大型語言模型行人路徑描述-LLaVA pedestrian route planning,PRD-LLaVAstreet view imagerymultimodal large language model
出版年 :	2026
學位:	碩士
摘要:	行人路徑規劃不僅協助行人抵達目的地，亦有助於提升步行效率。行人基於路徑規劃結果所選擇的路徑，將直接影響其移動的品質與感受，而此一決策過程受到建成環境、個人偏好與行動情境等多重因素的影響。由於其中多為難以量化的質性因子，如安全性、舒適性與便利性等考量，若欲於實務中納入這些因素，往往需列舉各項相關條件，導致路徑規劃在應用上面臨高度複雜性。此外，現有導航系統所提供的資訊多以時間、距離與轉彎數等量化指標為主，即便已提供多條路徑選項，其在步行體驗上的差異，也往往未被凸顯，導致使用者難以根據個人需求做出最適切的選擇。有鑑於此，本研究提出「行人路徑描述-LLaVA (PRD-LLaVA)」架構，透過多模態大型語言模型 (multimodal large language model)，分析路徑規劃所推薦路徑的街景影像，萃取影響行人步行體驗的客觀空間特徵，如人行道、土地利用、綠化比例與阻礙物等，並生成文字描述，以補足傳統導航系統缺乏的微觀環境資訊，促進行人路徑選擇。該架構包含三個處理階段：1. 首先，從導航成果路徑中(已具最快或最短路路徑考量)，辨識與萃取行人相關設施與步行環境；2. 其次生成單張影像的語意描述；3. 最終歸納為完整路徑說明。實驗結果顯示，透過結構化提示 (prompt) 與模型微調 (fine-tuning)，PRD-LLaVA 於第一階段人行設施辨識任務達 93.01% 的準確率；在第二階段計算 BERTScore 之 F1 指標評估單張影像描述與人工標記資料之語意相似度，結果顯示 PRD-LLaVA 模型的表現為 0.70，優於 ChatGPT-4o 的 0.38；在第三階段，PRD-LLaVA 所生成路徑說明能概括行人路徑沿線之空間配置特徵，包括 6 種人行道情況、4 種土地利用、沿途地景特徵如商店、道路寬窄、綠化比例等。此外，本研究亦將模型成果整合為地理資訊系統，展示結合地圖之語意化路徑描述，提升行人在路徑選擇上的理解與判斷。總結而言，本研究運用大型語言模型具有彈性的語言生成能力，結合街景影像、透過模型訓練，提出自動化描述路徑的方法，突破傳統路徑規劃中需手動定義、窮舉因子並蒐集資料的限制，提供一種具備彈性與效率的智慧步行導航解方。 Pedestrian route planning not only assists pedestrians in reaching their destinations but also contributes to enhancing the walking experience. Pedestrian route planning outcomes influences the quality of daily mobility. This decision-making process is shaped by built environment, individual preferences, and situational contexts. Many of these factors are qualitative and difficult to quantify, such safety, comfort, and convenience. Incorporating such factors into practical routing applications often requires exhaustive enumeration of relevant conditions, resulting in high complexity in route planning. Moreover, existing navigation systems primarily provide information through quantitative indicators such as travel time, distance, and number of turns. Even when multiple alternative routes are offered, differences in walking experience are often insufficiently highlighted, making it difficult for users to select routes that best align with their individual needs and preferences. To address these limitations, this study proposes the Pedestrian Route Description-LLaVA (PRD-LLaVA) framework, which leverages a multimodal large language model to analyze street view imagery along routes recommended by existing routing algorithms. The framework extracts objective spatial features that influence pedestrian walking experience, such as sidewalks, land use, greenery, and obstacles, and generates textual descriptions to complement the micro-scale environmental information lacking in conventional navigation systems. PRD-LLaVA consists of three processing stages: 1. identifying pedestrian facilities and walking environments, based on the navigation output route which has already considered the shortest or fastest path; 2. generating semantic descriptions for each street view images; and; 3. aggregating these descriptions into route summary. Experimental results demonstrate that, through structured prompting and model fine-tuning, PRD-LLaVA achieved an accuracy of 93.01% in the first-stage pedestrian facility identification task. In the second stage, F1-scores of BERTScore is used to evaluate the semantic similarity between generated image descriptions and manually annotated references, with PRD-LLaVA achieving a score of 0.70, outperforming ChatGPT-4o, which scores 0.38. In the third stage, the route descriptions generated by PRD-LLaVA effectively capture the spatial configuration along pedestrian routes, including six types of sidewalk conditions, four land-use categories, and streetscape features such as shops, street width, and the proportion of greenery. Furthermore, the model outputs are integrated into a geographic information system to demonstrate how semantic route descriptions can be combined with map-based interfaces, enhancing pedestrians’ understanding during route selection. In summary, this study exploits the flexible language generation capabilities of large language models, in combination with model training strategies and street view imagery, to propose an automated approach for generating pedestrian route descriptions. By overcoming the limitations of traditional route planning methods that rely on manual feature definition, exhaustive factor enumeration, and labor-intensive data collection, the proposed framework offers a flexible and efficient solution for intelligent pedestrian navigation.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102226
DOI:	10.6342/NTU202600838
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2026-04-09
顯示於系所單位：	地理環境資源學系

文件中的檔案：

檔案	大小	格式
ntu-114-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	9.57 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。