應用視覺語言模型於高密度市區中進行具有社會化互動認知之自駕車路徑規劃

曾益銘; Yik-Ming Chin

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99614

標題:	應用視覺語言模型於高密度市區中進行具有社會化互動認知之自駕車路徑規劃 Trajectory Planning in Dense Urban Environments: Utilizing Vision-Language Models to Learn Socially-Aware Behaviors for High-Uncertainty Scenarios
作者:	曾益銘 Yik-Ming Chin
指導教授:	施吉昇 Chi-Sheng Shih
關鍵字:	自駕車,路徑規劃,大語言模型,視覺語言模型,社會化互動認知, Autonomous Vehicle,Motion Planning,Large Language Model,Vision Language Model,Socially-Aware,
出版年 :	2025
學位:	碩士
摘要:	現代自駕系統在高密度城市環境中表現不佳，主要原因在於這些場景中社會互動頻繁且充滿不確定性，對路徑規劃構成挑戰。現有方法多依賴僵化的基於規則模組，或僅針對場景幾何優化學習型模型，卻常無法有效捕捉類似人類的社會化互動認知與境推理能力。相較之下，我們主張，一個具備社會認知的規劃器應當能夠聯合考慮空間、語意與行為線索，以生成符合潛在社會規範的行車軌跡。為此，我們提出一個基於視覺語言模型（VLM）的框架，整合社會屬性理解與路徑生成，並具備可部署性。本方法利用視覺輸入與細緻的行為標註，並透過自訂損失函數進行模型訓練，以同時對齊軌跡與語意資訊。在經精選的 TITAN 資料集中進行評估後，我們的系統展現出良好的安全性與社會化互動認知。實驗結果顯示，本系統在四項指標上均有優異表現：在 NLP 評估指標方面，BLEU-4 達到 0.21，ROUGE-L 為 0.37，METEOR 為 0.52，GPT-4o 評分指標達到 86 分，VQA 準確率為 44%，平均 L2 距離誤差為 30 像素，顯示出出色的語意理解與規劃能力。本研究提倡以規劃為核心的學習方法，結合社會語境監督，並提供一套具社會互動認知自駕解決方案。 Modern autonomous driving systems struggle in dense urban environments where social interactions and uncertainty dominate the planning landscape. Existing approaches either rely on rigid rule-based modules or optimize learning-based models on scene geometry, but often fail to capture human-like social awareness and contextual reasoning. In contrast, this work argues that a socially-informed planner should reason jointly over spatial, semantic, and behavioral cues to generate trajectories aligned with implicit social norms. Toward this goal, this work proposes a Vision-Language Model (VLM)-based framework that unifies social attribute understanding and trajectory generation within a deployable architecture. Leveraging visual inputs and fine-grained action annotations, the models is trained with a custom loss that accounts for trajectory and semantic alignment. Evaluated on a curated subset of the TITAN dataset, the system exhibits safety and social compliance. Experimental results show strong performance across four key evaluations: the model achieves a BLEU-4 score of 0.21, ROUGE-L of 0.37, and METEOR of 0.52 in NLP metrics; a GPT-4o Rubric Score of 86; a VQA accuracy of 44%; and an average L2 distance error of 30 pixels, reflecting robust reasoning and planning capabilities. This work advocates for planning-centric learning with socially grounded supervision, and offers a solution toward socially-aware autonomous driving.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99614
DOI:	10.6342/NTU202502673
全文授權:	未授權
電子全文公開日期:	N/A
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	5.79 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。