童視覺語言模型：基於視覺語言模型的車輛品牌、型號與顏色辨識系統

童嬿瑜; Yen-Yu Tung

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102260

標題:	童視覺語言模型：基於視覺語言模型的車輛品牌、型號與顏色辨識系統 TungVLM: A Vision-Language Framework for Vehicle Make, Model, and Color Recognition
作者:	童嬿瑜 Yen-Yu Tung
指導教授:	傅楸善 Chiou-Shann Fuh
關鍵字:	車輛廠牌型號顏色辨識,視覺語言模型YOLOv12SigLIP 2深度學習智慧交通 Vehicle Make, Model, and Color Recognition,Vision-Language ModelsYOLOv12SigLIP 2Deep LearningIntelligent Transportation Systems
出版年 :	2026
學位:	碩士
摘要:	隨著智慧交通系統與科技執法需求的提升，車輛廠牌、型號與顏色辨識（Make, Model, and Color Recognition, MMCR）已成為電腦視覺領域中的重要研究議題。然而，在臺灣的交通環境中，車輛種類繁多，包含多樣化的進口車與國產車，且型號與顏色特徵在實際交通場景下難以準確區分。傳統的監督式學習方法依賴大量人工標註的資料集，不僅耗時費力，且標註的一致性與品質常受限於人為因素。為解決上述問題，我們提出了TungVLM，一套結合先進物件偵測技術與視覺語言模型的自動化標註與辨識系統，旨在透過整合高效能的物件偵測模型與具備強大語意理解能力的視覺語言分類模型，在維持辨識效率的同時，實現對車輛廠牌、型號及顏色的精準辨識。本研究的方法論包含三個主要階段：資料集準備、辨識模型訓練以及系統推論。首先，在資料收集與前處理方面，本研究整合了公開資料集 DVM-Car （Deep Visual Marketing-Car）以及透過網路爬蟲技術從臺灣各大汽車交易網站（如 8891、HOT大聯盟、SUM汽車網（Serve Your Motors）等）收集的在地化車輛影像。接著，利用YOLOv12（You Only Look Once version 12）物件偵測模型產生的車輛座標資訊，結合DVM-Car資料集或二手車網站所提供的原始車輛屬性，自動生成影像對應標註，並經過人工驗證，建立高品質標註檔案以供後續辨識模型訓練使用。在核心辨識模型的架構設計上，我們提出了一種結合高效能視覺編碼器與大型語言模型的混合架構。我們採用了 Google DeepMind 最新的SigLIP 2（Sigmoid Loss for Language-Image Pretraining）作為視覺編碼器，以擷取富含語義的深層影像特徵，並選用Microsoft的Phi-4-mini-instruct作為大型語言模型（LLM: Large Language Model）的骨幹。為了有效的融合視覺與語言特徵，本研究設計了一個包含可學習查詢向量與Transformer編碼器的重新採樣器，以及一個將視覺特徵映射至語言潛在空間的投影器。模型的輸出層設計為多頭分類器，分別針對車輛廠牌、各廠牌下的型號、顏色以及邊界框進行預測。透過這種端對端的訓練方式，模型能夠同時學習車輛的全域特徵與細微的局部差異。在系統架構與多車影像推論方面，本研究設計了一套兩階段推論流程。首先，針對單一或多車輛的輸入影像，本系統導入了YOLOv12模型進行物件偵測。YOLOv12具備卓越的偵測速度與準確率，能夠從複雜場景中精準定位出車輛的邊界框。隨後，系統透過YOLOv12產生的座標資訊，將每一輛車從原始影像中精確裁切並去除背景雜訊干擾，最後將裁切後的影像輸入至訓練完成的VLM辨識模型，輸出每一輛車的完整屬性，並自動生成結構化的標註檔案。綜合上述內容，本研究成功開發了一套結合自動化標註流程與先進VLM架構模型的車輛辨識系統。TungVLM不僅有效解決了臺灣多樣化車型辨識的困難，更證明了視覺語言模型在特定應用場景中的強大潛力。 With the advancement of Intelligent Transportation Systems (ITS) and technological law enforcement, Vehicle Make, Model, and Color Recognition (MMCR) has become a critical research topic. However, Taiwan's traffic environment presents unique challenges due to the high diversity of imported and domestic vehicles, making the distinction of subtle model features and colors in complex scenes difficult. Traditional supervised learning relies on large-scale, manually annotated datasets, a labor-intensive process susceptible to human error. To address these challenges, we propose TungVLM, an automated labeling and recognition system integrating advanced object detection with Vision-Language Models (VLMs). The primary objective is to leverage a high-performance detection model alongside a classification model with strong semantic understanding to achieve precise recognition of vehicle makes, models, and colors while maintaining recognition efficiency. The methodology comprises three main stages: dataset preparation, recognition model training, and system inference. In terms of data collection and preprocessing, TungVLM integrates the public DVM-Car dataset with a large-scale collection of localized vehicle images scraped from major Taiwanese automotive trading websites (e.g., 8891, HOT, SUM). Subsequently, using the vehicle coordinate information generated by the YOLOv12 object detection model, combined with original vehicle attributes provided by the DVM-Car dataset or the used car websites, the system automatically generates corresponding image annotations. These annotations undergo manual verification to establish a high-quality annotation file for subsequent recognition model training. For the core architecture, we propose a hybrid framework combining a high-performance vision encoder with a Large Language Model (LLM). Specifically, we employ Google DeepMind’s latest SigLIP 2 (Sigmoid Loss for Language-Image Pretraining) as the Vision Encoder to extract semantically rich deep image features, and Microsoft’s Phi-4-mini-instruct as the backbone for the LLM. To effectively fuse visual and linguistic features, TungVLM designs a Re-sampler containing learnable queries and a Transformer Encoder, along with a Projector that maps visual features into the language latent space. The model's output layer is designed as a Multi-Head Classifier, capable of predicting vehicle Make, Model-per-make, Color, and Bounding Box. Through this end-to-end training approach, the model learns both global vehicle characteristics and subtle local differences simultaneously. In terms of system architecture and multi-vehicle inference, this study designs a two-stage inference pipeline. First, for input images containing single or multiple vehicles, the system incorporates the YOLOv12 model for object detection. YOLOv12 demonstrates superior detection speed and accuracy, enabling precise localization of vehicle bounding boxes in complex scenes. Subsequently, using the coordinate information generated by YOLOv12, TungVLM accurately crops each vehicle from the original image to eliminate background noise. These cropped images are then fed into the trained VLM recognition model to output the comprehensive attributes for each vehicle and generate corresponding annotation files. In conclusion, TungVLM successfully develops a vehicle recognition system that integrates an automated labeling pipeline with an advanced VLM architecture. TungVLM not only effectively addresses the difficulties of recognizing diverse vehicle models in Taiwan but also demonstrates the immense potential of Vision-Language Models in domain-specific applications.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102260
DOI:	10.6342/NTU202600860
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2031-03-02
顯示於系所單位：	生醫電子與資訊學研究所

文件中的檔案：

檔案	大小	格式
ntu-114-2.pdf 未授權公開取用	5.69 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。