童視覺語言模型：基於視覺語言模型的車輛品牌、型號與顏色辨識系統

童嬿瑜; Yen-Yu Tung

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102260

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	傅楸善	zh_TW
dc.contributor.advisor	Chiou-Shann Fuh	en
dc.contributor.author	童嬿瑜	zh_TW
dc.contributor.author	Yen-Yu Tung	en
dc.date.accessioned	2026-04-08T16:42:35Z	-
dc.date.available	2026-04-09	-
dc.date.copyright	2026-04-08	-
dc.date.issued	2026	-
dc.date.submitted	2026-03-17	-
dc.identifier.citation	[1] 8891 中古車網, "中古車搜尋-提供全台二手車資訊, "https://www.8891.com.tw, 2025. [2] abc 好車網, "abc好車網-和泰集團二手車平台,". https://www.abccar.com.tw, 2025. [3] J. Alayrac et al., "Flamingo: A Visual Language Model for Few-Shot Learning," Proceedings of Advances in Neural Information Processing Systems, vol. 35, New Orleans, LA, pp. 23716-23736, 2022. [4] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-End Object Detection with Transformers," Proceedings of the European Conference on Computer Vision, Glasgow, UK, pp. 213-229, 2020. [5] W. C. Chang and Y. A. Chen, "Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation," arXiv:2510.18502, 2025. [6] Q. M. Dinh, M. K. Ho, A. Q. Dang, and H. P. Tran, "TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, pp. 7134-7143, 2024. [7] A. Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," Proceedings of International Conference on Learning Representations, Vienna, Austria, pp. 1-22, 2021. [8] J. Fu, H. Zheng, and T. Mei, "Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, pp. 4438-4446, 2017. [9] S. Gunasekar et al., "Textbooks Are All You Need," arXiv:2306.11644, 2023. [10] K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, pp. 770-778, 2016. [11] HOT大聯盟, "HOT大聯盟中古車-認證中古車及售後服務,". https://www.hotcar.com.tw, 2025. [12] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, "3D Object Representations for Fine Grained Categorization," Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, pp. 554-561, 2013. [13] J. Li, D. Li, S. Savarese, and S. Hoi, "BLIP-2: Bootstrapping Language-Image Pretraining with Frozen Image Encoders and Large Language Models," Proceedings of International Conference on Machine Learning, Honolulu, HI, pp. 1-13, 2023. [14] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, "Generalized Focal Loss: Learning Joint Representation of Quality and Distribution for Dense Object Detection," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, pp. 13364-13373, 2020. [15] X. Li et al., "CLIP-ReID: Exploiting Vision-Language Model for Image ReIdentification without Concrete Text Labels," Proceedings of IEEE International Conference on Multimedia and Expo, Brisbane, Australia, pp. 1-6, 2023. [16] H. Liu, C. Li, Q. Wu, and Y. J. Lee, "Visual Instruction Tuning," Proceedings of Advances in Neural Information Processing Systems, vol. 36, New Orleans, LA, pp. 1-25, 2023. [17] X. Liu, W. Liu, T. Mei, and H. Ma, "A Deep Learning-Based Approach to Progressive Vehicle Re-Identification for Urban Surveillance," Proceedings of the European Conference on Computer Vision, Amsterdam, Netherlands, pp. 869-884, 2016. [18] D. G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004. [19] J. Miao et al., "DVM-Car: A Large-Scale Automotive Dataset for Design Analysis and Generation," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, pp. 10435-10445, 2022. [20] Microsoft Research, "Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs," arXiv:2503.01743, 2025. [21] S. A. Najeeb, R. H. Raza, A. Yusuf, and Z. Sultan, "Fine-Grained Vehicle Classification in Urban Traffic Scenes Using Deep Learning," Proceedings of the International Conference on Robotics, Vision, Signal Processing and Power Applications, Penang, Malaysia, pp. 902-908, 2021. [22] OpenAI, "GPT-4o Model,". https://developers.openai.com/api/docs/models/gpt-4o, 2025. [23] A. Radford et al., "Learning Transferable Visual Models From Natural Language Supervision," Proceedings of International Conference on Machine Learning, Virtual, pp. 8748-8763, 2021. [24] S. Q. Ren, K. M. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 2017. [25] N. Shazeer, "GLU Variants Improve Transformer," arXiv:2002.05202, 2020. [26] J. Su, M. Ahmed, Y. Pan, B. Han, Y. Liu, and J. Guo, "RoFormer: Enhanced Transformer with Rotary Position Embedding," Neurocomputing, vol. 568, 127063, pp. 1-24, 2024. [27] SUM賞車網, "SUM賞車網-二手車買賣及保固服務,". https://www.sum.com.tw, 2025. [28] Z. Sun, G. Bebis, and R. Miller, "On-Road Vehicle Detection: A Review," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 5, pp. 694-711, 2006. [29] Y. Tian et al., "YOLOv12: Attention-Centric Real-Time Object Detectors," arXiv:2502.12524, 2025. [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention Is All You Need," Proceedings of Advances in Neural Information Processing Systems, vol. 30, Long Beach, CA, pp. 1-15, 2017. [31] X. Xiao et al., "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks, " arXiv:2311.06242, 2024. [32] H. Yang et al., "Large-Scale Vehicle Re-Identification in the Wild," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, pp. 190-199, 2020. [33] L. Yang, P. Luo, C. C. Loy, and X. Tang, "A Large-Scale Car Dataset for Fine Grained Categorization and Verification," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, pp. 3973-3981, 2015. [34] X. Zhai, B. Mustafa, A. Steiner, and L. Beyer, "Sigmoid Loss for Language Image Pre-Training," Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 11975-11986, 2023. [35] X. Zhai et al., "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features," arXiv:2502.14786, 2025. [36] Z. Zheng et al., "Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression," Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, New York, NY, pp. 12993-13000, 2020. [37] L. Zhu, F. R. Yu, Y. Wang, B. Ning, and T. Tang, "Big Data Analytics in Intelligent Transportation Systems: A Survey," IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 1, pp. 383-398, 2019.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/102260	-
dc.description.abstract	隨著智慧交通系統與科技執法需求的提升，車輛廠牌、型號與顏色辨識（Make, Model, and Color Recognition, MMCR）已成為電腦視覺領域中的重要研究議題。然而，在臺灣的交通環境中，車輛種類繁多，包含多樣化的進口車與國產車，且型號與顏色特徵在實際交通場景下難以準確區分。傳統的監督式學習方法依賴大量人工標註的資料集，不僅耗時費力，且標註的一致性與品質常受限於人為因素。為解決上述問題，我們提出了TungVLM，一套結合先進物件偵測技術與視覺語言模型的自動化標註與辨識系統，旨在透過整合高效能的物件偵測模型與具備強大語意理解能力的視覺語言分類模型，在維持辨識效率的同時，實現對車輛廠牌、型號及顏色的精準辨識。本研究的方法論包含三個主要階段：資料集準備、辨識模型訓練以及系統推論。首先，在資料收集與前處理方面，本研究整合了公開資料集 DVM-Car （Deep Visual Marketing-Car）以及透過網路爬蟲技術從臺灣各大汽車交易網站（如 8891、HOT大聯盟、SUM汽車網（Serve Your Motors）等）收集的在地化車輛影像。接著，利用YOLOv12（You Only Look Once version 12）物件偵測模型產生的車輛座標資訊，結合DVM-Car資料集或二手車網站所提供的原始車輛屬性，自動生成影像對應標註，並經過人工驗證，建立高品質標註檔案以供後續辨識模型訓練使用。在核心辨識模型的架構設計上，我們提出了一種結合高效能視覺編碼器與大型語言模型的混合架構。我們採用了 Google DeepMind 最新的SigLIP 2（Sigmoid Loss for Language-Image Pretraining）作為視覺編碼器，以擷取富含語義的深層影像特徵，並選用Microsoft的Phi-4-mini-instruct作為大型語言模型（LLM: Large Language Model）的骨幹。為了有效的融合視覺與語言特徵，本研究設計了一個包含可學習查詢向量與Transformer編碼器的重新採樣器，以及一個將視覺特徵映射至語言潛在空間的投影器。模型的輸出層設計為多頭分類器，分別針對車輛廠牌、各廠牌下的型號、顏色以及邊界框進行預測。透過這種端對端的訓練方式，模型能夠同時學習車輛的全域特徵與細微的局部差異。在系統架構與多車影像推論方面，本研究設計了一套兩階段推論流程。首先，針對單一或多車輛的輸入影像，本系統導入了YOLOv12模型進行物件偵測。YOLOv12具備卓越的偵測速度與準確率，能夠從複雜場景中精準定位出車輛的邊界框。隨後，系統透過YOLOv12產生的座標資訊，將每一輛車從原始影像中精確裁切並去除背景雜訊干擾，最後將裁切後的影像輸入至訓練完成的VLM辨識模型，輸出每一輛車的完整屬性，並自動生成結構化的標註檔案。綜合上述內容，本研究成功開發了一套結合自動化標註流程與先進VLM架構模型的車輛辨識系統。TungVLM不僅有效解決了臺灣多樣化車型辨識的困難，更證明了視覺語言模型在特定應用場景中的強大潛力。	zh_TW
dc.description.abstract	With the advancement of Intelligent Transportation Systems (ITS) and technological law enforcement, Vehicle Make, Model, and Color Recognition (MMCR) has become a critical research topic. However, Taiwan's traffic environment presents unique challenges due to the high diversity of imported and domestic vehicles, making the distinction of subtle model features and colors in complex scenes difficult. Traditional supervised learning relies on large-scale, manually annotated datasets, a labor-intensive process susceptible to human error. To address these challenges, we propose TungVLM, an automated labeling and recognition system integrating advanced object detection with Vision-Language Models (VLMs). The primary objective is to leverage a high-performance detection model alongside a classification model with strong semantic understanding to achieve precise recognition of vehicle makes, models, and colors while maintaining recognition efficiency. The methodology comprises three main stages: dataset preparation, recognition model training, and system inference. In terms of data collection and preprocessing, TungVLM integrates the public DVM-Car dataset with a large-scale collection of localized vehicle images scraped from major Taiwanese automotive trading websites (e.g., 8891, HOT, SUM). Subsequently, using the vehicle coordinate information generated by the YOLOv12 object detection model, combined with original vehicle attributes provided by the DVM-Car dataset or the used car websites, the system automatically generates corresponding image annotations. These annotations undergo manual verification to establish a high-quality annotation file for subsequent recognition model training. For the core architecture, we propose a hybrid framework combining a high-performance vision encoder with a Large Language Model (LLM). Specifically, we employ Google DeepMind’s latest SigLIP 2 (Sigmoid Loss for Language-Image Pretraining) as the Vision Encoder to extract semantically rich deep image features, and Microsoft’s Phi-4-mini-instruct as the backbone for the LLM. To effectively fuse visual and linguistic features, TungVLM designs a Re-sampler containing learnable queries and a Transformer Encoder, along with a Projector that maps visual features into the language latent space. The model's output layer is designed as a Multi-Head Classifier, capable of predicting vehicle Make, Model-per-make, Color, and Bounding Box. Through this end-to-end training approach, the model learns both global vehicle characteristics and subtle local differences simultaneously. In terms of system architecture and multi-vehicle inference, this study designs a two-stage inference pipeline. First, for input images containing single or multiple vehicles, the system incorporates the YOLOv12 model for object detection. YOLOv12 demonstrates superior detection speed and accuracy, enabling precise localization of vehicle bounding boxes in complex scenes. Subsequently, using the coordinate information generated by YOLOv12, TungVLM accurately crops each vehicle from the original image to eliminate background noise. These cropped images are then fed into the trained VLM recognition model to output the comprehensive attributes for each vehicle and generate corresponding annotation files. In conclusion, TungVLM successfully develops a vehicle recognition system that integrates an automated labeling pipeline with an advanced VLM architecture. TungVLM not only effectively addresses the difficulties of recognizing diverse vehicle models in Taiwan but also demonstrates the immense potential of Vision-Language Models in domain-specific applications.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-04-08T16:42:35Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-04-08T16:42:35Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	誌謝 i 中文摘要 ii ABSTRACT v CONTENTS viii LIST OF FIGURES xiii LIST OF TABLES xviii Chapter 1 Introduction 1 1.1 Background 1 1.2 Objectives 4 1.3 Contributions 6 1.4 Thesis Organization 7 Chapter 2 Related Works 9 2.1 Overview 9 2.2 Traditional Feature Extraction Methods 10 2.3 Deep Learning for Fine-Grained Vehicle Classification 12 2.4 Object Detection for Vehicle Localization 16 2.5 Vision-Language Models in Vehicle Recognition 18 Chapter 3 Background 22 3.1 Overview 22 3.2 The Transformer Architecture 23 3.2.1 Scaled Dot-Product Attention 24 3.2.2 Multi-Head Attention 25 3.2.3 Position-wise Feed-Forward Networks 26 3.2.4 Positional Encoding 27 3.2.5 Residual Connections and Layer Normalization 28 3.3 Vision Encoder: The SigLIP 2 Architecture 28 3.3.1 Vision Transformer Backbone 29 3.3.2 NaFlex: Native Aspect Ratio and Variable Resolution 31 3.3.3 Dense Features via Hybrid Training Objectives 32 3.4 Large Language Model: The Phi-4 Framework 33 3.4.1 Decoder-Only Transformer Backbone 34 3.4.2 Rotary Positional Embeddings 34 3.4.3 SwiGLU Activation Function 35 3.4.4 Data-Centric Reasoning Optimization 36 3.5 Attention-Centric Object Detection: YOLOv12 37 3.5.1 Area Attention Mechanism 38 3.5.2 Residual Efficient Layer Aggregation Networks (R-ELAN) Backbone Architecture 39 3.5.3 FlashAttention and Hardware Optimization 40 3.5.4 Position Encoding and Matching 40 3.6 Multimodal Fusion Mechanisms 41 3.6.1 Linear Projection 41 3.6.2 Perceiver Re-sampler 43 3.6.3 Q-Former 44 Chapter 4 Methodology 46 4.1 Overview 46 4.2 Dataset Preparation 47 4.2.1 Global Base Dataset: DVM-Car 48 4.2.2 Local Extension: Taiwan-Specific Web Scraping 52 4.2.3 Data Preprocessing and Cleaning 54 4.2.4 Automated Annotation and Vehicle-Centric Cropping 56 4.3 Recognition Model Training 57 4.3.1 Vision Encoder: SigLIP 2 58 4.3.2 Semantic Adapter: Re-sampler and Projector 59 4.3.3 Hierarchical Classification Mechanism 60 4.3.4 Training Strategy and Optimization 61 4.3.5 Evaluation Metrics 63 4.4 System Inference Pipeline 64 Chapter 5 Experiment Results 68 5.1 Overview 68 5.2 Experimental Environment 69 5.3 Results on DVM-Car & Web Scraping Dataset 70 5.3.1 Dataset Statistics for Experiment 71 5.3.2 Comparison with Baseline: Florence-2-base vs. TungVLM 72 5.3.3 TungVLM Detailed Accuracy 74 5.4 Results on CHTTL Surveillance Dataset 76 5.5 Ablation Studies 80 5.5.1 Impact of Vision Encoder and NaFlex 80 5.5.2 Impact of Hierarchical Classification Heads 81 Chapter 6 Conclusion and Future Works 83 6.1 Conclusion 83 6.2 Future Works 85 Chapter 7 References 88	-
dc.language.iso	en	-
dc.subject	車輛廠牌型號顏色辨識	-
dc.subject	視覺語言模型	-
dc.subject	YOLOv12	-
dc.subject	SigLIP 2	-
dc.subject	深度學習	-
dc.subject	智慧交通	-
dc.subject	Vehicle Make, Model, and Color Recognition	-
dc.subject	Vision-Language Models	-
dc.subject	YOLOv12	-
dc.subject	SigLIP 2	-
dc.subject	Deep Learning	-
dc.subject	Intelligent Transportation Systems	-
dc.title	童視覺語言模型：基於視覺語言模型的車輛品牌、型號與顏色辨識系統	zh_TW
dc.title	TungVLM: A Vision-Language Framework for Vehicle Make, Model, and Color Recognition	en
dc.type	Thesis	-
dc.date.schoolyear	114-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	董子嘉;曾昱嘉	zh_TW
dc.contributor.oralexamcommittee	Tzu-Chia Tung;Yu-Chia Tseng	en
dc.subject.keyword	車輛廠牌型號顏色辨識,視覺語言模型YOLOv12SigLIP 2深度學習智慧交通	zh_TW
dc.subject.keyword	Vehicle Make, Model, and Color Recognition,Vision-Language ModelsYOLOv12SigLIP 2Deep LearningIntelligent Transportation Systems	en
dc.relation.page	90	-
dc.identifier.doi	10.6342/NTU202600860	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2026-03-17	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	生醫電子與資訊學研究所	-
dc.date.embargo-lift	2031-03-02	-
Appears in Collections:	生醫電子與資訊學研究所

Files in This Item:

File	Size	Format
ntu-114-2.pdf Restricted Access	5.69 MB	Adobe PDF	View/Open

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets