請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94622| 標題: | 結合跨模態模型與大型語言模型進行人物動作辨識:以 UCF-101 資料集為例 Combining Cross-Modal Models with Large Language Models for Human Action Recognition: The Case of the UCF-101 Dataset |
| 作者: | 許彤 Tung Hsu |
| 指導教授: | 黃乾綱 Chien-Kang Huang |
| 關鍵字: | 人物動作辨識,提示工程,影片描述模型,大型語言模型,零樣本辨識,遷移學習, Human Action Recognition,Prompt Engineering,Video Captioning Model,Large Language Model,Zero-Shot Recognition,Transfer Learning, |
| 出版年 : | 2024 |
| 學位: | 碩士 |
| 摘要: | 人物動作辨識 (Human Action Recognition, HAR) 可謂一具挑戰性之議題,因其在分析影像時,須處理複雜之動態場景、時間序列及視覺訊息。近年影像辨識技術逐漸由過去單一模態轉向跨模態架構,如視覺語言模型 (Vision-Language Model, VLM) 等。過往對跨模態辨識模型之研究與改良,通常涉及在模型中納入巨量訓練資料並對模型進行大幅度微調 (Finetune) 以增進辨識成效,惟其過程含龐大的訓練時間、資金、設備等成本,可能限制學者於該領域之深究,此外,模型適應於不同任務或資料集之通用性亦為當今研究所重視之目標。因此,降低跨模態動作辨識模型之訓練成本,同時提升其通用性成為亟待解決之議題。
本研究提出全新跨模態架構,旨在實現具通用性之人物動作辨識任務,並有效降低欲發展該類模型之隱藏成本。研究方法結合已具一定成熟度之影片描述模型 (Video Captioning Model, VCM) 和大型語言模型 (Large Language Model, LLM),並透過專為此任務設計之提示工程技術,為兩模型之核心效能建構實效連繫。實驗結果顯示,在 UCF-101 資料集上,該架構於零樣本辨識準確率 (Zero-Shot Recognition) 達 73.4%,已優於部分跨模態之人物動作辨識模型,充分反映其可行性及發展潛力。此外,本研究對所提出之架構進行優化分析與驗證,經優化方法之實驗結果證實,該架構可透過遷移學習,對 VCM 進行輕量的訓練,方可於特定任務上將整體效能提升 11.5% 至 14%,亦佐證該種模組化思維之結合方式對後續改良提供相當便捷與良善之空間。本研究成果對影像辨識領域的學術研究及應用提供不同思維角度和參考方案,期望於技術蓬勃發展的時代下探究多種可能性。 Human Action Recognition (HAR) is a challenging task due to the need to handle complex dynamic scenes, temporal sequences, and visual information during image analysis. Recently, image recognition techniques have transitioned from unimodal to cross-modal architectures, such as Vision-Language Models (VLMs). Previous research and improvements on cross-modal recognition models typically involve incorporating vast amounts of training data and extensive finetuning to enhance recognition performance. However, this process entails significant costs in terms of training time, financial resources, and equipment, which may hinder deeper exploration by researchers in this field. Moreover, achieving generalizability across different tasks or datasets remains a key objective in current research. Therefore, reducing the training costs of cross-modal action recognition models while enhancing their generalizability is a pressing issue. This study proposes a novel cross-modal architecture aimed at achieving a generalizable human action recognition task while effectively reducing the hidden costs associated with developing such models. This method combines the matured Video Captioning Model (VCM) and Large Language Model (LLM) and constructs an efficient linkage between the core capabilities of these models through task-specific prompt engineering techniques. Experimental results demonstrate that on the UCF-101 dataset, the proposed architecture achieves a zero-shot recognition accuracy of 73.4%, outperforming some existing cross-modal human action recognition models and reflecting its feasibility and development potential. Additionally, this study conducts optimization analysis and validation of the proposed architecture. Experimental results of the optimization methods confirm that the proposed architecture can enhance overall performance by 11.5% to 14% on specific tasks through lightweight training of the VCM via transfer learning. This finding also supports the modular approach as a highly convenient and effective way for subsequent improvements. The outcomes of this study provide alternative perspectives and reference solutions for academic research and applications in the field of image recognition, with the hope of exploring various possibilities in an era of rapid technological development. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/94622 |
| DOI: | 10.6342/NTU202403933 |
| 全文授權: | 同意授權(限校園內公開) |
| 電子全文公開日期: | 2026-07-15 |
| 顯示於系所單位: | 工程科學及海洋工程學系 |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-112-2.pdf 未授權公開取用 | 3.32 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
