Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99468| Title: | 結合多視角特徵與指令調校視覺語言模型引導之情緒辨識方法 Context-Aware Emotion Recognition via Multi-View Instruction-Tuned Visual Language Guidance |
| Authors: | 胡家愷 Jia-Kai Hu |
| Advisor: | 周承復 Cheng-Fu Chou |
| Keyword: | 情境感知情緒識別,情緒理解,視覺語言模型,指令式微調,資料不平衡,多標籤分類, Context-aware emotion recognition,Emotion understanding,Vision-Language Model,Instruction tuning,Data imbalance,Multi-label classification, |
| Publication Year : | 2025 |
| Degree: | 碩士 |
| Abstract: | 情境感知的情緒識別是一項具有挑戰性的任務,需要理解多種情境線索來推斷目標人物的情緒狀態。過去的研究主要著重於從影像中擷取情緒線索,例如背景、臉部特徵、肢體語言、物件關係、人物互動關係等。然而,近年來視覺語言模型 (Vision-Language Models, VLMs) 快速發展,其結合語言與影像的理解能力,對情緒的表達與語意掌握更為深刻,為該項任務提供了新的可能性。
本研究提出一種結合 VLM 以及分類模型的兩階段方法。在第一階段,我們利用設計好的情緒指令式微調資料集對 VLM 進行微調,以提升 VLM 對情緒的理解能力。為了建造情緒指令式微調資料集,我們使用大型視覺語言模型,結合上下文學習及多範例提示來進行資料生成,並針對資料不平衡的問題進行處理。而在第二階段,我們提取微調後的 VLM 特徵,並訓練一個輕量級的分類模型進行情緒分類。 實驗結果顯示,本方法在只有使用場景、身體與臉部三種視覺特徵的狀況下,即能達到具有競爭力的表現。此外,指令式微調有效增強模型對情緒的理解,進一步提升下游分類任務的表現。 Context-aware emotion recognition is a challenging task that requires understanding various contextual cues—such as background, facial expressions, body posture, object relations, and human interactions—to infer a target individual's emotional state. While prior studies have primarily focused on visual cue extraction from images, recent advances in Vision-Language Models (VLMs) provide new opportunities to capture the semantic meanings of emotions through their rich understanding of both vision and language. In this work, we propose a two-stage framework that leverages the capabilities of VLMs. In the first stage, we fine-tune a VLM using a carefully designed instruction-tuning dataset to enhance its capacity for emotion understanding. To construct the instruction-tuning dataset, we utilize a large VLM with in-context learning and few-shot prompting to generate emotion data, while also addressing the issue of data imbalance. In the second stage, we extract the emotion-relevant representations from the fine-tuned VLM and train a lightweight classifier to perform emotion classification. Experimental results demonstrate that our method achieves competitive performance using only visual features from the scene, body, and face. Furthermore, instruction tuning significantly improves the model's emotional understanding, thereby enhancing downstream classification performance. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99468 |
| DOI: | 10.6342/NTU202501696 |
| Fulltext Rights: | 同意授權(限校園內公開) |
| metadata.dc.date.embargo-lift: | 2030-08-04 |
| Appears in Collections: | 資訊工程學系 |
Files in This Item:
| File | Size | Format | |
|---|---|---|---|
| ntu-113-2.pdf Restricted Access | 20.86 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
