Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99468
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor周承復zh_TW
dc.contributor.advisorCheng-Fu Chouen
dc.contributor.author胡家愷zh_TW
dc.contributor.authorJia-Kai Huen
dc.date.accessioned2025-09-10T16:22:53Z-
dc.date.available2025-09-11-
dc.date.copyright2025-09-10-
dc.date.issued2025-
dc.date.submitted2025-07-30-
dc.identifier.citation[1] Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George Votsis, Stefanos Kollias, Winfried Fellenz, and John G Taylor. Emotion recognition in humancomputer interaction. IEEE Signal processing magazine, 18(1):32–80, 2001.
[2] Tom Ziemke. On the role of emotion in biological and robotic autonomy. BioSystems, 91(2):401–408, 2008.
[3] Ernest Ntizikira, Lei Wang, Jenhui Chen, and Kiran Saleem. Enhancing iot security through emotion recognition and blockchain-driven intrusion prevention. Internet of Things, 29:101442, 2025.
[4] Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 2881–2889, 2020.
[5] Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia, 19(3):34–41, 2012.
[6] Lisa Feldman Barrett, Batja Mesquita, and Maria Gendron. Context in emotion perception. Current directions in psychological science, 20(5):286–290, 2011.
[7] Gang Liu, Shifang Cai, and Ce Wang. Speech emotion recognition based on emotion perception. EURASIP Journal on Audio, Speech, and Music Processing, 2023(1):22, 2023.
[8] Ronak Kosti, Jose M. Alvarez, Adria Recasens, and Agata Lapedriza. Emotion recognition in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[9] Jiyoung Lee, Seungryong Kim, Sunok Kim, Jungin Park, and Kwanghoon Sohn. Context-aware emotion recognition networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
[10] Trisha Mittal, Pooja Guhan, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. Emoticon: Context-aware multimodal emotion recognition using frege’s principle. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[11] Manh-Hung Hoang, Soo-Hyung Kim, Hyung-Jeong Yang, and Guee-Sang Lee. Context-aware emotion recognition based on visual relationship detection. IEEE Access, 9:90465–90474, 2021.
[12] Minghui Zhang, Yumeng Liang, and Huadong Ma. Context-aware affective graph reasoning for emotion recognition. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 151–156, 2019.
[13] Qinquan Gao, Hanxin Zeng, Gen Li, and Tong Tong. Graph reasoning-based emotion recognition network. IEEE Access, 9:6488–6497, 2021.
[14] Dingkang Yang, Shuai Huang, Shunli Wang, Yang Liu, Peng Zhai, Liuzhen Su, Mingcheng Li, and Lihua Zhang. Emotion recognition for multiple context awareness. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 144–162, Cham, 2022. Springer Nature Switzerland.
[15] Weixin Li, Xuan Dong, and Yunhong Wang. Human emotion recognition with relational region-level analysis. IEEE Transactions on Affective Computing, 14(1):650–663, 2023.
[16] Xinpeng Li, Xiaojiang Peng, and Changxing Ding. Sequential interactive biased network for context-aware emotion recognition. In 2021 IEEE International Joint Conference on Biometrics (IJCB), pages 1–6, 2021.
[17] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023.
[18] Yuxuan Lei, Dingkang Yang, Zhaoyu Chen, Jiawei Chen, Peng Zhai, and Lihua Zhang. Large vision-language models as emotion recognizers in context awareness. In The 16th Asian Conference on Machine Learning (Conference Track), 2024.
[19] Yasaman Etesam, Ozge Nilay Yalçın, Chuxuan Zhang, and Angelica Lim. Contextual emotion recognition using large vision language models. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4769–4776, 2024.
[20] Akshita Mittel and Shashank Tripathi. Peri: Part aware emotion recognition in the wild. In Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, editors, Computer Vision – ECCV 2022 Workshops, pages 76–92, Cham, 2023. Springer Nature Switzerland.
[21] Trisha Mittal, Aniket Bera, and Dinesh Manocha. Multimodal and context-aware emotion perception model with multiplicative fusion. IEEE MultiMedia, 28(2):67–75, 2021.
[22] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
[23] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
[24] Wenliang Dai, Junnan Li, DONGXU LI, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards generalpurpose vision-language models with instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 49250–49267. Curran Associates, Inc., 2023.
[25] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
[26] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[27] Gemini Team. Gemini: A family of highly capable multimodal models, 2025.
[28] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024.
[29] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
[30] Sitao Zhang, Yimu Pan, and James Z Wang. Learning emotion representations from verbal and nonverbal communication. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18993–19004, 2023.
[31] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
[32] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
[33] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
[34] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
[35] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9268–9277, 2019.
[36] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
[37] Tal Ridnik, Emanuel Ben-Baruch, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. Asymmetric loss for multi-label classification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 82–91, 2021.
[38] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Ali Kazemzadeh, Emily Mower, Samuel R. Kim, Jeannette N. Chang, and Shrikanth S. Narayanan. Iemocap: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4):335–359, 2008.
[39] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
[40] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127(3):302–321, 2019.
[41] Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In 2018 ieee winter conference on applications of computer vision (wacv), pages 381–389. IEEE, 2018.
[42] Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel understanding with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28202–28211, 2024.
[43] Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725, 2024.
[44] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19358–19369, 2023.
[45] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[46] Meta. Llama-3.2-11b-vision, September 2024.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99468-
dc.description.abstract情境感知的情緒識別是一項具有挑戰性的任務,需要理解多種情境線索來推斷目標人物的情緒狀態。過去的研究主要著重於從影像中擷取情緒線索,例如背景、臉部特徵、肢體語言、物件關係、人物互動關係等。然而,近年來視覺語言模型 (Vision-Language Models, VLMs) 快速發展,其結合語言與影像的理解能力,對情緒的表達與語意掌握更為深刻,為該項任務提供了新的可能性。
本研究提出一種結合 VLM 以及分類模型的兩階段方法。在第一階段,我們利用設計好的情緒指令式微調資料集對 VLM 進行微調,以提升 VLM 對情緒的理解能力。為了建造情緒指令式微調資料集,我們使用大型視覺語言模型,結合上下文學習及多範例提示來進行資料生成,並針對資料不平衡的問題進行處理。而在第二階段,我們提取微調後的 VLM 特徵,並訓練一個輕量級的分類模型進行情緒分類。
實驗結果顯示,本方法在只有使用場景、身體與臉部三種視覺特徵的狀況下,即能達到具有競爭力的表現。此外,指令式微調有效增強模型對情緒的理解,進一步提升下游分類任務的表現。
zh_TW
dc.description.abstractContext-aware emotion recognition is a challenging task that requires understanding various contextual cues—such as background, facial expressions, body posture, object relations, and human interactions—to infer a target individual's emotional state. While prior studies have primarily focused on visual cue extraction from images, recent advances in Vision-Language Models (VLMs) provide new opportunities to capture the semantic meanings of emotions through their rich understanding of both vision and language.
In this work, we propose a two-stage framework that leverages the capabilities of VLMs. In the first stage, we fine-tune a VLM using a carefully designed instruction-tuning dataset to enhance its capacity for emotion understanding. To construct the instruction-tuning dataset, we utilize a large VLM with in-context learning and few-shot prompting to generate emotion data, while also addressing the issue of data imbalance. In the second stage, we extract the emotion-relevant representations from the fine-tuned VLM and train a lightweight classifier to perform emotion classification.
Experimental results demonstrate that our method achieves competitive performance using only visual features from the scene, body, and face. Furthermore, instruction tuning significantly improves the model's emotional understanding, thereby enhancing downstream classification performance.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-09-10T16:22:53Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-09-10T16:22:53Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsVerification Letter from the Oral Examination Committee i
Acknowledgements iii
摘要 v
Abstract vii
Contents ix
List of Figures xiii
List of Tables xv
Chapter 1 Introduction 1
Chapter 2 Related Work 5
2.1 Context-Aware Emotion Recognition (CAER) . . . . . . . . . . . . . 5
2.2 Vision Language Models (VLMs) . . . . . . . . . . . . . . . . . . . 6
2.2.1 Visual Language Model in CAER . . . . . . . . . . . . . . . . . . 8
2.3 Instruction Tuning for Multi-Task VLM . . . . . . . . . . . . . . . . 9
2.4 Data Imbalance Handling . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Data Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 3 Dataset 13
3.1 Context-Aware Emotion Datasets . . . . . . . . . . . . . . . . . . . 13
3.1.1 EMOTIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1.1 Data Distribution . . . . . . . . . . . . . . . . . . . . 15
3.1.2 CAER-S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3 HECO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3.1 Data Distribution . . . . . . . . . . . . . . . . . . . . 17
3.2 QA Dataset Preparation . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Task Formulation and Prompt Design . . . . . . . . . . . . . . . . 18
3.2.2 Data Balancing Strategy . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2.1 Minor-Label QA Augmentation . . . . . . . . . . . . . 19
3.2.2.2 Major-Label Down Sampling . . . . . . . . . . . . . . 20
3.2.3 Instruction Tuning Dataset Generation . . . . . . . . . . . . . . . . 20
3.3 Dataset Statistics after Processing . . . . . . . . . . . . . . . . . . . 22
3.3.1 Label Distribution Before and After Balancing . . . . . . . . . . . . 22
3.3.2 Instruction-Tuning Task Composition . . . . . . . . . . . . . . . . 24
Chapter 4 Method 27
4.1 Overview of Two-Stage Architecture . . . . . . . . . . . . . . . . . 28
4.2 Stage 1: Person-centric Emotion VLM . . . . . . . . . . . . . . . . 29
4.2.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Stage 2: Emotion Classifier Training . . . . . . . . . . . . . . . . . 31
4.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.1 Stage 1: Instruction-Tuning Objective . . . . . . . . . . . . . . . . 32
4.4.2 Stage 2: Classification Objective . . . . . . . . . . . . . . . . . . . 33
4.4.2.1 Multi-Label Classification: Asymmetric Loss . . . . . 33
4.4.2.2 Multi-Class Classification: Cross-Entropy Loss . . . . 34
Chapter 5 Experiments 35
5.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.5.1 Effect of Visual Feature Composition . . . . . . . . . . . . . . . . . 42
5.5.2 Effect of Instruction Tuning Strategy . . . . . . . . . . . . . . . . . 43
5.5.3 Effect of Data Balancing Strategy . . . . . . . . . . . . . . . . . . 43
5.5.4 Effect of Self-Attention Mechanism . . . . . . . . . . . . . . . . . 44
5.5.5 Effect of Loss Function . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5.6 Effect of Incremental VLM Enhancements . . . . . . . . . . . . . . 45
Chapter 6 Conclusion 47
References 49
Appendix A — Dataset Construction 57
A.1 Pose Description Task . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.2 Situation Description Task . . . . . . . . . . . . . . . . . . . . . . . 58
A.3 Rationale Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Appendix B — Qualitative Result 63
B.1 More Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 63
-
dc.language.isoen-
dc.subject情緒理解zh_TW
dc.subject資料不平衡zh_TW
dc.subject多標籤分類zh_TW
dc.subject情境感知情緒識別zh_TW
dc.subject視覺語言模型zh_TW
dc.subject指令式微調zh_TW
dc.subjectContext-aware emotion recognitionen
dc.subjectVision-Language Modelen
dc.subjectEmotion understandingen
dc.subjectMulti-label classificationen
dc.subjectData imbalanceen
dc.subjectInstruction tuningen
dc.title結合多視角特徵與指令調校視覺語言模型引導之情緒辨識方法zh_TW
dc.titleContext-Aware Emotion Recognition via Multi-View Instruction-Tuned Visual Language Guidanceen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee吳曉光;呂政修;蔡瑞煌;陳駿丞zh_TW
dc.contributor.oralexamcommitteeHsiao-Kuang Wu;Jenq-Shiou Leu;Rua-Huan Tsaih;Jun-Cheng Chenen
dc.subject.keyword情境感知情緒識別,情緒理解,視覺語言模型,指令式微調,資料不平衡,多標籤分類,zh_TW
dc.subject.keywordContext-aware emotion recognition,Emotion understanding,Vision-Language Model,Instruction tuning,Data imbalance,Multi-label classification,en
dc.relation.page65-
dc.identifier.doi10.6342/NTU202501696-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-07-31-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-lift2030-08-04-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
20.86 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved