基於多模態大型語言模型的情感圖片編輯

林菩提; Pu-Ti Lin

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95717

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	莊永裕	zh_TW
dc.contributor.advisor	Yung-Yu Chuang	en
dc.contributor.author	林菩提	zh_TW
dc.contributor.author	Pu-Ti Lin	en
dc.date.accessioned	2024-09-15T16:57:21Z	-
dc.date.available	2024-09-16	-
dc.date.copyright	2024-09-15	-
dc.date.issued	2024	-
dc.date.submitted	2024-08-08	-
dc.identifier.citation	[1] Q. Lin, J. Zhang, Y. S. Ong, and M. Zhang, “Make me happier: Evoking emotions through image diffusion models,” arXiv preprint arXiv:2403.08255, 2024. [2] S. Weng, P. Zhang, Z. Chang, X. Wang, S. Li, and B. Shi1, “Affective image filter: Reflecting emotions from text to images,” pp. 10810–10819, 2023. [3] T.-J. Fu, W. Hu, X. Du, W. Y. Wang, Y. Yang, and Z. Gan, “Guiding instruction-based image editing via multimodal large language models,” in International Conference on Learning Representations, 2024. [4] J. Yang, Q. Huang, T. Ding, D. Lischinski, D. Cohen-Or, and H. Huang,“Emoset: A large-scale visual emotion dataset with rich attributes,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20383–20394, 2023. [5] J. Yang, J. Feng, and H. Huang, “Emogen: Emotional image content generation with text-to-image diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6358–6368, 2024. [6] J. A. Mikels, B. L. Fredrickson, G. R. Larkin, C. M. Lindberg, S. J. Maglio, and P. A. ReuterLorenz, “Emotional category data on images from the international affective picture system,” Behavior research methods, vol. 37, no. 4, pp. 626–630, 2005. [7] J. Yang, D. She, Y.-K. Lai, P. L. Rosin, and M.-H. Yang, “Wscnet: Weakly supervised coupled networks for visual sentiment analysis,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7584–7592, 2018. [8] J. Yang, J. Li, X. Wang, Y. Ding, and X. Gao, “Stimuli-aware visual emotion analysis,” IEEE Transactions on Image Processing, vol. 30, pp. 7432–7445, 2021. [9] L. Xu, Z. Wang, B. Wu, and S. Lui, “Mdan: Multi-level dependent attention network for visual emotion analysis,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9479–9488, 2022. [10] T.-Y. Lin, P. Doll ́ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2117–2125, 2017. [11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021. [12] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022, 2021. [13] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in Advances in Neural Information Processing Systems (NeurIPS), 2023. [14] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International Conference on Machine Learning, pp. 19730–19742, 2023. [15] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” in International Conference on Learning Representations, 2024. [16] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond,” in International Conference on Learning Representations, 2024. [17] K. Crowson, S. Biderman, D. Kornis, D. Stander, E. Hallahan, L. Castricato, and E. Raff, “Vqgan-clip: Open domain image generation and editing with natural language guidance,” in European Conference on Computer Vision, pp. 88–105, 2022. [18] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6007–6017, 2023. [19] T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 18392–18402, 2023. [20] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020. [21] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or,“Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022. [22] W. Zhang, X. He, and W. Lu, “Exploring discriminative representations for image emotion recognition with cnns,” IEEE Transactions on Multimedia, vol. 22, no. 2, pp. 515–523, 2019. [23] S. Zhao, Z. Jia, H. Chen, L. Li, G. Ding, and K. Keutzer, “Pdanet: Polarity-consistent deep attention network for fine-grained visual emotion regression,” in Proceedings of the 27th ACM international conference on multimedia, pp. 192–201, 2019. [24] D. Li, J. Li, and S. C. Hoi, “Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing,” in Advances in Neural Information Processing Systems (NeurIPS), 2023. [25] S. Zhao, Z. Jia, H. Chen, L. Li, G. Ding, and K. Keutzer, “Pdanet: Polarity-consistent deep attention network for fine-grained visual emotion regression,” in Proceedings of the 27th ACM international conference on multimedia, pp. 192–201, 2019. [26] D. Li, J. Li, and S. C. Hoi, “Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing,” in Advances in Neural Information Processing Systems (NeurIPS), 2023.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95717	-
dc.description.abstract	Emotional Image Editing (EIE) 是通過圖片編輯來使圖片產生所需的情緒。為了便於使用者操作，使用者只需提供一張圖片和所需的情緒。這是一個尚未有太多研究的領域，現有方法受限於缺乏優質的資料集，以及無法參考使用者提供的圖片來決定編輯位置。本篇論文提出了一個基於Transformer和多模態大型語言模型（MLLM）的最先進的視覺情感分析（VEA）模型，用來建立一個包含能改變圖片情緒的指令的資料集。我們設計了EEdit，一個能進行情感圖片編輯的雙階段模型，由指令生成和圖片編輯兩部分組成。我們的模型在使用者心理學實驗中取得了最先進的結果。	zh_TW
dc.description.abstract	Emotional Image Editing (EIE) involves modifying an image to evoke a desired emotion after editing. To facilitate user interaction, users only need to provide an image and the desired emotion. This task is relatively novel and underexplored, with existing approaches limited by inadequate datasets and the inability to reference user-provided images for determining edit locations. In this paper, we propose a state-of-the-art Visual Emotion Analysis (VEA) model based on transformer and Multimodal Large Language Model (MLLM) architectures, specifically the Multi-Branch Emotional Analysis Transformer (MEAT) and Emotion Question and Answer(EmotionQA), to create a dataset containing instructions capable of altering image emotions. The transformer-based model serves as the state-of-the-art VEA model, while the MLLM-based model aids in emotional image editing. We also introduce EEdit, a two-stage model for emotional image editing, comprising an instruction generation model and an image editing model. Our proposed model achieves state-of-the-art results in user psychological experiments.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-09-15T16:57:21Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-09-15T16:57:21Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i 摘要 ii Abstract iii Contents iv List of Figures vi List of Tables ix Chapter1. Introduction 1 Chapter2. Related Work 5 2.1 Visual Emotion Analysis . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Instruction-based Image Editing . . . . . . . . . . . . . . . . . . . . 6 2.3 Emotional Image Generation and Editing . . . . . . . . . . . . . . . 7 2.4 Multimodal large language models (MLLMs) . . . . . . . . . . . . . 8 Chapter3. Method 9 3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Visual analysis model . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 Instruction generation and Instruction-based image editing . . . . . . 13 Chapter4. Result 16 4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter5. Conclusions 26 References 32	-
dc.language.iso	en	-
dc.title	基於多模態大型語言模型的情感圖片編輯	zh_TW
dc.title	EEdit: Emotional Image Editing with Multimodal Large Language Model on a Single GPU	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	林宏祥;吳昱霆	zh_TW
dc.contributor.oralexamcommittee	Hong-Shiang Lin;Yu-Ting Wu	en
dc.subject.keyword	情感圖片編輯,視覺情感分析,多模態大型語言模型,	zh_TW
dc.subject.keyword	Emotional Image Editing,Visual Emotion Analysis,Multimodal Large Language Model,	en
dc.relation.page	35	-
dc.identifier.doi	10.6342/NTU202402633	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2024-08-10	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf	32.92 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。