Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95626Full metadata record
| ???org.dspace.app.webui.jsptag.ItemTag.dcfield??? | Value | Language |
|---|---|---|
| dc.contributor.advisor | 陳祝嵩 | zh_TW |
| dc.contributor.advisor | Chu-Song Chen | en |
| dc.contributor.author | 謝欣玉 | zh_TW |
| dc.contributor.author | Hsin-Yu Hsieh | en |
| dc.date.accessioned | 2024-09-15T16:10:55Z | - |
| dc.date.available | 2024-09-16 | - |
| dc.date.copyright | 2024-09-14 | - |
| dc.date.issued | 2024 | - |
| dc.date.submitted | 2024-08-13 | - |
| dc.identifier.citation | [1] Y. Alaluf, E. Richardson, G. Metzer, and D. CohenOr. A neural space-time repre sentation for text-toimage personalization. ACM Transactions on Graphics (TOG), 42(6):1–10, 2023.
[2] O. Avrahami, K. Aberman, O. Fried, D. Cohen-Or, and D. Lischinski. Break-a- scene: Extracting multiple concepts from a single image. In SIGGRAPH Asia 2023 Conference Papers, pages 1–12, 2023. [3] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen- vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023. [4] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. [5] K. Crowson, S. Biderman, D. Kornis, D. Stander, E. Hallahan, L. Castricato, and E. Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision, pages 88–105. Springer, 2022. [6] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in neural information processing systems, 34:19822–19835, 2021. [7] W. Feng, X. He, T.-J. Fu, V. Jampani, A. Akula, P. Narayana, S. Basu, X. E. Wang, and W. Y. Wang. Training-free structured diffusion guidance for compositional text- to-image synthesis. arXiv preprint arXiv:2212.05032, 2022. [8] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. [9] Y. Ge, S. Zhao, Z. Zeng, Y. Ge, C. Li, X. Wang, and Y. Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023. [10] Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, and Y. Shan. Seed-x:Multimodal models with unified multigranularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024. [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020. [12] J. Gu, Y. Wang, N. Zhao, T.-J. Fu, W. Xiong, Q. Liu, Z. Zhang, H. Zhang, J. Zhang, H. Jung, et al. Photoswap: Personalized subject swapping in images. Advances in Neural Information Processing Systems, 36, 2024. [13] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo. Vec tor quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10696–10706, 2022. [14] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or. Prompt-to-prompt image editing with cross attention control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. [15] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. [16] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. [17] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007– 6017, 2023. [18] D. Li, J. Li, and S. Hoi. Blip-diffusion: Pretrained subject representation for con trollable textto-image generation and editing. Advances in Neural Information Processing Systems, 36, 2024. [19] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023. [20] T. Li, M. Ku, C. Wei, and W. Chen. Dreamedit: Subject-driven image editing. arXiv preprint arXiv:2306.12624, 2023. [21] Y. Li, Y. Zhang, C. Wang, Z. Zhong, Y. Chen, R. Chu, S. Liu, and J. Jia. Mini -gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024. [22] C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021. [23] R. Mokady, A. Hertz, K. Aberman, Y. Pritch, and D. Cohen-Or. Null-text inver sion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023. [24] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen. Glide: Towards photorealistic image generation and editing with text -guided diffusion models. arXiv preprint arXiv:2112.10741, 2021. [25] X. Pan, L. Dong, S. Huang, Z. Peng, W. Chen, and F. Wei. Kosmos-g: Generating images in context with multimodal large language models. In ICLR, 2023. [26] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. [27] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text -conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022. [28] M. D. M. Reddy, M. S. M. Basha, M. M. C. Hari, and M. N. Penchalaiah. Dall-e:Creating images from text. UGC Care Group I Journal, 8(14):71–75, 2021. [29] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. [30] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman. Dream booth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023. [31] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al. Photorealistic textto-image diffusion models with deep language understanding. Advancesin neural information processing systems, 35:36479–36494, 2022. [32] Q. Sun, Y. Cui, X. Zhang, F. Zhang, Q. Yu, Y. Wang, Y. Rao, J. Liu, T. Huang, and X. Wang. Generative multimodal models are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024. [33] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. [34] N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel. Plug-and-play diffusion fea tures for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023. [35] A. Voynov, Q. Chu, D. CohenOr, and K. Aberman. p+: Extended textual condi tioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023. [36] Y. Wei, Y. Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023. [37] B. Yang, S. Gu, B. Zhang, T. Zhang, X. Chen, X. Sun, D. Chen, and F. Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023. [38] Y. Zeng, Z. Lin, J. Zhang, Q. Liu, J. Collomosse, J. Kuen, and V. M. Pa tel. Scenecomposer: Any-level semantic image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22468–22478, 2023. [39] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. [40] Y. Zhang, W. Dong, F. Tang, N. Huang, H. Huang, C. Ma, T.-Y. Lee, O. Deussen, and C. Xu. Prospect: Prompt spectrum for attribute-aware personalization of diffusion models. ACM Transactions on Graphics (TOG), 42(6):1–14, 2023. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95626 | - |
| dc.description.abstract | 在本研究中,我們設計了一個自動化流程生成高品質的個性化影像編輯資料集,並將一個多模態大型語言模型微調於此資料集上,得到了歷史上第一個可以進行個性化影像編輯任務的大型語言模型:SEED-PIE,我們的方法在個性化影像編輯的推論速度打敗過去所有的方法,提昇了將近 10 倍的速度,此外,我們的模型無須對新個性化主體進行新一輪的個性化訓練,而是能以新個性化主體之參考圖片直接進行個性化影像編輯任務(零樣本學習)。任何使用者都能簡易地使用SEED-PIE 模型以高速地進行個性化影像編輯任務,我們的模型在公開數據集:DreamEditBench 上達到不俗的表現,顯示我們的模型所生成的圖片能忠於參考圖片中的個性化主體並與來源圖片中的背景保持一致性。 | zh_TW |
| dc.description.abstract | In this study, we design an automated pipeline to generate a high-quality personalized im age editing dataset. We then finetune a multimodal large language model on this dataset, resulting in SEED-PIE, the first large language model capable of performing personal ized image editing tasks. Our method is more computationally efficient than previous ap proaches, achieving nearly a tenfold improvement in terms of the inference speed. One of our method’s advantage is that it eliminates the need for additional personalization training for new subjects, enabling direct personalized image editing tasks using reference images of new personalized subjects (zero-shot learning). SEED-PIE allows any user to easily
perform highspeed personalized image editing tasks. Our model demonstrates satisfiable performance on the public benchmark DreamEditBench, indicating its ability to gener ate images that remain faithful to the personalized subject in the reference image while maintaining consistency with the background of the source image. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-09-15T16:10:55Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2024-09-15T16:10:55Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Acknowledgements i
摘要 ii Abstract iii Contents iv List of Figures vi List of Tables ix Chapter 1 Introduction 1 Chapter 2 Related work 8 2.1 Text-to-Image Generation . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Personalized Image Generation . . . . . . . . . . . . . . . . . . . . 9 2.3 Text-guided Image Editing . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Personalized Image Editing . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Multimodal LLM based Image Generation . . . . . . . . . . . . . . 10 Chapter 3 Method 12 3.1 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.1 BLIP-Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.2 SEED-X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Our approach: SEED-PIE . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 Automated Pipeline for Generating PIE Data . . . . . . . . . . . . . 17 3.2.2 Instruction Tuning on Personalized Image Editing . . . . . . . . . . 19 Chapter 4 Experiments 20 4.1 Instruction Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Time Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 Quantitative Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4 Qualitative Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Chapter 5 Conclusion 25 References 26 | - |
| dc.language.iso | en | - |
| dc.subject | 深度學習 | zh_TW |
| dc.subject | 多模態大型語言模型 | zh_TW |
| dc.subject | 影像編輯 | zh_TW |
| dc.subject | 個性化內容生成 | zh_TW |
| dc.subject | Deep Learning | en |
| dc.subject | Personalization | en |
| dc.subject | Image Editing | en |
| dc.subject | Multimodal Large Language Model | en |
| dc.title | 基於多模態大型語言模型之個性化影像編輯技術 | zh_TW |
| dc.title | Personalized Image Editing Based on Multimodal Large Language Model | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 112-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 楊惠芳;劉耿豪 | zh_TW |
| dc.contributor.oralexamcommittee | Huei-Fang Yang;Keng-Hao Liu | en |
| dc.subject.keyword | 個性化內容生成,影像編輯,多模態大型語言模型,深度學習, | zh_TW |
| dc.subject.keyword | Personalization,Image Editing,Multimodal Large Language Model,Deep Learning, | en |
| dc.relation.page | 31 | - |
| dc.identifier.doi | 10.6342/NTU202404267 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2024-08-13 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| dc.date.embargo-lift | 2029-08-13 | - |
| Appears in Collections: | 資訊工程學系 | |
Files in This Item:
| File | Size | Format | |
|---|---|---|---|
| ntu-112-2.pdf Until 2029-08-13 | 2.53 MB | Adobe PDF |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
