Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100916
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor許永真zh_TW
dc.contributor.advisorJane Yung-Jen Hsuen
dc.contributor.author陳韋傑zh_TW
dc.contributor.authorWei-Jie Chenen
dc.date.accessioned2025-11-12T16:04:46Z-
dc.date.available2025-11-13-
dc.date.copyright2025-11-12-
dc.date.issued2025-
dc.date.submitted2025-08-11-
dc.identifier.citation[1] O. Avrahami, K. Aberman, O. Fried, D. Cohen-Or, and D. Lischinski. Break-a-scene: Extracting multiple concepts from a single image. In SIGGRAPH Asia 2023 Conference Papers, pages 1–12, 2023.
[2] Y. Cai, Y. Wei, Z. Ji, J. Bai, H. Han, and W. Zuo. Decoupled textual embeddings for customized image generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 909–917, 2024.
[3] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
[4] H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM transactions on Graphics (TOG), 42(4):1–10, 2023.
[5] H. Chefer, O. Lang, M. Geva, V. Polosukhin, A. Shocher, M. Irani, I. Mosseri, and L. Wolf. The hidden language of diffusion models. arXiv preprint arXiv:2306.00966, 2023.
[6] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
[7] C. Eckert and M. Stacey. Sources of inspiration: a language of design. Design studies, 21(5):523–538, 2000.
[8] Y. Frenkel, Y. Vinker, A. Shamir, and D. Cohen-Or. Implicit style-content separation using b-lora. In European Conference on Computer Vision, pages 181–198. Springer, 2024.
[9] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
[11] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
[12] S. Hao, K. Han, Z. Lv, S. Zhao, and K.-Y. K. Wong. Conceptexpress: Harnessing diffusion models for single-image unsupervised concept extraction. In European Conference on Computer Vision, pages 215–233. Springer, 2024.
[13] S. Hao, K. Han, S. Zhao, and K.-Y. K. Wong. Vico: Plug-and-play visual condition for personalized text-to-image generation. arXiv preprint arXiv:2306.00971, 2023.
[14] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
[15] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
[16] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. International Conference on Learning Representations, 1(2):3, 2022.
[17] S. Huang, B. Gong, Y. Feng, X. Chen, Y. Fu, Y. Liu, and D. Wang. Learning disentangled identifiers for action-customized text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7797–7806, 2024.
[18] Z. Huang, T. Wu, Y. Jiang, K. C. Chan, and Z. Liu. Reversion: Diffusion-based relation inversion from images. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024.
[19] D. P. Kingma, M. Welling, et al. Auto-encoding variational bayes, 2013.
[20] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023.
[21] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023.
[22] C. Liu, V. Shah, A. Cui, and S. Lazebnik. Unziplora: Separating content and style from a single image. arXiv preprint arXiv:2412.04465, 2024.
[23] S. Motamed, D. P. Paudel, and L. Van Gool. Lego: Learning to disentangle and invert personalized concepts beyond object appearance in text-to-image diffusion models. In European Conference on Computer Vision, pages 116–133. Springer, 2024.
[24] N. Otsu et al. A threshold selection method from gray-level histograms. Automatica, 11(285-296):23–27, 1975.
[25] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
[26] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PmLR, 2021.
[27] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
[28] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[29] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[30] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023.
[31] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
[32] V. Shah, N. Ruiz, F. Cole, E. Lu, S. Lazebnik, Y. Li, and V. Jampani. Ziplora: Any subject in any style by effectively merging loras. In European Conference on Computer Vision, pages 422–438. Springer, 2024.
[33] J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
[34] R. Tang, L. Liu, A. Pandey, Z. Jiang, G. Yang, K. Kumar, P. Stenetorp, J. Lin, and F. Ture. What the daam: Interpreting stable diffusion using cross attention. arXiv preprint arXiv:2210.04885, 2022.
[35] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
[36] J. Tian, L. Aggarwal, A. Colaco, Z. Kira, and M. Gonzalez-Franco. Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3554–3563, 2024.
[37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[38] Y. Vinker, A. Voynov, D. Cohen-Or, and A. Shamir. Concept decomposition for visual exploration and inspiration. ACM Transactions on Graphics (TOG), 42(6):1–13, 2023.
[39] W. Wu, Y. Zhao, M. Z. Shou, H. Zhou, and C. Shen. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1206–1217, 2023.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100916-
dc.description.abstract在設計嶄新的視覺概念時,設計師經常從既有的想法中汲取靈感,藉由重新組合各種元素,以創造出獨特且原創的作品。隨著文字轉圖像(Text-to-Image, T2I)模型的快速發展,機器現已能夠協助此一創作過程,特別是在分解複雜視覺概念並將其與現有概念重新組合方面。然而,現有的視覺概念分解方法通常依賴於多樣化的輸入圖像,當輸入圖像在視覺上過於相似時,這些方法往往難以將物體從顯著的背景中分離出來,導致所產出的結果難以理解或應用。

本研究揭示了被分解的視覺子概念與擴散模型中 U-Net 的交叉注意力圖(cross-attention maps)之間的強烈關聯。基於此觀察,我們提出了一種新方法:AGTree。該方法使用交叉注意力圖作為內生遮罩,藉此有效抑制背景雜訊,並在訓練過程中引入隨機丟棄機制(random drop),以提升語意上的關聯性。此外,我們擴展了現有的評估指標,使其能更全面地評估模型表現。我們的方法在基於 CLIP 的量化指標上提升了 8.62%,而質化分析也證實了其在將背景資訊從學習到的表徵中解耦的有效性。原始程式碼可於以下網址取得:https://github.com/JackChen890311/AGTree。
zh_TW
dc.description.abstractWhen designing novel visual concepts, designers often draw inspiration from existing ideas, recombining elements to create something unique and original. With the rapid advancement of text-to-image (T2I) models, machines can now assist in this creative process, particularly in decomposing complex visual concepts and recombining them with existing ones. However, current decomposition methods for visual concepts typically rely on diverse input images. When the inputs are visually similar, these methods struggle to isolate objects from prominent backgrounds, often resulting in outputs that are hard to interpret or apply.

In this work, we reveal a strong correlation between decomposed visual subconcepts and cross-attention maps within the diffusion U-Net. Building on this insight, we propose AGTree, a novel method that uses cross-attention maps as intrinsic masks to effectively suppress background noise, along with incorporating random dropout during training to further enhance semantic relevance. Additionally, we extend the existing evaluation metric to provide a more comprehensive assessment of model performance. Quantitative results show that our method achieves an 8.62% improvement on a CLIP-based metric, while qualitative analyses demonstrate its effectiveness in disentangling background information from the learned representations. Code is available at: https://github.com/JackChen890311/AGTree.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-11-12T16:04:46Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-11-12T16:04:46Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 ii
Abstract iii
Contents v
List of Figures viii
List of Tables x

Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 2
1.3 Proposed Method 3
1.4 Thesis Organization 4

Chapter 2 Related Work 5
2.1 Diffusion Models 5
2.2 Personalized Generation 6
2.3 Concept Learning 7
2.4 Concept Decomposition 8
2.5 Style-Content Decomposition 8
2.6 Latent Diffusion Models 9
2.7 Attention Maps in LDMs 11

Chapter 3 Problem Definition 12
3.1 Notations 12
3.2 Visual Concept Decomposition 13
3.3 Research Question 14
3.4 Optimization Objective 15

Chapter 4 Method 17
4.1 Attention-based Mask 18
4.1.1 Attention Maps 18
4.1.2 Exponential Moving Average (EMA) 19
4.1.3 Otsu Binarization 20
4.2 Random Drop 21
4.3 Masked Reconstruction 21
4.4 Hyperparameter Settings 22

Chapter 5 Experiments 24
5.1 Dataset 24
5.2 Evaluation Metrics 25
5.3 Baseline 27
5.4 Quantitative Results 28
5.5 Qualitative Results 29
5.6 Background Leakage Experiments 31
5.7 Ablation Studies 35

Chapter 6 Conclusion 38

References 39

Appendix A — Implementation Details 45
A.1 Related Settings 45
A.2 Text Prompt Template 45
A.3 Prompt for Conceptual Tagging 47

Appendix B — Datasets 48
B.1 The AGTree Dataset A 48
B.2 The AGTree Dataset B 49
B.3 The Background Dataset 50

Appendix C — Supplementary Experiments 51
C.1 The AGTree Dataset B 51
C.2 The Consistency Score 52
C.3 Complete Qualitative Results 53
-
dc.language.isoen-
dc.subject注意力機制-
dc.subject擴散模型-
dc.subject生成式人工智慧-
dc.subject視覺概念-
dc.subject概念分解-
dc.subjectAttention Mechanism-
dc.subjectDiffusion Model-
dc.subjectGenerative AI-
dc.subjectVisual Concept-
dc.subjectConcept Decomposition-
dc.titleAGTree:基於注意力引導的視覺概念分解zh_TW
dc.titleAGTree: Attention Guided Visual Concept Decompositionen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.coadvisor鄭文皇zh_TW
dc.contributor.coadvisorWen-Huang Chengen
dc.contributor.oralexamcommittee楊智淵;吳家麟;陳駿丞zh_TW
dc.contributor.oralexamcommitteeChih-Yuan Yang;Ja-Ling Wu;Jun-Cheng Chenen
dc.subject.keyword注意力機制,擴散模型生成式人工智慧視覺概念概念分解zh_TW
dc.subject.keywordAttention Mechanism,Diffusion ModelGenerative AIVisual ConceptConcept Decompositionen
dc.relation.page59-
dc.identifier.doi10.6342/NTU202502288-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-08-13-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-lift2030-07-22-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  未授權公開取用
30.24 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved