請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100916完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 許永真 | zh_TW |
| dc.contributor.advisor | Jane Yung-Jen Hsu | en |
| dc.contributor.author | 陳韋傑 | zh_TW |
| dc.contributor.author | Wei-Jie Chen | en |
| dc.date.accessioned | 2025-11-12T16:04:46Z | - |
| dc.date.available | 2025-11-13 | - |
| dc.date.copyright | 2025-11-12 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-08-11 | - |
| dc.identifier.citation | [1] O. Avrahami, K. Aberman, O. Fried, D. Cohen-Or, and D. Lischinski. Break-a-scene: Extracting multiple concepts from a single image. In SIGGRAPH Asia 2023 Conference Papers, pages 1–12, 2023.
[2] Y. Cai, Y. Wei, Z. Ji, J. Bai, H. Han, and W. Zuo. Decoupled textual embeddings for customized image generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 909–917, 2024. [3] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. [4] H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM transactions on Graphics (TOG), 42(4):1–10, 2023. [5] H. Chefer, O. Lang, M. Geva, V. Polosukhin, A. Shocher, M. Irani, I. Mosseri, and L. Wolf. The hidden language of diffusion models. arXiv preprint arXiv:2306.00966, 2023. [6] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021. [7] C. Eckert and M. Stacey. Sources of inspiration: a language of design. Design studies, 21(5):523–538, 2000. [8] Y. Frenkel, Y. Vinker, A. Shamir, and D. Cohen-Or. Implicit style-content separation using b-lora. In European Conference on Computer Vision, pages 181–198. Springer, 2024. [9] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020. [11] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020. [12] S. Hao, K. Han, Z. Lv, S. Zhao, and K.-Y. K. Wong. Conceptexpress: Harnessing diffusion models for single-image unsupervised concept extraction. In European Conference on Computer Vision, pages 215–233. Springer, 2024. [13] S. Hao, K. Han, S. Zhao, and K.-Y. K. Wong. Vico: Plug-and-play visual condition for personalized text-to-image generation. arXiv preprint arXiv:2306.00971, 2023. [14] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. [15] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. [16] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. International Conference on Learning Representations, 1(2):3, 2022. [17] S. Huang, B. Gong, Y. Feng, X. Chen, Y. Fu, Y. Liu, and D. Wang. Learning disentangled identifiers for action-customized text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7797–7806, 2024. [18] Z. Huang, T. Wu, Y. Jiang, K. C. Chan, and Z. Liu. Reversion: Diffusion-based relation inversion from images. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. [19] D. P. Kingma, M. Welling, et al. Auto-encoding variational bayes, 2013. [20] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. [21] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. [22] C. Liu, V. Shah, A. Cui, and S. Lazebnik. Unziplora: Separating content and style from a single image. arXiv preprint arXiv:2412.04465, 2024. [23] S. Motamed, D. P. Paudel, and L. Van Gool. Lego: Learning to disentangle and invert personalized concepts beyond object appearance in text-to-image diffusion models. In European Conference on Computer Vision, pages 116–133. Springer, 2024. [24] N. Otsu et al. A threshold selection method from gray-level histograms. Automatica, 11(285-296):23–27, 1975. [25] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. [26] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PmLR, 2021. [27] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022. [28] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. [29] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015. [30] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023. [31] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. [32] V. Shah, N. Ruiz, F. Cole, E. Lu, S. Lazebnik, Y. Li, and V. Jampani. Ziplora: Any subject in any style by effectively merging loras. In European Conference on Computer Vision, pages 422–438. Springer, 2024. [33] J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. [34] R. Tang, L. Liu, A. Pandey, Z. Jiang, G. Yang, K. Kumar, P. Stenetorp, J. Lin, and F. Ture. What the daam: Interpreting stable diffusion using cross attention. arXiv preprint arXiv:2210.04885, 2022. [35] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017. [36] J. Tian, L. Aggarwal, A. Colaco, Z. Kira, and M. Gonzalez-Franco. Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3554–3563, 2024. [37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [38] Y. Vinker, A. Voynov, D. Cohen-Or, and A. Shamir. Concept decomposition for visual exploration and inspiration. ACM Transactions on Graphics (TOG), 42(6):1–13, 2023. [39] W. Wu, Y. Zhao, M. Z. Shou, H. Zhou, and C. Shen. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1206–1217, 2023. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100916 | - |
| dc.description.abstract | 在設計嶄新的視覺概念時,設計師經常從既有的想法中汲取靈感,藉由重新組合各種元素,以創造出獨特且原創的作品。隨著文字轉圖像(Text-to-Image, T2I)模型的快速發展,機器現已能夠協助此一創作過程,特別是在分解複雜視覺概念並將其與現有概念重新組合方面。然而,現有的視覺概念分解方法通常依賴於多樣化的輸入圖像,當輸入圖像在視覺上過於相似時,這些方法往往難以將物體從顯著的背景中分離出來,導致所產出的結果難以理解或應用。
本研究揭示了被分解的視覺子概念與擴散模型中 U-Net 的交叉注意力圖(cross-attention maps)之間的強烈關聯。基於此觀察,我們提出了一種新方法:AGTree。該方法使用交叉注意力圖作為內生遮罩,藉此有效抑制背景雜訊,並在訓練過程中引入隨機丟棄機制(random drop),以提升語意上的關聯性。此外,我們擴展了現有的評估指標,使其能更全面地評估模型表現。我們的方法在基於 CLIP 的量化指標上提升了 8.62%,而質化分析也證實了其在將背景資訊從學習到的表徵中解耦的有效性。原始程式碼可於以下網址取得:https://github.com/JackChen890311/AGTree。 | zh_TW |
| dc.description.abstract | When designing novel visual concepts, designers often draw inspiration from existing ideas, recombining elements to create something unique and original. With the rapid advancement of text-to-image (T2I) models, machines can now assist in this creative process, particularly in decomposing complex visual concepts and recombining them with existing ones. However, current decomposition methods for visual concepts typically rely on diverse input images. When the inputs are visually similar, these methods struggle to isolate objects from prominent backgrounds, often resulting in outputs that are hard to interpret or apply.
In this work, we reveal a strong correlation between decomposed visual subconcepts and cross-attention maps within the diffusion U-Net. Building on this insight, we propose AGTree, a novel method that uses cross-attention maps as intrinsic masks to effectively suppress background noise, along with incorporating random dropout during training to further enhance semantic relevance. Additionally, we extend the existing evaluation metric to provide a more comprehensive assessment of model performance. Quantitative results show that our method achieves an 8.62% improvement on a CLIP-based metric, while qualitative analyses demonstrate its effectiveness in disentangling background information from the learned representations. Code is available at: https://github.com/JackChen890311/AGTree. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-11-12T16:04:46Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-11-12T16:04:46Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Acknowledgements i
摘要 ii Abstract iii Contents v List of Figures viii List of Tables x Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Proposed Method 3 1.4 Thesis Organization 4 Chapter 2 Related Work 5 2.1 Diffusion Models 5 2.2 Personalized Generation 6 2.3 Concept Learning 7 2.4 Concept Decomposition 8 2.5 Style-Content Decomposition 8 2.6 Latent Diffusion Models 9 2.7 Attention Maps in LDMs 11 Chapter 3 Problem Definition 12 3.1 Notations 12 3.2 Visual Concept Decomposition 13 3.3 Research Question 14 3.4 Optimization Objective 15 Chapter 4 Method 17 4.1 Attention-based Mask 18 4.1.1 Attention Maps 18 4.1.2 Exponential Moving Average (EMA) 19 4.1.3 Otsu Binarization 20 4.2 Random Drop 21 4.3 Masked Reconstruction 21 4.4 Hyperparameter Settings 22 Chapter 5 Experiments 24 5.1 Dataset 24 5.2 Evaluation Metrics 25 5.3 Baseline 27 5.4 Quantitative Results 28 5.5 Qualitative Results 29 5.6 Background Leakage Experiments 31 5.7 Ablation Studies 35 Chapter 6 Conclusion 38 References 39 Appendix A — Implementation Details 45 A.1 Related Settings 45 A.2 Text Prompt Template 45 A.3 Prompt for Conceptual Tagging 47 Appendix B — Datasets 48 B.1 The AGTree Dataset A 48 B.2 The AGTree Dataset B 49 B.3 The Background Dataset 50 Appendix C — Supplementary Experiments 51 C.1 The AGTree Dataset B 51 C.2 The Consistency Score 52 C.3 Complete Qualitative Results 53 | - |
| dc.language.iso | en | - |
| dc.subject | 注意力機制 | - |
| dc.subject | 擴散模型 | - |
| dc.subject | 生成式人工智慧 | - |
| dc.subject | 視覺概念 | - |
| dc.subject | 概念分解 | - |
| dc.subject | Attention Mechanism | - |
| dc.subject | Diffusion Model | - |
| dc.subject | Generative AI | - |
| dc.subject | Visual Concept | - |
| dc.subject | Concept Decomposition | - |
| dc.title | AGTree:基於注意力引導的視覺概念分解 | zh_TW |
| dc.title | AGTree: Attention Guided Visual Concept Decomposition | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.coadvisor | 鄭文皇 | zh_TW |
| dc.contributor.coadvisor | Wen-Huang Cheng | en |
| dc.contributor.oralexamcommittee | 楊智淵;吳家麟;陳駿丞 | zh_TW |
| dc.contributor.oralexamcommittee | Chih-Yuan Yang;Ja-Ling Wu;Jun-Cheng Chen | en |
| dc.subject.keyword | 注意力機制,擴散模型生成式人工智慧視覺概念概念分解 | zh_TW |
| dc.subject.keyword | Attention Mechanism,Diffusion ModelGenerative AIVisual ConceptConcept Decomposition | en |
| dc.relation.page | 59 | - |
| dc.identifier.doi | 10.6342/NTU202502288 | - |
| dc.rights.note | 同意授權(限校園內公開) | - |
| dc.date.accepted | 2025-08-13 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| dc.date.embargo-lift | 2030-07-22 | - |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 30.24 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
