資料缺乏下多樣且真實的影像生成

林揚昇; Yang-Sheng Lin

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89020

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	許永真	zh_TW
dc.contributor.advisor	Yung-jen Hsu	en
dc.contributor.author	林揚昇	zh_TW
dc.contributor.author	Yang-Sheng Lin	en
dc.date.accessioned	2023-08-16T16:47:52Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-08-16	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-07	-
dc.identifier.citation	[1] J. Cao, L. Hou, M.-H. Yang, R. He, and Z. Sun, “Remix: Towards image-to-image translation with limited data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15018–15027, 2021. [2] J. Cao, M. Luo, J. Yu, M.-H. Yang, and R. He, “Scoremix: A scalable augmentation strategy for training gans with limited data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. [3] Z. Li, X. Wu, B. Xia, J. Zhang, C. Wang, and B. Li, “A comprehensive survey on data-efficient gans in image generation,” arXiv preprint arXiv:2204.08329, 2022. [4] Y. Wang, C. Wu, L. Herranz, J. Van de Weijer, A. Gonzalez-Garcia, and B. Raducanu, “Transferring gans: generating images from limited data,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 218–234, 2018. [5] S. Mo, M. Cho, and J. Shin, “Freeze the discriminator: a simple baseline for fine-tuning gans,” arXiv preprint arXiv:2002.10964, 2020. [6] T. Grigoryev, A. Voynov, and A. Babenko, “When, why, and which pretrained gans are useful?,” arXiv preprint arXiv:2202.08937, 2022. 46 [7] L. Yu, J. van de Weijer, et al., “Deepi2i: Enabling deep hierarchical image-to-image translation by transferring from gans,” Advances in Neural Information Processing Systems, vol. 33, pp. 11803–11815, 2020. [8] Y. Wang, H. Laria, J. van de Weijer, L. Lopez-Fuentes, and B. Raducanu, “Transferi2i: Transfer learning for image-to-image translation from small datasets,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14010–14019, 2021. [9] M.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz, “Few-shot unsupervised image-to-image translation,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10551–10560, 2019. [10] G. Zhang, K. Cui, T.-Y. Hung, and S. Lu, “Defect-gan: High-fidelity defect synthesis for automated defect inspection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2524–2534, 2021. [11] K. Cui, J. Huang, Z. Luo, G. Zhang, F. Zhan, and S. Lu, “Genco: generative cotraining for generative adversarial networks with limited data,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 499–507, 2022. [12] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, “Training generative adversarial networks with limited data,” Advances in neural information processing systems, vol. 33, pp. 12104–12114, 2020. [13] S. Zhao, Z. Liu, J. Lin, J.-Y. Zhu, and S. Han, “Differentiable augmentation for data-efficient gan training,” Advances in Neural Information Processing Systems, vol. 33, pp. 7559–7570, 2020. [14] H. Zhang, Z. Zhang, A. Odena, and H. Lee, “Consistency regularization for generative adversarial networks,” arXiv preprint arXiv:1910.12027, 2019. 47 [15] Z. Zhao, S. Singh, H. Lee, Z. Zhang, A. Odena, and H. Zhang, “Improved consistency regularization for gans,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, pp. 11033–11041, 2021. [16] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134, 2017. [17] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, pp. 2223–2232, 2017. [18] Y. Pang, J. Lin, T. Qin, and Z. Chen, “Image-to-image translation: Methods and applications,” IEEE Transactions on Multimedia, vol. 24, pp. 3859–3881, 2021. [19] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “Highresolution image synthesis and semantic manipulation with conditional gans,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807, 2018. [20] T. Park, M.-Y. Liu, T.-C.Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2337–2346, 2019. [21] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 694–711, Springer, 2016. [22] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proceedings of the IEEE international conference on computer vision, pp. 1501–1510, 2017. 48 [23] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544, 2016. [24] W. Yang, X. Zhang, Y. Tian, W. Wang, J.-H. Xue, and Q. Liao, “Deep learning for single image super-resolution: A brief review,” IEEE Transactions on Multimedia, vol. 21, no. 12, pp. 3106–3121, 2019. [25] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang, “Diverse image-to-image translation via disentangled representations,” in Proceedings of the European conference on computer vision (ECCV), pp. 35–51, 2018. [26] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image translation,” in Proceedings of the European conference on computer vision (ECCV), pp. 172–189, 2018. [27] T. Wang, T. Zhang, B. Zhang, H. Ouyang, D. Chen, Q. Chen, and F. Wen, “Pretraining is all you need for image-to-image translation,” arXiv preprint arXiv:2205.12952, 2022. [28] G. Parmar, K. K. Singh, R. Zhang, Y. Li, J. Lu, and J.-Y. Zhu, “Zero-shot image-to-image translation,” arXiv preprint arXiv:2302.03027, 2023. [29] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, pp. 1096–1103, 2008. [30] K. He, X. Chen, S. Xie, Y. Li, P. Doll´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009, 2022. [31] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” in Proceedings 49 of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9653–9663, 2022. [32] Z. Fei, M. Fan, L. Zhu, J. Huang, X. Wei, and X. Wei, “Masked autoencoders meet generative adversarial networks and beyond,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24449–24459, 2023. [33] C. Wei, K. Mangalam, P.-Y. Huang, Y. Li, H. Fan, H. Xu, H. Wang, C. Xie, A. Yuille, and C. Feichtenhofer, “Diffusion models as masked autoencoders,” arXiv preprint arXiv:2304.03283, 2023. [34] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410, 2019. [35] R. Abdal, Y. Qin, and P. Wonka, “Image2stylegan: How to embed images into the stylegan latent space?,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4432–4441, 2019. [36] R. Abdal, Y. Qin, and P. Wonka, “Image2stylegan++: How to edit the embedded images?,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8296–8305, 2020. [37] E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: a stylegan encoder for image-to-image translation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2287–2296, 2021. [38] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “Stargan v2: Diverse image synthesis for multiple domains,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8188–8197, 2020. 50 [39] J. Huang, K. Cui, D. Guan, A. Xiao, F. Zhan, S. Lu, S. Liao, and E. Xing, “Masked generative adversarial networks are data-efficient generation learners,” Advances in Neural Information Processing Systems, vol. 35, pp. 2154–2167, 2022. [40] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. [41] M. Mundt, S. Majumder, S. Murali, P. Panetsos, and V. Ramesh, “Codebrim: Concrete defect bridge image dataset,” 2019. [42] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797, 2018. [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015. [44] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009. [45] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021. 51 [46] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021. [47] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89020	-
dc.description.abstract	非監督的圖像到圖像轉換（Unsupervised Image-to-Image Translation）因其廣泛的應用範疇與不需要標註的特性，已成為圖像生成領域的研究重心之一並獲得了非常顯著的成果。然而，在資料有限的情況下，確保訓練的穩定性並產生多樣且真實的圖像仍是很困難的研究問題。為了解決這些挑戰，我們提出了兩種簡單且即插即用的方法：遮罩自動編碼器生成對抗網絡（MAE-GAN）和風格嵌入自適應歸一化塊（SEAN）。 MAE-GAN是一種用於非監督圖像到圖像（Unsupervised I2I）任務的預訓練方法，它融合了MAE和GAN的架構和優點，並且在預訓練期間能學習到不同領域的風格信息，從而使下游任務的訓練穩定性和圖像品質提高。SEAN塊是一種新的歸一化塊(Normalization Block)，它利用了大規模的預訓練特徵提取器(Large-scale Pre-trained Feature Extractor) ，並在模型的每一層中能各自學習每個不同領域的風格特徵空間。並且，它還能在多樣性和保真度之間進行選擇，使得可以生成更多樣化或更真實的圖像。我們的方法在資料型態較少見且具有挑戰性的混凝土缺陷橋樑圖像數據集（CODEBRIM）上取得了非常好的成果，此外，我們的方法也使用10％動物臉部數據集（AFHQ）進行訓練，達到了與原本訓練在完整數據集上的模型相進的圖像品質，並且還能獲得更好的圖像多樣性，證明了其在現實世界中的應用性和巨大的潛力。	zh_TW
dc.description.abstract	Unsupervised Image-to-Image Translation (Unsupervised I2I) has emerged as a significant area of interest and has recently seen substantial advancements due to its wide range of applications and reduced data annotation requirements. However, in scenarios with limited data, ensuring training stability and generating diverse, realistic images remain critical research directions. To address these challenges, we propose two simple, plug-and-play methods: the Masked AutoEncoder Generative Adversarial Network (MAE-GAN) and the Style Embedding Adaptive Normalization (SEAN) block. The MAE-GAN, a pre-training method for Unsupervised I2I tasks, integrates the architectures and strengths of both MAE and GAN. It also enhances learning style-specific information during pre-training, leading to stable training and improved image quality in downstream tasks. The SEAN block is a novel normalization block that leverages large-scale pre-trained feature extractors and self-learns the style feature space for each domain in each layer. Consequently, it allows for a choice between diversity and fidelity, enabling the generation of more diverse or realistic images. Our method achieves substantial success on the less common and challenging concrete defect bridge dataset (CODEBRIM), demonstrating its real-world applicability. Additionally, our methods, trained on just 10% of the Animal Faces HQ dataset (AFHQ), achieve image quality on par with models trained on the full dataset, while also reaching greater image diversity, proving its real-world applicability and immense potential.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-16T16:47:52Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-08-16T16:47:52Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.6 Outline of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Literature Review 5 2.1 Data-Efficient Generation . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Pre-trained Generative Adversarial Networks . . . . . . . . . . 6 2.2 Image-to-Image Translation . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Unsupervised Image-to-Image Translation . . . . . . . . . . . 7 2.2.2 Data-Efficient Image-to-Image Translation . . . . . . . . . . . 7 2.3 Masked AutoEncoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Masked AutoEncoder with Generation Model . . . . . . . . . 9 2.4 Latent Space Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Methodology 11 3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 MAE-GAN for I2I Pre-training . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 Pre-training Pipeline . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.3 Masked Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Style Embedding Adaptive Normalization . . . . . . . . . . . . . . . 18 3.3.1 Pre-trained Feature Extractor as Style Code Generator . . . . 19 3.3.2 Style Code Space . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Experiments 23 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.1 COncrete DEfect BRidge IMage . . . . . . . . . . . . . . . . . 24 4.1.2 Animal Faces-HQ . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.1 Frech´et inception distance . . . . . . . . . . . . . . . . . . . . 26 4.3.2 Learned perceptual image patch similarity . . . . . . . . . . . 27 4.4 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.5 Futher Analysis for Each Component in MAE-GAN . . . . . . . . . . 29 4.5.1 Masked Autoencoder Generative Adversarial Network Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5.2 Masking Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.6 Futher Analysis for Each Component in SEAN . . . . . . . . . . . . . 37 4.6.1 Comparison of Various Style Code Insertion Methods . . . . . 38 4.6.2 Multiple Style Codes Mean and Label Embedding . . . . . . . 40 4.6.3 Style Space Sampling . . . . . . . . . . . . . . . . . . . . . . . 42 5 Conclusion 44 5.1 Summary and Contribution . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2.1 Pre-trained Feature Extractor Selection . . . . . . . . . . . . . 45 5.2.2 Supervised Image-to-Image Translation . . . . . . . . . . . . . 45 5.2.3 Integration with Data Augmentation Methods . . . . . . . . . 45 Bibliography 46	-
dc.language.iso	en	-
dc.subject	遮罩自動編碼器	zh_TW
dc.subject	資料缺乏下的圖像生成	zh_TW
dc.subject	非監督的圖像到圖像轉換	zh_TW
dc.subject	生成對抗網絡	zh_TW
dc.subject	Multiple Domain Image-to-Image Translation	en
dc.subject	Unsupervised Image-to-Image Translation	en
dc.subject	Data-Efficient Generative Adversarial Network	en
dc.subject	Masked Autoencoder	en
dc.title	資料缺乏下多樣且真實的影像生成	zh_TW
dc.title	Diverse and Fidelity Image Synthesis for Unsupervised Image-to-Image Translation with Limited Data	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	郭彥伶;鄭文皇;陳駿丞 ;楊智淵	zh_TW
dc.contributor.oralexamcommittee	Yen-Ling Guo;Wen-Huang Cheng;Jun-Cheng Chen;Chih-Yuan Yang	en
dc.subject.keyword	非監督的圖像到圖像轉換,生成對抗網絡,資料缺乏下的圖像生成,遮罩自動編碼器,	zh_TW
dc.subject.keyword	Unsupervised Image-to-Image Translation,Multiple Domain Image-to-Image Translation,Data-Efficient Generative Adversarial Network,Masked Autoencoder,	en
dc.relation.page	52	-
dc.identifier.doi	10.6342/NTU202303360	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2023-08-09	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf	19.67 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。