可控制風格的場景文字編輯

吳玉辰; Yu-Chen Wu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89905

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳祝嵩	zh_TW
dc.contributor.advisor	Chu-Song Chen	en
dc.contributor.author	吳玉辰	zh_TW
dc.contributor.author	Yu-Chen Wu	en
dc.date.accessioned	2023-09-22T16:37:21Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-09-22	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-09	-
dc.identifier.citation	[1] Google font. https://fonts.google.com/. [2] Xkcd. https://xkcd.com/color/rgb/. [3] O. Avrahami, O. Fried, and D. Lischinski. Blended latent diffusion, 2023. [4] O. Avrahami, D. Lischinski, and O. Fried. Blended diffusion for text-driven editing of natural images. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2022. [5] D. Bautista and R. Atienza. Scene text recognition with permuted autoregressive sequence models. In Proceedings of the 17th European Conference on Computer Vision (ECCV), Cham, 10 2022. Springer International Publishing. [6] D. Bazazian, R. Gomez, A. Nicolaou, L. Gomez, D. Karatzas, and A. D. Bagdanov. Improving text proposals for scene images with fully convolutional networks, 2017. [7] H. Chen, Z. Xu, Z. Gu, J. Lan, X. Zheng, Y. Li, C. Meng, H. Zhu, and W. Wang. Diffute: Universal text editing diffusion model, 2023. [8] C.-K. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, J. Han, E. Ding, J. Liu, D. Karatzas, C. S. Chan, and L. Jin. Icdar2019 robust reading challenge on arbitrary-shaped text (rrc-art), 2019. [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. [11] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis, 2021. [12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. [13] R. Gomez, A. F. Biten, L. Gomez, J. Gibert, M. Rusiñol, and D. Karatzas. Selective style transfer for text, 2019. [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015. [15] P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, and X. Li. Single shot text detector with regional attention, 2017. [16] J. Ho and T. Salimans. Classifier-free diffusion guidance, 2022. [17] J. Ji, G. Zhang, Z. Wang, B. Hou, Z. Zhang, B. Price, and S. Chang. Improving diffusion models for scene text editing with dual encoders, 2023. [18] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks, 2019. [19] G. Kim, T. Kwon, and J. C. Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation, 2022. [20] J. Lee, Y. Kim, S. Kim, M. Yim, S. Shin, G. Lee, and S. Park. Rewritenet: Reliable scene text editing with implicit decomposition of text contents and styles, 2022. [21] X. Liu, D. H. Park, S. Azadi, G. Zhang, A. Chopikyan, Y. Hu, H. Shi, A. Rohrbach, and T. Darrell. More control for free! image synthesis with semantic diffusion guidance, 2022. [22] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. 09 2012. [23] N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon, W. Khlif, M. M. Luqman, J.-C. Burie, C.-l. Liu, and J.-M. Ogier. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification - rrc-mlt. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 1454–1459, 2017. [24] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022. [25] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recognizing text with perspective distortion in natural scenes. In 2013 IEEE International Conference on Computer Vision, pages 569–576, 2013. [26] G. P. B. V. T. H. Praveen Krishnan, Rama Kovvuri. Textstylebrush: Transfer of text aesthetics from a single example. Facebook AI, 2022. [27] Y. Qu, Q. Tan, H. Xie, J. Xu, Y. Wang, and Y. Zhang. Exploring stroke-level modifications for scene text editing. arXiv preprint arXiv:2212.01982, 2022. [28] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. [29] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 2022. [30] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models, 2022. [31] T. Su, F. Yang, X. Zhou, D. Di, Z. Wang, and S. Li. Scene style text editing, 2023. [32] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In 2011 International Conference on Computer Vision, pages 1457–1464, 2011. [33] L. Wu, C. Zhang, J. Liu, J. Han, J. Liu, E. Ding, and X. Bai. Editing text in the wild. ACMMM, 2019. [34] Q. Yang, H. Jin, J. Huang, and W. Lin. Swaptext: Image based texts transfer in scenes. CVPR, 2020. [35] Y. Zhu, J. Chen, L. Liang, Z. Kuang, L. Jin, and W. Zhang. Fourier contour embedding for arbitrary-shaped text detection, 2021.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/89905	-
dc.description.abstract	場景文字編輯近年來取得了顯著進展，讓我們能夠將現實世界中的文字轉換成指定的文本內容。過去的研究主要依賴生成對抗網絡（GANs），並著重於從圖像中裁剪目標文字區域來引導編輯過程。隨著擴散模型生成品質的提升與進展，使得場景文字編輯也可使用擴散模型來實現。與大部分 GAN 研究不同，擴散模型通常使用整個場景進行填補，並考慮全局資訊，使填補區域得以更加真實。然而過去的研究比較無法控制所生成的文字風格與輸入及參考影像間的關係。在本研究中，我們著重於提升場景文字編輯的風格可控性。我們開發一個方法，讓用戶在交換真實圖像中的文字時能夠操縱文字風格。我們的方法基於近期的擴散模型DiffSTE 模型。利用 DiffSTE 可在指令中指定風格的特性，我們提出了一個集成風格分類和預訓練文本識別的框架，以引導 DiffSTE 在現實場景中生成帶有所需風格的文字。我們的主要貢獻包括實現真實場景的文字交換，以及對文字外觀的精細控制以及定制字體風格和顏色的能力。所開發的方法與技術可以根據用戶的偏好和具體應用需求增強提取文字的呈現效果。	zh_TW
dc.description.abstract	Scene text editing aims to enable the rewriting and style transformation of texts in realworld images. Previous works mainly relied on Generative Adversarial Networks (GANs) and focused on cropping target text regions for guidance. With the improved generation quality of diffusion models, scene text editing has also adopted diffusion models for implementation. In this work, we emphasize style controllability in scene text editing. Our goal is to develop a system that allows users to manipulate text styles while swapping texts between real images. Our work leverages DiffSTE, a diffusionbased work, to specify styles as instructions. We introduce an approach that integrates style classification and pretrained text recognition for guiding DiffSTE in generating the texts with desired styles in realworld scenes. Our main contributions include achieving realistic scene text swapping, finegrained control over text appearance, and the ability to customize font styles and colors. This approach enhances the rewriting of extracted text according to user preferences and specific application requirements.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-09-22T16:37:21Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-09-22T16:37:21Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements ii 摘要 iii Abstract iv Contents vi List of Figures viii List of Tables xi Chapter 1 Introduction 1 1.1 Introduction of Scene Text Editing 1 Chapter 2 Related Works 5 2.1 Image Editing 5 2.2 Scene Text Editing 7 2.2.1 GAN-Based STE 7 2.2.2 Diffusion-Based STE 9 Chapter 3 Proposed Method 10 3.1 Framework Structure 11 3.2 Methodology 12 3.2.1 DiffSTE 12 3.2.2 Style Classification 13 3.2.3 Text Recognition 16 3.3 Training and Inference 16 3.3.1 Generate Training Data 17 3.3.2 Implementation etails 19 3.3.3 Inference 19 Chapter 4 Experiments 20 4.1 Text Swapping 20 4.1.1 Datasets and Baselines 21 4.1.2 Experiments Results 22 4.2 Style Classification 23 4.2.1 Different Backbone 23 4.2.2 User Study 24 4.2.2.1 Font Analysis 25 4.2.2.2 Color Analysis 26 4.3 Extensions 28 4.3.1 Multi-Reference Manipulation 28 Chapter 5 Conclusion 33 References 34	-
dc.language.iso	en	-
dc.title	可控制風格的場景文字編輯	zh_TW
dc.title	Style Controllable Scene Text Editing	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	楊惠芳;黃文良	zh_TW
dc.contributor.oralexamcommittee	Huei-Fang Yang;Wen-Liang Hwang	en
dc.subject.keyword	場景文字,場景文字編輯,擴散模型,	zh_TW
dc.subject.keyword	Scene Text,Scene Text Editing,Diffusion Model,	en
dc.relation.page	37	-
dc.identifier.doi	10.6342/NTU202301954	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2023-08-11	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	2.95 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。