視覺語言模型輔助之風格感知向量草圖補全

秦孝媛; Hsiao Yuan Chin

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98493

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳炳宇	zh_TW
dc.contributor.advisor	Bing-Yu Chen	en
dc.contributor.author	秦孝媛	zh_TW
dc.contributor.author	Hsiao Yuan Chin	en
dc.date.accessioned	2025-08-14T16:19:51Z	-
dc.date.available	2025-08-15	-
dc.date.copyright	2025-08-14	-
dc.date.issued	2025	-
dc.date.submitted	2025-08-01	-
dc.identifier.citation	[1] I. Berger, A. Shamir, M. Mahler, E. Carter, and J. Hodgins. Style and abstraction in portrait sketching. ACM Transactions on Graphics (TOG), 32(4):1–12, 2013. [2] M. Cai, Z. Huang, Y. Li, U. Ojha, H. Wang, and Y. J. Lee. Leveraging large language models for scalable vector graphics-driven image understanding. arXiv preprint arXiv:2306.06094, 2023. [3] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. [4] M. Eitz, J. Hays, and M. Alexa. How do humans sketch objects? ACM Trans. Graph. (Proc. SIGGRAPH), 31(4):44:1–44:10, 2012. [5] K. Frans, L. Soros, and O. Witkowski. CLIPDraw: Exploring Text-to-drawing Synthesis Through language-Image Encoders. Advances in Neural Information Processing Systems, 35:5207–5218, 2022. [6] S. Fu, N. Tamir, S. Sundaram, L. Chai, R. Zhang, T. Dekel, and P. Isola. Dream-sim: Learning new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems, volume 36, pages 50742–50768, 2023. [7] R. Gal, Y. Vinker, Y. Alaluf, A. Bermano, D. Cohen-Or, A. Shamir, and G. Chechik. Breathing life into sketches using text-to-video priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page Accepted, 2024. [8] Y. Gryaditskaya, M. Sypesteyn, J. W. Hoftijzer, S. Pont, F. Durand, and A. Bousseau. Opensketch: A richly-annotated dataset of product design sketches. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia), 38(6):232, 2019. [9] D. Ha and D. Eck. A neural representation of sketch drawings. In Proc. ICLR, 2018. [10] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. [11] A. Jain, A. Xie, and P. Abbeel. VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models. In Proc. CVPR, pages 1911–1920, 2023. [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. ICLR, 2015. [13] H. Li, H. Zhang, Y. Wang, J. Cao, A. Shamir, and D. Cohen-Or. Curve style analysis in a set of shapes. In Computer Graphics Forum, volume 32, pages 77–88. Wiley Online Library, 2013. [14] H. Lin, Y. Fu, X. Xue, and Y.-G. Jiang. Sketch-bert: Learning sketch bidirec-tional encoder representation from transformers by self-supervised learning of sketch gestalt. In Proc. CVPR, pages 6758–6767, 2020. [15] Z. Lin, D. Pathak, B. Li, J. Li, X. Xia, G. Neubig, P. Zhang, and D. Ramanan. Evaluating text-to-visual generation with image-to-text generation. arXiv preprint arXiv:2404.01291, 2024. [16] K. Nishina and Y. Matsui. Svgeditbench: A benchmark dataset for quantitative assessment of llm’s svg editing capabilities. arXiv preprint arXiv:2404.13710, 2024. [17] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Py-torch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. [18] Z. Qu, T. Xiang, and Y.-Z. Song. Sketchdreamer: Interactive text-augmented creative sketch ideation. In Proc. BMVC, 2023. [19] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021. [20] L. S. F. Ribeiro, T. Bui, J. Collomosse, and M. Ponti. Sketchformer: Transformer-based representation for sketched structure. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14153–14162, 2020. [21] J. A. Rodriguez, A. Puri, S. Agarwal, I. H. Laradji, S. Rajeswar, D. Vazquez, C. Pal, and M. Pedersoli. Starvector: Generating scalable vector graphics code from images and text. In Proc. AAAI, volume 39, pages 29691–29693, 2025. [22] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. [23] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The sketchy database: Learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (proceedings of SIGGRAPH), 2016. [24] Z. Tang, C. Wu, Z. Zhang, M. Ni, S. Yin, Y. Liu, Z. Yang, L. Wang, Z. Liu, J. Li, and D. Nan. Strokenuwa: Tokenizing strokes for vector graphic synthesis. arXiv preprint arXiv:2401.17093, 2024. [25] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. [26] Y. Vinker, Y. Alaluf, D. Cohen-Or, and A. Shamir. Clipascene: Scene sketching with different types and levels of abstraction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4146–4156, 2023. [27] Y. Vinker, E. Pajouheshgar, J. Y. Bo, R. C. Bachmann, A. H. Bermano, D. Cohen-Or, A. Zamir, and A. Shamir. Clipasso: Semantically-aware object sketching. ACM Trans. Graph., 41(4), jul 2022. [28] Y. Vinker, T. R. Shaham, K. Zheng, A. Zhao, J. E. Fan, and A. Torralba. Sketchagent:Language-driven sequential sketch generation. arXiv preprint arXiv:2411.17673, 2024. [29] J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, and S. Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023. [30] R. Wu, W. Su, and J. Liao. Chat2svg: Vector graphics generation with large language models and image diffusion models. arXiv preprint arXiv:2411.16602, 2024. [31] R. Wu, W. Su, K. Ma, and J. Liao. Iconshop: Text-guided vector icon synthesis with autoregressive transformers. ACM Transactions on Graphics (TOG), 42(6):1–14, 2023. [32] X. Xing, J. Hu, G. Liang, J. Zhang, D. Xu, and Q. Yu. Empowering llms to understand and generate complex vector graphics. arXiv preprint arXiv:2412.11102, 2024. [33] X. Xing, C. Wang, H. Zhou, J. Zhang, Q. Yu, and D. Xu. Diffsketcher: Text guided vector sketch synthesis through latent diffusion models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. [34] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proc. ICCV, pages 3836–3847, 2023. [35] T. Zhou, C. Fang, Z. Wang, J. Yang, B. Kim, Z. Chen, J. Brandt, and D. Terzopoulos. Learning to sketch with deep q networks and demonstrated strokes. arXiv preprint arXiv:1810.05977, 2018. [36] B. Zou, M. Cai, J. Zhang, and Y. J. Lee. Vgbench: Evaluating large language models on vector graphics understanding and generation. arXiv preprint arXiv:2407.10972, 2024.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98493	-
dc.description.abstract	草圖是重要的表達媒介，近年來已有眾多研究致力於自動草圖生成。其中一項對業餘使用者極具實用性的功能，是根據文字描述自動補全部分草圖以生成複雜場景，同時保留原始草圖的風格。現有方法僅著重於產出符合輸入提示內容、且具預設風格的草圖，而忽略了輸入部分草圖中的風格特徵，例如整體的抽象程度與局部筆劃風格等。為解決此挑戰，我們提出 AutoSketch ，一種能適應多樣化草圖風格並支援多輪補全的風格感知向量草圖補全方法。AutoSketch 透過兩階段的流程，風格一致地補全輸入草圖。在第一階段，我們首先優化筆劃以符合一組輸入提示，該提示由原始文字描述擴充而來，擴充內容包含由視覺語言模型（VLM）所提取的風格描述。這些風格描述進一步產生非寫實的引導圖像，藉此引導補全更多內容筆劃。在第二階段，我們利用 VLM 將第一階段生成的筆劃調整為與輸入草圖風格一致，並透過一個迭代風格調整機制實現此目標。在每次迭代中，VLM 辨識輸入草圖與前一階段筆劃之間的風格差異，並將這些差異轉換為調整碼，用以更新筆劃。我們在各種草圖風格與文字提示下，將本方法與現有技術進行比較，並進行廣泛的消融研究、質性與量化評估，證實 AutoSketch 能支援多樣化的草圖創作情境。	zh_TW
dc.description.abstract	Sketches are an important medium of expression and recently many works concentrate on automatic sketch creations. One such ability very useful for amateurs is text-based completion of a partial sketch to create a complex scene, while preserving the style of the partial sketch. Existing methods focus solely on generating sketch that match the content in the input prompt in a predefined style, ignoring the styles of the input partial sketches, e.g., the global abstraction level and local stroke styles. To address this challenge, we introduce AutoSketch, a style-aware vector sketch completion method that accommodates diverse sketch styles and supports iterative sketch completion. AutoSketch completes the input sketch in a style-consistent manner using a two-stage method. In the first stage, we initially optimize the strokes to match an input prompt augmented by style descriptions extracted from a vision-language model (VLM). Such style descriptions lead to non-photorealistic guidance images which enable more content to be depicted through new strokes. In the second stage, we utilize the VLM to adjust the strokes from the previous stage to adhere to the style present in the input partial sketch through an iterative style adjustment process. In each iteration, the VLM identifies a list of style differences between the input sketch and the strokes generated in the previous stage, translating these differences into adjustment codes to modify the strokes. We compare our method with existing methods using various sketch styles and prompts, perform extensive ablation studies and qualitative and quantitative evaluations, and demonstrate that AutoSketch can support diverse sketching scenarios.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-14T16:19:51Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-08-14T16:19:51Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements i 摘要 iii Abstract v Contents vii List of Figures ix List of Tables xi Chapter 1 Introduction 1 Chapter 2 Related Work 5 2.1 Vector Sketch Generation . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Sketch Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 LLM-based Sketch and SVG Editing . . . . . . . . . . . . . . . . . 6 Chapter 3 Overview 9 Chapter 4 Content-centric Sketch Completion 11 4.1 Prompt Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2 Stroke Optimization for Completion . . . . . . . . . . . . . . . . . . 12 Chapter 5 VLM-based Sketch Style Adjustment 15 Chapter 6 Experiment 19 6.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 19 6.2 Comparison with Existing Methods . . . . . . . . . . . . . . . . . . 20 6.3 Diverse Sketch Scenario . . . . . . . . . . . . . . . . . . . . . . . . 25 6.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.4.1 The effectiveness of the style adjustment stage . . . . . . . . . . . . 26 6.4.2 The effectiveness of adaptive prompt augmentation . . . . . . . . . 27 6.4.3 The effectiveness of using style adjustment code . . . . . . . . . . . 28 6.4.4 Generalization of VLMs. . . . . . . . . . . . . . . . . . . . . . . . 28 Chapter 7 Limitations and Future Work 31 Chapter 8 Conclusion 33 References 35 Appendix A: VLM Preamble Detail 40 A.1 Style Difference Detection Preamble . . . . . . . . . . . . . . . . . . 40 A.2 Adjustment Code Generation Preamble . . . . . . . . . . . . . . . . 42 Appendix B: Detailed Case Examples 45 B.1 Case: Dogs Playing . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 B.1.1 Style Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 B.1.2 Adjustment Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 B.2 Case: Girl Walking in the Park . . . . . . . . . . . . . . . . . . . . . 49 B.2.1 Style Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 B.2.2 Adjustment Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 50	-
dc.language.iso	zh_TW	-
dc.subject	草圖補全	zh_TW
dc.subject	向量草圖	zh_TW
dc.subject	貝茲曲線	zh_TW
dc.subject	場景補全	zh_TW
dc.subject	風格感知	zh_TW
dc.subject	Style-Aware	en
dc.subject	Scene Completion	en
dc.subject	Bézier Curves	en
dc.subject	Sketch Completion	en
dc.subject	Vector Sketches	en
dc.title	視覺語言模型輔助之風格感知向量草圖補全	zh_TW
dc.title	AutoSketch: VLM-Assisted Style-Aware Vector Sketch Completion	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	林文杰;王昱舜;朱宏國	zh_TW
dc.contributor.oralexamcommittee	Wen-Chieh Lin;Yu-Shuen Wang;Hung-Kuo Chu	en
dc.subject.keyword	向量草圖,草圖補全,風格感知,場景補全,貝茲曲線,	zh_TW
dc.subject.keyword	Vector Sketches,Sketch Completion,Style-Aware,Scene Completion,Bézier Curves,	en
dc.relation.page	52	-
dc.identifier.doi	10.6342/NTU202502832	-
dc.rights.note	未授權	-
dc.date.accepted	2025-08-06	-
dc.contributor.author-college	管理學院	-
dc.contributor.author-dept	資訊管理學系	-
dc.date.embargo-lift	N/A	-
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 未授權公開取用	31.6 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。