基於網頁截圖生成前端程式碼：結合主要配色與文字邊距擷取資訊

朱育萱; Yu-Hsuan Chu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100950

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳炳宇	zh_TW
dc.contributor.advisor	Bing-Yu Chen	en
dc.contributor.author	朱育萱	zh_TW
dc.contributor.author	Yu-Hsuan Chu	en
dc.date.accessioned	2025-11-26T16:13:25Z	-
dc.date.available	2025-11-27	-
dc.date.copyright	2025-11-26	-
dc.date.issued	2025	-
dc.date.submitted	2025-10-28	-
dc.identifier.citation	[1] M. AI. Llama 3.2: Large language model meta ai. https://ai.meta.com/llama, 2024. Accessed: 2024-11-25. [2] Google. Gemini api: Gemini-1.5 pro. https://ai.google.dev/gemini-api, 2024. Accessed: 2024-06-06. [3] H. Laurençon, L. Tronchon, and V. Sanh. Unlocking the conversion of web screenshots into html code with the websight dataset. arXiv preprint arXiv:2403.09029, 2024. [4] J. Lin, J. Guo, S. Sun, Z. Yang, J.-G. Lou, and D. Zhang. Layoutprompter: Awaken the design ability of large language models. Advances in Neural Information Processing Systems, 36, 2024. [5] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024. [6] OpenAI. Gpt-4o: Advanced multimodal language model. https://openai.com, 2024. Accessed: 2024-10-06.31 [7] PaddleOCR Contributors. PaddleOCR: Multi-language, awesome OCR toolkits based on PaddlePaddle. https://github.com/PaddlePaddle/PaddleOCR, 2023. [8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. [9] C. Si, Y. Zhang, Z. Yang, R. Liu, and D. Yang. Design2code: How far are we from automating front-end engineering? arXiv preprint arXiv:2403.03163, 2024. [10] Y. Wan, C. Wang, Y. Dong, W. Wang, S. Li, Y. Huo, and M. R. Lyu. Automatically generating ui code from screenshot: A divide-and-conquer-based approach. arXiv preprint arXiv:2406.16386, 2024. [11] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi, Y. Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. [12] P. Zhang, X. Dong, Y. Zang, Y. Cao, R. Qian, L. Chen, Q. Guo, H. Duan, B. Wang, L. Ouyang, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output. arXiv preprint arXiv:2407.03320, 2024.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/100950	-
dc.description.abstract	目前實務上，將使用者介面設計草稿轉換為前端程式碼的過程通常需要透過人工方式進行，這不僅是一項繁瑣且耗時的任務，對於非技術領域的從業人員而言更是如此。然而，隨著多模態大型語言模型的迅速發展，將網頁截圖轉換為HTML/CSS 程式碼的相關研究已經取得了一些突破。儘管如此，現有大多數方法仍依賴於大量訓練數據集和龐大的運算資源來進行模型微調（fine-tuning），這凸顯出當前多模態大型語言模型在有效擷取網頁設計中的關鍵資訊方面仍存在限制。此外，即使透過嵌入對網頁的純文字描述來輔助前端網頁程式碼的生成，結果仍差強人意。因此，本研究提出了一種創新方法，旨在通過直接將網頁設計中的關鍵元素（例如主要顏色、文字和邊界定位信息）融入到模型的 prompt 中，來改進多模態大型語言模型的生成效能。此方法不僅能夠避免耗時且資源密集的模型微調過程，還能提高模型生成前端程式碼的準確性和一致性。在我們的實驗結果顯示也證明將顏色和文字坐標信息結合於提示中，可以顯著提升模型在復現網頁截圖的精準度，即便我們的方法相比以往需要微調的模型，計算複雜度較低，卻能達到更高的效能和效率。這項研究的貢獻在於，無需進行繁瑣的微調過程，我們的模型已能有效地捕捉網頁設計中最關鍵的視覺元素，並且能夠高效地生成對應的 HTML/CSS 程式碼，這為未來網頁設計自動化及前端開發工具的創新奠定了基礎。	zh_TW
dc.description.abstract	Manually converting visual user interface designs into code is both complex and time-consuming, particularly for non-experts. With the rapid development of multi-modal large language models(MLLMs), there has been significant progress in generating accurate HTML/CSS code based on provided website screenshots. However, most existing approaches depend on extensive training datasets and costly model fine-tuning, underscoring the limitations of current models in capturing key web design features. Despite incorporating text descriptions of the given website screenshot into the prompts, the generated code remains significantly different from the ground truth. To address these challenges, this paper introduces a novel method leveraging MLLMs to embed dominant color codes and text margins extracted from website screenshots directly into the input prompt. Our approach aims to reduce reliance on large-scale training and fine-tuning while enhancing the extraction of web design elements critical for code generation. Experimental results demonstrate that incorporating color and positional information of text elements significantly improves the consistency and accuracy of translating website screenshots into functional code. Despite its low computational complexity, our method outperforms state-of-the-art methods on code generation tasks, achieving superior results without requiring additional model training or fine-tuning.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-11-26T16:13:25Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-11-26T16:13:25Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	摘要 i Abstract iii Contents v List of Figures vii List of Tables ix Chapter 1 Introduction 1 Chapter 2 Related Work 3 2.1 Multi-modal Large Language Model . . . . . . . . . . . . . . . . . . 3 2.2 MLLMs for Webpage Generation . . . . . . . . . . . . . . . . . . . 4 2.3 Prompt Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 3 Methodology 7 3.1 From UI-Design into Front-end code . . . . . . . . . . . . . . . . . 8 3.1.1 Screen Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.2 Color Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.3 Spacing Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Key Feature Extraction: Dominant Colors and Text Margins . . . . . 9 3.2.1 Dominant Color Extraction . . . . . . . . . . . . . . . . . . . . . . 9 3.2.2 Extraction of Text Element Margins . . . . . . . . . . . . . . . . . 10 Chapter 4 Experiments 13 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Backbone Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.4 Quantiative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.6 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 5 Discussion 21 5.1 Closed-source v.s. Open-source models . . . . . . . . . . . . . . . . 21 5.2 Correlation Analysis: Enhancing Performance in Challenging Web-page Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.3 Effect of number of dominant colors . . . . . . . . . . . . . . . . . . 23 Chapter 6 Conclusion 27 6.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.2 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 References 31 Appendix A — Prompting Details 33 A.1 Direct Asking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 A.2 Our Method Prompting . . . . . . . . . . . . . . . . . . . . . . . . . 33 Appendix B — User Study Details 35	-
dc.language.iso	en	-
dc.subject	多模態大型語言模型	-
dc.subject	網頁程式碼生成	-
dc.subject	電腦視覺	-
dc.subject	Webpage Code Generation	-
dc.subject	MLLM	-
dc.subject	Visual-to-Code	-
dc.title	基於網頁截圖生成前端程式碼：結合主要配色與文字邊距擷取資訊	zh_TW
dc.title	Webpage Code Generation from Screenshots with Dominant Color and Text Margin Extraction	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	朱宏國;王昱舜	zh_TW
dc.contributor.oralexamcommittee	Hung-Kuo Chu;Yu-Shuen Wang	en
dc.subject.keyword	多模態大型語言模型,網頁程式碼生成電腦視覺	zh_TW
dc.subject.keyword	Webpage Code Generation,MLLMVisual-to-Code	en
dc.relation.page	36	-
dc.identifier.doi	10.6342/NTU202504621	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2025-10-29	-
dc.contributor.author-college	管理學院	-
dc.contributor.author-dept	資訊管理學系	-
dc.date.embargo-lift	2025-11-27	-
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf	1.4 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。