基於SMPL-X應用擴散模型的文字到可動畫三維人體虛擬分身服裝生成

林家禾; Jia-He Lin

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95813

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅立成	zh_TW
dc.contributor.advisor	Li-Chen Fu	en
dc.contributor.author	林家禾	zh_TW
dc.contributor.author	Jia-He Lin	en
dc.date.accessioned	2024-09-18T16:10:30Z	-
dc.date.available	2024-09-19	-
dc.date.copyright	2024-09-18	-
dc.date.issued	2024	-
dc.date.submitted	2024-08-06	-
dc.identifier.citation	REFERENCES [1] Prof.Hung yi Lee. Diffusion model. https://speech.ee.ntu.edu.tw/~hylee/ ml/ml2023-course-data/DiffusionModel%20(v2).pdf, 2023. [2] Prof.Hung yi Lee. Stable diffusion. https://speech.ee.ntu.edu.tw/~hylee/ ml/ml2023-course-data/StableDiffusion%20(v2).pdf, 2023. [3] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. [4] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. [5] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. [6] cloneofsimo. lora. https://github.com/cloneofsimo/lora, 2023. [7] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. [8] Chaoqun Gong, Yuqin Dai, Ronghui Li, Achun Bao, Jun Li, Jian Yang, Yachao Zhang, and Xiu Li. Text2avatar: Text to 3d human avatar generation with codebookdriven body controllable attribute. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 16–20. IEEE, 2024. [9] Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022. [10] Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, and Kwan-Yee K Wong. Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 958–968, 2024. [11] Nikos Kolotouros, Thiemo Alldieck, Andrei Zanfir, Eduard Bazavan, Mihai Fieraru, and Cristian Sminchisescu. Dreamhuman: Animatable 3d avatars from text. Advances in Neural Information Processing Systems, 36, 2024. [12] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. [13] blender. shape key. https://docs.blender.org/manual/en/latest/ animation/shape_keys/index.html, 2024. [14] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scal- ing rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024. [15] JonathanHo, AjayJain, andPieterAbbeel. Denoisingdiffusionprobabilisticmodels. Advances in neural information processing systems, 33:6840–6851, 2020. [16] Unity Technologies. unity. https://unity.com/, 2024. [17] blender. blender. https://www.blender.org/, 2024. [18] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: shape completion and animation of people. In ACM SIGGRAPH 2005 Papers, pages 408–416. 2005. [19] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, October 2015. [20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [21] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [22] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. [23] OpenAI. Gpt-3.5-turbo. https://platform.openai.com/docs/models/ gpt-3-5-turbo, 2023. [24] Google. gemma-7b. https://huggingface.co/google/gemma-7b, 2024. [25] RohanAnil, AndrewMDai, OrhanFirat, MelvinJohnson, DmitryLepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023. [26] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. [27] OpenAI. Gpt-4. https://platform.openai.com/docs/models/ gpt-4-turbo-and-gpt-4, 2023. [28] OpenAI. Gpt-4o. https://platform.openai.com/docs/models/gpt-4o, 2024. [29] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [30] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015. [31] Lennart Demes. ambientcg. https://ambientcg.com/, 2017. [32] RunDiffusion. Juggernaut-xl-v6. https://huggingface.co/RunDiffusion/ Juggernaut-XL-v6, 2024. [33] danielgatis. rembg. https://github.com/danielgatis/rembg, 2020. [34] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95813	-
dc.description.abstract	近年來，隨著元宇宙(Metaverse)技術的發展，虛擬分身（Avatar）扮演著至關重要的角色。但傳統的三維虛擬分身建模需要花費大量的時間及資源。為了解決這個問題，設計一個可以直接從文字產生三維虛擬分身的系統變得越來越重要，這種方法不僅降低了分身創建的門檻，還提供了更大的彈性。透過分析使用者提供的文字敘述，如性別、人種、身形與服裝等資訊，系統可以幫助其他模組預測及生成出符合敘述的虛擬分身。比起傳統的角色建模，由文字生成虛擬分身雖然節省了大量時間，但現有方法還是需要數小時來生成，難以快速地讓使用者得到想要的結果。因此，本論文提出了TeCHAvatar，一個基於SMPL-X的虛擬分身生成系統，分別預測人種，形狀和生成紋理圖，並整合成一個完整的虛擬分身。這個過程僅需花費數十秒。由於現有的SOTA方法所產生出的大多是屬於隱式的虛擬分身，難以建立一組帶有骨骼的三維虛擬人物網格，無法應對後續像是骨骼動畫的應用。因此我們以SMPL-X的骨骼為基礎，將服裝綁定在同一骨骼上，並在虛擬分身變形後自動調整關節點到適當的位置，確保骨骼還是有效的。此外，為了提升系統對服裝提示詞的理解程度，我們提出了CTLoRA，用來幫助生成指定服裝材質的紋理圖。我們還設計了專用的著色器，用於渲染虛擬分身。	zh_TW
dc.description.abstract	In recent years, with the development of metaverse technology, avatars have played a crucial role. However, traditional 3D avatar modeling requires a significant amount of time and resources. To address this issue, designing a system that can generate 3D avatars directly from text has become increasingly important. This approach not only lowers the barrier to avatar creation but also provides greater flexibility. By analyzing user-provided textual descriptions, such as gender, race, body shape, and clothing information, the system can assist other modules in predicting and generating avatars that match the descriptions. Compared to traditional character modeling, generating avatars from text saves a lot of time, but the existing methods still require several hours to produce results, making it difficult for users to quickly obtain the desired outcome. Therefore, this thesis proposes TeCHAvatar, an avatar generation system based on SMPL-X, which separately predicts race, shape, and generates texture maps from the given text, integrating them into a complete avatar. This process takes only a few seconds. Since most of the avatars produced by current state-of-the-art methods are implicit and cannot establish a 3D avatar mesh with skeletons, they are not suitable for subsequent applications such as skeleton animation. Hence, we use the skeleton of SMPL-X, binding the clothing to the same skeleton and automatically adjusting the joints to appropriate positions after the avatar deforms, ensuring the skeleton remains effective. Additionally, to enhance the system's understanding of clothing prompts, we propose CTLoRA, which helps generate texture maps for specified clothing materials. We also design specialized shaders for rendering the final avatars.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-09-18T16:10:30Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-09-18T16:10:30Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	致謝 i 中文摘要 ii ABSTRACT iii CONTENTS v LIST OF FIGURES viii LIST OF TABLES x Chapter 1 Introduction 1 1.1 Background · · · · · · · · · · · · · · · · · · · · · · · · · · 1 1.2 Motivation · · · · · · · · · · · · · · · · · · · · · · · · · · · 2 1.3 Related Work · · · · · · · · · · · · · · · · · · · · · · · · · · 3 1.3.1 Text-to-3D Avatar................................................................... 3 1.4 Objectives and Contributions · · · · · · · · · · · · · · · · · · · 4 1.5 Thesis Organization · · · · · · · · · · · · · · · · · · · · · · · 6 Chapter 2 Preliminaries 8 2.1 3D Human Mesh · · · · · · · · · · · · · · · · · · · · · · · · 8 2.1.1 SMPL-X............................................................................... 10 2.2 Large Language Model · · · · · · · · · · · · · · · · · · · · · · 11 2.2.1 Gemma7b ............................................................................. 12 2.2.2 GPT-3.5-Turbo....................................................................... 13 2.3 Diffusion Model · · · · · · · · · · · · · · · · · · · · · · · · · 13 2.3.1 Latent Diffusion Model............................................................ 15 2.3.2 Stable Diffusion ..................................................................... 18 2.3.3 Stable Diffusion XL ................................................................ 19 2.3.4 Stable Diffusion 3................................................................... 20 2.4 Low-Rank Adaptation · · · · · · · · · · · · · · · · · · · · · · 20 2.4.1 LoRA of SDXL...................................................................... 21 2.5 Unity · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 22 2.5.1 Shader.................................................................................. 23 2.6 Blender · · · · · · · · · · · · · · · · · · · · · · · · · · · · 24 2.6.1 Cloth Physics Simulation.......................................................... 25 2.6.2 Surface Deform Modifier ......................................................... 25 Chapter 3 Methodology 27 3.1 System Overview · · · · · · · · · · · · · · · · · · · · · · · · 27 3.2 Text Decoupling Module · · · · · · · · · · · · · · · · · · · · · 29 3.3 Texture Generating Module · · · · · · · · · · · · · · · · · · · · 31 3.3.1 Generate Texture Maps for Clothing ........................................... 31 3.3.2 Generate Prints on Clothing ...................................................... 33 3.4 Normal Predicting Module · · · · · · · · · · · · · · · · · · · · 35 3.4.1 Normal Library ...................................................................... 36 3.4.2 Inference............................................................................... 36 3.5 Shape Predicting Module · · · · · · · · · · · · · · · · · · · · · 37 3.5.1 Shape Key............................................................................. 38 3.5.2 Training................................................................................ 39 3.5.3 Inference............................................................................... 41 3.6 Skin Color Predicting Module · · · · · · · · · · · · · · · · · · · 42 3.6.1 Definition ............................................................................. 42 3.6.2 Training................................................................................ 44 3.6.3 Inference............................................................................... 45 3.7 Bottom-Lerp Avatar Shader· · · · · · · · · · · · · · · · · · · · 46 3.7.1 Preprocessing......................................................................... 47 3.7.2 Inference............................................................................... 49 3.8 Multi-Prints Clothing Shader · · · · · · · · · · · · · · · · · · · 50 3.8.1 Preprocessing......................................................................... 51 3.8.2 Inference............................................................................... 52 3.9 Avatar and Top Deforming · · · · · · · · · · · · · · · · · · · · 53 Chapter 4 Experiments 55 4.1 Experimental Setup · · · · · · · · · · · · · · · · · · · · · · · 55 4.2 Evaluation Metrics· · · · · · · · · · · · · · · · · · · · · · · · 56 4.2.1 R-Precision ........................................................................... 56 4.3 Experimental Results· · · · · · · · · · · · · · · · · · · · · · · 56 4.3.1 Qualitative Results.................................................................. 56 4.3.2 Quantitative Results ................................................................ 57 4.4 Ablation Studies· · · · · · · · · · · · · · · · · · · · · · · · · 58 Chapter 5 Conclusion 70 REFERENCES 72	-
dc.language.iso	en	-
dc.title	基於SMPL-X應用擴散模型的文字到可動畫三維人體虛擬分身服裝生成	zh_TW
dc.title	TeCHAvatar: Text to Clothing on Animatable 3D Human Avatar based on SMPL-X with Diffusion Model	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	歐陽明;陳彥仰;莊永裕;徐偉恩;鄭龍磻	zh_TW
dc.contributor.oralexamcommittee	Ming Ouh-Young;Mike Chen;Yung-Yu Chuang;Vincent Hsu;Lung-Pan Cheng	en
dc.subject.keyword	虛擬分身,文字到虛擬分身,生成式模型,擴散模型,	zh_TW
dc.subject.keyword	Avatar,Text to Avatar,Generative Model,Diffusion Model,	en
dc.relation.page	76	-
dc.identifier.doi	10.6342/NTU202403191	-
dc.rights.note	未授權	-
dc.date.accepted	2024-08-09	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 未授權公開取用	19.15 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。