三維虛擬分身重建系統基於分數蒸餾採樣與顯式表示

游鈞皓; Chun-Hau Yu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95814

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅立成	zh_TW
dc.contributor.advisor	Li-Chen Fu	en
dc.contributor.author	游鈞皓	zh_TW
dc.contributor.author	Chun-Hau Yu	en
dc.date.accessioned	2024-09-18T16:10:54Z	-
dc.date.available	2024-09-19	-
dc.date.copyright	2024-09-18	-
dc.date.issued	2024	-
dc.date.submitted	2024-08-09	-
dc.identifier.citation	[1] T. Alldieck, M. Zanfir, and C. Sminchisescu. Photorealistic monocular 3d reconstruction of humans wearing clothing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2 [2] A. Bulat and G. Tzimiropoulos. How far are we from solving the 2d amp; 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Oct. 2017. 11, 16, 23,32 [3] Y. Cao, K. Han, and K.-Y. K. Wong. Sesdf: Self-evolved signed distance field for implicit 3d clothed human reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3 [4] Dawson-Haggerty et al. trimesh. 19 [5] Y. Feng, V. Choutas, T. Bolkart, D. Tzionas, and M. J. Black. Collaborative regression of expressive bodies using moderation. In International Conference on 3D Vision (3DV), 2021. 4, 40 [6] D. Gao, X. Chen, X. Zhang, Q. Wang, K. Sun, B. Zhang, L. Bo, and Q. Huang. Cloth2tex: A customized cloth texture generation pipeline for 3d virtual try-on, 2023. 65 [7] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models, 2020. 25 [8] Y. Huang, H. Yi, Y. Xiu, T. Liao, J. Tang, D. Cai, and J. Thies. TeCH: Text-guided Reconstruction of Lifelike Clothed Humans. In International Conference on 3D Vision (3DV), 2024. 2, 3, 4, 32, 35, 64 [9] M. Kazhdan, M. Bolitho, and H. Hoppe. Screened poisson surface reconstruction. volume 32, page 29, 2013. 19 [10] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick. Segment anything, 2023. 10,17 [11] S. Laine, J. Hellsten, T. Karras, Y. Seol, J. Lehtinen, and T. Aila. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics, 39(6), 2020. 11, 22, 23, 30, 31, 32, 33, 34 [12] J. Li, S. Bian, C. Xu, Z. Chen, L. Yang, and C. Lu. Hybrik-x: Hybrid analytical-neural inverse kinematics for whole-body mesh recovery. arXiv preprint arXiv:2304.05690, 2023. 40 [13] C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.L. Chang, M. G. Yong, J. Lee, W.-T. Chang, W. Hua, M. Georg, and M. Grundmann. Mediapipe: A framework for building perception pipelines, 2019. 9, 17, 18 [14] Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sjölund, and T. B. Schön. Photo-realistic image restoration in the wild with controlled vision-language models. arXiv preprint arXiv:2404.09732, 2024. 64 [15] OpenAI. Gpt-4 technical report, 2024. 11, 21, 61 [16] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019. 6, 14 [17] B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022. 24, 27, 34, 64 [18] X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. Zaiane, and M. Jagersand. U2-net: Going deeper with nested u-structure for salient object detection. volume 106, page 107404, 2020. 44 [19] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. 60, 61 [20] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models, 2022. 3, 12, 24, 25, 29, 34 [21] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022. 35 [22] S. Saito, , Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. arXiv preprint arXiv:1905.05172, 2019. 3, 4, 43, 51, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63 [23] S. Saito, T. Simon, J. Saragih, and H. Joo. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2020. 3 [24] S. Saito, J. Yang, Q. Ma, and M. J. Black. SCANimate: Weakly supervised learning of skinned clothed avatar networks. In Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2021. 64 [25] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2015. 13, 30, 32 [26] V. Sklyarova, E. Zakharov, O. Hilliges, M. J. Black, and J. Thies. Haar: Text-conditioned generative model of 3d strand-based human hairstyles. ArXiv, Dec 2023. 65 [27] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution, 2020. 24 [28] Z. Wang, J. Zhang, T. Chen, W. Wang, and P. Luo. Restoreformer++: Towards real-world blind face restoration from undegraded key-value paris. 2023. 8, 16 [29] Y. Xiu, J. Yang, X. Cao, D. Tzionas, and M. J. Black. ECON: Explicit Clothed humans Optimized via Normal integration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023. 4, 14, 19, 20 [30] Y. Xiu, J. Yang, D. Tzionas, and M. J. Black. ICON: Implicit Clothed humans Obtained from Normals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13296–13306, June 2022. 3, 4, 8, 16 [31] H. Zhang, Y. Tian, Y. Zhang, M. Li, L. An, Z. Sun, and Y. Liu. Pymaf-x: Towards well-aligned full-body model regression from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 4, 6, 16, 40 [32] L. Zhang, Q. Qiu, H. Lin, Q. Zhang, C. Shi, W. Yang, Y. Shi, S. Yang, L. Xu, and J. Yu. Dreamface: Progressive generation of animatable 3d faces under text guidance, 2023. 64 [33] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. 30, 31, 53 [34] Y. Zheng, W. Yifan, G. Wetzstein, M. J. Black, and O. Hilliges. Pointavatar: Deformable point-based head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 4 [35] Z. Zheng, T. Yu, Y. Liu, and Q. Dai. Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021. 3, 4, 43, 51, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/95814	-
dc.description.abstract	AR/VR 技術因其在教育、社交活動、數位娛樂和遠程協作等潛在應用而在近年來備受關注。而在虛擬空間中，每個使用者都需要一個 3D 虛擬分身來代表自己。因此，本論文提出了一個三維虛擬分身重建系統，讓使用者可以重建出一個完整著色、穿戴服裝且可實時動畫的三維虛擬分身。現有的大多數方法中，若不是對資料要求過於嚴格，如深度資訊或多視角圖片，就是在特定區域會出現顯著的性能下降。因此，我們引入了 Score Distillation Sampling (SDS) 技術，並設計了 FSIC 和 RADR 等模組，讓 Latent Diffusion Model (LDM) 引導重建過程以提升重建結果。此外，我們開發了多種訓練策略，包括 personalized LDM、delayed SDS、focused SDS 及multi-pose SDS，來讓訓練過程更有效率。我們的虛擬分身採用顯示表示，與現代大多數電腦圖學管線相容。並且，HARDER 僅需一張 RGB 圖片即能產生高度逼真的三維虛擬分身，且整個重建和實時動畫過程可在單一消費級 GPU 上完成，使此應用更加普及。	zh_TW
dc.description.abstract	AR/VR technology has gained much attention in recent years due to its potential applications such as education, social activities, digital entertainment, and remote collaboration. A 3D human avatar is needed to represent each user in the virtual space. Therefore, we propose HARDER, a 3D human avatar reconstruction system that allows users to reconstruct a fully textured, clothed, and real-time animatable 3D human avatar. Most existing methods either impose overly strict data requirements, such as depth information or multi-view images, or suffer from significant performance drops in specific areas. To address these challenges, we introduce the Score Distillation Sampling (SDS) technique and design the FSIC and RADR modules to let the Latent Diffusion Model (LDM) guide the reconstruction process, especially in unseen regions. Furthermore, we have developed various training strategies including personalized LDM, delayed SDS, focused SDS, and multi-pose SDS to make the training process more efficient. Our avatars use an explicit representation that is compatible with modern computer graphics pipelines. Also, HARDER can generate highly realistic 3D avatars from just a single RGB image, and the entire reconstruction and real-time animation process can be completed on a single consumer-grade GPU, making this application more accessible.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-09-18T16:10:54Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-09-18T16:10:54Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	致謝 ........................................................................................................ i 中文摘要 ................................................................................................. ii ABSTRACT ............................................................................................. iii CONTENTS ............................................................................................. v LIST OF FIGURES .................................................................................... ix LIST OF TABLES...................................................................................... xi Chapter 1 Introduction ........................................................................... 1 1.1 Background............................................................................. 1 1.2 Motivation .............................................................................. 2 1.3 Objectives............................................................................... 2 1.4 Related Work .......................................................................... 3 1.4.1 Implicit approaches ................................................................. 3 1.4.2 Explicit approaches ................................................................. 3 1.4.3 Related work comparison ......................................................... 4 1.5 Thesis Organization .................................................................. 5 Chapter 2 Preliminaries.......................................................................... 6 2.1 Parametric Human Model........................................................... 6 2.2 Human Pose Estimation ............................................................. 6 2.3 Normal Estimation.................................................................... 8 2.4 Super Resolution ...................................................................... 8 2.5 Human Landmark Detection ....................................................... 9 2.6 Foreground Segmentation .......................................................... 10 2.7 Face Detection ......................................................................... 11 2.8 Multimodal LLM ..................................................................... 11 2.9 Differentiable Renderer ............................................................. 11 2.10 Text-to-Image Latent Diffusion Models (T2I LDM) ......................... 12 2.11 Face Recognition...................................................................... 13 Chapter 3 Methodology .......................................................................... 14 3.1 Data Preprocessing ................................................................... 14 3.1.1 Landmark-Guided Segmentation (LaGS) ..................................... 17 3.1.2 Hand-Excluded Reconstruction (HER) ........................................ 18 3.2 Geometry Initialization .............................................................. 19 3.2.1 SMPL-X Rigging ................................................................... 20 3.3 Feature-Specific Image Captioning (FSIC)..................................... 20 3.4 Region-Aware Differentiable Rendering (RADR) ............................ 21 3.4.1 Camera Pose Generation .......................................................... 22 3.4.2 Detection-Based Filtering ......................................................... 23 3.5 Score Distillation Sampling (SDS) ............................................... 24 3.5.1 Score-Based Generative Models................................................. 24 3.5.2 Text-to-Image Latent Diffusion Models ....................................... 25 3.5.3 SDS Loss .............................................................................. 27 3.6 Training Parameters .................................................................. 28 3.7 Training Objective .................................................................... 28 3.7.1 Image Loss............................................................................ 29 3.7.1.1 Body Reconstruction Loss ....................................... 30 3.7.1.2 Face Reconstruction Loss ........................................ 31 3.7.1.3 Face Recognition Loss ............................................ 32 3.7.1.4 Chamfer Distance Loss ........................................... 32 3.7.2 SDS Loss .............................................................................. 33 3.8 Training Strategy...................................................................... 34 3.8.1 Personalized LDM .................................................................. 35 3.8.2 Delayed SDS ......................................................................... 35 3.8.3 Focused SDS ......................................................................... 38 3.8.4 Multi-Pose SDS ..................................................................... 39 3.9 Limitation............................................................................... 39 3.10 Applications ............................................................................ 40 3.10.1 Animation ............................................................................. 40 3.10.2 Outfit Editing......................................................................... 41 Chapter 4 Experiments........................................................................... 43 4.1 Ablation Study......................................................................... 43 4.1.1 Ablation Study on Super Resolution ........................................... 44 4.1.2 Ablation Study on LaGS .......................................................... 44 4.1.3 Ablation Study on HER ........................................................... 44 4.1.4 Ablation Study on Lf ace ........................................................... 48 4.1.5 Ablation Study on Lcd .............................................................. 48 4.1.6 Ablation Study on Personalized LDM ......................................... 48 4.1.7 Ablation Study on Delayed SDS ................................................ 50 4.1.8 Ablation Study on Focused SDS ................................................ 50 4.1.9 Ablation Study on Multi-Pose SDS............................................. 51 4.2 Comparison............................................................................. 51 4.2.1 Qualitative Comparison............................................................ 52 4.2.2 Quantitative Comparison .......................................................... 53 4.2.3 Reconstruction Time Comparison............................................... 61 4.2.4 User Study ............................................................................ 62 Chapter 5 Conclusion ............................................................................. 64 REFERENCES .......................................................................................... 66	-
dc.language.iso	en	-
dc.title	三維虛擬分身重建系統基於分數蒸餾採樣與顯式表示	zh_TW
dc.title	HARDER: 3D Human Avatar Reconstruction with Distillation and Explicit Representation	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	歐陽明;陳彥仰;莊永裕;鄭龍磻;徐偉恩	zh_TW
dc.contributor.oralexamcommittee	Ming Ouh-young;Mike Chen;Yung-Yu Chuang;Lung-Pan Cheng;Vincent Hsu	en
dc.subject.keyword	虛擬分身,三維人體重建,潛在擴散模型,分數蒸餾採樣,	zh_TW
dc.subject.keyword	Avatar,3D Human Reconstruction,Latent Diffusion Models,Score Distillation Sampling,	en
dc.relation.page	70	-
dc.identifier.doi	10.6342/NTU202403194	-
dc.rights.note	未授權	-
dc.date.accepted	2024-08-12	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 目前未授權公開取用	2.55 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。