Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99611
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor傅立成zh_TW
dc.contributor.advisorLi-Chen Fuen
dc.contributor.author余承諺zh_TW
dc.contributor.authorCheng-Yen Yuen
dc.date.accessioned2025-09-17T16:08:17Z-
dc.date.available2025-09-18-
dc.date.copyright2025-09-17-
dc.date.issued2025-
dc.date.submitted2025-08-11-
dc.identifier.citation[1] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video based reconstruction of 3d people models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8387–8397, Jun 2018. CVPR Spotlight Paper.
[2] J. de Vries. Learnopengl: Normal mapping, 2017. Accessed: 2025-07-19.
[3] J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi. Clipscore: A referencefree evaluation metric for image captioning, 2022.
[4] J. Hughes. Computer Graphics: Principles and Practice. The systems programming series. Addison-Wesley, 2014.
[5] T. Jiang, X. Chen, J. Song, and O. Hilliges. Instantavatar: Learning avatars from monocular video in 60 seconds. arXiv, 2022.
[6] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023.
[7] J. Kim, G. Gu, M. Park, S. Park, and J. Choo. Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8176–8185, 2024.
[8] T. Kim, B. Kim, S. Saito, and H. Joo. Gala: Generating animatable layered assets from a single scan. In CVPR, 2024.
[9] D. P. Kingma and M. Welling. Auto-encoding variational bayes, 2022.
[10] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick. Segment anything. arXiv:2304.02643, 2023.
[11] J. Lei, Y. Wang, G. Pavlakos, L. Liu, and K. Daniilidis. Gart: Gaussian articulated template models, 2023.
[12] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
[13] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015.
[14] L. Medeiros, S. K. Sah, K. de Moraes Vestena, J. Doran, KabirSubbiah, R. Brown, Z. Deane-Mayer, dolhasz, f-fl0, littlestronomer, Bogay, D. Bansal, E. Özgüroğlu, K. shah, M. Matějek, healthonrails, mutusfa, and rballachay. luca-medeiros/langsegment-anything. https://github.com/luca-medeiros/lang-segment-anything, jun 30 2025.
[15] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
[16] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
[17] N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024.
[18] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models, 2021.
[19] A. Shamir. A survey on mesh segmentation techniques. Computer Graphics Forum, 27(6):1539–1556, 2008.
[20] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
[21] J. Wen, X. Zhao, Z. Ren, A. Schwing, and S. Wang. Gomavatar: Efficient animatable human modeling from monocular video using gaussians-on-mesh. In CVPR, 2024.
[22] H. Zhang, Y. Tian, Y. Zhang, M. Li, L. An, Z. Sun, and Y. Liu. Pymaf-x: Towards well-aligned full-body model regression from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[23] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models, 2023.
[24] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
[25] F. Zhao, Z. Xie, M. Kampffmeyer, H. Dong, S. Han, T. Zheng, T. Zhang, and X. Liang. M3d-vton: A monocular-to-3d virtual try-on network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13239–13249, October 2021.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99611-
dc.description.abstract在本論文中,我們提出了一個兩階段模組化框架,用於從單目 RGB 影片中重建虛擬分身,旨在支援服裝級控制和虛擬分身組合。為此,我們定義了兩個階段,即解纏階段和組合階段。在解纏階段,我們的系統將輸入影片分解為具有語義意義的組件,包括皮膚紋理和一組帶有紋理的服裝網格,每個組件都與一個規範姿勢對齊,並由用戶提供的文字提示引導。此階段利用參數模型(例如 SMPL-X)進行姿勢和形狀估計,確保跨幀的一致性。

在組合階段,解纏後的組件將根據使用者定義的設定(例如體型和服裝選擇)重新組合成一個統一的、可動畫化的網格。產生的虛擬分身支援運動重新導向和渲染,同時支援靈活的服裝重組。

我們的框架強調模組化、可重複使用性和速度,允許快速創建虛擬分身,且僅需少量品質權衡。實驗結果展現了高度的視覺連貫性、服裝完整性以及對構圖的支持。作為未來的發展方向,我們設想在系統中擴展一個 CLIP 引導的服裝檢索模組。這將使用戶能夠透過自然語言描述進行直覺的、基於文字的虛擬分身編輯。
zh_TW
dc.description.abstractIn this thesis, we present a two-stage modular framework for avatar reconstruction from monocular RGB videos, designed to support garment-level control and avatar composition. To this end, two stages are defined, namely, Disentanglement Stage and Composition Stage. In the Disentanglement Stage, our system decomposes input video into semantically meaningful components, including a skin texture and a set of textured clothing meshes, each aligned to a canonical pose and guided by user-provided textual prompts. This stage leverages parametric models (\textit{e.g.}, SMPL-X) for pose and shape estimation, ensuring consistency across frames.

In the Composition Stage, the disentangled components are reassembled into a unified, animatable mesh based on user-defined settings, such as body shape and clothing selection. The resulting avatar supports motion retargeting and rendering while enabling flexible garment recombination.

Our framework emphasizes modularity, reusability, and speed, allowing rapid avatar creation with only minor quality trade-offs. Experimental results demonstrate high visual coherence, garment integrity, and support for composition. As a future direction, we envision extending the system with a CLIP-guided garment retrieval module. This would enable intuitive, text-based avatar editing through natural language descriptions.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-09-17T16:08:17Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-09-17T16:08:17Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents致謝 i
中文摘要 iii
Abstract v
Contents vii
List of Figures xi
List of Tables xiii
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 1
1.3 Objectives 2
1.4 Contributions 3
1.5 Related Work 5
1.6 Thesis Organization 6
Chapter 2 Preliminaries 9
2.1 Mesh Representation 9
2.2 SMPL-X 11
2.3 Gaussian Splatting 12
2.4 Signed Distance Field (SDF) 14
2.5 Object Segmentation 15
2.5.1 Segment Anything Model 2 (SAM2) 15
2.5.2 Grounding DINO 16
2.5.3 LangSAM: Language Segment-Anything Model 18
2.6 Diffusion Model 19
2.6.1 Stable Diffusion 19
2.6.2 ControlNet 20
Chapter 3 Methodology 23
3.1 Disentanglement 26
3.1.1 Video Segmentation 26
3.1.2 Geometry Reconstruction 28
3.1.3 Mesh Segmentation 32
3.1.4 Skin Completion 40
3.1.5 Object Texture Painting 43
3.1.6 Summary 45
3.2 Composition 46
3.2.1 Canonical Model SDF Construction 47
3.2.2 Mesh Layering 49
3.2.3 Summary 52
Chapter 4 Experiments 55
4.1 Environment 55
4.2 Dataset 56
4.3 Qualitative Results 56
4.4 Quantitative Evaluation 57
4.5 Ablation Study 63
Chapter 5 Conclusion 67
References 69
-
dc.language.isoen-
dc.subject虛擬分身zh_TW
dc.subject三維人體重建zh_TW
dc.subject模組化分身合成zh_TW
dc.subject服裝解纏zh_TW
dc.subject高斯潑濺zh_TW
dc.subject擴散模型zh_TW
dc.subjectModular Avatar Compositionen
dc.subjectVirtual Avataren
dc.subjectDiffusion Modelen
dc.subjectGaussian Splattingen
dc.subjectGarment Disentanglementen
dc.subject3D Human Reconstructionen
dc.titleMAC-RAD:透過從 RGB 視訊分離可重複使用資產實現模組化虛擬分身合成zh_TW
dc.titleMAC-RAD: Modular Avatar Composition via Reusable Assets Disentanglement from RGB Videoen
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree碩士-
dc.contributor.oralexamcommittee歐陽明;陳祝嵩;莊永裕;鄭龍磻;徐偉恩zh_TW
dc.contributor.oralexamcommitteeMing Ouhyoung;Chu-Song Chen;Yung-Yu Chuang;Lung-Pan Cheng;Wei-En Hsuen
dc.subject.keyword虛擬分身,三維人體重建,模組化分身合成,服裝解纏,高斯潑濺,擴散模型,zh_TW
dc.subject.keywordVirtual Avatar,3D Human Reconstruction,Modular Avatar Composition,Garment Disentanglement,Gaussian Splatting,Diffusion Model,en
dc.relation.page72-
dc.identifier.doi10.6342/NTU202503094-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2025-08-13-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-lift2028-08-01-
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
  此日期後於網路公開 2028-08-01
37.44 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved