Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96805
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor王鈺強zh_TW
dc.contributor.advisorYu-Chiang Frank Wangen
dc.contributor.author吳彬世zh_TW
dc.contributor.authorBin-Shih Wuen
dc.date.accessioned2025-02-21T16:37:58Z-
dc.date.available2025-02-22-
dc.date.copyright2025-02-21-
dc.date.issued2025-
dc.date.submitted2024-12-26-
dc.identifier.citationP. Achlioptas, O. Diamanti, I. Mitliagkas, and L.uibas. Learning representations and generative models for 3d point clouds. In International conference on machine learning, pages 40–49. PMLR, 2018.
M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
Z. Cao, F. Hong, T. Wu, L. Pan, and Z. Liu. Large-vocabulary 3d diffusion model with transformer. arXiv preprint arXiv:2309.07920, 2023.
E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, O. Gallo, L. J. Guibas, J. Tremblay, S. Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
K. Chen, C. B. Choy, M. Savva, A. X. Chang, T. Funkhouser, and S. Savarese.
Text2shape: Generating shapes from natural language by learning joint embeddings.
In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision,
Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pages
100–116. Springer, 2019.
M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020.
Z. Chen and H. Zhang. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5939–5948, 2019.
Y.-C. Cheng, H.-Y. Lee, S. Tulyakov, A. G. Schwing, and L.-Y. Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4456–4465, 2023.
B. O. Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
W. Feng, X. He, T.-J. Fu, V. Jampani, A. Akula, P. Narayana, S. Basu, X. E. Wang, and W. Y. Wang. Training-free structured diffusion guidance for compositional textto-image synthesis. arXiv preprint arXiv:2212.05032, 2022.
R. Fu, X. Zhan, Y. Chen, D. Ritchie, and S. Sridhar. Shapecrafter: A recursive textconditioned 3d shape generation model. Advances in Neural Information Processing Systems, 35:8882–8895, 2022.
J. Gao, T. Shen, Z. Wang, W. Chen, K. Yin, D. Li, O. Litany, Z. Gojcic, and S. Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35:31841–31854, 2022.
H. Ha, S. Agrawal, and S. Song. Fit2form: 3d generative model for robot gripper form design. In Conference on Robot Learning, pages 176–187. PMLR, 2021.
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
T. Huang, Y. Zeng, B. Dong, H. Xu, S. Xu, R. W. Lau, and W. Zuo. Textfield3d: Towards enhancing open-vocabulary 3d generation with noisy text fields. arXiv preprint arXiv:2309.17175, 2023.
C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, and T. Park. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
P. Katara, Z. Xian, and K. Fragkiadaki. Gen2sim: Scaling up robot learning in simulation with generative models. arXiv preprint arXiv:2310.18308, 2023.
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
S. Laine, J. Hellsten, T. Karras, Y. Seol, J. Lehtinen, and T. Aila. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics (TOG), 39(6):1–14, 2020.
J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pretraining with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pretraining for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
K. Li and J. Malik. Implicit maximum likelihood estimation. arXiv preprint arXiv:1809.09087, 2018.
M. Li, Y. Duan, J. Zhou, and J. Lu. Diffusion-sdf: Text-to-shape via voxelized diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12642–12651, 2023.
C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin.
Magic3d: High-resolution text-to-3d content creation.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
Z. Liu, Y. Wang, X. Qi, and C.-W. Fu. Towards implicit text-guided 3d shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17896–17906, 2022.
L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019.
G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or. Latentnerf for shape-guided generation of 3d shapes and textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12663–12673, 2023.
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
P. Mittal, Y.-C. Cheng, M. Singh, and S. Tulsiani. Autosdf: Shape priors for 3d completion, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 306–315, 2022.
D. H. Park, S. Azadi, X. Liu, T. Darrell, and A. Rohrbach. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
D. Pavllo, J. Kohler, T. Hofmann, and A. Lucchi. Learning generative models of textured 3d meshes from real-world images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13879–13889, 2021.
B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al. Photorealistic text-toimage diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
A. Sanghi, H. Chu, J. G. Lambourne, Y. Wang, C.-Y. Cheng, M. Fumero, and K. R. Malekshan. Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18603–18613, 2022.
A. Sanghi, R. Fu, V. Liu, K. D. Willis, H. Shayani, A. H. Khasahmadi, S. Sridhar, and D. Ritchie. Clip-sculptor: Zero-shot generation of high-fidelity and diverse shapes from natural language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18339–18348, 2023.
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 2556–2565, 2018.
T. Shen, J. Gao, K. Yin, M.-Y. Liu, and S. Fidler. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems, 34:6087–6101, 2021.
M. Tao, B.-K. Bao, H. Tang, and C. Xu. Galip: Generative adversarial clips for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14214–14223, 2023.
A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
C. Wang, M. Chai, M. He, D. Chen, and J. Liao. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3835–3844, 2022.
J. Wei, H. Wang, J. Feng, G. Lin, and K.-H. Yap. Taps3d: Text-guided 3d textured shape generation from pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16805–16815, 2023.
T. Wu, J. Zhang, X. Fu, Y. Wang, J. Ren, L. Pan, W. Wu, L. Yang, J. Wang, C. Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023.
J. Xu, X. Wang, W. Cheng, Y.-P. Cao, Y. Shan, X. Qie, and S. Gao. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20908–20918, 2023.
G. Yang, X. Huang, Z. Hao, M.-Y. Liu, S. Belongie, and B. Hariharan. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4541–4550, 2019.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96805-
dc.description.abstract由於缺乏大規模的文字與3D對應資料,近期的文字轉3D生成工作主要依賴於使用2D擴散模型來合成3D資料。由於基於擴散模型的方法通常在訓練和推理過程中需要大量的優化時間,因此基於GAN(生成式對抗網絡)的模型在快速生成3D資料方面仍然具有吸引力。在這篇論文中,我們提出了一種名為三平面注意力的文字引導3D生成方法(TPA3D),這是一個端到端可訓練的基於GAN的深度學習模型,專為快速文字到3D生成而設計。在訓練過程中僅觀察到3D形狀資料及其渲染的2D圖像,我們的TPA3D旨在檢索詳細的視覺描述,以合成對應的3D網格資料。這是通過我們提出的對句子和單詞級別文字特徵進行的注意力機制實現的。在實驗中,我們展示了TPA3D生成的高質量3D紋理形狀與細粒度描述相一致,同時展現了令人印象深刻的計算效率。zh_TW
dc.description.abstractDue to the lack of large-scale text-3D correspondence data, recent text-to-3D generation works mainly rely on utilizing 2D diffusion models for synthesizing 3D data. Since diffusion-based methods typically require significant optimization time for both training and inference, the use of GAN-based models would still be desirable for fast 3D generation. In this work, we propose Triplane Attention for text-guided 3D generation (TPA3D), an end-to-end trainable GAN-based deep learning model for fast text-to-3D generation. With only 3D shape data and their rendered 2D images observed during training, our TPA3D is designed to retrieve detailed visual descriptions for synthesizing the corresponding 3D mesh data. This is achieved by the proposed attention mechanisms on the extracted sentence and word-level text features. In our experiments, we show that TPA3D generates high-quality 3D textured shapes aligned with fine-grained descriptions, while impressive computation efficiency can be observed.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-02-21T16:37:58Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-02-21T16:37:58Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 iii
Abstract iv
Contents v
List of Figures viii
List of Tables xi
Chapter 1 Introduction 1
Chapter 2 Related Works 5
2.1 Text-Guided 2D Image Synthesis 5
2.2 Text-Guided 3D Object Generation 6
Chapter 3 Preliminary 8
Chapter 4 Method 10
4.1 Problem Formulation and Model Overview 10
4.2 Pseudo Caption Generation 12
4.3 Triplane Attention 3D Generator 12
4.3.1 Sentence-Level Triplane Generator 13
4.3.2 Word-Level Triplane Refinement via TPA 13
4.3.2.1 Triplane-Feature Consistency and Connectivity 15
4.3.2.2 Refinement with Word Features 16
4.4 Text-Guided Discriminators 16
4.5 Training and Inference 18
4.5.1 Training 18
4.5.2 Inference 18
Chapter 5 Experiments 19
5.1 Dataset 19
5.2 Quantitative Results 21
5.3 Qualitative Results 22
5.4 Further Analysis 23
5.4.1 Text-Guided Manipulation 23
5.4.2 Comparison with SDS-Based Methods 25
5.4.3 Ablation Study on TPA 25
5.4.4 Inference Speed Comparison 26
Chapter 6 Conclusion 27
References 28
Appendix A — Further Introduction 37
A.1 Implementation Details 37
A.1.1 Model Configuration 37
A.1.2 More Details about Datasets 37
A.1.3 More Details about GET3D 38
A.1.4 More Details of Pseudo Captioning 39
A.2 Ablation Study 39
A.2.1 Components in TPA blocks 40
A.2.2 TPAgeo and TPAtex 42
A.2.3 Block Numbers of TPA 42
A.2.4 Training Objectives 42
A.3 More Qualitative Results 43
A.3.1 Disentanglement of Geometry and Texture 44
A.3.2 Interpolation of Geometry and Texture 44
A.3.3 Generation for Detailed Description 45
A.3.4 Controllable Manipulation 46
A.3.5 Multi-class Generation 48
A.4 Human Evaluation for Verifying Fidelity 48
A.5 TPA Effectiveness for Simple or Complex Captions 49
A.6 Limitations 50
-
dc.language.isoen-
dc.subject文字到3D生成zh_TW
dc.subject3D 視覺zh_TW
dc.subject生成式AIzh_TW
dc.subject電腦視覺zh_TW
dc.subject注意力機制zh_TW
dc.subjectattention mechanismen
dc.subjectcomputer visionen
dc.subject3D visionen
dc.subjectgenerative AIen
dc.subjecttext-to-3D generationen
dc.title三平面注意力於快速文字到 3D 生成zh_TW
dc.titleTPA3D: Triplane Attention for Fast Text-to-3D Generationen
dc.typeThesis-
dc.date.schoolyear113-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee陳祝嵩;楊福恩zh_TW
dc.contributor.oralexamcommitteeChu-Song Chen;Fu-En Yangen
dc.subject.keyword電腦視覺,3D 視覺,生成式AI,文字到3D生成,注意力機制,zh_TW
dc.subject.keywordcomputer vision,3D vision,generative AI,text-to-3D generation,attention mechanism,en
dc.relation.page50-
dc.identifier.doi10.6342/NTU202401418-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2024-12-26-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電信工程學研究所-
dc.date.embargo-lift2025-02-22-
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-1.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
24.63 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved