邁向複雜3D視覺建構： 從物件層級補全與生成，到場景層級跨模態定位與修復

黃聖喻; Sheng-Yu Huang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101440

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	王鈺強	zh_TW
dc.contributor.advisor	Yu-Chiang Frank Wang	en
dc.contributor.author	黃聖喻	zh_TW
dc.contributor.author	Sheng-Yu Huang	en
dc.date.accessioned	2026-02-03T16:18:19Z	-
dc.date.available	2026-02-04	-
dc.date.copyright	2026-02-03	-
dc.date.issued	2025	-
dc.date.submitted	2026-01-07	-
dc.identifier.citation	[1] W. Yuan, T. Khot, D. Held, C. Mertz, and M. Hebert, “Pcn: Point completion network,” in International Conference on 3D Vision (3DV), 2018. xi, 2, 4, 12, 13, 15, 34, 36 [2] X. Wen, P. Xiang, Z. Han, Y.-P. Cao, P. Wan, W. Zheng, and Y.-S. Liu, “Pmp-net++: Point cloud completion by transformer-enhanced multi-step point moving paths,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022. xi, 2, 5, 12, 13, 15 [3] X. Yu, Y. Rao, Z. Wang, Z. Liu, J. Lu, and J. Zhou, “Pointr: Diverse point cloud completion with geometry-aware transformers,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021. xi, 2, 4, 5, 12, 13, 15, 16, 22 [4] L. Pan, X. Chen, Z. Cai, J. Zhang, H. Zhao, S. Yi, and Z. Liu, “Variational relational point completion network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. xi, xiii, 2, 4, 5, 12, 13, 15, 16, 34, 36, 44 [5] Q.-Y. Zhou, J. Park, and V. Koltun, “Open3d: A modern library for 3d data processing,” arXiv preprint arXiv:1801.09847, 2018. xii, 18 [6] Z. Liu, Y. Wang, X. Qi, and C.-W. Fu, “Towards implicit text-guided 3d shape generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. xii, xiv, 23, 25, 27, 32, 33, 38, 41, 43, 44, 48 [7] Y.-C. Cheng, H.-Y. Lee, S. Tulyakov, A. Schwing, and L. Gui, “Sdfusion: Multimodal 3d shape completion, reconstruction, and generation,” arXiv preprint arXiv:2212.04493, 2022. xii, 22, 23, 27, 30, 32, 33, 41, 43, 44 [8] B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text-to-3d using 2d diffusion,” arXiv preprint arXiv:2209.14988, 2022. xii, 23, 27, 34, 43 [9] A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen, “Point-e: A system for generating 3d point clouds from complex prompts,” arXiv preprint arXiv:2212.08751, 2022. xii, 34, 43, 44 [10] C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin, “Magic3d: High-resolution text-to-3d content creation,” arXiv preprint arXiv:2211.10440, 2022. xii, 23, 27, 34, 43 [11] P. Mittal, Y.-C. Cheng, M. Singh, and S. Tulsiani, “Autosdf: Shape priors for 3d completion, reconstruction and generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. xii, 22, 23, 25, 26, 32, 33, 44 [12] R. Wu, X. Chen, Y. Zhuang, and B. Chen, “Multimodal shape completion via conditional generative adversarial networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020. xiii, 34, 36, 44 [13] Z. Chen, F. Long, Z. Qiu, T. Yao, W. Zhou, J. Luo, and T. Mei, “Anchorformer: Point cloud completion from discriminative nodes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. xiii, 34, 36, 44 [14] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023. xiv, 39, 47 [15] A. Mirzaei, T. Aumentado-Armstrong, K. G. Derpanis, J. Kelly, M. A. Brubaker, I. Gilitschenski, and A. Levinshtein, “Spin-nerf: Multiview segmentation and perceptual inpainting with neural radiance fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. xv, xvi, 70, 71, 74, 80, 81, 82, 83 [16] M. Ye, M. Danelljan, F. Yu, and L. Ke, “Gaussian grouping: Segment and edit anything in 3d scenes,” Proceedings of the European Conference on Computer Vision (ECCV), 2024. xv, xvi, 70, 71, 72, 74, 75, 78, 80, 81, 82, 83 [17] Y. Wang, Q. Wu, G. Zhang, and D. Xu, “Gscream: Learning 3d geometry and feature consistent gaussian splatting for object removal,” Proceedings of the European Conference on Computer Vision (ECCV), 2024. xv, xvi, 70, 71, 75, 80, 81, 82, 83, 84 [18] H. Chen, C. C. Loy, and X. Pan, “Mvip-nerf: Multi-view 3d inpainting on nerf scenes via diffusion prior,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. xvi, 71, 75, 80, 81, 83, 84 [19] C. H. Lin, C. Kim, J.-B. Huang, Q. Li, C.-Y. Ma, J. Kopf, M.-H. Yang, and H.-Y. Tseng, “Taming latent diffusion model for neural radiance field inpainting,” Proceedings of the European Conference on Computer Vision (ECCV), 2024. xvi, 70, 71, 75, 80, 81, 83, 84 [20] J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “Lerf: Language embedded radiance fields,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023. xvi, 82, 85 [21] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mipnerf 360: Unbounded anti-aliased neural radiance fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. xvi, 82, 83, 85 [22] D. Wang, T. Zhang, A. Abboud, and S. S¨usstrunk, “Innerf360: Text-guided 3d-consistent object inpainting on 360-degree neural radiance fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. xvii, 82, 84, 85 [23] S. Y. Huang, H.-Y. Hsu, and F. Wang, “Spovt: Semantic-prototype variational transformer for dense point cloud semantic completion,” Advances in Neural Information Processing Systems (NeurIPS), 2022. 1, 22 [24] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2, 4, 25 [25] Y. Yang, C. Feng, Y. Shen, and D. Tian, “Foldingnet: Point cloud autoencoder via deep grid deformation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2, 4 [26] L. P. Tchapmi, V. Kosaraju, H. Rezatofighi, I. Reid, and S. Savarese, “Topnet: Structural point cloud decoder,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2, 4 [27] X. Wang, M. H. Ang Jr, and G. H. Lee, “Cascaded refinement network for point cloud completion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 4 [28] X. Wang, M. H. Ang, and G. Lee, “Cascaded refinement network for point cloud completion with self-supervision,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021. 2, 4 [29] Y. Xia, Y. Xia, W. Li, R. Song, K. Cao, and U. Stilla, “Asfm-net: asymmetrical siamese feature matching network for point completion,” in Proceedings of the ACM Conference on Multimedia (MM), 2021. 2, 4 [30] M. Liu, L. Sheng, S. Yang, J. Shao, and S.-M. Hu, “Morphing and sampling network for dense point cloud completion,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020. 2, 10 [31] H. Xie, H. Yao, S. Zhou, J. Mao, S. Zhang, and W. Sun, “Grnet: Gridding residual network for dense point cloud completion,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020. 2, 5 [32] X. Wang, M. H. Ang, and G. H. Lee, “Voxel-based network for shape completion by leveraging edge generation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021. 2, 5 [33] P. Xiang, X. Wen, Y.-S. Liu, Y.-P. Cao, P. Wan, W. Zheng, and Z. Han, “Snowflakenet: Point cloud completion by snowflake point deconvolution with skip-transformer,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021. 2, 5 [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NeurIPS), 2017. 2, 5, 9 [35] J. Li, K. Han, P. Wang, Y. Liu, and X. Yuan, “Anisotropic convolutional networks for 3d semantic scene completion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 5 [36] A. Dourado, F. Guth, and T. de Campos, “Data augmented 3d semantic scene completion with 2d segmentation priors,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2022. 2, 5 [37] X. Yang, H. Zou, X. Kong, T. Huang, Y. Liu, W. Li, F. Wen, and H. Zhang, “Semantic segmentation-assisted scene completion for lidar point clouds,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021. 2, 5 [38] Y. Cai, X. Chen, C. Zhang, K.-Y. Lin, X. Wang, and H. Li, “Semantic scene completion via integrating instances and scene in-the-loop,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2, 5 [39] S. Zhang, S. Li, A. Hao, and H. Qin, “Point cloud semantic scene completion from rgb-d images,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021. 2, 4, 5 [40] M. Garbade, Y.-T. Chen, J. Sawatzky, and J. Gall, “Two stream 3d semantic scene completion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2019. 2, 5 [41] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: deep hierarchical feature learning on point sets in a metric space,” Advances in Neural Information Processing Systems (NeurIPS), 2017. 4, 25, 36, 57 [42] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Transactions on Graphics (TOG), 2019. 4, 13, 25 [43] B. Gong, Y. Nie, Y. Lin, X. Han, and Y. Yu, “Me-pcn: point completion conditioned on mask emptiness,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021. 5 [44] X. Wen, T. Li, Z. Han, and Y.-S. Liu, “Point cloud completion by skipattention network with hierarchical folding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 5 [45] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic scene completion from a single depth image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 5 [46] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013. 8 [47] L. Yi, V. G. Kim, D. Ceylan, I.-C. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas, “A scalable active framework for region annotation in 3d shape collections,” ACM Transactions on Graphics (TOG), 2016. 12 [48] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An informationrich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015. 12, 25, 32 [49] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. 13, 32 [50] J. Ma and D. Yarats, “On the adequacy of untuned warmup for adaptive optimization,” Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019. 13 [51] H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin, “Cyclical annealing schedule: A simple approach to mitigating kl vanishing,” Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019. 13 [52] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019. 13, 33 [53] N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y. Lo, J. Johnson, and G. Gkioxari, “Accelerating 3d deep learning with pytorch3d,” arXiv:2007.08501, 2020. 13 [54] Y. Nie, J. Hou, X. Han, and M. Nießner, “Rfd-net: Point scene understanding by semantic instance reconstruction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 13 [55] J. Hou, A. Dai, and M. Nießner, “Revealnet: Seeing behind objects in rgb-d scans,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 13 [56] S.-Y. Huang, C.-P. Huang, K.-P. Chang, Z.-T. Chou, I.-J. Liu, and Y.-C. F. Wang, “Learning shape-color diffusion priors for text-guided 3d object generation,” IEEE Transactions on Multimedia, 2025. 22 [57] P. An, D. Zhu, S. Quan, J. Ding, J. Ma, Y. Yang, and Q. Liu, “Esc-net: Alleviating triple sparsity on 3d lidar point clouds for extreme sparse scene completion,” IEEE Transactions on Multimedia, pp. 1–12, 2024. 22 [58] S. Li, P. Gao, X. Tan, and W. Xiang, “Rlgrid: Reinforcement learning controlled grid deformation for coarse-to-fine point could completion,” IEEE Transactions on Multimedia, pp. 1–16, 2023. 22 [59] X. Wen, J. Zhou, Y.-S. Liu, H. Su, Z. Dong, and Z. Han, “3d shape reconstruction from 2d images with disentangled attribute flow,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 22 [60] S. Cai, A. Obukhov, D. Dai, and L. Van Gool, “Pix2nerf: Unsupervised conditional p-gan for single image to neural radiance fields translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 22 [61] L. Wu, Q. Zhang, J. Hou, and Y. Xu, “Leveraging single-view images for unsupervised 3d point cloud completion,” IEEE Transactions on Multimedia, pp. 1–14, 2023. 22 [62] X. Tu, J. Zhao, M. Xie, Z. Jiang, A. Balamurugan, Y. Luo, Y. Zhao, L. He, Z. Ma, and J. Feng, “3d face reconstruction from a single image assisted by 2d face images in the wild,” IEEE Transactions on Multimedia, vol. 23, pp. 1160–1172, 2021. 22 [63] W. Nie, C. Jiao, R. Chang, L. Qu, and A.-A. Liu, “Cpg3d: Cross-modal priors guided 3d object reconstruction,” IEEE Transactions on Multimedia, vol. 25, pp. 9383–9396, 2023. 22 [64] K. Chen, C. B. Choy, M. Savva, A. X. Chang, T. Funkhouser, and S. Savarese, “Text2shape: Generating shapes from natural language by learning joint embeddings,” in Proceedings of the Asian Conference on Computer Vision (ACCV). Springer, 2018. 22, 23, 25, 32 [65] K. Li and J. Malik, “Implicit maximum likelihood estimation,” arXiv preprint arXiv:1809.09087, 2018. 22, 25 [66] R. Fu, X. Zhan, Y. Chen, D. Ritchie, and S. Sridhar, “Shapecrafter: A recursive text-conditioned 3d shape generation model,” arXiv preprint arXiv:2207.09446, 2022. 22, 23, 25, 26, 33 [67] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proceedings of the International Conference on Machine Learning (ICML). PMLR, 2017. 23, 25 [68] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NeurIPS), 2017. 23, 26, 31 [69] S. Gong, M. Li, J. Feng, Z. Wu, and L. Kong, “Diffuseq: Sequence to sequence text generation with diffusion models,” Proceedings of the International Conference on Learning Representations (ICLR), 2023. 23 [70] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” Proceedings of the International Conference on Learning Representations (ICLR), 2020. 23 [71] M. Li, Y. Duan, J. Zhou, and J. Lu, “Diffusion-sdf: Text-to-shape via voxelized diffusion,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 23, 27 [72] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems (NeurIPS), 2020. 23, 26, 41 [73] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Highresolution image synthesis with latent diffusion models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 23, 26, 34, 41 [74] W.-C. Fan, Y.-C. Chen, D. Chen, Y. Cheng, L. Yuan, and Y.-C. F. Wang, “Frido: Feature pyramid diffusion for complex scene image synthesis,” Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023. 23, 26, 41 [75] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems (NeurIPS), 2022. 23, 26 [76] O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for textdriven editing of natural images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 23 [77] X. Zeng, A. Vahdat, F.Williams, Z. Gojcic, O. Litany, S. Fidler, and K. Kreis, “Lion: Latent point diffusion models for 3d shape generation,” Advances in Neural Information Processing Systems (NeurIPS), 2022. 23, 27, 30, 41 [78] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023. 24, 29, 33, 40 [79] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2015. 25 [80] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger, “Occupancy networks: Learning 3d reconstruction in function space,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 25, 27 [81] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, “Deepsdf: Learning continuous signed distance functions for shape representation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 25 [82] H. Thanh-Tung and T. Tran, “Catastrophic forgetting and mode collapse in gans,” in 2020 international joint conference on neural networks (ijcnn). IEEE, 2020, pp. 1–10. 25 [83] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF confer ence on computer vision and pattern recognition, 2019, pp. 4401–4410. 25 [84] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. 26, 31 [85] A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” Advances in Neural Information Processing Systems (NeurIPS), 2017. 26 [86] M. Honnibal and M. Johnson, “An improved non-monotonic transition system for dependency parsing,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015. 26 [87] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Proceedings of the European Conference on Computer Vision (ECCV), 2020. 27 [88] Y. Chen, Y. Pan, Y. Li, T. Yao, and T. Mei, “Control3d: Towards controllable text-to-3d generation,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1148–1156. 27 [89] A. Raj, S. Kaza, B. Poole, M. Niemeyer, N. Ruiz, B. Mildenhall, S. Zada, K. Aberman, M. Rubinstein, J. Barron et al., “Dreambooth3d: Subjectdriven text-to-3d generation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 2349–2359. 27 [90] Y. Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang, “Mvdream: Multi-view diffusion for 3d generation,” arXiv preprint arXiv:2308.16512, 2023. 27 [91] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint arXiv:2301.00234, 2022. 29 [92] Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang, “An empirical study of gpt-3 for few-shot knowledge-based vqa,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2022. 29 [93] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018. 31 [94] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural computation, vol. 1, no. 2, pp. 270–280, 1989. 32 [95] P. Cignoni, M. Callieri, M. Corsini, M. Dellepiane, F. Ganovelli, G. Ranzuglia et al., “Meshlab: an open-source mesh processing tool.” in Eurographics Italian chapter conference, vol. 2008. Salerno, Italy, 2008, pp. 129–136. 33 [96] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in Neural Information Processing Systems (NeurIPS), 2017. 33 [97] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proceedings of the International Conference on Machine Learning (ICML), 2021. 33, 37 [98] D. W. Shu, S. W. Park, and J. Kwon, “3d point cloud generative adversarial network based on tree structured graph convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 33 [99] T. M¨uller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Transactions on Graphics (ToG), vol. 41, no. 4, pp. 1–15, 2022. 34, 70, 73, 74 [100] M. Honnibal and M. Johnson, “An improved non-monotonic transition system for dependency parsing,” in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 1373–1378. 40 [101] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019. 40 [102] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022. 40 [103] T.-Y. Wu, S.-Y. Huang, and Y.-C. F. Wang, “Data-efficient 3d visual grounding via order-aware referring,” in 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 3107–3117. 50 [104] L.-H. Lee, T. Braud, P. Zhou, L.Wang, D. Xu, Z. Lin, A. Kumar, C. Bermejo, and P. Hui, “All one needs to know about metaverse: A complete survey on technological singularity, virtual ecosystem, and research agenda,” arXiv preprint arXiv:2110.05352, 2021. 50 [105] R. B. Kochanski, J. M. Lombardi, J. L. Laratta, R. A. Lehman, and J. E. O’Toole, “Image-guided navigation and robotics in spine surgery,” Neurosurgery, pp. 1179–1189, 2019. 50 [106] P. Anderson, Q.Wu, D. Teney, J. Bruce, M. Johnson, N. S¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2018, pp. 3674–3683. 50 [107] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, “Mdetr-modulated detection for end-to-end multi-modal understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1780–1790. 50, 53, 54 [108] H. Zhang, P. Zhang, X. Hu, Y.-C. Chen, L. Li, X. Dai, L. Wang, L. Yuan, J.- N. Hwang, and J. Gao, “Glipv2: Unifying localization and vision-language understanding,” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022. 50, 53, 54 [109] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023. 50, 53, 54 [110] A. Shtedritski, C. Rupprecht, and A. Vedaldi, “What does clip know about a red circle? visual prompt engineering for vlms,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 11 987–11 997. 50, 53 [111] L. Yang, Y. Wang, X. Li, X. Wang, and J. Yang, “Fine-grained visual prompting,” in Proceedings of Advances in Neural Information Processing Systems, 2024. 50, 53 [112] Y. Zhao, Z. Lin, D. Zhou, Z. Huang, J. Feng, and B. Kang, “Bubogpt: Enabling visual grounding in multi-modal llms,” arXiv preprint arXiv:2307.08581, 2023. 50 [113] P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas, “Referit3d: Neural listeners for fine-grained 3d object identification in real world scenes,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 422–440. 51, 54, 63, 64 [114] D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3d object localization in rgb-d scans using natural language,” in Proceedings of the European conference on computer vision (ECCV). Springer, 2020, pp. 202–221. 51, 64, 65 [115] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2017, pp. 5828–5839. 51, 64 [116] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, pp. 61–80, 2008. 51 [117] S. Huang, Y. Chen, J. Jia, and L. Wang, “Multi-view transformer for 3d visual grounding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 524–15 533. 51, 54, 55, 63 [118] Z. Yang, S. Zhang, L.Wang, and J. Luo, “Sat: 2d semantics assisted training for 3d visual grounding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1856–1866. 51, 54, 55, 63 [119] E. Bakr, Y. Alsaedy, and M. Elhoseiny, “Look around and refer: 2d synthetic semantics knowledge distillation for 3d visual grounding,” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022. 51, 54, 55 [120] J. Luo, J. Fu, X. Kong, C. Gao, H. Ren, H. Shen, H. Xia, and S. Liu, “3d-sps: Single-stage 3d visual grounding via referred point progressive selection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 454–16 463. 51, 54, 55, 65 [121] A. Jain, N. Gkanatsios, I. Mediratta, and K. Fragkiadaki, “Bottom up top down detection transformers for language grounding in images and point clouds,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2022, pp. 417–433. 51, 54, 63, 65 [122] M. Feng, Z. Li, Q. Li, L. Zhang, X. Zhang, G. Zhu, H. Zhang, Y. Wang, and A. Mian, “Free-form description guided 3d visual graph network for object grounding in point cloud,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3722–3731. 51, 54 [123] P.-H. Huang, H.-H. Lee, H.-T. Chen, and T.-L. Liu, “Text-guided graph neural networks for referring 3d instance segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 1610–1618. 51, 54 [124] S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Language conditioned spatial relation reasoning for 3d object grounding,” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022. 51, 54 [125] Z. Wang, H. Huang, Y. Zhao, L. Li, X. Cheng, Y. Zhu, A. Yin, and Z. Zhao, “3DRP-net: 3D relative position-aware network for 3D visual grounding,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. 51, 54 [126] E. M. Bakr, M. Ayman, M. Ahmed, H. Slim, and M. Elhoseiny, “Cot3dref: Chain-of-thoughts data-efficient 3d visual grounding,” arXiv preprint arXiv:2310.06214, 2023. 51, 52, 55, 58, 60, 63 [127] Z. Yuan, X. Yan, Z. Li, X. Li, Y. Guo, S. Cui, and Z. Li, “Toward explainable and fine-grained 3d grounding through referring textual phrases,” arXiv preprint arXiv:2207.01821, 2022. 51, 55 [128] A. Abdelreheem, K. Olszewski, H.-Y. Lee, P. Wonka, and P. Achlioptas, “Scanents3d: Exploiting phrase-to-3d-object correspondences for improved visio-linguistic models in 3d scenes,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 3524–3534. 51, 55 [129] Y. Wu, X. Cheng, R. Zhang, Z. Cheng, and J. Zhang, “Eda: Explicit textdecoupling and dense alignment for 3d visual grounding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 19 231–19 242. 51, 55 [130] J. Hsu, J. Mao, and J. Wu, “Ns3d: Neuro-symbolic grounding of 3d objects and relations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2614–2623. 51, 52, 55 [131] X. Zhu, H. Zhou, P. Xing, L. Zhao, H. Xu, J. Liang, A. Hauptmann, T. Liu, and A. Gallagher, “Open-vocabulary 3d semantic segmentation with text-to-image diffusion models,” Proceedings of the European Conference on Computer Vision (ECCV), 2024. 51 [132] A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann, “Openmask3d: Open-vocabulary 3d instance segmentation,” Advances in Neural Information Processing Systems (NeurIPS), 2023. 51 [133] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026. 51 [134] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Highresolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695. 51 [135] W.-D. K. Ma, A. Lahiri, J. P. Lewis, T. Leung, and W. B. Kleijn, “Directed diffusion: Direct control of object placement through attention guidance,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024. 52 [136] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021. 52, 55 [137] J. C. McVay and M. J. Kane, “Conducting the train of thought: working memory capacity, goal neglect, and mind wandering in an executive-control task.” Journal of Experimental Psychology: Learning, Memory, and Cognition, p. 196, 2009. 52, 58 [138] L. Chen, M. A. Lambon Ralph, and T. T. Rogers, “A unified model of human semantic knowledge and its disorders,” Nature human behaviour, p. 0039, 2017. 52, 58 [139] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 965–10 975. 53 [140] Y. Yao, A. Zhang, Z. Zhang, Z. Liu, T.-S. Chua, and M. Sun, “Cpt: Colorful prompt tuning for pre-trained vision-language models,” arXiv preprint arXiv:2109.11797, 2021. 53 [141] S. Subramanian, W. Merrill, T. Darrell, M. Gardner, S. Singh, and A. Rohrbach, “Reclip: A strong zero-shot baseline for referring expression comprehension,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2022. 53 [142] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proceedings of the European conference on computer vision (ECCV). Springer, 2020, pp. 213–229. 54 [143] Z. Hao, L. Feng, L. Shilong, Z. Lei, S. Hang, Z. Jun, L. M. Ni, and S. Heung-Yeung, “DINO: DETR with improved denoising anchor boxes for end-toend object detection,” in Proceedings of the International Conference on Learning Representations (ICLR), 2023. 54 [144] Z. Yuan, X. Yan, Y. Liao, R. Zhang, S. Wang, Z. Li, and S. Cui, “Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1791–1800. 54 [145] L. Yang, Z. Zhang, Z. Qi, Y. Xu, W. Liu, Y. Shan, B. Li, W. Yang, P. Li, Y. Wang et al., “Exploiting contextual objects and relations for 3d visual grounding,” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023. 54 [146] D. He, Y. Zhao, J. Luo, T. Hui, S. Huang, A. Zhang, and S. Liu, “Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding,” in Proceedings of the 29th ACM International Conference on Multimedia (MM), 2021, pp. 2344–2352. 54, 63 [147] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019, pp. 4171–4186. 55 [148] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. 55 [149] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning, “Generating semantically precise scene graphs from textual descriptions for improved image retrieval,” in Proceedings of the fourth workshop on vision and language, 2015, pp. 70–80. 56 [150] L. Jiang, H. Zhao, S. Shi, S. Liu, C.-W. Fu, and J. Jia, “Pointgroup: Dual-set point grouping for 3d instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4867–4876. 57, 64 [151] C. R. Qi, O. Litany, K. He, and L. J. Guibas, “Deep hough voting for 3d object detection in point clouds,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 9277–9286. 57 [152] Y. Wu, M. Shi, S. Du, H. Lu, Z. Cao, and W. Zhong, “3d instances as 1d kernels,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2022, pp. 235–252. 57 [153] Y. Zhang, Z. Gong, and A. X. Chang, “Multi3drefer: Grounding text description to multiple 3d objects,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), 2023, pp. 15 225–15 236. 64, 65 [154] L. Zhao, D. Cai, L. Sheng, and D. Xu, “3dvg-transformer: Relation modeling for visual grounding on point clouds,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 2928–2937. 65 [155] S. Bird, E. Klein, and E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009. 66 [156] S.-Y. Huang, Z.-T. Chou, and Y.-C. F. Wang, “3d gaussian inpainting with depth-guided cross-view consistency,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 26 704–26 713. 69 [157] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021. 70, 73 [158] J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5855–5864. 70 [159] A. Yu, R. Li, M. Tancik, H. Li, R. Ng, and A. Kanazawa, “Plenoctrees for real-time rendering of neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5752–5761. 70, 73 [160] C. Sun, M. Sun, and H.-T. Chen, “Direct voxel grid optimization: Superfast convergence for radiance fields reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5459–5469. 70, 73 [161] S. Fridovich-Keil, A. Yu, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa,“Plenoxels: Radiance fields without neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5501–5510. 70, 73, 74 [162] R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth, “Nerf in the wild: Neural radiance fields for unconstrained photo collections,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7210–7219. 70 [163] C. Reiser, S. Peng, Y. Liao, and A. Geiger, “Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 335–14 345. 70 [164] A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su, “Tensorf: Tensorial radiance fields,” in European Conference on Computer Vision. Springer, 2022, pp. 333–350. 70 [165] Z. Yu, A. Chen, B. Huang, T. Sattler, and A. Geiger, “Mip-splatting: Aliasfree 3d gaussian splatting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 70 [166] M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 70 [167] B. Kerbl, G. Kopanas, T. Leimk¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.” ACM Transactions on Graphics (TOG), 2023. 70, 74, 79 [168] G. Chen and W. Wang, “A survey on 3d gaussian splatting,” arXiv preprint arXiv:2401.03890, 2024. 70 [169] M. C. Macedo and A. L. Apolinario, “Occlusion handling in augmented reality: past, present and future,” IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 2, pp. 1590–1609, 2021. 70 [170] W. Broll, “Augmented reality,” in Virtual and Augmented Reality (VR/AR) Foundations and Methods of Extended Realities (XR). Springer, 2022, pp. 291–329. 70 [171] R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V. Lempitsky, “Resolution-robust large mask inpainting with fourier convolutions,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2022. 70, 78, 80, 83 [172] C. Corneanu, R. Gadde, and A. M. Martinez, “Latentpaint: Image inpainting in latent space with diffusion models,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2024. 70 [173] A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool,“Repaint: Inpainting using denoising diffusion probabilistic models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 70 [174] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” Proceedings of the International Conference on Learning Representations (ICLR), 2024. 70 [175] S. Xie, Z. Zhang, Z. Lin, T. Hinz, and K. Zhang, “Smartbrush: Text and shape guided object inpainting with diffusion model,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 70 [176] B. Yang, S. Gu, B. Zhang, T. Zhang, X. Chen, X. Sun, D. Chen, and F. Wen,“Paint by example: Exemplar-based image editing with diffusion models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 70 [177] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Highresolution image synthesis with latent diffusion models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2022. 70, 78, 80, 83 [178] S. Yang, X. Chen, and J. Liao, “Uni-paint: A unified framework for multimodal image inpainting with pretrained diffusion model,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023. 70 [179] Y. Hao, Y. Liu, Z. Wu, L. Han, Y. Chen, G. Chen, L. Chu, S. Tang, Z. Yu, Z. Chen et al., “Edgeflow: Achieving practical interactive segmentation with edge-guided flow,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021. 71 [180] S. Weder, G. Garcia-Hernando, A. Monszpart, M. Pollefeys, G. J. Brostow, M. Firman, and S. Vicente, “Removing objects from neural radiance fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 71 [181] Y. Yin, Z. Fu, F. Yang, and G. Lin, “Or-nerf: Object removing from 3d scenes guided by multiview segmentation with neural radiance fields,” arXiv preprint arXiv:2305.10503, 2023. 71, 74 [182] A. Mirzaei, T. Aumentado-Armstrong, M. A. Brubaker, J. Kelly, A. Levinshtein, K. G. Derpanis, and I. Gilitschenski, “Reference-guided controllable inpainting of neural radiance fields,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023. 71 [183] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023. 71, 72, 74, 76, 82 [184] A. Mirzaei, R. De Lutio, S. W. Kim, D. Acuna, J. Kelly, S. Fidler, I. Gilitschenski, and Z. Gojcic, “Reffusion: Reference adapted diffusion models for 3d scene inpainting,” arXiv preprint arXiv:2404.10765, 2024. 71 [185] Z. Liu, H. Ouyang, Q. Wang, K. L. Cheng, J. Xiao, K. Zhu, N. Xue, Y. Liu, Y. Shen, and Y. Cao, “Infusion: Inpainting 3d gaussians via learning depth completion from diffusion prior,” arXiv preprint arXiv:2404.11613, 2024.71 [186] E. Weber, A. Holynski, V. Jampani, S. Saxena, N. Snavely, A. Kar, and A. Kanazawa, “Nerfiller: Completing scenes via generative 3d inpainting,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 20 731–20 741. 71 [187] D. Hu, H. Fu, J. Guo, L. Peng, T. Chu, F. Liu, T. Liu, and M. Gong, “In-nout: Lifting 2d diffusion prior for 3d object removal via tuning-free latents alignment,” Advances in Neural Information Processing Systems, vol. 37, pp. 45 737–45 766, 2024. 71 [188] K. Deng, A. Liu, J.-Y. Zhu, and D. Ramanan, “Depth-supervised nerf: Fewer views and faster training for free,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 882–12 891. 73 [189] T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan et al., “Grounded sam: Assembling open-world models for diverse visual tasks,” arXiv preprint arXiv:2401.14159, 2024. 74, 85 [190] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021. 75 [191] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 79, 83 [192] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,“Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017. 83	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101440	-
dc.description.abstract	近年來，3D電腦視覺的進展顯著提升了我們對複雜真實場景的理解能力。然而，真實世界的3D資料常因視角差異而變得不完整或存在跨視角的歧義，為穩健的3D建模帶來巨大挑戰。本論文構建了一個統一的複雜3D視覺建模框架，研究分為物件層級與場景層級兩大任務。在物件層級中，首先解決了部分點雲的語義感知補全，以重建完整的幾何結構與語義標註；接著擴展單一模態生成至跨模態合成，將文字描述生成為3D模型。在場景層級中，我們示範了透過自然語言輸入精確定位3D場景中的目標物件，並在此基礎上執行場景級修補，移除指定物件後以多視角一致性技術填補缺失區域。這四大主題的系統整合顯著提升了建模的穩定性、生成品質、定位準確度與跨視角一致性，為沉浸式計算與機器人應用奠定了堅實基礎。	zh_TW
dc.description.abstract	Recent advancements in 3D computer vision have significantly enhanced our capability to interpret complex real-world scenes. However, real-world 3D data often exhibit incompleteness and ambiguity across viewpoints, presenting considerable challenges for robust 3D modeling. This thesis focuses on constructing a unified framework for complex 3D visual modeling, organized into two main categories: Object-Level and Scene-Level. In the Object-Level sections, we first explore 3D Modeling via Semantic-Aware Completion, reconstructing 3D objects with complete geometry and semantic labels from partial point clouds and semantic information; then we examine 3D Modeling via Cross-Modal Generation, extending single-modal point cloud-to-point cloud transformations to text-driven 3D content synthesis. For Scene-Level sections, we start from introducing 3D Modeling via Scene-Level Visual Grounding, demonstrating how natural language prompts can precisely identify target objects within 3D scenes; finally, with the capability of locating any object in a given 3D scene, we present 3D Modeling via Scene-Level Inpainting, showcasing methods to remove specified objects and coherently fill missing regions using multi-view consistent techniques. Through the systematic integration of these four interconnected themes, our research demonstrates significant advancements in modeling robustness, generative fidelity, localization precision, and cross-view consistency, establishing a solid foundation for future applications in immersive computing, robotics, and beyond.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-02-03T16:18:19Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-02-03T16:18:19Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	致謝i 中文摘要iii Abstract v Contents vii List of Figures xi List of Tables xix 1 3D Modeling via Semantic-Aware Completion 1 1.1 Publication Preface . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Object completion . . . . . . . . . . . . . . . . . . . . . 4 1.3.2 Scene completion . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.1 Problem formulation and model overview . . . . . . . . . 6 1.4.2 Semantic-Prototype Variational Transformer . . . . . . . 7 1.4.3 Training and Inference . . . . . . . . . . . . . . . . . . . 11 1.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.1 Dataset and Implementation Details . . . . . . . . . . . . 12 1.5.2 Semantic Point Cloud Completion . . . . . . . . . . . . . 13 1.5.3 Further Analysis and Ablation Study . . . . . . . . . . . . 14 1.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 3D Modeling via Cross-Modal Generation 21 2.1 Publication Preface . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Text-guided 3D object generation . . . . . . . . . . . . . 25 2.3.2 Diffusion models . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.1 Problem Formulation and Model Overview . . . . . . . . 27 2.4.2 Phrase Disentanglement using LLM . . . . . . . . . . . . 29 2.4.3 Learning Shape-Color Diffusion Priors . . . . . . . . . . 30 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.1 Dataset and Implementation Details . . . . . . . . . . . . 32 2.5.2 Text-guided 3D object generation . . . . . . . . . . . . . 33 2.5.3 Extensional experiments on visual-guided 3D generation. . 36 2.5.4 Further analysis . . . . . . . . . . . . . . . . . . . . . . . 37 2.6 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3 3D Modeling via Scene-Level Visual Grounding 49 3.1 Publication Preface . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.1 2D Visual Grounding . . . . . . . . . . . . . . . . . . . . 53 3.3.2 3D Visual Grounding . . . . . . . . . . . . . . . . . . . . 54 3.3.3 Data-Efficient 3D Visual Grounding . . . . . . . . . . . . 55 3.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.1 Problem Formulation and Model Overview . . . . . . . . 57 3.4.2 3D Visual Grounding with Order-Aware Object Referring 58 3.4.3 Order-Aware Warm-up with Synthetic Referential Order . 61 3.4.4 Overall Training Pipeline . . . . . . . . . . . . . . . . . . 63 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.2 Quantitative Results for Data Efficiency . . . . . . . . . . 64 3.5.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . 66 3.5.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . 67 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4 3D Modeling via Scene-Level Inpainting 69 4.1 Publication Preface . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.1 3D Representations for Novel View Synthesis . . . . . . . 73 4.3.2 3D Scene Inpainting . . . . . . . . . . . . . . . . . . . . 74 4.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.1 Problem Definition and Model Overview . . . . . . . . . 75 4.4.2 Inferring Depth-Guided Inpainting Masks . . . . . . . . . 77 4.4.3 Inpainting-guided 3DGS Refinement . . . . . . . . . . . 78 4.4.4 Training and Inference . . . . . . . . . . . . . . . . . . . 80 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5.2 Quantitative Evaluations . . . . . . . . . . . . . . . . . . 83 4.5.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . 84 4.5.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . 85 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5 Conclusion 87 Reference 89	-
dc.language.iso	zh_TW	-
dc.subject	深度學習	-
dc.subject	電腦視覺	-
dc.subject	3D 電腦視覺	-
dc.subject	點雲	-
dc.subject	3D生成	-
dc.subject	3D視覺定位3D高斯潑濺	-
dc.subject	場景修補	-
dc.subject	deep learning	-
dc.subject	computer vision	-
dc.subject	3D computer vision	-
dc.subject	point cloud	-
dc.subject	3D generation	-
dc.subject	3D visual grounding	-
dc.subject	Gaussian Splatting	-
dc.subject	3D scene inpainting	-
dc.title	邁向複雜3D視覺建構：從物件層級補全與生成，到場景層級跨模態定位與修復	zh_TW
dc.title	Towards Complex 3D Visual Modeling: From Object-Level Completion and Generation to Scene-Level Cross-Modal Localization and Inpainting	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	博士	-
dc.contributor.oralexamcommittee	陳祝嵩;陳煥宗;莊永裕;劉育綸;孫誠	zh_TW
dc.contributor.oralexamcommittee	Chu-Song Chen;Hwann-Tzong Chen;Yung-Yu Chuang;Yu-Lun Liu;Cheng Sun	en
dc.subject.keyword	深度學習,電腦視覺3D 電腦視覺點雲3D生成3D視覺定位3D高斯潑濺場景修補	zh_TW
dc.subject.keyword	deep learning,computer vision3D computer visionpoint cloud3D generation3D visual groundingGaussian Splatting3D scene inpainting	en
dc.relation.page	115	-
dc.identifier.doi	10.6342/NTU202600037	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2026-01-08	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
dc.date.embargo-lift	2026-02-04	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	28.67 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。