基於空間感知卷積及形狀補全應用於擴增實境之三維語意分割

Yun-Chih Guo; 郭耘志

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/8111

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅立成(Li-Chen Fu)
dc.contributor.author	Yun-Chih Guo	en
dc.contributor.author	郭耘志	zh_TW
dc.date.accessioned	2021-05-20T00:48:59Z	-
dc.date.available	2024-03-01
dc.date.available	2021-05-20T00:48:59Z	-
dc.date.copyright	2021-03-03
dc.date.issued	2021
dc.date.submitted	2021-02-18
dc.identifier.citation	[1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition, 1998. [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012. [3] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015. [4] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [5] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [6] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: single shot multibox detector. In European conference on computer vision, 2016. [7] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, 2016. [8] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [9] O. Ronneberger, P. Fischer, and T. Brox. U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 2015. [10] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV), 2018. [11] P. C. Ng and S. Henikoff. Sift: predicting amino acid changes that affect protein function. Nucleic acids research, 2003. [12] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005. [13] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: deep learning on point sets for 3d classification and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [14] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, 2017. [15] W. Wu, Z. Qi, and F. Li. Pointconv: deep convolutional networks on 3d point clouds. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [16] H. Thomas, C. R. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas. Kpconv: flexible and deformable convolution for point clouds. In IEEE International Conference on Computer Vision (ICCV), 2019. [17] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: richly-annotated 3d reconstructions of indoor scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [18] L. Tchapmi, C. Choy, I. Armeni, J. Gwak, and S. Savarese. Segcloud: semantic segmentation of 3d point clouds. In 2017 international conference on 3D vision (3DV), 2017. [19] B. Graham, M. Engelcke, and L. van der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [20] C. Choy, J. Gwak, and S. Savarese. 4d spatio-temporal convnets: minkowski convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [21] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random fields as recurrent neural networks. In IEEE International Conference on Computer Vision (ICCV), 2015. [22] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. Advances in neural information processing systems, 2011. [23] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and A. Markham. Randla-net: efficient semantic segmentation of large-scale point clouds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. [24] A. Dai and M. Nießner. 3dmv: joint 3d-multi-view prediction for 3d semantic scene segmentation. In European Conference on Computer Vision (ECCV), 2018. [25] M. Jaritz, J. Gu, and H. Su. Multi-view pointnet for 3d scene understanding. In IEEE International Conference on Computer Vision (ICCV) Workshops, 2019. [26] M. Tatarchenko, J. Park, V. Koltun, and Q.-Y. Zhou. Tangent convolutions for dense prediction in 3d. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [27] J. Huang, H. Zhang, L. Yi, T. Funkhouser, M. Nießner, and L. Guibas. Texturenet: consistent local parametrizations for learning from high-resolution signals on meshes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [28] Y. Lin, Z. Yan, H. Huang, D. Du, L. Liu, S. Cui, and X. Han. Fpconv: learning local flattening for point convolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [29] A. Kundu, X. Yin, A. Fathi, D. Ross, B. Brewington, T. Funkhouser, and C. Pantofaru. Virtual multi-view fusion for 3d semantic segmentation. In European Conference on Computer Vision (ECCV), 2020. [30] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. [31] Z. Hu, M. Zhen, X. Bai, H. Fu, and C.-l. Tai. Jsenet: joint semantic segmentation and edge detection network for 3d point clouds. In European Conference on Computer Vision (ECCV), 2020. [32] F. Zhang, J. Fang, B. Wah, and P. Torr. Deep fusionnet for point cloud semantic segmentation. In European Conference on Computer Vision (ECCV), 2020. [33] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3d semantic parsing of large-scale indoor spaces. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [34] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [35] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [36] D. Ciresan, A. Giusti, L. Gambardella, and J. Schmidhuber. Deep neural networks segment neuronal membranes in electron microscopy images. In Advances in Neural Information Processing Systems, 2012. [37] A. Dai, C. Ruizhongtai Qi, and M. Nießner. Shape completion using 3d encoderpredictor cnns and shape synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [38] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3d point clouds. In International conference on machine learning, 2018. [39] J. Gwak, C. Choy, and S. Savarese. Generative sparse detection networks for 3d single-shot object detection, 2020. [40] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: an imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019. [41] X. Yan, C. Zheng, Z. Li, S. Wang, and S. Cui. Pointasnl: robust point clouds processing using nonlocal neural networks with adaptive sampling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/8111	-
dc.description.abstract	三維室內場景語意分割在電腦視覺中是一項十分熱門的研究項目，對於許多應用程序而言，準確了解場景中每個點的類別非常重要，受益於深度學習的發展，已經有許多基於體素和點的神經網絡被提出來解決語義分割問題。但是，大多數都沒有充分考慮空間結構的資訊。本論文的目的是希望能夠提出一個系統，利用帶有顏色資訊的場景點雲對整個室內場景做語意分割。基於空間資料的稀疏性，我們設計了一個新穎的空間感知稀疏卷積運算。我們使用物體存在的空間資訊編碼作為額外的特徵，並使用自我注意力機制有效整合不同資訊；此外，我們引入了補全網路對分割網路的結果做修正，使場景中每種物體可以得到更加合裡和完整的形狀。透過以上兩點方法，我們建立準確的場景語意分割網路，獲得整個場景的屬性分類。在實驗的部分，我們使用兩個公開的資料集進行定量和定性的分析，首先我們對不同配置的模型進行比較，證明提出方法的有效性；其次將與其他最先進的方法進行比較，證明提出方法的優越性；最後則是一個實際應用的分析，說明具體的應用性。我們期望提出的三維場景語意分割系統能夠為實際應用提供準確而迅速的結果。	zh_TW
dc.description.abstract	3D semantic segmentation of indoor scenes is a popular research topic in the field of computer vision. For many applications, it is very important to know exactly what category each point in the scene belongs to. Benefiting from the development of deep learning, many neural networks based on voxels and points have been proposed to solve semantic segmentation problems. However, most of them don't fully consider the information of the spatial structure. current voxel-based sparse convolutional neural networks can effectively extract the 3D features in the space. However, it assumes that the feature in empty space is zero, causing the information loss of the spatial structure. In this thesis, we propose a system that uses scene point clouds with color information to semantically segment an entire indoor scene. Based on the sparsity of spatial data, we design a novel spatial-aware sparse convolution operation. We encode the spatial information of the object's existence as an additional feature and use the self-attention mechanism to effectively aggregate features. In addition, we introduce a completion network to refine the results from the segmentation network, so that each object in the scene can get a more reasonable and complete shape. Through the above two methods, we build an accurate scene semantic segmentation network to obtain the semantic information of the entire scene. In the experimental part, we use two public datasets to perform quantitative and qualitative analysis. We compare our results with other state-of-the-art methods to prove the superiority of the method. Also, we examine our models with different configurations to assure the effectiveness of the proposed method. Finally, An real application is introduced to demonstrate our work. We expect that the proposed 3D scene semantic segmentation system can provide accurate and fast results for practical applications.	en
dc.description.provenance	Made available in DSpace on 2021-05-20T00:48:59Z (GMT). No. of bitstreams: 1 U0001-1702202112242500.pdf: 13993690 bytes, checksum: ae43984c019fe772e473d08fda028fe8 (MD5) Previous issue date: 2021	en
dc.description.tableofcontents	口試委員審定書 i 致謝 ii 中文摘要 iii Abstract iv Contents vi List of Figures ix List of Tables xi Chapter 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2 Preliminaries 12 2.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Basic Components . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.3 3D convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Semantic Segmentation Framework . . . . . . . . . . . . . . . . . . 23 2.2.1 FCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Sparse Convolutional Neural Networks . . . . . . . . . . . . . . . . 26 2.3.1 Sparse Tensor Representation . . . . . . . . . . . . . . . . . . . . . 28 2.3.2 Spatial Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.3 Output Coordinate Generation . . . . . . . . . . . . . . . . . . . . 29 2.3.4 In-out Kernel Mapping . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.5 Sparse Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Chapter 3 Methodology 32 3.1 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Spatial-aware Convolution Layer . . . . . . . . . . . . . . . . . . . 33 3.2.1 Spatial Occupancy Encoding . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 Self-attention Aggregation . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Segmentation Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Completion Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 4 Experiment Results 47 4.1 Settings of Environment . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Datasets and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 48 4.2.1 ScanNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.2 Stanford 3D Large-Scale Indoor Spaces (S3DIS) . . . . . . . . . . 49 4.2.3 Mean Accuracy and mean Intersection of Union (mIoU) . . . . . . 49 4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.1 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.2 Comparison with State-of-the-art . . . . . . . . . . . . . . . . . . . 55 4.3.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Real-World AR Application . . . . . . . . . . . . . . . . . . . . . . 59 Chapter 5 Conclusion 62 References 63
dc.language.iso	en
dc.title	基於空間感知卷積及形狀補全應用於擴增實境之三維語意分割	zh_TW
dc.title	3D Semantic Segmentation based on Spatial-aware Convolution and Shape Completion for Augmented Reality Applications	en
dc.type	Thesis
dc.date.schoolyear	109-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	歐陽明(Ouh-Young Ming),鄭龍磻(Lung-Pan Chen),洪一平(Yi-Ping Hung)
dc.subject.keyword	場景語意分割,稀疏卷積網路,空間感知卷積,補全網路,擴增實境,	zh_TW
dc.subject.keyword	Scene Semantic Segmentation,Sparse Convolutional Network,Spatial-aware convolution,Completion Network,Augment Reality,	en
dc.relation.page	67
dc.identifier.doi	10.6342/NTU202100720
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2021-02-18
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
dc.date.embargo-lift	2024-03-01	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-1702202112242500.pdf	13.67 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。