請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84547完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 傅立成(Li-Chen Fu) | |
| dc.contributor.author | Po-Hui Wang | en |
| dc.contributor.author | 王博輝 | zh_TW |
| dc.date.accessioned | 2023-03-19T22:15:11Z | - |
| dc.date.copyright | 2022-10-14 | |
| dc.date.issued | 2022 | |
| dc.date.submitted | 2022-09-21 | |
| dc.identifier.citation | [1] 'ARCore.' https://developers.google.com/ar (accessed August, 2, 2022). [2] 'Vuforia.' https://developer.vuforia.com/ (accessed August, 2, 2022). [3] 'ARkit.' https://developer.apple.com/augmented-reality/ (accessed August, 1, 2022). [4] 'EasyAR.' https://www.easyar.com/ (accessed August, 1, 2022). [5] 'Unity.' https://unity.com/ (accessed August, 2, 2022). [6] L. G. Roberts, 'Machine Perception of Three-Dimensional Solids,' in Outstanding Dissertations in the Computer Sciences, 1963. [7] V. Hedau, D. Hoiem, and D. A. Forsyth, 'Recovering the spatial layout of cluttered rooms,' 2009 IEEE 12th International Conference on Computer Vision, pp. 1849-1856, 2009. [8] A. Mallya and S. Lazebnik, 'Learning Informative Edge Maps for Indoor Scene Layout Prediction,' 2015 IEEE International Conference on Computer Vision (ICCV), pp. 936-944, 2015. [9] S. Dasgupta, K. Fang, K. Chen, and S. Savarese, 'DeLay: Robust Spatial Layout Estimation for Cluttered Indoor Scenes,' 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 616-624, 2016. [10] D. C. Lee, M. Hebert, and T. Kanade, 'Geometric reasoning for single image structure recovery,' 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2136-2143, 2009. [11] Y. Ren, S. Li, C. Chen, and C.-C. J. Kuo, 'A Coarse-to-Fine Indoor Layout Estimation (CFILE) Method,' in ACCV, 2016. [12] Y. Chen, S. Huang, T. Yuan, S. Qi, Y. Zhu, and S.-C. Zhu, 'Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense,' 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8647-8656, 2019. [13] S. Huang, S. Qi, Y. Xiao, Y. Zhu, Y. N. Wu, and S.-C. Zhu, 'Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation,' in NeurIPS, 2018. [14] Y. Du, Z. Liu, H. Basevi, A. Leonardis, B. Freeman, J. B. Tenenbaum, and J. Wu, 'Learning to Exploit Stability for 3D Scene Parsing,' in NeurIPS, 2018. [15] S. Huang, S. Qi, Y. Zhu, Y. Xiao, Y. Xu, and S.-C. Zhu, 'Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image,' ArXiv, vol. abs/1808.02201, 2018. [16] M. Hueting, P. Reddy, V. G. Kim, E. Yumer, N. A. Carr, and N. J. Mitra, 'SeeThrough: Finding Chairs in Heavily Occluded Indoor Scene Images,' arXiv: Computer Vision and Pattern Recognition, 2017. [17] H. Izadinia, Q. Shan, and S. M. Seitz, 'IM2CAD,' 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2422-2431, 2017. [18] Y. Nie, X. Han, S. Guo, Y. Zheng, J. Chang, and J. J. Zhang, 'Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes From a Single Image,' 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 52-61, 2020. [19] C.-c. Zhang, Z. Cui, Y. Zhang, B. Zeng, M. Pollefeys, and S. Liu, 'Holistic 3D Scene Understanding from a Single Image with Implicit Representation,' 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8829-8838, 2021. [20] S. Popov, P. Bauszat, and V. Ferrari, 'CoReNet: Coherent 3D scene reconstruction from a single RGB image,' in ECCV, 2020. [21] Z. Weng and S. Yeung, 'Holistic 3D Human and Scene Mesh Estimation from Single View Images,' 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 334-343, 2021. [22] M. Dahnert, J. Hou, M. Nießner, and A. Dai, 'Panoptic 3D Scene Reconstruction From a Single RGB Image,' ArXiv, vol. abs/2111.02444, 2021. [23] H. F. Durrant-Whyte and T. Bailey, 'Simultaneous localization and mapping: part I,' IEEE Robotics & Automation Magazine, vol. 13, pp. 99-110, 2006. [24] A. J. Davison, I. D. Reid, N. Molton, and O. Stasse, 'MonoSLAM: Real-Time Single Camera SLAM,' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, pp. 1052-1067, 2007. [25] R. Mur-Artal and J. D. Tardós, 'ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras,' IEEE Transactions on Robotics, vol. 33, pp. 1255-1262, 2017. [26] J. J. Engel, T. Schöps, and D. Cremers, 'LSD-SLAM: Large-Scale Direct Monocular SLAM,' in ECCV, 2014. [27] D. Schlegel, M. Colosi, and G. Grisetti, 'ProSLAM: Graph SLAM from a Programmer's Perspective,' 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1-9, 2018. [28] C. Kerl, J. Sturm, and D. Cremers, 'Robust odometry estimation for RGB-D cameras,' 2013 IEEE International Conference on Robotics and Automation, pp. 3748-3754, 2013. [29] J. Liu, Y. Xie, S. Gu, and X. Chen, 'A SLAM-Based Mobile Augmented Reality Tracking Registration Algorithm,' Int. J. Pattern Recognit. Artif. Intell., vol. 34, pp. 2054005:1-2054005:19, 2020. [30] A. Alahi, R. Ortiz, and P. Vandergheynst, 'FREAK: Fast Retina Keypoint,' 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 510-517, 2012. [31] S. Song, S. P. Lichtenberg, and J. Xiao, 'SUN RGB-D: A RGB-D scene understanding benchmark suite,' 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 567-576, 2015. [32] D. H. Hubel and T. N. Wiesel, 'Receptive fields and functional architecture of monkey striate cortex,' The Journal of Physiology, vol. 195, 1968. [33] D. H. Hubel and T. N. Wiesel, 'Receptive fields of single neurones in the cat's striate cortex,' The Journal of Physiology, vol. 148, 1959. [34] K. Fukushima, 'Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,' Biological Cybernetics, vol. 36, pp. 193-202, 2004. [35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, 'Gradient-based learning applied to document recognition,' Proc. IEEE, vol. 86, pp. 2278-2324, 1998. [36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, 'ImageNet classification with deep convolutional neural networks,' Communications of the ACM, vol. 60, pp. 84 - 90, 2012. [37] V. Nair and G. E. Hinton, 'Rectified Linear Units Improve Restricted Boltzmann Machines,' in ICML, 2010. [38] K. He, X. Zhang, S. Ren, and J. Sun, 'Deep Residual Learning for Image Recognition,' 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016. [39] K. Simonyan and A. Zisserman, 'Very Deep Convolutional Networks for Large-Scale Image Recognition,' CoRR, vol. abs/1409.1556, 2015. [40] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, 'You Only Look Once: Unified, Real-Time Object Detection,' 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779-788, 2016. [41] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, 'Graph R-CNN for Scene Graph Generation,' ArXiv, vol. abs/1808.00191, 2018. [42] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, 'ORB-SLAM: A Versatile and Accurate Monocular SLAM System,' IEEE Transactions on Robotics, vol. 31, pp. 1147-1163, 2015. [43] J. Jiao, J. Jiao, Y. Mo, W. Liu, and Z. Deng, 'MagicVO: End-to-End Monocular Visual Odometry through Deep Bi-directional Recurrent Convolutional Neural Network,' ArXiv, vol. abs/1811.10964, 2018. [44] J. Cheng, S. W. Foo, and S. M. Krishnan, 'Automatic detection of region of interest and center point of left ventricle using watershed segmentation,' 2005 IEEE International Symposium on Circuits and Systems, pp. 149-151 Vol. 1, 2005. [45] N. Ahuja and R. Charan, 'Pixel Matching and Motion Segementation in Image Sequences,' in ACCV, 1995. [46] S. Ren, K. He, R. B. Girshick, and J. Sun, 'Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1137-1149, 2015. [47] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, 'Mask R-CNN,' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, pp. 386-397, 2020. [48] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, 'YOLOv4: Optimal Speed and Accuracy of Object Detection,' ArXiv, vol. abs/2004.10934, 2020. [49] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, 'PyTorch: An Imperative Style, High-Performance Deep Learning Library,' in NeurIPS, 2019. [50] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman, 'The Pascal Visual Object Classes (VOC) Challenge,' International Journal of Computer Vision, vol. 88, pp. 303-338, 2009. [51] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, 'Indoor Segmentation and Support Inference from RGBD Images,' in ECCV, 2012. [52] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, and T. Darrell, 'A category-level 3-D object dataset: Putting the Kinect to work,' 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 1168-1174, 2011. [53] J. Xiao, A. Owens, and A. Torralba, 'SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels,' 2013 IEEE International Conference on Computer Vision, pp. 1625-1632, 2013. [54] D. P. Kingma and J. Ba, 'Adam: A Method for Stochastic Optimization,' CoRR, vol. abs/1412.6980, 2015. [55] W. Zhang, B. Han, and P. Hui, 'Jaguar: Low Latency Mobile Augmented Reality with Flexible Tracking,' Proceedings of the 26th ACM international conference on Multimedia, 2018. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/84547 | - |
| dc.description.abstract | 近年來,擴增實境開始被廣泛應用在各種場域,如倉儲、展演及教育等地。在相關應用的開發基礎中,能僅靠RGB影像就同時做到理解空間資訊及追蹤使用者姿態的定位演算法更是尤為重要。然而,目前市面上的相關套件普遍存在著一些限制,如Google公司開發的ARCore因為硬體需求,僅能使用在經過官方認證的設備上。這對於想要在自己的設備上進行擴增實境應用開發的一般使用者,造成了相當大的不便與阻礙。 在這篇論文中,我們提出了一個基於深度學習網路及SLAM的場景理解定位系統,可以僅靠RGB影像就同時做到場景理解及使用者定位。使用機器學習的方法,推導出使用者所處環境中有何種物件及其相對姿態,並透過SLAM匹配連續RGB影像幀的ORB特徵點來計算出使用者在空間中的位置及姿態。為了驗證此系統的有效性,我們使用3D引擎Unity作為開發平台,將此系統應用於我們設計的AR遠程會議應用中。使用者可透過配置於擴增實境眼鏡上的RGB攝像頭獲得周圍的環境資訊及使用者姿態,進而在鏡片上投影出其它會議參與者的虛擬分身在真實環境中的適當位置。 在實驗的部分,我們對系統中的各個模組進行了多種質性及量化分析,並邀請了數名受試者,請他們嘗試使用此AR遠端會議程式以評斷場景理解定位系統的好壞。由實驗結果及受試者的回饋問卷得知,此系統的定位穩定性及準確率獲得大多數使用者的肯定。我們期望我們所提出的場景理解定位系統能夠為更多獨立開發擴增實境應用的使用者,提供堅實且穩定的基礎技術。 | zh_TW |
| dc.description.abstract | Augmented reality has been widely used in various fields, such as warehousing, performances, and education. Among the technological developments in related applications, the positioning algorithm that can understand spatial information and track the user's posture at the same time using only RGB images is critical. However, the related development kits currently available on the market generally have limitations. For example, ARCore developed by Google can only be used on officially certified devices due to hardware requirements. This makes it impossible for developers who would like to work on augmented reality applications on their own devices to use those kits. In this thesis, we propose a scene understanding and positioning system based on neural networks and SLAM (Simultaneous Localization and Mapping), which can achieve scene understanding and user localization simultaneously using only RGB images. The machine learning method is first used to obtain the information of the class and relative pose of objects in the user's environment, and then the user's position and pose in space are calculated by matching the feature points on the continuous RGB image frames through the SLAM technique. In order to verify the effectiveness of our system, we use the 3D engine, called Unity, as the development platform, and implement our system for an AR teleconference application we proposed. The user can obtain the surrounding environment information and posture of the object (say, chair) of interest with the RGB camera on the augmented reality glasses, and then project the virtual avatar of another party of the conference at an appropriate location (say, chair) in the natural environment. In the experiment part, besides demonstrating the effectiveness and efficiency of our developed system, we also invite several subjects and ask them to try our AR teleconference application for user study, to collect users’ feelings about the quality of the scene understanding and positioning system. According to the feedback provided by the subjects, most users agree that our system does provide a real-time, stable, and accurate positioning system. We hope that our proposed scene understanding and positioning system provides a solid framework for more people to develop their augmented reality applications on different devices more easily. | en |
| dc.description.provenance | Made available in DSpace on 2023-03-19T22:15:11Z (GMT). No. of bitstreams: 1 U0001-1908202218200400.pdf: 7607513 bytes, checksum: 0d8dfe173665ef416fb5d750ee3ba83f (MD5) Previous issue date: 2022 | en |
| dc.description.tableofcontents | 口試委員會審定書 i 中文摘要 ii ABSTRACT iii CONTENTS v LIST OF FIGURES vii LIST OF TABLES x Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 4 1.3 Related work 6 1.3.1 Scene Understanding from a Single Image 6 1.3.2 Simultaneous Localization and Mapping 7 1.4 Contribution 9 1.5 Thesis organization 10 Chapter 2 Preliminaries 12 2.1 Convolutional Neural Network 12 2.1.1 Basic Components 14 2.1.2 ResNet 19 2.2 Scene Understanding Frameworks 21 2.2.1 YOLO 21 2.2.2 SGCN 23 2.3 Simultaneous Localization and Mapping 24 Chapter 3 Scene Understanding and Positioning 26 3.1 System Overview 26 3.2 Scene Understanding Module 27 3.2.1 Single-Image Scene Understanding Model 28 3.2.2 Model Fine-tuning 30 3.2.3 Post-processing Algorithms 31 3.3 Spatial Positioning Module 33 3.3.1 Interoperation Framework between Unity and SLAM 33 3.3.2 Integration of SLAM with Scene Understanding Module 36 3.4 Render Virtual Object Module 37 3.4.1 Position Correspondence of Real and Virtual Spaces 37 3.4.2 Translation Error Problem 41 3.5 Post-processing of Layout 44 Chapter 4 Experiments 46 4.1 Settings of Experimental Environment 46 4.2 Datasets and Evaluation Metrics 47 4.2.1 SUN RGB-D Dataset 48 4.2.2 Confusion Matrix and Mean Average Precision (mAP) 50 4.3 Implementation Details 53 4.4 Experimental Results 53 4.4.1 Scene Understanding Module 54 4.4.2 Spatial Positioning Module 55 4.5 AR Teleconference Application 62 4.5.1 User study 65 Chapter 5 Conclusions 67 REFERENCE 69 | |
| dc.language.iso | en | |
| dc.subject | 場景理解 | zh_TW |
| dc.subject | 擴增實境 | zh_TW |
| dc.subject | Unity | zh_TW |
| dc.subject | 同時定位與地圖構建 | zh_TW |
| dc.subject | 類神經網路 | zh_TW |
| dc.subject | Unity | en |
| dc.subject | Augmented Reality | en |
| dc.subject | Neural Networks | en |
| dc.subject | Scene Understanding | en |
| dc.subject | Simultaneous Localization and Mapping | en |
| dc.title | 基於RGB影像之場景理解定位系統應用於擴增實境遠程會議 | zh_TW |
| dc.title | A Scene Understanding and Positioning System from RGB Images for Teleconference Application in Augmented Reality | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 110-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 歐陽明(Ming Ouhyoung),莊永裕(Yung-Yu Chuang),洪一平(Yi-Ping Hung) | |
| dc.subject.keyword | 擴增實境,類神經網路,場景理解,同時定位與地圖構建,Unity, | zh_TW |
| dc.subject.keyword | Augmented Reality,Neural Networks,Scene Understanding,Simultaneous Localization and Mapping,Unity, | en |
| dc.relation.page | 74 | |
| dc.identifier.doi | 10.6342/NTU202202599 | |
| dc.rights.note | 同意授權(限校園內公開) | |
| dc.date.accepted | 2022-09-23 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 資訊網路與多媒體研究所 | zh_TW |
| dc.date.embargo-lift | 2025-09-19 | - |
| 顯示於系所單位: | 資訊網路與多媒體研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| U0001-1908202218200400.pdf 授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務) | 7.43 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
