基於人類行為及認知記憶之物件目標導航居家照護型機器人

陳建婷; Chien-Ting Chen

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90523

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	傅立成	zh_TW
dc.contributor.advisor	Li-Chen Fu	en
dc.contributor.author	陳建婷	zh_TW
dc.contributor.author	Chien-Ting Chen	en
dc.date.accessioned	2023-10-03T16:28:25Z	-
dc.date.available	2023-11-09	-
dc.date.copyright	2023-10-03	-
dc.date.issued	2023	-
dc.date.submitted	2023-05-20	-
dc.identifier.citation	Burgess, N. (2008). Spatial cognition and the brain. Annals of the New York Academy of Sciences, 1124(1), 77-97. Zhang, Y., Tian, G., Lu, J., Zhang, M., & Zhang, S. (2019). Efficient dynamic object search in home environment by mobile robot: A priori knowledge-based approach. IEEE Transactions on Vehicular Technology, 68(10), 9466-9477. Zhang, Y., Tian, G., Shao, X., Zhang, M., & Liu, S. (2022). Semantic Grounding for Long-Term Autonomy of Mobile Robots Toward Dynamic Object Search in Home Environments. IEEE Transactions on Industrial Electronics, 70(2), 1655-1665. Zeng, Z., Röfer, A., & Jenkins, O. C. (2020, May). Semantic linking maps for active visual object search. In 2020 IEEE international conference on robotics and automation (ICRA) (pp. 1984-1990). IEEE. Pei-Han Huang, “User Intent-driven Navigation of Home Service Robot based on Semantic Scene Cognition”, National Taiwan University Thesis by Master and Doctor Website (http://tul.blog.ntu.edu.tw/archives/28277), 2022, doi: 10.6342/NTU202202718 Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR. Duan, H., Zhao, Y., Chen, K., Lin, D., & Dai, B. (2022). Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2969-2978). Redmon, J., & Farhadi, A. (2017). YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7263-7271). Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 (pp. 21-37). Springer International Publishing. Lu, D., & Weng, Q. (2007). A survey of image classification methods and techniques for improving classification performance. International journal of Remote sensing, 28(5), 823-870. Zou, Z., Chen, K., Shi, Z., Guo, Y., & Ye, J. (2023). Object detection in 20 years: A survey. Proceedings of the IEEE. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., & Garcia-Rodriguez, J. (2017). A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857. https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Annotating_Object_Instances_with_a_Polygon_RNN Glumov, N. I., Kolomiyetz, E. I., & Sergeyev, V. V. (1995). Detection of objects on the image using a sliding window mode. Optics & Laser Technology, 27(4), 241-249. Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008, June). A discriminatively trained, multiscale, deformable part model. In 2008 IEEE conference on computer vision and pattern recognition (pp. 1-8). Ieee. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2015). Region-based convolutional networks for accurate object detection and segmentation. IEEE transactions on pattern analysis and machine intelligence, 38(1), 142-158. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28. O'Shea, K., & Nash, R. (2015). An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458. Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. IEEE Intelligent Systems and their applications, 13(4), 18-28. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779-788). Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767. Jocher, G., Chaurasia, A., Stoken, A., Borovec, J., Kwon, Y., Michael, K., ... & Jain, M. (2022). ultralytics/yolov5: v7. 0-YOLOv5 SOTA Realtime Instance Segmentation. Zenodo. Wang, C. Y., Bochkovskiy, A., & Liao, H. Y. M. (2022). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., & Garcia-Rodriguez, J. (2017). A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440). Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 (pp. 234-241). Springer International Publishing. He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969). He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881-2890). Druon, R., Yoshiyasu, Y., Kanezaki, A., & Watt, A. (2020). Visual object search by learning spatial context. IEEE Robotics and Automation Letters, 5(2), 1279-1286. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., ... & Kavukcuoglu, K. (2016, June). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937). PMLR. Chaplot, D. S., Gandhi, D. P., Gupta, A., & Salakhutdinov, R. R. (2020). Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems, 33, 4247-4258. Anguita, D., Ghio, A., Oneto, L., Parra, X., & Reyes-Ortiz, J. L. (2013, April). A public domain dataset for human activity recognition using smartphones. In Esann (Vol. 3, p. 3). Spriggs, E. H., De La Torre, F., & Hebert, M. (2009, June). Temporal segmentation and activity classification from first-person sensing. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (pp. 17-24). IEEE. Carreira, J., Noland, E., Hillier, C., & Zisserman, A. (2019). A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987. Chen, C. H., & Ramanan, D. (2017). 3d human pose estimation= 2d pose estimation+ matching. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7035-7043). Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1653-1660). Fang, H. S., Xie, S., Tai, Y. W., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE international conference on computer vision (pp. 2334-2343). Xu, W., Wu, M., Zhu, J., & Zhao, M. (2021). Multi-scale skeleton adaptive weighted GCN for skeleton-based human action recognition in IoT. Applied Soft Computing, 104, 107236. Ahmad, T., Jin, L., Zhang, X., Lai, S., Tang, G., & Lin, L. (2021). Graph convolutional neural network for human action recognition: a comprehensive survey. IEEE Transactions on Artificial Intelligence, 2(2), 128-145. Wang, X., Guo, S., Chen, J., Chen, P., & Cui, G. (2022). GCN-enhanced multidomain fusion network for through-wall human activity recognition. IEEE Geoscience and Remote Sensing Letters, 19, 1-5. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497). Siyal, A. R., Bhutto, Z., Shah, S. M. S., Iqbal, A., Mehmood, F., Hussain, A., & Saleem, A. (2020). Still image-based human activity recognition with deep representations and residual learning. International Journal of Advanced Computer Science and Applications, 11(5). Snehitha, B., Sreeya, R. S., & Manikandan, V. M. (2021, December). Human Activity Detection from Still Images using Deep Learning Techniques. In 2021 International Conference on Control, Automation, Power and Signal Processing (CAPS) (pp. 1-5). IEEE. McDermott, K. B., & Roediger, H. L. (2018). Memory (encoding, storage, retrieval). General Psychology FA2018. Noba Project: Milwaukie, OR, 117-153. Melton, A. W. (1963). Implications of short-term memory for a general theory of memory. Journal of verbal Learning and verbal Behavior, 2(1), 1-21. https://gohighbrow.com/the-brain-and-memory/ McDermott, K. B., & Roediger, H. L. (2018). Memory (encoding, storage, retrieval). General Psychology FA2018. Noba Project: Milwaukie, OR, 117-153. McLeod, S. A. (2007). Stages of memory-encoding storage and retrieval. Retrieved, 21, 2015. 50 Hunt, R. R. (2003). Two contributions of distinctive processing to accurate memory. Journal of Memory and Language, 48(4), 811-825.. Craik, F. I., & Lockhart, R. S. (1972). Levels of processing: A framework for memory research. Journal of verbal learning and verbal behavior, 11(6), 671-684. Bower, G. H., & Reitman, J. S. (1972). Mnemonic elaboration in multilist learning. Journal of verbal learning and verbal behavior, 11(4), 478-485. Hunt, R. R., & McDaniel, M. A. (1993). The enigma of organization and distinctiveness. Journal of memory and language, 32(4), 421-445. McGeoch, J. A. (1932). Forgetting and the law of disuse. Psychological review, 39(4), 352. Laney, C., & Loftus, E. F. (2023). Eyewitness testimony and memory biases. Australasian Policing, 15(1), 39-44. Tulving, E., & Thomson, D. M. (1973). Encoding specificity and retrieval processes in episodic memory. Psychological review, 80(5), 352. Ramakrishnan, S. K., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J., ... & Batra, D. (2021). Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238. Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127, 302-321. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9). Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 633-641). McCormac, J., Handa, A., Davison, A., & Leutenegger, S. (2017, May). Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In 2017 IEEE International Conference on Robotics and automation (ICRA) (pp. 4628-4635). IEEE. Amanatides, J., & Woo, A. (1987, August). A fast voxel traversal algorithm for ray tracing. In Eurographics (Vol. 87, No. 3, pp. 3-10). Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815-823). Shahroudy, A., Liu, J., Ng, T. T., & Wang, G. (2016). Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1010-1019). Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 2425-2433). Zellers, R., Bisk, Y., Farhadi, A., & Choi, Y. (2019). From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6720-6731). Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., & Gould, S. (2020). A recurrent vision-and-language bert for navigation. arXiv preprint arXiv:2011.13922. Wang, Z., Yu, J., Yu, A. W., Dai, Z., Tsvetkov, Y., & Cao, Y. (2021). Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904. Speer, R., Chin, J., & Havasi, C. (2017, February). Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI conference on artificial intelligence (Vol. 31, No. 1). Thagard, P., & Schröder, T. (2014). Emotions as semantic pointers: Constructive neural mechanisms. The psychological construction of emotions. New York: Guilford. Srinivasa-Desikan, B. (2018). Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras. Packt Publishing Ltd. Liu, V., & Chilton, L. B. (2022, April). Design guidelines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (pp. 1-23). Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., ... & Batra, D. (2019). Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9339-9347). Ramakrishnan, S. K., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J., ... & Batra, D. (2021). Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238. Jang, J., Kim, D., Park, C., Jang, M., Lee, J., & Kim, J. (2020, October). ETRI-activity3D: A large-scale RGB-D dataset for robots to recognize daily activities of the elderly. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 10990-10997). IEEE. Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L. Y., & Kot, A. C. (2019). Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 42(10), 2684-2701. Anderson, P., Chang, A., Chaplot, D. S., Dosovitskiy, A., Gupta, S., Koltun, V., ... & Zamir, A. R. (2018). On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757..	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90523	-
dc.description.abstract	隨著醫療技術的進步，世界正面臨著人口高齡化及青年勞動力不足的問題，高齡長輩們的長期照護給居家服務型機器人帶來了大量需求。而隨著年齡增長，人類的記憶認知能力逐漸退化，故機器人具備物件搜索的推理決策力尤為重要，此外，為了使機器人能更好的提供居家照護服務，機器人須擁有對物品、環境與人類行為的辨識力，以及編碼、儲存及提取記憶的認知力，其中，在人類社會中，又以語言為最方便的互動方式，故機器人需要具備能理解人類語言並迅速給予人類適當回應的語言理解力，才能在人機互動中實現良好的移動導航。且為了能讓機器人能更輕巧便利適用於居家環境中，在設計系統架構時，考慮機器人上有限的感測器及運算資源也是一大難題。在本研究中，我們提出了一個基於人類行為及記憶認知的物件搜索推理系統，在機器人上架設RGBD相機與2D Laser感測器，讓機器人在第一次進入新環境中可以在考慮不同場景的狀況下自主探索環境，使其對於新環境有初步的空間認知。此外，透過計算資料集裡物體出現的位置分布給予機器人先前的知識，使機器人對於目標物出現的地點有概念，除了使用2D Laser建立的2D平面地圖外，我們也利用RGBD相機搭配語義分割模型及場景辨識模型產生具有語義的點雲，建構擁有物件及場景的3D語義地圖，並在日後可以重複利用及更新此地圖來完成更多的動態物件搜索。另一方面，我們使用視覺語言模型CLIP來推論人類行為，將描述人類行為的文字與相機影像上的特徵值做相似度比對，最終取出相似度最高的文字當作影像中的人類行為，此模型在實體實驗中能實時推論出人類行為。接著先挑選出人類的重要行為，再將人類行為、行為人、行為地點、行為時間儲存成記憶，並利用ConceptNet所建立的常識圖譜得到與目標物相關的目標行為，並從記憶中提取與此目標行為最相關的目標人物，控制機器人找到目標人物，並與人類進行互動的語言中理解並得出目標物件的位置。	zh_TW
dc.description.abstract	With the advancement of medical technology and the increase of life expectancy, the world is facing the problem of an aging population and a shortage of youth labor. The long-term care of the elderly has brought a large demand for home care robots. With the increase of age, human memory and cognitive ability gradually degrade, it gets particularly difficult for them to memorize the position of different objects in the house. Therefore, it is important for the robot to have the ability to detect the object, scene, human activity, and reason to make decisions so that they are able to search for the object for the human. Only the cognitive ability to encode, store and retrieve memory can achieve a good movement trajectory in human-robot interaction. In order to achieve good mobile navigation in human-robot interaction, the robots need to have language understanding so that they can understand human’s words and quickly respond to humans appropriately. To make the robot more lightweight and convenient for use in the home environment, it is also quite challenging to design the robot system with only a limited sensor set and computing resources. In this research, we develop a home care mobile robot system which can reason the location of a target object and navigate to find it based on human activity inference and cognitive memory. Such a robot aims to help humans search for objects lost in the home environment due to forgetting, and the lost objects can be either personal objects or public ones. The whole system can be decomposed into 3 parts: the first part is the object-goal navigation module which includes the exploration and spatial surrounding detection sub-modules to construct the semantic voxel map; the second part is the cognitive memory module which executes the human and activity recognition, stores the encoded information into the robot's episodic memory ready for retrieving the memory concerning the lost object if necessary; the last part is the interaction module which can infer the position of the target object by interacting with the related humans. In our design, the robot carrying RGBD cameras and 2D Laser sensors can explore the environment autonomously and simultaneously, construct a 3D semantic map of every scene and objects, that can be updated and used for inferring the locations of the target objects in the future. To make the inference more human-like, we leverage the existing big data to calculate the location distribution of objects ahead of time, collect cues from human identities and their activities through face and human activity recognition, builded a memory to store these cues together with time and place, and finally take advantage of a commonsense knowledge graph to reason where the target(lost) object might be located while interacting with humans logically. We validate the proposed method with respect to SR (success rate), SPL (success weighted by path length) and DTS (distance to success) in a simulated environment, and the results show that our method can search for the target object efficiently and robustly with shorter path length and higher success rate than 3 comparative methods in the literature. We further demonstrate our system in a real-world home environment which searches either for a personal object for a single user or for a public object for several users.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-10-03T16:28:25Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-10-03T16:28:25Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	致謝 I 摘要 II ABSTRACT III TABLE OF CONTENTSV LIST OF FIGURES VIII LIST OF TABLES . X Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Research objectives 3 1.3 Contributions 4 1.4 Thesis Overview 6 Chapter 2 Preliminaries 7 2.1 Object Detection and Semantic Segmentation 7 2.1.1 Object Detection 8 2.1.2 Semantic Segmentation 11 2.2 Object-Goal Navigation and Object Search 12 2.3 Human Activity Recognition 15 2.4 Memory Encoding, Storage and Retrieval 19 2.5 ROS (Robot Operating System) 23 Chapter 3 Methodology 25 3.1 System Overview 25 3.2 Exploration Module 28 3.3 Spatial Surrounding Detection Module 29 3.3.1 Semantic Segmentation 30 3.3.2 Scene Recognition 30 3.4 Object Prior Probability 31 3.5 Semantic Voxel Map 34 3.5.1 Navigation.40 3.6 Human and Activity Recognition 41 3.6.1 Human Activity Inference 43 3.7 Knowledge Graph Common Sense Module 46 3.8 Cognitive Memory Module 47 3.8.1 Definition of Memory Feature, Semantic Element and Event 49 3.8.2 Semantic Pointer 49 3.8.3 Memory Encoding and Storage 50 3.8.4 Memory Retrieval 51 3.9 Interaction Module 53 3.9.1 Robot Asking 54 3.9.2 Conversation Construction 55 Chapter 4 Experiment 59 4.1 Semantic Voxel Map 59 4.1.1 Simulator Setup 59 4.1.2 Object Prior Probability 61 4.1.3 Constructing the Map 67 4.2 Human Activity Inference 70 4.3 Knowledge Graph 77 4.4 Conversation Construction 78 4.5 Object Search 83 4.6 Real-world 87 Chapter 5 Conclusion 94 Reference 96 Figure 2-1 Classification, object detection, semantic segmentation [13] 8 Figure 2-2 A grid of cells, bounding box and the probability map in YOLOv1 [20] 9 Figure 2-3 Pose estimation [37] 17 Figure 2-4 Different memory in different brain region [43] 20 Figure 2-5 Publish and subscribe in ROS 24 Figure 3-1 System architecture 25 Figure 3-2 System Flow 27 Figure 3-3 Frontier in grid map 28 Figure 3-4 Semantic segmentation overview [56] 30 Figure 3-5 Scene recognition model [5] 31 Figure 3-6 Semantic voxel map overview 35 Figure 3-7 3D projection 36 Figure 3-8 Voxel map semantic information 37 Figure 3-9 Raycasting to filter the noise voxel 39 Figure 3-10 Flow of the navigate rule 40 Figure 3-11 PoseC3D architecture [7] 43 Figure 3-12 Our CLIP process for human activity recognition 44 Figure 3-13 The relationships in ConceptNet 47 Figure 3-14 The relationships between the object and the activity 47 Figure 3-15 Human memory hierarchy 48 Figure 3-16 Semantic element in an event 51 Figure 4-1 The robot in Habitat-Sim 60 Figure 4-2 Home environment in Habitat-Sim 68-70 Figure 4-3 The confusion matrix of each class from the NTU RGB+D 120 by using only image embedding 73 Figure 4-4 The confusion matrix of each class from the ETRI by using by using only image embedding 74 Figure 4-5 The confusion matrix of each class from the ETRI by using both image embedding and text embedding 74 Figure 4-6 The confusion matrix of each class from the ETRI by using bounding box image 75 Figure 4-7 The confusion matrix of each class from the ETRI by using square image 75 Figure 4-8 Knowledge Graph in Neo4j 77 Figure 4-9 Mean score of the 6 questions in Questionnaire 1 81 Figure 4-10 Mean score of the 4 questions in Questionnaire 2 82 Figure 4-11 Search for the book in the beginning 83 Figure 4-12 Search for the book in the middle 84 Figure 4-13 Success to find the book in the end 84 Figure 4-14 Success to find the toy in the end 84 Figure 4-15 The sensors on OREO 88 Figure 4-16 The home environment in YongLing 412 89 Figure 4-17 Grid map of the YongLing 412 89 Figure 4-18 Semantic voxel map of the YongLing 412 89 Figure 4-19 Experiment of searching for the personal object - bottle 92 Figure 4-20 Experiment of searching for the public object - book 93 Table 4-1 The probability of each target object in each room 61 Table 4-2 The probability of each target object and each landmark object co-occur in living room 62 Table 4-3 The probability of each target object and each landmark object co-occur in aisle 62 Table 4-4 The probability of each target object and each landmark object co-occur in bedroom 63 Table 4-5 The probability of each target object and each landmark object co-occur in bathroom 63 Table 4-6 The probability of each target object and each landmark object co-occur in dining room 64 Table 4-7 The probability of each target object and each landmark object co-occur in game room 64 Table 4-8 The probability of each target object and each landmark object co-occur in kitchen 65 Table 4-9 The probability of each target object and each landmark object co-occur in library 65 Table 4-10 The probability of each target object and each landmark object co-occur in meeting room 66 Table 4-11 The probability of each target object and each landmark object co-occur in office 66 Table 4-12 The probability of each target object and each landmark object co-occur in staircase 67 Table 4-13 The accuracy of the CLIP model 76 Table 4-14 Accuracy of PoseC3D, CLIP, and ours 76 Table 4-15 Questions in questionnaire 1 81 Table 4-16 Questions in questionnaire 2 81 Table 4-17 Search for 5 target objects in 5 different unknown environments 85 Table 4-18 Search for 5 target objects in 5 different known environments 86	-
dc.language.iso	en	-
dc.subject	人機互動	zh_TW
dc.subject	常識圖譜	zh_TW
dc.subject	人類行為推論	zh_TW
dc.subject	物件目標導航	zh_TW
dc.subject	3D語義地圖	zh_TW
dc.subject	認知記憶	zh_TW
dc.subject	3D Semantic Map	en
dc.subject	Cognitive Memory	en
dc.subject	Commonsense Knowledge Graph	en
dc.subject	Human Activity Inference	en
dc.subject	Object-Goal navigation	en
dc.subject	Human Robot Iteration	en
dc.title	基於人類行為及認知記憶之物件目標導航居家照護型機器人	zh_TW
dc.title	Object-Goal Navigation of Home Care Robot based on Human Activity Inference and Cognitive Memory	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	林沛群;曾士桓;楊谷洋;宋開泰	zh_TW
dc.contributor.oralexamcommittee	Pei-Chun Lin;Shih-Huan Tseng;Kuu-Young Young;Kai-Tai Song	en
dc.subject.keyword	物件目標導航,3D語義地圖,人類行為推論,認知記憶,常識圖譜,人機互動,	zh_TW
dc.subject.keyword	Object-Goal navigation,3D Semantic Map,Human Activity Inference,Cognitive Memory,Commonsense Knowledge Graph,Human Robot Iteration,	en
dc.relation.page	100	-
dc.identifier.doi	10.6342/NTU202300825	-
dc.rights.note	未授權	-
dc.date.accepted	2023-05-22	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電機工程學系	-
Appears in Collections:	電機工程學系

Files in This Item:

File	Size	Format
ntu-111-2.pdf Restricted Access	6.59 MB	Adobe PDF

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets