基於常識知識與大型語言模型之具有生成式行動規劃之認知家用社交型機器人

陳慈安; Cih-An Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92979

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅立成	zh_TW
dc.contributor.advisor	Li-Chen Fu	en
dc.contributor.author	陳慈安	zh_TW
dc.contributor.author	Cih-An Chen	en
dc.date.accessioned	2024-07-10T16:08:59Z	-
dc.date.available	2024-07-11	-
dc.date.copyright	2024-07-10	-
dc.date.issued	2024	-
dc.date.submitted	2024-07-02	-
dc.identifier.citation	[1] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey on in-context learning,” arXiv preprint arXiv:2301.00234, 2022. [2] J.Wei, X.Wang, D.Schuurmans, M.Bosma, F.Xia, E.Chi, Q.V.Le, D.Zhouetal., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022. [3] J. D. Hwang, C. Bhagavatula, R. Le Bras, J. Da, K. Sakaguchi, A. Bosselut, and Y. Choi, “Comet-atomic 2020: On symbolic and neural commonsense knowledge graphs,” in AAAI, 2021. [4] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1452–1464, 2018. [5] D. Feil-Seifer and M.Mataric, “Definingsociallyassistiverobotics,”in 9th International Conference on Rehabilitation Robotics, 2005. ICORR 2005., 2005, pp. 465–468. [6] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Sys- tems, vol. 33, pp. 9459–9474, 2020. [7] R. Speer, J. Chin, and C. Havasi, “Conceptnet 5.5: An open multilingual graph of general knowledge,” 2018. [8] M. Sap, R. LeBras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith, and Y. Choi, “Atomic: An atlas of machine commonsense for if-then reasoning,” 2019. [9] N.Mostafazadeh, A.Kalyanpur, L.Moon, D.Buchanan, L.Berkowitz, O.Biran,and J. Chu-Carroll, “GLUCOSE: GeneraLized and COntextualized story explanations,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 4569–4586. [Online]. Available: https://aclanthology.org/2020.emnlp-main.370 [10] D. Xu, R. Martín-Martín, D.-A. Huang, Y. Zhu, S. Savarese, and L. Fei-Fei, “Re- gression planning networks,” in Thirty-third Conference on Neural Information Pro- cessing Systems (NeurIPS), 2019. [11] B.Ichter, P.Sermanet, and C.Lynch, “Broadly-exploring, local-policy trees for long horizon task planning,” 2020. [12] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International Conference on Machine Learning. PMLR, 2022, pp. 9118–9147. [13] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, K.-H. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet, N. Sievers, C. Tan, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, P. Xu, S. Xu, M. Yan, and A. Zeng, “Do as i can, not as i say: Grounding language in robotic affordances,” 2022. [14] C. H. Song, B. M. Sadler, J. Wu, W.-L. Chao, C. Washington, and Y. Su, “Llm- planner: Few-shot grounded planning for embodied agents with large language models,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023, pp. 2986–2997. [15] S.Yao, J.Zhao, D.Yu, N.Du, I.Shafran, K.Narasimhan, and Y.Cao,“React:Syner- gizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629, 2022. [16] N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems, vol. 36, 2024. [17] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models,” 2023. [18] Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei, “Kosmos-2: Grounding multimodal large language models to the world,” 2023. [19] S. I. Serengil and A. Ozpinar, “Lightface: A hybrid deep face recognition framework,” in 2020 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE, 2020, pp. 23–27. [Online]. Available: https://ieeexplore.ieee.org/ document/9259802 [20] Nate Raw, “vit-age-classifier (revision 461a4c4),” 2023. [Online]. Available: https://huggingface.co/nateraw/vit-age-classifier [21] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 567–576. [22] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Se- mantic understanding of scenes through the ade20k dataset,” International Journal of Computer Vision, vol. 127, pp. 302–321, 2019. [23] A. Pal, C. Nieto-Granda, and H. I. Christensen, “Deduce: Diverse scene detection methods in unseen challenging environments,” 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4198–4204, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:199064574 [24] L. Zhou, J. Cen, X. Wang, Z. Sun, T. L. Lam, and Y. Xu, “Borm: Bayesian ob- ject relation model for indoor scene recognition,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021, pp. 39–46. [25] B. Miao, L. Zhou, A. S. Mian, T. L. Lam, and Y. Xu, “Object-to-scene: Learning to transfer object knowledge to indoor scene recognition,” in 2021 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 2069–2075. [26] C. Song and X. Ma, “Srrm: Semantic region relation model for indoor scene recognition,” in 2023 International Joint Conference on Neural Networks (IJCNN), 2023, pp. 01–08. [27] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” 2018. [28] T.-Y.Lin, M.Maire, S.Belongie, J.Hays, P.Perona, D.Ramanan, P.Dollár, andC.L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part V 13. Springer, 2014, pp. 740–755. [29] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, 2005, pp. 886–893 vol. 1. [30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, SSD: Single Shot MultiBox Detector. Springer International Publishing, 2016, p. 21–37. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-46448-0_2 [31] G.Jocher, A.Chaurasia, and J.Qiu, “Ultralyticsyolov8,”2023.[Online].Available:https://github.com/ultralytics/ultralytics [32] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for ac- curate object detection and semantic segmentation,” 2014. [33] F. Blochliger, M. Fehr, M. Dymczyk, T. Schneider, and R. Siegwart, “Topomap: Topological mapping and navigation based on visual slam maps,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 3818–3825. [34] G. L. Oliveira, N. Radwan, W. Burgard, and T. Brox, “Topometric localization with deep learning,” in Robotics Research: The 18th International Symposium ISRR. Springer, 2020, pp. 505–520. [35] T.E.Behrens, T.H.Muller, J.C.Whittington, S.Mark, A.B.Baram, K.L.Stachen- feld, and Z. Kurth-Nelson, “What is a cognitive map? organizing knowledge for flexible behavior,” Neuron, vol. 100, no. 2, pp. 490–509, 2018. [36] J. C. Whittington, D. McCaffary, J. J. Bakermans, and T. E. Behrens, “How to build a cognitive map,” Nature neuroscience, vol. 25, no. 10, pp. 1257–1272, 2022. [37] R. A. Epstein, E. Z. Patai, J. B. Julian, and H. J. Spiers, “The cognitive map in humans: spatial navigation and beyond,” Nature neuroscience, vol. 20, no. 11, pp. 1504–1513, 2017. [38] C.-H. Tu, “A robot system for indoor environment question answering with cognitive map leveraging vision-language models,” Master’s thesis, National Taiwan University, 2023. [39] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [40] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021. [41] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International conference on machine learning. PMLR, 2022, pp. 12 888–12 900. [42] OpenAI, “Gpt-4 technical report,” 2024. [43] Gemini Team, “Gemini: A family of highly capable multimodal models,” 2024. [44] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arXiv preprint arXiv:2009.03300, 2020. [45] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023. [46] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Roz- ière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lam- ple, “Llama: Open and efficient foundation language models,” 2023. [47] (2024) Claude 3 haiku: our fastest model yet. [Online]. Available: https: //www.anthropic.com/news/claude-3-haiku [48] Y.Gao, Y.Xiong, X.Gao, K.Jia, J.Pan, Y.Bi, Y.Dai, J.Sun, M.Wang, andH.Wang, “Retrieval-augmented generation for large language models: A survey,” 2024. [49] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” Advances in Neural Information Processing Systems, vol. 36, 2024. [50] K. Kärkkäinen and J. Joo, “Fairface: Face attribute dataset for balanced race, gender, and age,” 2019. [51] M. Honnibal and I. Montani, “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing,” 2017, to appear. [52] M. Grootendorst, “Keybert: Minimal keyword extraction with bert.” 2020. [Online]. Available: https://doi.org/10.5281/zenodo.4461265 [53] S. A. Sloman, “The empirical case for two systems of reasoning.” Psychological bulletin, vol. 119, no. 1, p. 3, 1996. [54] S. Y. Min, D. S. Chaplot, P. Ravikumar, Y. Bisk, and R. Salakhutdinov, “Film: Following instructions in language with modular methods,” arXiv preprint arXiv:2110.07342, 2021. [55] K. Kawamura, D. Noelle, K. Hambuchen, T. Rogers, and E. Turkay, “A multi-agent approach to self-reflection for cognitive robotics,” in International Conference on Advanced Robotics, Coimbra, Portugal. Citeseer, 2003, pp. 568–575. [56] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92979	-
dc.description.abstract	隨著人工智慧與機器學習技術的快速發展，機器人正逐步融入我們的日常生活，成為不可或缺的一部分。在居家環境中，認知社交型機器人(Cognitive Social Robot)不僅可以作為交談夥伴來陪伴使用者，還能擔任個人管家的角色，按照使用者的當前需求提供相應的服務和互動。近年來，大型語言模型的出現進一步提升了機器人在推理和決策的能力，除了用於產生與上下文更相關的回覆外，生成式行動規劃（Generative Action Planning）也是一個正積極被開發的領域。藉由處理及剖析指令，認知機器人能夠自主生成適當的行動序列，並透過與環境互動來調整其策略，以達成最終目標。若將此應用於居家環境，機器人將被賦予更高的認知推理能力，進而根據使用者的需求提供高效且適切的支援，不僅可提升居住者的生活品質，同時也促進了更為友好和自然的人機互動。本研究旨在結合常識知識與大型語言模型，開發一個具有生成式行動規劃的認知家用社交型機器人系統，依據機器人從其視野所捕捉到的影像及接收到的使用者語音輸入，自動判斷並選擇最適當的角色。這些角色包括接待者、陪伴者和居家服務者，分別負責執行環境介紹、進行互動對話及提供特定的居家服務。為了有效理解和回應使用者的需求，我們會分別利用視覺語言模型和深度學習模型來進行場景辨識、物件偵測、臉部辨識及年齡估測等，以收集有關當前環境和使用者的重要資訊。此外，在居家服務者的角色中，對於隱含的語音輸入，我們藉由萃取ATOMIC2020知識庫中的常識知識以及善用大型語言模型來使機器人產生自我指令（Self-instruction），同時考量豐富的使用者及環境資訊，讓認知機器人進一步自行推理出可達成指令的一系列高階計畫序列（High-level-plans)。對於執行中的每一高階計畫，我們也提出一個重規劃演算法，讓機器人得以根據環境觀察來進行即時的反思及修正，大幅提升任務執行的效率及成功率。最後，我們將採用結合視覺語言模型及認知地圖的模組作為底層規劃者（Low-level-planner），讓機器人系統具備高度空間認知能力，對應高階計畫，在居家環境進行有效率的定位及導航。	zh_TW
dc.description.abstract	With the rapid development of artificial intelligence and machine learning technologies, robots are progressively integrated into our daily lives, becoming an indispensable part. Within home environment settings, cognitive social robots not only serve as conversational companions but also assume the role of personal assistants, providing home services and interactions tailored to the users’ needs. The emergence of large language models (LLMs) has further enhanced the capabilities of robots in reasoning and decision-making. In addition to generating more contextually relevant responses, generative action planning is also actively being researched. By analyzing given instructions, cognitive robots are capable of autonomously generating action lists and adjusting their strategies through interaction with the environment. When operating within a domestic setting, these robots are endowed with enhanced cognitive reasoning abilities, thereby providing more efficient services based on user needs. This not only promotes the living quality for residents but also fosters more amiable and natural human-machine interactions. This research aims to propose a framework which integrates commonsense knowledge with LLMs to develop a cognitive home social robot capable of generative action planning. Based on images captured from its field of vision and user utterance, the robot automatically and dynamically determines its most appropriate role to assume by itself, such as a reception robot, a companion robot, and a home service robot, etc. In order to respond to user needs effectively and efficiently, we utilize vision-language models (VLMs) to resolve tasks including scene recognition and object detection, along with deep learning models for facial recognition and age estimation, gathering crucial context about the environment and users. In particular, if the role is home service robot and user utterance is with implicit meaning, we first infer the explicit meaning from the ATOMIC 2020, which is a commonsense knowledge base, and harness LLMs to enable the robot to generate self-instructions. By considering versatile user and environmental information, the cognitive robot autonomously reasons a series of high-level plans that fulfill self-instructions. For each high-level executing plan, we propose a replanning algorithm that allows the robot to reflect and online replan based on environmental observations, significantly improves efficiency and success rate of the tasks being executed. Lastly, we incorporate a module that integrates VLMs and a cognitive map to serve as the low-level planner, endowing the robot system with advanced spatial cognition capabilities, to effectively localize and navigate within a home environment.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-07-10T16:08:59Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-07-10T16:08:59Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	口試委員會審定書 i 致謝 ii 中文摘要 iii Abstract v Contents vii List of Figures xi List of Tables xiii Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Research Objectives 2 1.3 Related Works 4 1.3.1 Generative Action Planning 4 1.3.2 Comparison 7 1.4 Contributions 9 1.5 Thesis Overview 10 Chapter 2 Preliminaries 11 2.1 Environment Perception 11 2.1.1 Scene Recognition 11 2.1.2 Object Detection 13 2.2 Cognitive Map 14 2.3 Vision-language Models 16 2.4 Large Language Models 18 2.5 Retrieval Augmented Generation 21 Chapter 3 Methodology 23 3.1 System Overview 23 3.2 User-related Information Collection 25 3.2.1 Scene Recognition 25 3.2.2 Object Detection 27 3.2.3 User Information Extraction 28 3.3 Robot’s Role Classification 30 3.3.1 Role Types 30 3.3.2 Role Classification 31 3.4 Relevant Commonsense Knowledge Information Extraction 34 3.4.1 Knowledge Base Pre-processing 34 3.4.2 Knowledge Base Information Retrieval 37 3.5 Cognitive Service Providing 40 3.5.1 Interactive Response Generation 41 3.5.2 Self-instruction Generation 44 3.5.3 Action Planner 46 3.5.3.1 High Level Planner 47 3.5.3.2 Low Level Planner 53 3.5.3.3 Plan Correction 56 Chapter 4 Experiments 61 4.1 Robot Setup 61 4.2 Scene Recognition 63 4.2.1 Experimental Setup 63 4.2.2 Result 64 4.3 Robot’s Role Classification 66 4.3.1 Experimental Setup 66 4.3.2 Result 67 4.4 Cognitive Service Providing 69 4.4.1 Interactive Response Generation 69 4.4.1.1 Experimental Setup 70 4.4.1.2 Result 70 4.4.2 Self-Instruction Generation 73 4.4.2.1 Experimental Setup 73 4.4.2.2 Result 75 4.4.3 Action Planner 76 4.4.3.1 Experimental Setup 76 4.4.3.2 Quantitative Result 77 4.4.3.3 Qualitative Result 80 4.5 Overall System 84 4.5.1 Ablation Study 84 4.5.2 Qualitative Result 87 Chapter 5 Conclusion 92 References 94 Appendix A — Tables 99 A.1 Robot's Role Classification 99 A.2 Knowledge Base Information Re-ranking 100 A.3 Question Set 101 A.3.1 Robot's Role Classification 101 A.3.2 Self-instruction Generation 103 A.3.3 Action Planner 104 A.4 Others 105	-
dc.language.iso	en	-
dc.subject	生成式行動規劃	zh_TW
dc.subject	大型語言模型	zh_TW
dc.subject	常識知識庫	zh_TW
dc.subject	認知家用社交機器人	zh_TW
dc.subject	重規劃演算法	zh_TW
dc.subject	Cognitive Home Social Robot	en
dc.subject	Generative Action Planning	en
dc.subject	Replan Algorithm	en
dc.subject	Large Language Models	en
dc.subject	Commonsense Knowledge Base	en
dc.title	基於常識知識與大型語言模型之具有生成式行動規劃之認知家用社交型機器人	zh_TW
dc.title	Cognitive Home Social Robot with Generative Action Planning Based on Commonsense Knowledge Base and Large Language Models	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	林沛群;蘇木春;連豊立;宋開泰	zh_TW
dc.contributor.oralexamcommittee	Pei-Chun Lin;Mu-Chun Su;Feng-Li Lian;Kai-Tai Song	en
dc.subject.keyword	生成式行動規劃,重規劃演算法,認知家用社交機器人,常識知識庫,大型語言模型,	zh_TW
dc.subject.keyword	Generative Action Planning,Replan Algorithm,Cognitive Home Social Robot,Commonsense Knowledge Base,Large Language Models,	en
dc.relation.page	105	-
dc.identifier.doi	10.6342/NTU202400866	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2024-07-03	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電機工程學系	-
dc.date.embargo-lift	2027-08-31	-
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 未授權公開取用	27.67 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。