朝向可泛化和可解釋的強化學習

劉冠廷; Guan-Ting Liu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92260

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	鄭卜壬	zh_TW
dc.contributor.advisor	Pu-Jen Cheng	en
dc.contributor.author	劉冠廷	zh_TW
dc.contributor.author	Guan-Ting Liu	en
dc.date.accessioned	2024-03-21T16:18:55Z	-
dc.date.available	2024-03-22	-
dc.date.copyright	2024-03-21	-
dc.date.issued	2024	-
dc.date.submitted	2024-02-07	-
dc.identifier.citation	Agarwal, R., Machado, M. C., Castro, P. S., and Bellemare, M. G. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. In International Conference on Learning Representations, 2021. Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. In Association for the Advancement of Artificial Intelligence, 2017. Balog, M., Gaunt, A. L., Brockschmidt, M., Nowozin, S., and Tarlow, D. Deepcoder: Learning to write programs. In International Conference on Learning Representations, 2017. Barto, A. G. and Mahadevan, S. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 2003. Bastani, O., Pu, Y., and Solar-Lezama, A. Verifiable reinforcement learning via policy extraction. In Neural Information Processing Systems, 2018. Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, Jun 2013. ISSN 1076-9757. doi: 10.1613/jair.3912. Bellman, R. A markovian decision process. Indiana Univ. Math. J., 6:679–684, 1957. ISSN 0022-2518. Boyd, S. P. and Vandenberghe, L. Convex optimization. Journal of the American Statistical Association, 100:1097 – 1097, 2005. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016a. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016b. Bunel, R. R., Hausknecht, M., Devlin, J., Singh, R., and Kohli, P. Leveraging grammar and reinforcement learning for neural program synthesis. In International Conference on Learning Representations, 2018. Cabannes, V. A., Kiani, B. T., Balestriero, R., LeCun, Y., and Bietti, A. The ssl interplay: Augmentations, inductive bias, and generalization. ArXiv, abs/2302.02774, 2023. Carli, D., Bevilacqua, F., Pozzer, C., and d’Ornellas, M. A survey of procedural content generation techniques suitable to game development. In Brazilian Symposium on Games and Digital Entertainment, SBGAMES, pp. 26–35, 11 2011. ISBN 978-1-4673-0797-0. doi: 10.1109/SBGAMES.2011.15. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. Chen, X., Liu, C., and Song, D. Execution-guided neural program synthesis. In International Conference on Learning Representations, 2019. Chen, X., Lin, M., Schärli, N., and Zhou, D. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023. Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches, 2014. Choi, D. and Langley, P. Learning teleoreactive logic programs from problem solving. In International Conference on Inductive Logic Programming, 2005. Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J. Quantifying generalization in reinforcement learning. In International Conference on Machine Learning, 2019. Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. Leveraging procedural generation to benchmark reinforcement learning, 2020. Cui, G. and Zhu, H. Differentiable synthesis of program architectures. In Neural Information Processing Systems, 2021. Dabney, W., Barreto, A., Rowland, M., Dadashi, R., Quan, J., Bellemare, M. G., and Silver, D. The value-improvement path: Towards better representations for reinforcement learning, 2020. Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A.-r., and Kohli, P. Robust-fill: Neural program learning under noisy i/o. In International Conference on Machine Learning, 2017. Dong, H., Mao, J., Lin, T., Wang, C., Li, L., and Zhou, D. Neural logic machines. In International Conference on Learning Representations, 2019. El Mrabet, M. A., El Makkaoui, K., and Faize, A. Supervised machine learning: A survey. In 2021 4th International Conference on Advanced Communication Technologies and Networking (CommNet), pp. 1–10, 2021. doi: 10.1109/CommNet52204.2021. 9641998. Ellis, K., Wong, C., Nye, M., Sable-Meyer, M., Cary, L., Morales, L., Hewitt, L., Solar-Lezama, A., and Tenenbaum, J. B. Dreamcoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning. arXiv preprint arXiv:2006.08381, 2020. Ellis, K., Wong, C., Nye, M., Sablé-Meyer, M., Morales, L., Hewitt, L., Cary, L., Solar-Lezama, A., and Tenenbaum, J. B. Dreamcoder: Bootstrapping inductive program synthesis with wake-sleep library learning. In SIGPLAN International Conference on Programming Language Design and Implementation 2021, 2021. Ferns, N., Panangaden, P., and Precup, D. Bisimulation metrics for continuous markov decision processes. SIAM Journal on Computing, 40(6):1662–1714, 2011. doi: 10.1137/10080484X. Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, 2018. Graves, A., Wayne, G., and Danihelka, I. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014. Gu, S., Holly, E., Lillicrap, T., and Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In IEEE International Conference on Robotics and Automation, 2017. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018. Hausknecht, M. and Stone, P. The impact of determinism on learning atari 2600 games. In AAAI Workshop on Learning for General Competency in Video Games, Austin, Texas, USA, January 2015. Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. Rainbow: Combining improvements in deep reinforcement learning, 2017. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016. Hong, J., Dohan, D., Singh, R., Sutton, C., and Zaheer, M. Latent programmer: Discrete latent codes for program synthesis. In International Conference on Machine Learning, 2021. Ibarz, J., Tan, J., Finn, C., Kalakrishnan, M., Pastor, P., and Levine, S. How to train your robot with deep reinforcement learning: lessons we have learned. The International Journal of Robotics Research, 2021. Igl, M., Ciosek, K., Li, Y., Tschiatschek, S., Zhang, C., Devlin, S., and Hofmann, K. Generalization in reinforcement learning with selective noise injection and information bottleneck, 2019. Inala, J. P., Bastani, O., Tavares, Z., and Solar-Lezama, A. Synthesizing programmatic policies that inductively generalize. In International Conference on Learning Representations, 2020. Jain, N., Vaidyanath, S., Iyer, A., Natarajan, N., Parthasarathy, S., Rajamani, S., and Sharma, R. Jigsaw: Large language models meet program synthesis. In International Conference on Software Engineering, 2022. Jiang, M., Grefenstette, E., and Rocktäschel, T. Prioritized level replay. In International Conference on Machine Learning, 2020. Kirk, R., Zhang, A., Grefenstette, E., and Rocktäschel, T. A survey of zero-shot generalisation in deep reinforcement learning. Journal of Artificial Intelligence Research, 76:201–264, January 2023. ISSN 1076-9757. doi: 10.1613/jair.1.14174. Kostrikov, I., Yarats, D., and Fergus, R. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels, 2020. Koul, A., Fern, A., and Greydanus, S. Learning finite state representations of recurrent policy networks. In International Conference on Learning Representations, 2019. Landajuela, M., Petersen, B. K., Kim, S., Santiago, C. P., Glatt, R., Mundhenk, N., Pettit, J. F., and Faissol, D. Discovering symbolic policies with deep reinforcement learning. In International Conference on Machine Learning, 2021. Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., and Srinivas, A. Reinforcement learning with augmented data, 2020. Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791. Lee, J., Hwangbo, J., and Hutter, M. Robust recovery controller for a quadrupedal robot using deep reinforcement learning, 2019a. Lee, K., Lee, K., Shin, J., and Lee, H. Network randomization: A simple technique for generalization in deep reinforcement learning, 2019b. Lee, Y., Sun, S.-H., Somasundaram, S., Hu, E., and Lim, J. J. Composing complex skills by learning transition policies. In International Conference on Learning Representations, 2019c. Lee, Y., Szot, A., Sun, S.-H., and Lim, J. J. Generalizable imitation learning from observation via inferring goal proximity. In Neural Information Processing Systems, 2021. Levine, S. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018. Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Lago, A. D., Hubert, T., Choy, P., d’Autume, C. d. M., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J., Gowal, S., Cherepanov, A., Molloy, J., Mankowitz, D. J., Robson, E. S., Kohli, P., de Freitas, N., Kavukcuoglu, K., and Vinyals, O. Competition-level code generation with alphacode. Science, 2022. Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control. In arXiv preprint arXiv:2209.07753, 2022. Liao, Y.-H., Puig, X., Boben, M., Torralba, A., and Fidler, S. Synthesizing environment-aware activities via activity sketches. In IEEE Conference on Computer Vision and Pattern Recognition, 2019. Lin, X. V., Wang, C., Zettlemoyer, L., and Ernst, M. D. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. In International Conference on Language Resources and Evaluation, 2018. Lin, Y.-A., Lee, C.-T., Liu, G.-T., Cheng, P.-J., and Sun, S.-H. Addressing long-horizon tasks by integrating program synthesis and state machines, 2023. Lipton, Z. C. The mythos of model interpretability. In ICML Workshop on Human Interpretability in Machine Learning, 2016. Liu, G.-T., Lin, G.-Y., and Cheng, P.-J. Improving generalization with cross-state behavior matching in deep reinforcement learning. In Autonomous Agents and Multiagent Systems, 2022. Liu, G.-T., Hu, E.-P., Cheng, P.-J., Lee, H.-Y., and Sun, S.-H. Hierarchical programmatic reinforcement learning via learning to compose programs. In International Conference on Machine Learning, 2023. Liu, Y., Wu, J., Wu, Z., Ritchie, D., Freeman, W. T., and Tenenbaum, J. B. Learning to describe scenes with programs. In International Conference on Learning Representations, 2019. Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., and Bowling, M. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents, 2017. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540):529—533, February 2015a. ISSN 0028-0836. doi: 10.1038/nature14236. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 2015b. Mumuni, A. and Mumuni, F. Data augmentation: A comprehensive survey of modern approaches. Array, 16:100258, 2022. ISSN 2590-0056. doi: https://doi.org/10.1016/j. array.2022.100258. Naeem, S., Ali, A., Anam, S., and Ahmed, M. An unsupervised machine learning algorithms: Comprehensive review. IJCDS Journal, 13:911–921, 04 2023. doi: 10.12785/ijcds/130172. Pattis, R. E. Karel the robot: a gentle introduction to the art of programming. John Wiley & Sons, Inc., 1981. Poesia, G., Polozov, O., Le, V., Tiwari, A., Soares, G., Meek, C., and Gulwani, S. Synchromesh: Reliable code generation from pre-trained language models. arXiv preprint arXiv:2201.11227, 2022. Puiutta, E. and Veith, E. M. S. P. Explainable reinforcement learning: A survey. In Holzinger, A., Kieseberg, P., Tjoa, A. M., and Weippl, E. R. (eds.), Machine Learning and Knowledge Extraction - International Cross-Domain Conference, CD-MAKE, 2020. Qiu, W. and Zhu, H. Programmatic reinforcement learning without oracles. In International Conference on Learning Representations, 2022. Raileanu, R. and Fergus, R. Decoupling value and policy for generalization in reinforcement learning. ArXiv, abs/2102.10330, 2021. Raileanu, R., Goldstein, M., Yarats, D., Kostrikov, I., and Fergus, R. Automatic data augmentation for generalization in deep reinforcement learning. arXiv preprint arXiv:2006.12862, 2020. Reed, S. and De Freitas, N. Neural programmer-interpreters. In International Conference on Learning Representations, 2016. Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L. Bpr: Bayesian personalized ranking from implicit feedback, 2012. Rubinstein, R. Y. Optimization of computer simulation models with rare events. European Journal of Operational Research, 1997. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay, 2015. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. Shin, E. C., Polosukhin, I., and Song, D. Improving neural program synthesis with inferred execution traces. In Neural Information Processing Systems, 2018. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. Mastering the game of go without human knowledge. Nature, 550(7676):354—359, October 2017a. ISSN 0028-0836. doi: 10.1038/nature24270. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. Mastering the game of go without human knowledge. Nature, 2017b. Silver, T., Allen, K. R., Lew, A. K., Kaelbling, L. P., and Tenenbaum, J. Few-shot bayesian imitation learning with logical program policies. In Association for the Advancement of Artificial Intelligence, 2020. Song, X., Jiang, Y., Tu, S., Du, Y., and Neyshabur, B. Observational overfitting in reinforcement learning, 2019. Srinivas, A., Laskin, M., and Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning, 2020. Sun, S.-H., Noh, H., Somasundaram, S., and Lim, J. Neural program synthesis from diverse demonstration videos. In International Conference on Machine Learning, 2018. Sun, S.-H., Wu, T.-L., and Lim, J. J. Program guided agent. In International Conference on Learning Representations, 2020a. Sun, Z., Zhu, Q., Xiong, Y., Sun, Y., Mou, L., and Zhang, L. Treegen: A tree-based transformer architecture for code generation. In Association for the Advancement of Artificial Intelligence, 2020b. Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2018, 2018. Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 1999. Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T., and Riedmiller, M. Deepmind control suite, 2018. Tian, Y., Luo, A., Sun, X., Ellis, K., Freeman, W. T., Tenenbaum, J. B., and Wu, J. Learning to infer and execute 3d shape programs. In International Conference on Learning Representations, 2019. Trivedi, D., Zhang, J., Sun, S.-H., and Lim, J. J. Learning to synthesize programs as interpretable and generalizable policies. In Advances in Neural Information Processing Systems, 2021. Verma, A., Murali, V., Singh, R., Kohli, P., and Chaudhuri, S. Programmatically interpretable reinforcement learning. In International Conference on Machine Learning, 2018. Verma, A., Le, H., Yue, Y., and Chaudhuri, S. Imitation-projected programmatic reinforcement learning. In Neural Information Processing Systems, 2019. Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. Feudal networks for hierarchical reinforcement learning. In International Conference on Machine Learning, 2017. Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 2019. Wang, K., Kang, B., Shao, J., and Feng, J. Improving generalization in reinforcement learning with mixture regularization. In Neural Information Processing Systems, 2020. Wang, Y., Wang, W., Joty, S., and Hoi, S. C. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Empirical Methods in Natural Language Processing, 2021. Wang, Y., Le, H., Gotmare, A. D., Bui, N. D., Li, J., and Hoi, S. C. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 1992. Winner, E. and Veloso, M. Distill: Learning domain-specific planners by example. In International Conference on Machine Learning, 2003. Wu, J., Tenenbaum, J. B., and Kohli, P. Neural scene de-rendering. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. Wurman, P. R., Barrett, S., Kawamoto, K., MacGlashan, J., Subramanian, K., Walsh, T. J., Capobianco, R., Devlic, A., Eckert, F., Fuchs, F., et al. Outracing champion gran turismo drivers with deep reinforcement learning. Nature, 2022. Yang, Y., Inala, J. P., Bastani, O., Pu, Y., Solar-Lezama, A., and Rinard, M. Program synthesis guided reinforcement learning for partially observed environments. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 29669–29683. Curran Associates, Inc., 2021. Yarats, D., Kostrikov, I., and Fergus, R. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International Conference on Learning Representations, 2021. Zaremba, W. and Sutskever, I. Reinforcement learning neural turing machines-revised. arXiv preprint arXiv:1505.00521, 2015. Zhang, A., McAllister, R., Calandra, R., Gal, Y., and Levine, S. Learning invariant representations for reinforcement learning without reconstruction, 2020. Zhang, C., Vinyals, O., Munos, R., and Bengio, S. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893, 2018a. Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization, 2018b. Zhao, W., Queralta, J. P., and Westerlund, T. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 737–744, 2020. Zhong, L., Lindeborg, R., Zhang, J., Lim, J. J., and Sun, S.-H. Hierarchical neural program synthesis. arXiv preprint arXiv:2303.06018, 2023.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92260	-
dc.description.abstract	深度強化學習是當前一項重要的機器學習領域，在機器人、自駕車、金融交易、棋類遊戲以及生成式人工智慧等領域都有非常多成功的應用。深度強化學習通過與環境的互動來學習達成特定目標，所以要如何有效地學習並在不同的輸入狀態提取能夠泛化的表徵一直都是深度強化學習的一大考驗。除了有效地學習外，強化學習的相關研究也十分關注如何創建能夠泛化並被人類理解的策略。因為在自駕車、金融交易以及醫療領域都十分注重代理人系統的可解釋性以即可驗證性，而如何驗證跟解釋透過強化學習所學習到的策略是非常大的挑戰。為了更進一步探討以上議題，本論文旨在利用不同輸入狀態的相似度以及領域特定語言來分別提升強化學習的泛化性以及可解釋性。在提升泛化性這個面向上，我們希望可以利用比較不同狀態特徵狀並計算其相似性，利用相關結果在多個環境中測試並評估提升的幅度及潛力。在提升可解釋性方面，我們可以利用組合程式，以生成描述超越現有數據集的精細行為的政策。這種創新方法在特定領域中展現出優於傳統方法的效果，有望促進更具適應性的強化學習政策的形成。另一個非常具有潛力的方向是將程序化強化學習和有限狀態機相結合，以更有效地展現複雜行為並解決長期任務。本論文所提出的方法，在許多不同的測試環境中取得非常好的效果，也可以與其他正則化或是提升可解釋性的方法互相搭配，進一步提升整體的可泛用性以及可解釋性。	zh_TW
dc.description.abstract	Deep Reinforcement Learning (DRL) stands as a critical field within contemporary machine learning, finding widespread applications in robotics, autonomous vehicles, financial trading, strategic games, and generative artificial intelligence. DRL operates through interactive engagement with environments to achieve specific objectives, thereby posing persistent challenges in effectively learning and extracting generalizable representations across diverse input states. Besides learning the policy efficiently and effectively, substantial focus in reinforcement learning research dwells on creating strategies that not only generalize but also remain comprehensible to human understanding. The interpretability of agent systems holds paramount importance in domains such as autonomous vehicles, financial trading, and healthcare, necessitating verifiability alongside comprehensibility. Validating and elucidating strategies learned through reinforcement learning poses a significant challenge. This thesis endeavors to delve further into these issues by leveraging the similarity of distinct input states and domain-specific languages to enhance the generalizability and interpretability of reinforcement learning. In the pursuit of augmenting generalizability, this study aims to compare various state features and compute their similarities. The resulting correlations will be assessed across multiple environments to evaluate the extent and potential of improvements. Concerning interpretability enhancement, employing compositional programs to generate policies describing nuanced behaviors beyond existing datasets presents an innovative approach. This method exhibits superior efficacy within specific domains, potentially fostering the development of more adaptive reinforcement learning policies. Another promising avenue involves amalgamating procedural reinforcement learning with finite-state machines to better manifest complex behaviors and address enduring tasks efficiently. The proposed methodology showcased commendable performance across diverse testing environments and could be synergistically combined with other regularization or interpretability-enhancing techniques, further augmenting overall applicability and comprehensibility.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-03-21T16:18:55Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-03-21T16:18:55Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements iii 摘要 v Abstract vii Contents ix List of Figures xv List of Tables xix Chapter 1 Current Advancement and Challenges of Reinforcement Learning 1 1.1 Introduction of Reinforcement Learning and Markov Decision Process 1 1.2 Challenges: Generalization in Reinforcement Learning . . . . . . . . 3 1.2.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Regularization and Learning Invariance . . . . . . . . . . . . . . . 6 1.2.3 Domain Randomization and Inductive Bias . . . . . . . . . . . . . 6 1.3 Challenges: Interpretability in Reinforcement Learning . . . . . . . . 8 1.3.1 Extracted Rules From Neural Networks or Demonstrations . . . . . 9 1.3.2 Program Guided Agents . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.3 Symbolic Programs as Policies . . . . . . . . . . . . . . . . . . . . 11 1.3.4 Human-Readable Programs as Policies . . . . . . . . . . . . . . . . 12 1.4 Approaches for Improving the Generalization and Interpretability in Reinforcement Learning 13 Chapter 2 Improving the Generalization of Policies in Reinforcement Learning Using Cross-State Self-Constraint 15 2.1 Challenges of Improving Generalization in Deep Reinforcement Learning . . . . . . . . . . . 16 2.2 Prior Works of Reinforcement Learning Regularization and Generalization . . . . . . . . . . 19 2.3 Using Cross-State Similarity Ranking for Learning General State Features . . . . . .. . . . 21 2.3.1 Behavior Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.2 Applying the Concept of Implicit Feedback in Reinforcement Learning 22 2.3.3 Probabilistic Behavior Matching as Implicit Feedback . . . . . . . . 23 2.3.4 Cross-State Self-Constraint in Combination with Reinforcement Learning . . . . . . . . . 25 2.4 Improving Generalization Performance on the OpenAI Procgen Benchmark . . . . . . . . . . . 28 2.4.1 Generalization on Procgen with Rainbow . . . . . . . . . . . . . . 30 2.4.2 Generalization on Procgen with PPO . . . . . . . . . . . . . . . . . 34 2.4.3 Evaluation on Gym Atari with Rainbow . . . . . . . . . . . . . . . 37 2.4.4 Visualization of Representation Embedding . . . . . . . . . . . . . 38 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 3 Improving the Interpretability of Policies Using Hierarchical Programmatic Reinforcement Learning 43 3.1 Challenges of Improving the Interpretability of Policies in Reinforcement Learning . .. . . 44 3.2 Prior Works About Programmatic Reinforcement Learning . . . . . . 46 3.3 Problem Formulation of Programmatic Reinforcement Learning . . . 48 3.3.1 Domain-Specific Language (DSL). . . . . . . . . . . . . . . . . . . 49 3.3.2 Markov Decision Process (MDP) . . . . . . . . . . . . . . . . . . . 49 3.4 A Hierarchical Approach for Synthesizing Task-Solving Programs . . 51 3.4.1 Continuous Parameterization of the Program Embedding Space . . . 52 3.4.2 Meta-Policy: Learning to Compose the Task-Solving Program . . . 55 3.5 Experiments in the Karel Environment . . . . . . . . . . . . . . . . . 56 3.5.1 The Karel Environment . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5.2 Detailed Descriptions of KAREL Tasks . . . . . . . . . . . . . . . . 58 3.5.3 Detailed Descriptions of KAREL-HARD Tasks . . . . . . . . . . . . . 60 3.5.4 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5.4.1 Improved Program Dataset Generation for the Karel DSL 61 3.5.4.2 Framework Implementation . . . . . . . . . . . . . . . 63 3.5.4.3 Baseline Approaches . . . . . . . . . . . . . . . . . . 64 3.5.5 Experimental Results on the KAREL and KAREL-HARD Tasks . . . . 65 3.5.6 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . 68 3.5.6.1 Generation of Out-Of-Distributional Programs . . . . . 68 3.5.6.2 Dimensionality of the Program Embedding Space . . . 70 3.5.6.3 Learning From Episodic Reward . . . . . . . . . . . . 71 3.5.7 Qualitative Examination of Results . . . . . . . . . . . . . . . . . . 71 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 4 Combining Generalization and Interpretability in Reinforcement Learning 75 4.1 Generalization of Interpretable Reinforcement Learning on Large-Scale Tasks . 75 4.2 Generalization of Interpretable Reinforcement Learning on Repetitive and Long-Horizon Tasks . . . 77 4.2.1 Retrieving a Set of Effective, Diverse, and Resuable Programs as Modes . . 79 4.2.1.1 Using the Cross-Entropy Method to Retrieve Effective Programs . . . . . 79 4.2.1.2 Retrieval of Effective and Diverse Programs . . . . . . 81 4.2.1.3 Retrieval of Effective, Diverse, and Compatible Programs 81 4.2.2 Learning the Mode Transition Function . . . . . . . . . . . . . . . 83 4.2.3 Evaluating Program Machine Policies on Long-Horizon Tasks . . . 85 4.2.4 THE KAREL-LONG Problem Set . . . . . . . . . . . . . . . . . . . . 85 4.2.5 Integration of Diversity Multiplier into Cross-Entropy Method . . . 88 4.2.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.7 Comparison with Deep RL and Programmatic RL Approaches . . . 90 4.2.8 Efficiency in Program Sampling . . . . . . . . . . . . . . . . . . . 92 4.2.9 Inductive Generalization Capability . . . . . . . . . . . . . . . . . 92 Chapter 5 Conclusion and Future work 95 5.1 Unraveling the Nexus between Interpretability and Human Understanding . . 96 5.2 Enhancing Learning Efficiency for Programmatic Reinforcement Learning . . 97 5.3 Leveraging Prior Knowledge for Better Interpretability . . . . . . . . 98 References 99 Appendix A — Supplementary Material of Cross-State Self-Constraint 113 A.1 Hyperparameters of Cross-State Self-Constraint with PPO and Rainbow113 A.2 Further Experimental Results . . . . . . . . . . . . . . . . . . . . . 115 Appendix B — Supplementary Material of Hierarchical Programmatic Reinforcement Learning 121 B.1 Hyperparameters and Settings . . . . . . . . . . . . . . . . . . . . . 121 B.2 Synthesized Programs . . . . . . . . . . . . . . . . . . . . . . . . . 123	-
dc.language.iso	en	-
dc.subject	機器學習	zh_TW
dc.subject	強化學習	zh_TW
dc.subject	泛化性	zh_TW
dc.subject	可解釋性	zh_TW
dc.subject	程式化強化學習	zh_TW
dc.subject	Reinforcement Learning	en
dc.subject	Machine Learning	en
dc.subject	Programmatic Reinforcement Learning	en
dc.subject	Interpretation	en
dc.subject	Generalization	en
dc.title	朝向可泛化和可解釋的強化學習	zh_TW
dc.title	Towards Generalizable and Interpretable Reinforcement Learning	en
dc.type	Thesis	-
dc.date.schoolyear	112-1	-
dc.description.degree	博士	-
dc.contributor.oralexamcommittee	陳縕儂;高宏宇;陳信希;李宏毅;許永真;曾新穆;林忠緯	zh_TW
dc.contributor.oralexamcommittee	Yun-Nung Chen;Hung-Yu Kao;Hsin-Hsi Chen;Hung-Yi Lee;Yung-jen Hsu;Shin-Mu Tseng;Chung-Wei Lin	en
dc.subject.keyword	機器學習,強化學習,泛化性,可解釋性,程式化強化學習,	zh_TW
dc.subject.keyword	Machine Learning,Reinforcement Learning,Generalization,Interpretation,Programmatic Reinforcement Learning,	en
dc.relation.page	131	-
dc.identifier.doi	10.6342/NTU202400365	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2024-02-16	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
dc.date.embargo-lift	2029-01-30	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf 未授權公開取用	17.57 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。