ND-MAPPO：具有噪音擾動的多智能體近似策略優化算法

Siyue Hu; 胡思悅

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/81673

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	廖世偉(Shih-wei Liao)
dc.contributor.author	Siyue Hu	en
dc.contributor.author	胡思悅	zh_TW
dc.date.accessioned	2022-11-24T09:25:34Z	-
dc.date.available	2022-11-24T09:25:34Z	-
dc.date.copyright	2021-09-02
dc.date.issued	2021
dc.date.submitted	2021-08-20
dc.identifier.citation	[1] C. S. de Witt, T. Gupta, D. Makoviichuk, V. Makoviychuk, P. H. Torr, M. Sun, and S. Whiteson. Is independent learning all you need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533, 2020. [2] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [3] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, et al. Noisy networks for exploration. arXiv preprint arXiv:1706.10295, 2017. [4] J. Hu, S. Jiang, S. A. Harding, H. Wu, and S.-w. Liao. Riit: Rethinking the importance of implementation tricks in multi-agent reinforcement learning. arXiv preprint arXiv:2102.03479, 2021. [5] S. Iqbal and F. Sha. Actor-attention-critic for multi-agent reinforcement learning. In International Conference on Machine Learning, pages 2961–2970. PMLR, 2019. [6] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014. Citeseer, 2000. [7] L. Kraemer and B. Banerjee. Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190:82–94, 2016. [8] D. Li, D. Zhao, Q. Zhang, and Y. Chen. Reinforcement learning and deep learning based lateral control for autonomous driving [application notes]. IEEE Computational Intelligence Magazine, 14(2):83–98, 2019. [9] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multiagent actor-critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275, 2017. [10] A. Mahajan, T. Rashid, M. Samvelyan, and S. Whiteson. Maven: Multi-agent variational exploration. arXiv preprint arXiv:1910.07483, 2019. [11] F. A. Oliehoek and C. Amato. A concise introduction to decentralized POMDPs. Springer, 2016. [12] F. A. Oliehoek, M. T. Spaan, and N. Vlassis. Optimal and approximate q-value functions for decentralized pomdps. Journal of Artificial Intelligence Research, 32:289353, 2008. [13] P. Peng, Y. Wen, Y. Yang, Q. Yuan, Z. Tang, H. Long, and J. Wang. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play starcraft combat games. arXiv preprint arXiv:1703.10069, 2017. [14] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905, 2017. [15] W. Qiu, X. Wang, R. Yu, X. He, R. Wang, B. An, S. Obraztsova, and Z. Rabinovich. Rmix: Learning risk-sensitive policies for cooperative reinforcement learning agents. arXiv preprint arXiv:2102.08159, 2021. [16] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pages 4295–4304. PMLR, 2018. [17] M. Samvelyan, T. Rashid, C. S. De Witt, G. Farquhar, N. Nardelli, T. G. Rudner, C.-M. Hung, P. H. Torr, J. Foerster, and S. Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019. [18] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015. [19] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015. [20] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. [21] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pages 387–395. PMLR, 2014. [22] K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International Conference on Machine Learning, pages 5887–5896. PMLR, 2019. [23] H. Song, M. Kim, D. Park, Y. Shin, and J.-G. Lee. Learning from noisy labels with deep neural networks: A survey. arXiv preprint arXiv:2007.08199, 2020. [24] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017. [25] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018. [26] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In NIPs, volume 99, pages 1057–1063. Citeseer, 1999. [27] M. Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pages 330–337, 1993. [28] X.Wang, L. Ke, Z. Qiao, and X. Chai. Large-scale traffic signal control using a novel multiagent reinforcement learning. IEEE transactions on cybernetics, 51(1):174–187, 2020. [29] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992. [30] Y. Yang, J. Hao, B. Liao, K. Shao, G. Chen, W. Liu, and H. Tang. Qatten: A general framework for cooperative multiagent reinforcement learning. arXiv preprint arXiv:2002.03939, 2020. [31] C. Yu, A. Velu, E. Vinitsky, Y.Wang, A. Bayen, and Y.Wu. The surprising effectiveness of mappo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955, 2021.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/81673	-
dc.description.abstract	近年來，許多廣為流行的多智能體強化學習（MARL）算法都採用了集中訓練與分散執行模（CTDE）。近期有部分學者嘗試將CTDE 架構直接套用在單智能體PPO 算法上，將其擴展為擁有集中式值函數的多智能體算法（MAPPO），並在《星際爭霸II》環境中進行測試，但實驗表明MAPPO 在《星際爭霸II》的許多任務下表現不佳。為了解決這個問題，我們設計了基於噪音的MAPPO（簡寫為ND-MAPPO），這個模型通過引入噪音機制，實現在集中的價值函數下給每個智能體分配不同的值，進而促進智能體的探索。實驗證明，我們所提的方法在《星際爭霸II》大部分場景皆遠超MAPPO，並在某些場景下同時超過最先進的CTDE算法QMIX。此外，我們首次從理論上證明PPO 通過集中值函數擴展為MAPPO是具備理論收斂性保證，並進一步分析值函數，從中獲得些有意思的見解。	zh_TW
dc.description.provenance	Made available in DSpace on 2022-11-24T09:25:34Z (GMT). No. of bitstreams: 1 U0001-0507202101384200.pdf: 1375025 bytes, checksum: d66115c1dac1c2ae8b8817a509221d70 (MD5) Previous issue date: 2021	en
dc.description.tableofcontents	Verification Letter from the Oral Examination Committee i Acknowledgements iii 摘要 v Abstract vii Contents ix List of Figures xi List of Tables xiii Chapter 1 Introduction 1 Chapter 2 Preliminaries 5 Chapter 3 RelatedWorks 9 Chapter 4 Methods 11 4.1 Motivation 11 4.2 Multi-agent PPO (MAPPO) 11 4.3 Noise-based Exploration 12 4.4 ND-MAPPO 12 4.5 Theoretical Perspective 14 Chapter 5 Experiments 15 5.1 Experiment Setup 15 5.1.1 Non-monotonic Matrix Game 15 5.1.2 StarCraft II 16 5.1.3 Evaluation Metric 17 5.2 Non-monotonic Matrix Game 17 5.3 SMAC 18 5.4 Ablation Studies 19 5.5 Extended V Value Analysis 20 5.6 Policy Entropy of ND-MAPPO 21 Chapter 6 Conclusion 23 Chapter 7 Broader Impact 25 References 27 Appendix A — Complete Detailed Multi-agent PPO proof 31 A.0.1 Multi-agent PPO Convergence 31 A.0.2 Lower Bound 34 Appendix B — Additional Results 35 B.1 Additional SMAC Results 35 B.2 Ablation Studies 35 Appendix C — Hyperparameters 37
dc.language.iso	en
dc.subject	多智能體強化學習	zh_TW
dc.subject	噪音擾動	zh_TW
dc.subject	集中訓練分散執行	zh_TW
dc.subject	Noise Disturbance	en
dc.subject	Centralized Training with Decentralized Execution	en
dc.subject	Multi-Agent Reinforcement Learning	en
dc.title	ND-MAPPO：具有噪音擾動的多智能體近似策略優化算法	zh_TW
dc.title	ND-MAPPO: Noise Disturbance Multi-Agent Proximal Policy Optimization	en
dc.date.schoolyear	109-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	戴敏育(Hsin-Tsai Liu),邱仁鈿(Chih-Yang Tseng),周俊男,孫瑞鴻
dc.subject.keyword	多智能體強化學習,集中訓練分散執行,噪音擾動,	zh_TW
dc.subject.keyword	Multi-Agent Reinforcement Learning,Centralized Training with Decentralized Execution,Noise Disturbance,	en
dc.relation.page	38
dc.identifier.doi	10.6342/NTU202101269
dc.rights.note	未授權
dc.date.accepted	2021-08-20
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-0507202101384200.pdf 未授權公開取用	1.34 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。