元學習於分散式連續控制機械手臂

Kuan-Ting Chen; 陳冠廷

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71129

Title:	元學習於分散式連續控制機械手臂 Distributed Continuous Control with Meta Learning On Robotic Arms
Authors:	Kuan-Ting Chen 陳冠廷
Advisor:	王勝德(Sheng-De Wang)
Keyword:	3D機械手臂,深度學習,非同步強化學習,元學習, 3D Robotic Arm,Deep Learning,Asynchronous Reinforcement Learning,Meta Learning,
Publication Year :	2018
Degree:	碩士
Abstract:	深度強化學習已經提出了許多方法去控制機械手臂，如Deep Q-Learning (DQN)、及Policy Gradient (PG)。而Deterministic Deep Policy Gradient (DDPG)則是利用了確定性策略(Deterministic Policy)取代隨機性策略(Stochastic Policy)簡化訓練流程並讓結果更突出。強化學習利用環境給的回饋來訓練底層的神經網路進而完成任務，而一個適當的回饋可以讓神經網路得到更好的結果但這是需要專業的背景知識及不斷從錯誤中來得到適合的環境回饋函數。在本篇論文中，我們提出了一個基於DDPG的方法，並利用優先記憶回放(Prioritized Experience Replay)、非同步學習(Asynchronous Agent Learning)及元學習(Meta Learning)的概念來強化DDPG，其中元學習是利用多個分散式的學習者(Learners)，我們稱為工作者(Workers)去學習連續過去時間的環境狀態以及環境回饋。本篇論文是以模擬六軸(IRB140)及七軸(LBR iiwa 14 R820)的機械手臂的尖端碰觸到3D空間隨機出現的目標為實驗。實驗結果顯示我們提出的演算法比DDPG使用特別定義的回饋函數還要訓練得更快、任務成功率更高。 Deep reinforcement learning has been proposed for training the control of robotic arms, such as Deep Q-Learning (DQN) and Policy Gradient (PG). The approach of Deterministic Deep Policy Gradient (DDPG) takes the advantage of deterministic policy instead of stochastic policy to further simplify the training process and improve the performance. Reinforcement Learning takes the reward from the environment and trains the underlying network to achieve the task. An appropriate reward will get better performance and faster training time, but it requires the domain knowledge and the method of trial and error to define the appropriate reward function. In this paper, we proposed a method that is based on DDPG and makes use of Prioritized Experience Replay (PER), Asynchronous Agent Learning and Meta Learning. The proposed Meta Learning approach uses multiple distributed learners, called workers, to learn from consecutive previous states and rewards. Simulations are done on 6-DOF (IRB140) and 7-DOF (LBR iiwa 14 R820) robotic arms to let the control agents learn to reach random targets in three dimension space. The experiments show that the algorithm we proposed is better than the algorithm using DDPG with specialized reward function on the task success rate and the training speed.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71129
DOI:	10.6342/NTU201801006
Fulltext Rights:	有償授權
Appears in Collections:	電機工程學系

Files in This Item:

File	Size	Format
ntu-107-1.pdf Restricted Access	1.68 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets