Okan YILDIRAN
The DARPA Robotics Challenge (DRC) seeks to address this problem by promoting innovation in human-supervised robotic technology for disaster-response operations.
Valve task for robots is still a difficult problem for robotics field. In DARPA Robotics Challenge, there are sections that required to fully turn various valves in order to get points.
In this project, experimented with similar scenario to DARPA, using reinforcement learning algorithm Sarsa, on ROS and VREP simulation environment.
State-Action-Reward-State-Action (SARSA) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning. It is an on-policy version of Temporal Difference (TD) algorithm.
Temporal Difference algorithms need lots of episodes in order to find good policy. To reduce required episodes, Eligibility Traces used widely. Eligibility traces are one of the basic mechanisms of reinforcement learning and they increase the efficiency dramatically. Combining it with SARSA, which is called Sarsa(λ).
In experiments xy-plane on top of the valve discretizated in order to reduce state space. States are x and y positions of the end effector of the robot arm.
Actions are defined as 8-directional movements.
In some states, it is not possible to move to another state while gripping. For example while gripping, moving towards to center of the valve is not possible.
Due to the limitations of the robot's arm and existence of impossible actions, we can say that actions are non-deterministic.
Current Location
For valve task, reward function can be the angle of rotation or single reward at fully rotation. However due to not being able to fully implement physics of a valve in VREP simulation environment, rewards is defined at couple of locations in the state space.
Goals
Some movements are not possible. These movements are given small amount of negative reward in order to find possible actions.
Goals
Learning rate = 0.1
E-greedy = 0.9 to 0 over time
Discount factor = 0.9
λ = 0.9
Episodes = 50
Max step for episode = 50
Non-change limit = 5
Goals