Reinforcement Learning
Mobile Robot Systems
Juraj Micko (jm2186)
Kwot Sin Lee (ksl36)
Wilson Suen (wss28)
Problem background
Use reinforcement learning
to learn a robot controller
to follow a path
Main task: Learn velocities
Extension: Learn hyper-parameters of a controller

Environment
Environment
Action space:

(u, \omega)
Path:

\rho(t)\ \ \text{for}\ t=0\ \text{to}\ 1
Environment

Observation space:
(\rho(t)_{x_r}, \rho(t)_{y_r})
- where t is chosen by the environment as the next point to follow, based on robots position
- coordinates are relative
Path handling
- Continuous path randomly generated
- Goals are generated base on the progress
- Robots are trained to head towards goal before handling path following


Reward Function
- Direction reward
- Distance reward
- Angular velocity
Total reward = Direction reward + Distance reward - Angular velocity
Reward Function
- Direction reward
- Distance reward
- Angular velocity

Reward Function
- Direction reward
- Distance reward
- Angular velocity

Reward Function
- Direction reward
- Distance reward
- Angular velocity

Extension
Learn hyper-parameters of a feedback-linearized low-level path-following controller
Path-following controller
Move holonomic point
Observation
Agent
(\dot{p_x}, \dot{p_y})
(K_i, K_p, K_d, \epsilon)
controller coefficients
Move holonomic point
- PID coefficients
- Epsilon (holonomic point offset) → feedback linearisation

= - \text{PID}(error) * V_p + V_f
(K_i, K_p, K_d)
V_{final} =
(\dot{p_x}, \dot{p_y})
Path-following controller
Move holonomic point
Observation
Agent
(\dot{p_x}, \dot{p_y})
Move the robot
(u, \omega)
(feedback linearization)
Action in the environment
(K_i, K_p, K_d, \epsilon)
controller coefficients
DDPG
Deep Deterministic Policy Gradients
Actor
Policy
Critic
Q-function

Q^*(s_t, a_t) = r_t + \gamma\ \underset{a}{\mathrm{max}}\,Q^*(s_{t+1}, a)
\pi^*(s_t) = \underset{a}{\mathrm{argmax}}\ Q^*(s_t,a)

(\,s_t,\ a_t,\ r_t,\ s_{t+1}\,)
Training [Actor]
Q
\rightarrow \mathrm{max}

(target)
Training [Critic]
(\,s_t,\ a_t,\ r_t,\ s_{t+1}\,)
Q(s_t, a_t)
= r_t + \gamma\ Q^*(s_{t+1}, \pi^*(s_{t+1}))


\rightarrow r_t + \gamma\ \underset{a}{\mathrm{max}}\,Q^*(s_{t+1}, a)
(target)
(target)
Updating target networks
- Soft updates:
(polyak averaging)
\theta^Q : \text{Q network}\\
\theta^\pi : \text{policy network}
\theta^{Q^*} : \text{target Q network}
\theta^{\pi^*} : \text{target policy network}
\theta^{Q^*} \leftarrow \tau\theta^Q + (1-\tau)\theta^{Q^*}\\
\theta^{\pi^*} \leftarrow \tau\theta^\pi + (1-\tau)\theta^{\pi^*}
Target networks
Exploration
- Cannot probabilistically select random action
- Add random noise instead
- Ornstein-Uhlenbeck Process

\pi'(s_t) = \pi(s_t) + \mathcal{N}
PPO
Proximal Policy Optimisation
Overview
- Policy Gradient Methods
- Trust Region Policy Optimisation (TRPO)
- Proximal Policy Optimisation (PPO)
Policy Gradient Method
- Increase probability of high reward actions
- Objective Function
L^{\text{PG}} (\theta) = \hat{\mathbb{E}}_t \left[\log \pi_\theta(a_t \vert s_t) \hat{A}_t \right]
- Unstable in practice
- Extremely large policy updates
Advantage Function
Policy Network
Time Step
Action
State
Trust Region Policy Optimisation
- Previously:
- Single bad step can collapse performance
- Dangerous to take large steps!
- Now:
- Constrain size of update
- Still take largest step for improvement
Trust Region Policy Optimisation
- Maximize a surrogate objective
\max_\theta \hat{\mathbb{E}}_t \left[ \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t \right]
policy before update
- KL-divergence constraint
\hat{\mathbb{E}}_t \left[\text{KL}[\pi_{\theta_\text{old}}(\cdot \vert s_t), \pi_\theta(\cdot \vert s_t)]\right] \leq \delta
Trust Region Policy Optimisation
- Objective function
- Intuition:
- Take large steps
- New policy cannot be "too different"
- Difference in the form of KL-divergence
\max_\theta \hat{\mathbb{E}}_t \left[ \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t - \beta \text{KL}[\pi_{\theta_\text{old}}(\cdot \vert s_t), \pi_\theta(\cdot \vert s_t)] \right]
Scale
Trust Region Policy Optimisation
- Hard to tune beta
- Different problems = different scaling
- Gradient policy update
- Analytical update possible (using natural gradients)
- Expensive 2nd order method to compute
Proximal Policy Optimisation
- Previously:
- Expensive 2nd order method
- Now:
- Use 1st order method
- Use more soft constraints
- Mitigates possible bad decisions
L^{\text{CLIP}}(\theta) = \hat{\mathbb{E}}_t \left[ \min (r_t(\theta)) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right]
r_t(\theta) = \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\theta_{\text{old}}}(a_t \vert s_t)}
Evaluation
Tools



Algorithms
- DDPG vs PPO

Warmup
Possible extensions
- Expand observation space
- Advanced simulation (Gazebo)
- Obstacles
- Reward: integrate path over a vector field
Thank you!
Mobile Robot Systems - Project presentation
By Juraj Mičko
Mobile Robot Systems - Project presentation
- 557