# Some introductory concepts inDeep Reinforcement Learning

Leonardo Petrini @ PCSL Group Meeting - July 16th, 2020.

Motivation: the game of Pong

How to make a computer play pong?

1. Explicit programming
2. Supervised Learning: learn from examples
3. Reinforcement Learning: learn from experience

RL ingredients

e.g. (Game of Pong)

• State = {pads and ball positions and velocities}
• Actions = {up, down, stay}
• Reward = {+1 score a goal, -1  concede a goal}

(there is some Markovian assumption here)

A Taxonomy of RL Algorithms

spinningup.openai.com

2

3

1

Agent's cheat sheet:

For each state $$s_t$$, take action $$a_t = \argmax_a \text{TABLE}(s_t, a)$$

Q-values and Policy (1)

Agent's cheat sheet:

For each state $$s_t$$, take action $$a_t = \argmax_i \text{TABLE}(s_t, a_i)$$

Q-values and Policy (1)

}

}

Policy

Q-values

Q-values and Policy (2)

Q(s,a)
\pi(a|s)

Q-value: expected discounted reward for playing action  $$a$$ in state $$s$$

\begin{aligned}Q(s_t,a_t) &\stackrel{.}{=} \langle r_t + \gamma r_{t+1} + \gamma^2 r_{t+2}+ \dots \rangle, \\ &\end{aligned}

$$\gamma < 1$$ is the discount factor.

Policy: probability of playing action $$a$$ in state $$s$$ $$\pi(a|s)$$

Q-values and Policy (2)

Q(s,a)
\pi(a|s)

Q-value: expected discounted reward for playing action  $$a$$ in state $$s$$

\begin{aligned}Q(s_t,a_t) &\stackrel{.}{=} \langle r_t + \gamma r_{t+1} + \gamma^2 r_{t+2}+ \dots \rangle\: \\ &= \langle r_t \rangle + \gamma Q(s_{t+1}, a_{t+1}),\end{aligned}

$$\gamma < 1$$ is the discount factor.

Policy: probability of playing action $$a$$ in state $$s$$ $$\pi(a|s)$$

algo to exploit
this equality

SARSA and Q-learning

$$s \rightarrow a \rightarrow r \rightarrow s' \rightarrow a'$$

SARSA algorithm. ​Initialize Q-values and start from state $$s_0$$
(1)  choose action $$a$$ according to policy $$\pi(a|s)$$ $$-$$ e.g. greedy: choose $$a^* = \argmax_a Q(s,a)$$
(2)  observe $$r$$ and $$s'$$
(3)  choose action $$a'$$ according to $$\pi(a'|s')$$
(4)  update with SARSA rule:
$$\Delta Q(s,a) = \eta [r + \gamma Q(s',a') - Q(s,a)]$$
(5)  $$s \leftarrow s', a \leftarrow a'$$
(6)  go to (1)

Q-learning

Q-learning algorithm (off-policy)

​Initialize Q-values and start from state $$s_0$$

(1)  choose action $$a$$ according to policy $$\pi(a|s)$$ $$-$$ e.g. greedy: choose $$a^* = \argmax_a Q(s,a)$$

(2)  observe $$r$$ and $$s'$$

(3)  update with Q-learning rule:

$$\Delta Q(s,a) = \eta [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$$

(5)  $$s \leftarrow s', a \leftarrow a'$$

(6)  go to (1)

Discrete vs continuous state space

What if the state space is too large, or even continuous?

1. We would need a huge table
2. Exploring each state to fill the table could take an unrealistic amount of time

We could use a neural network to approximate the Q-value function!!

Deep Q-learning (DQN)

Use a neural network to learn Q-values.

Output vector  $$\{Q$$$$_w$$$$(a_n,s)\}_{n=1}^N$$, where $$w$$ are the network parameters.

Loss function (from SARSA update rule):

$$\mathcal{L} = \frac{1}{2} [r + \gamma Q_w(s',a') - Q_w(s,a)]^2$$

Deep Q-learning (DQN)

Use a neural network to learn Q-values.

Output vector  $$\{Q$$$$_w$$$$(a_n,s)\}_{n=1}^N$$, where $$w$$ are the network parameters.

Loss function (from SARSA update rule):

$$\mathcal{L} = \frac{1}{2} [r + \gamma Q_w(s',a') - Q_w(s,a)]^2$$

The target changes during learning, extra care needed!!

}

Target. Ignore $$w$$ dependence

x

## From Q-learning to Policy Gradient:  Can we learn directly the policy $$\pi(a|s)$$?

Yes, we can!

Goal: approximate the optimal policy function $$\pi^*(a|s)$$

We do that by parametrizing the policy $$\pi_\theta(a|s)$$ and optimize $$\theta$$ in order to maximize a performance measure $$J(\theta)$$.

$$\rightarrow$$  Gradient Ascent in parameter space:

$$\theta_{n+1} = \theta_n + \eta \nabla_\theta J(\theta)$$

What performance measure

Total expected reward for playing an episode of the game following a trajectory $$\tau = \{s_0, \dots s_{end}\}$$,

$$J(\theta) = \mathbb{E}_{\tau | \pi_\theta}[r(\tau)]$$

We define $$\pi_\theta(\tau) \stackrel{.}{=} p(s_0)\,\Pi_{t=0}^T p(s_{t+1}| s_t, a_t)\pi_\theta(a_t|s_t)$$

How do we deal with the gradient of an expectation?

Goal: approximate the optimal policy function $$\pi^*(a|s)$$

We do that by parametrizing the policy $$\pi_\theta(a|s)$$ and optimize $$\theta$$ in order to maximize a performance measure $$J(\theta)$$.

$$\rightarrow$$  Gradient Ascent in parameter space:

$$\theta_{n+1} = \theta_n + \eta \nabla_\theta J(\theta)$$

Performance measure? Total expected reward for playing an episode of the game following a trajectory $$\tau = \{s_0, \dots s_{end}\}$$,

$$J(\theta) = \mathbb{E}_{\tau | \pi_\theta}[r(\tau)]$$

We define $$\pi_\theta(\tau) \stackrel{.}{=} p(s_0)\,\Pi_{t=0}^T p(s_{t+1}| s_t, a_t)\pi_\theta(a_t|s_t)$$

How do we deal with the gradient of an expectation?



\begin{aligned} \nabla \mathbb{E}_{\tau | \pi_\theta}[r(\tau)] &= \nabla \int \pi(\tau)r(\tau)\\ &= \int \nabla \pi(\tau) r(\tau) \\ &= \int \pi(\tau) \nabla \log \pi(\tau) r(\tau) \\ &= \mathbb{E}_{\tau | \pi_\theta}[r(\tau)\nabla \log \pi(\tau)] \end{aligned}

The derivative of the expected reward is the expectation of the reward times the derivative of the log policy:



\begin{aligned} \nabla J(\theta) &= \mathbb{E}_{\tau | \pi_\theta}[r(\tau)\nabla \log \pi(\tau)] \\ &= \mathbb{E}_{\tau | \pi_\theta} \left[r(\tau) \sum_{t=0}^T \nabla \log \pi_\theta(a_t|s_t) \right] \end{aligned}

Recall: $$\pi$$$$_\theta$$$$(\tau) \stackrel{.}{=} p(s_0)\,\Pi_{t=0}^T p(s_{t+1}| s_t, a_t)$$$$\pi_\theta(a_t|s_t)$$

No need to know about the initial state distribution or the transition probabilities between states!!

REINFORCE algorithm



\begin{aligned} \nabla J(\theta) &= \mathbb{E}_{\tau | \pi_\theta} \left[r(\tau) \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \right] \end{aligned}

Trajectory reward $$r(\tau)$$ ?

The REINFORCE algorithm takes the discounted return

$$r(\tau) = \sum_{t=0}^T G_t = \sum_{t=0}^T (r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots )$$

Interpretating $$\Delta\theta \varpropto G_t \frac{\nabla_\theta\pi_\theta (a_t|s_t)}{\pi_\theta(a_t|s_t)}$$ : if taking action $$a_t$$ in state $$s_t$$ gives positive return $$\rightarrow$$ move $$\theta$$ in the direction that increases the probability of repeating $$a_t$$ when visiting $$s_t$$.

Issue. Rewards are usually sparse $$\rightarrow$$ average over trajectories have high variance.

REINFORCE w/ baseline

To lower the variance we can add a baseline

$$\Delta\theta \varpropto (G_t - b(s_t)) \frac{\nabla_\theta\pi_\theta (a_t|s_t)}{\pi_\theta(a_t|s_t)}$$

which still keeps the gradient unbiased.

A common choice is to take the state value as baseline:

$$V(s) = \mathbb{E}_{\pi_\theta}[G_t | s_t = s]$$

Actor Critic Methods



\begin{aligned} \nabla J(\theta) &= \mathbb{E}_{\tau | \pi_\theta} \left[r(\tau) \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \right] \\ & = \sum_{t=0}^T \mathbb{E}_{s_0, \dots s_t} \left[\nabla_\theta \log \pi_\theta(a_t|s_t) \right] \mathbb{E}_{s_{t+1}, \dots s_{end}} [G_t] \\ & = \sum_{t=0}^T \mathbb{E}_{s_0, \dots s_t} \left[\nabla_\theta \log \pi_\theta(a_t|s_t) \right] Q_w(s_t,a_t) \end{aligned}

Notice that we can decompose the expectation in

Actor Critic Methods



\begin{aligned} \nabla J(\theta) &= \mathbb{E}_{\tau | \pi_\theta} \left[r(\tau) \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \right] \\ & = \sum_{t=0}^T \mathbb{E}_{s_0, \dots s_t} \left[\nabla_\theta \log \pi_\theta(a_t|s_t) \right] \mathbb{E}_{s_{t+1}, \dots s_{end}} [G_t] \\ & = \sum_{t=0}^T \mathbb{E}_{s_0, \dots s_t} \left[\nabla_\theta \log \pi_\theta(a_t|s_t) \right] Q_w(s_t,a_t) \end{aligned}

Notice that we can decompose the expectation in

1. The “Critic” $$Q_w(s,a)$$ estimates the value function.
2. The “Actor” updates the policy distribution in the direction suggested by the Critic.

and both the Critic and Actor functions are parameterized with neural networks.

Q-value!!

class ActorCritic(nn.Module):
def __init__(self, num_inputs, num_actions):
super(ActorCritic, self).__init__()
self.conv1 = nn.Conv2d(num_inputs, 32, 3, stride=2, padding=1)
self.conv2 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
self.conv3 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
self.conv4 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
self.lstm = nn.LSTMCell(32 * 6 * 6, 512)
self.critic_linear = nn.Linear(512, 1)
self.actor_linear = nn.Linear(512, num_actions)
self._initialize_weights()

def _initialize_weights(self):
for module in self.modules():
if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear):
nn.init.xavier_uniform_(module.weight)
# nn.init.kaiming_uniform_(module.weight)
nn.init.constant_(module.bias, 0)
elif isinstance(module, nn.LSTMCell):
nn.init.constant_(module.bias_ih, 0)
nn.init.constant_(module.bias_hh, 0)

def forward(self, x, hx, cx):
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
x = F.relu(self.conv3(x))
x = F.relu(self.conv4(x))
hx, cx = self.lstm(x.view(x.size(0), -1), (hx, cx))
return self.actor_linear(hx), self.critic_linear(hx), hx, cx

Some References:

- Artificial Neural Networks Course @ EPFL - Wulfram Gerstner

- Reinforcement Learning: An Introduction - second edition. Richard S. Sutton and Andrew G. Barto. The MIT Press - Cambridge, Massachusetts - London, England

Reward Shaping



To address the problem of sparse rewards one can manually design a reward function.

#### Deep Reinforcement Learning

By Leonardo Petrini

# Deep Reinforcement Learning

Presentation for PCSL Group Meeting @ EPFL

• 171