Some introductory concepts in
Deep Reinforcement Learning

Leonardo Petrini @ PCSL Group Meeting - July 16th, 2020.

Motivation: the game of Pong

How to make a computer play pong?

Explicit programming
Supervised Learning: learn from examples
Reinforcement Learning: learn from experience

RL ingredients

e.g. (Game of Pong)

State = {pads and ball positions and velocities}
Actions = {up, down, stay}
Reward = {+1 score a goal, -1 concede a goal}

(there is some Markovian assumption here)

A Taxonomy of RL Algorithms

spinningup.openai.com

Agent's cheat sheet:

For each state $s_t$, take action $a_t = \argmax_a \text{TABLE}(s_t, a)$

Q-values and Policy (1)

Agent's cheat sheet:

For each state $s_t$, take action $a_t = \argmax_i \text{TABLE}(s_t, a_i)$

Q-values and Policy (1)

}

Policy

Q-values

Q-values and Policy (2)

Q(s,a)

\pi(a|s)

Q-value: expected discounted reward for playing action $a$ in state $s$

$$\begin{aligned}Q(s_t,a_t) &\stackrel{.}{=} \langle r_t + \gamma r_{t+1} + \gamma^2 r_{t+2}+ \dots \rangle, \\ &\end{aligned}$$

$\gamma < 1$ is the discount factor.

Policy: probability of playing action $a$ in state $s$ $$\pi(a|s)$$

Q-values and Policy (2)

Q(s,a)

\pi(a|s)

Q-value: expected discounted reward for playing action $a$ in state $s$

$$\begin{aligned}Q(s_t,a_t) &\stackrel{.}{=} \langle r_t + \gamma r_{t+1} + \gamma^2 r_{t+2}+ \dots \rangle\: \\ &= \langle r_t \rangle + \gamma Q(s_{t+1}, a_{t+1}),\end{aligned}$$

$\gamma < 1$ is the discount factor.

Policy: probability of playing action $a$ in state $s$ $$\pi(a|s)$$

algo to exploit
this equality

SARSA and Q-learning

$$s \rightarrow a \rightarrow r \rightarrow s' \rightarrow a' $$

SARSA algorithm. Initialize Q-values and start from state $s_0$
(1) choose action $a$ according to policy $\pi(a|s)$ $-$ e.g. greedy: choose $a^* = \argmax_a Q(s,a)$
(2) observe $r$ and $s'$
(3) choose action $a'$ according to $\pi(a'|s')$
(4) update with SARSA rule:
$$\Delta Q(s,a) = \eta [r + \gamma Q(s',a') - Q(s,a)]$$
(5) $s \leftarrow s', a \leftarrow a'$
(6) go to (1)

Q-learning

Q-learning algorithm (off-policy)

Initialize Q-values and start from state $s_0$

(1) choose action $a$ according to policy $\pi(a|s)$ $-$ e.g. greedy: choose $a^* = \argmax_a Q(s,a)$

(2) observe $r$ and $s'$

(3) update with Q-learning rule:

$$\Delta Q(s,a) = \eta [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$$

(5) $s \leftarrow s', a \leftarrow a'$

(6) go to (1)

Discrete vs continuous state space

What if the state space is too large, or even continuous?

We would need a huge table
Exploring each state to fill the table could take an unrealistic amount of time

We could use a neural network to approximate the Q-value function!!

Deep Q-learning (DQN)

Use a neural network to learn Q-values.

Output vector $\{Q$$_w$$(a_n,s)\}_{n=1}^N$, where $w$ are the network parameters.

Loss function (from SARSA update rule):

$$\mathcal{L} = \frac{1}{2} [r + \gamma Q_w(s',a') - Q_w(s,a)]^2$$

Deep Q-learning (DQN)

Use a neural network to learn Q-values.

Output vector $\{Q$$_w$$(a_n,s)\}_{n=1}^N$, where $w$ are the network parameters.

Loss function (from SARSA update rule):

$$\mathcal{L} = \frac{1}{2} [r + \gamma Q_w(s',a') - Q_w(s,a)]^2$$

The target changes during learning, extra care needed!!

}

Target. Ignore $w$ dependence

From Q-learning to Policy Gradient: Can we learn directly the policy $\pi(a|s)$?

Yes, we can!

Policy Gradient

Goal: approximate the optimal policy function $\pi^*(a|s)$

We do that by parametrizing the policy $\pi_\theta(a|s)$ and optimize $\theta$ in order to maximize a performance measure $J(\theta)$.

$\rightarrow$ Gradient Ascent in parameter space:

$$\theta_{n+1} = \theta_n + \eta \nabla_\theta J(\theta)$$

Policy Gradient (2)

What performance measure?

Total expected reward for playing an episode of the game following a trajectory $\tau = \{s_0, \dots s_{end}\}$,

$$J(\theta) = \mathbb{E}_{\tau | \pi_\theta}[r(\tau)]$$

We define $$\pi_\theta(\tau) \stackrel{.}{=} p(s_0)\,\Pi_{t=0}^T p(s_{t+1}| s_t, a_t)\pi_\theta(a_t|s_t)$$

How do we deal with the gradient of an expectation?

Policy Gradient

Goal: approximate the optimal policy function $\pi^*(a|s)$

We do that by parametrizing the policy $\pi_\theta(a|s)$ and optimize $\theta$ in order to maximize a performance measure $J(\theta)$.

$\rightarrow$ Gradient Ascent in parameter space:

$$\theta_{n+1} = \theta_n + \eta \nabla_\theta J(\theta)$$

Performance measure? Total expected reward for playing an episode of the game following a trajectory $\tau = \{s_0, \dots s_{end}\}$,

$$J(\theta) = \mathbb{E}_{\tau | \pi_\theta}[r(\tau)]$$

We define $$\pi_\theta(\tau) \stackrel{.}{=} p(s_0)\,\Pi_{t=0}^T p(s_{t+1}| s_t, a_t)\pi_\theta(a_t|s_t)$$

How do we deal with the gradient of an expectation?

Policy Gradient Theorem (1)

$$ $$

\begin{aligned} \nabla \mathbb{E}_{\tau | \pi_\theta}[r(\tau)] &= \nabla \int \pi(\tau)r(\tau)\\ &= \int \nabla \pi(\tau) r(\tau) \\ &= \int \pi(\tau) \nabla \log \pi(\tau) r(\tau) \\ &= \mathbb{E}_{\tau | \pi_\theta}[r(\tau)\nabla \log \pi(\tau)] \end{aligned}

The derivative of the expected reward is the expectation of the reward times the derivative of the log policy:

Policy Gradient Theorem (2)

$$ $$

\begin{aligned} \nabla J(\theta) &= \mathbb{E}_{\tau | \pi_\theta}[r(\tau)\nabla \log \pi(\tau)] \\ &= \mathbb{E}_{\tau | \pi_\theta} \left[r(\tau) \sum_{t=0}^T \nabla \log \pi_\theta(a_t|s_t) \right] \end{aligned}

Recall: $\pi$$_\theta$$(\tau) \stackrel{.}{=} p(s_0)\,\Pi_{t=0}^T p(s_{t+1}| s_t, a_t)$$\pi_\theta(a_t|s_t)$

The gradient of the performance function finally reads

No need to know about the initial state distribution or the transition probabilities between states!!

REINFORCE algorithm

$$ $$

\begin{aligned} \nabla J(\theta) &= \mathbb{E}_{\tau | \pi_\theta} \left[r(\tau) \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \right] \end{aligned}

Performance measure gradient:

Trajectory reward $r(\tau)$ ?

The REINFORCE algorithm takes the discounted return

$$r(\tau) = \sum_{t=0}^T G_t = \sum_{t=0}^T (r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots )$$

Interpretating $\Delta\theta \varpropto G_t \frac{\nabla_\theta\pi_\theta (a_t|s_t)}{\pi_\theta(a_t|s_t)}$ : if taking action $a_t$ in state $s_t$ gives positive return $\rightarrow$ move $\theta$ in the direction that increases the probability of repeating $a_t$ when visiting $s_t$.

Issue. Rewards are usually sparse $\rightarrow$ average over trajectories have high variance.

REINFORCE w/ baseline

To lower the variance we can add a baseline

$$\Delta\theta \varpropto (G_t - b(s_t)) \frac{\nabla_\theta\pi_\theta (a_t|s_t)}{\pi_\theta(a_t|s_t)}$$

which still keeps the gradient unbiased.

A common choice is to take the state value as baseline:

$$V(s) = \mathbb{E}_{\pi_\theta}[G_t | s_t = s]$$

Actor Critic Methods

$$ $$

\begin{aligned} \nabla J(\theta) &= \mathbb{E}_{\tau | \pi_\theta} \left[r(\tau) \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \right] \\ & = \sum_{t=0}^T \mathbb{E}_{s_0, \dots s_t} \left[\nabla_\theta \log \pi_\theta(a_t|s_t) \right] \mathbb{E}_{s_{t+1}, \dots s_{end}} [G_t] \\ & = \sum_{t=0}^T \mathbb{E}_{s_0, \dots s_t} \left[\nabla_\theta \log \pi_\theta(a_t|s_t) \right] Q_w(s_t,a_t) \end{aligned}

Notice that we can decompose the expectation in

Actor Critic Methods

$$ $$

Notice that we can decompose the expectation in

The “Critic” $Q_w(s,a)$ estimates the value function.
The “Actor” updates the policy distribution in the direction suggested by the Critic.

and both the Critic and Actor functions are parameterized with neural networks.

Q-value!!

https://github.com/uvipen/Super-mario-bros-A3C-pytorch/tree/master/src

class ActorCritic(nn.Module):
    def __init__(self, num_inputs, num_actions):
        super(ActorCritic, self).__init__()
        self.conv1 = nn.Conv2d(num_inputs, 32, 3, stride=2, padding=1)
        self.conv2 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        self.conv3 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        self.conv4 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        self.lstm = nn.LSTMCell(32 * 6 * 6, 512)
        self.critic_linear = nn.Linear(512, 1)
        self.actor_linear = nn.Linear(512, num_actions)
        self._initialize_weights()

    def _initialize_weights(self):
        for module in self.modules():
            if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                # nn.init.kaiming_uniform_(module.weight)
                nn.init.constant_(module.bias, 0)
            elif isinstance(module, nn.LSTMCell):
                nn.init.constant_(module.bias_ih, 0)
                nn.init.constant_(module.bias_hh, 0)

    def forward(self, x, hx, cx):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = F.relu(self.conv4(x))
        hx, cx = self.lstm(x.view(x.size(0), -1), (hx, cx))
        return self.actor_linear(hx), self.critic_linear(hx), hx, cx

https://github.com/uvipen/Super-mario-bros-A3C-pytorch/tree/master/src

Some References:

- Artificial Neural Networks Course @ EPFL - Wulfram Gerstner

- Reinforcement Learning: An Introduction - second edition. Richard S. Sutton and Andrew G. Barto. The MIT Press - Cambridge, Massachusetts - London, England

- towardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d

Reward Shaping

$$ $$

To address the problem of sparse rewards one can manually design a reward function.

Deep Reinforcement Learning

By Leonardo Petrini

Deep Reinforcement Learning

Presentation for PCSL Group Meeting @ EPFL

Leonardo Petrini

PhD Student @ Physics of Complex Systems Lab, EPFL Lausanne

Some introductory concepts in
Deep Reinforcement Learning

From Q-learning to Policy Gradient: Can we learn directly the policy \(\pi(a|s)\)?

From Q-learning to Policy Gradient: Can we learn directly the policy \(\pi(a|s)\)?

Deep Reinforcement Learning

Deep Reinforcement Learning

Leonardo Petrini

Some introductory concepts in Deep Reinforcement Learning

From Q-learning to Policy Gradient: Can we learn directly the policy \(\pi(a|s)\)?

From Q-learning to Policy Gradient: Can we learn directly the policy \(\pi(a|s)\)?

Deep Reinforcement Learning

More from Leonardo Petrini

Some introductory concepts in
Deep Reinforcement Learning