Deep RL Workshop: Pong from Pixels

The Credit Assignment Problem

A major challenge of RL: the credit assignment problem

Our network makes a decision at every time-step of the game, but may receive feedback rarely, even just once per several hundred decisions
How do we know which actions were good and which were bad?
This is important because machine learning depends on using positive/negative feedback to teach a model

Policy Gradients

The specific algorithm we'll be using to train our agent is called policy gradients.

It proposes one solution to the credit assignment problem.

General idea:

Run a game and keep track of all actions $a_1, a_2, ..., a_T$
If we win the game (+1 reward), encourage all $a_1, ..., a_T$ , if we lose (-1 reward), discourage all of them
If we run enough games, eventually, good actions will get a net positive feedback, and bad actions negative

Policy Gradients

The specific algorithm we'll be using to train our agent is called policy gradients.

It proposes one solution to the credit assignment problem.

General idea:

Run a game and keep track of all actions $a_1, a_2, ..., a_T$
If we win the game (+1 reward), encourage all $a_1, ..., a_T$ , if we lose (-1 reward), discourage all of them
If we run enough games, eventually, good actions will get a net positive feedback, and bad actions negative

Policy Gradient Theorem

Our loss function to minimize will be $-E_t[R(t)]$ , the negative expected value (average) of the reward of our neural network's policy.

We use gradient descent to find model parameters that minimize the loss. To do this, we must calculate $\nabla_{\theta} (-E[R(t)])$ the gradient of the loss with respect to the model parameters $\theta$ .

More details, but the policy gradient theorem states that:

$\nabla_{\theta} (-E[R(t)]) = -E[R(t) \nabla_{\theta} \log (p(a_t))]$

where $p(a_t)$ is the probability of "the chosen action".

Discounted Reward

One last important detail:

In reality, we don't use the raw +1 or -1 rewards. We use the discounted reward

$R(t) = \sum_{k=0}^{T-t} \gamma^k r(s_{t+k})$

where $\gamma$ is the discount factor, a number between 0 and 1

Intuitively, $t$ is responsible for the rewards in future time-steps, but exponentially less so as time goes on.

Deep Reinforcement Learning Workshop:

Pong from Pixels 🏓

Overview 👩‍🏫

Intro to Reinforcement Learning

What is Reinforcement Learning? (Informally)

What is Reinforcement Learning? (Formally)

Describing Pong with RL

Deep RL in Context

Policy Gradients

The Credit Assignment Problem

Policy Gradients

Policy Gradients

Policy Gradient Theorem

Discounted Reward

Do It Yourself!

Obtain the Materials 💸

Notes and Tips

Notes and Tips Cont

Deep RL Workshop: Pong from Pixels

Deep RL Workshop: Pong from Pixels

Stewy Slocum

Deep Reinforcement Learning Workshop:

Pong from Pixels 🏓

Deep RL Workshop: Pong from Pixels

More from Stewy Slocum