Introducion to Reinforcement Learning

(aka how to make AI play Atari games)

by Cheuk Ting Ho (@cheukting_ho)

Why we like games?

Environment is simple

Actions are limited

Reward is quantified

that is, easy to solve

but Reinforcement Learning is not limited to games

The Journey:

Define the problem

Environment, agent, rewards
Markov Decision Process

Black-box method

(Deep) Cross-entropy method
Evolution Strategies

Open the box

Finding optimal policy using Bellman Equations
Model-free model
Exploration vs Exploitation
Cliff World - Q learning vs SARSA
Experience reply
Approx. Q learning and DQN

Agent: cart (Action: left, right)

Environment: the mountain

State: Location of the cart (x, y)

Reward: reaching the flag (+10)

Policy: series of actions

Define the problem

Markov Decision Process

outcomes are partly under the control of a decision maker (choosing an action) partly random (probability to a state)

Cross-entropy method

Tabular

- a table to keep track of the policy

- reward corresponding to the state and action pair

- update policy according to elite state and actions

Deep learning

- approximate with neural net

- when the table becomes too big

*caution: randomness in environment

Cross-entropy method

Sample rewards
Check the rewards distribution
Pick the elite policies (reward > certain percentile)
Update policy with only the elite policies

Deep learning

- Agent pick actions with prediction from a MLP classifier on the current state

Evolution Strategies

vs Reinforcement Learning

Black-box: don't care if there's an agent or environment
Guess and check: optimising rewards by tweaking parameters
No backprop: ES injects noise directly in the parameter space
(RL injects noise in the action space and uses backprop to compute the parameter updates)

Evolution Strategies

vs Reinforcement Learning

Open the black-box

knowledge of environment

knowledge of intermediate rewards

Reward design

Explaining goals to agent through rewards

Reward for WHAT never for HOW

Reward discounting to avoid inf. reward and +ve feeback loop

Finding optimal Policy

Tools

Use dynamic programming (Bellman equations)
Policy evaluation (based on Bellman expectation eq.)
Policy improvement (based on Bellman optimality eq.)

Steps

Evaluate given policy (Policy or Value iteration)
Policy iteration evaluate policy until convergence
Value iteration evaluate policy only with single iteration
Improve policy by acting greedily w.r.t. to its value function

Model-free Policy

Bellman equations depends on P(s',r|s,a)

What if we don't know P(s',r|s,a)?

Introduction Qπ(s,a) which is the expected gain at a state and action following policy π

Learning from trajectories

which is a sequence of
– states (s)
– actions (a)
– rewards (r)

Model-free Policy

Model-based: you know P(s'|s,a)
- can apply dynamic programming
- can plan ahead

Model-free: you can sample trajectories
- can try stuff out
- insurance not included

Model-free Policy

Finding expectation by:

1: Monte-Carlo

Averages Q over sampled paths
Needs full trajectory to learn
Less reliant on markov property

2: temporal difference

Uses recurrent formula for Q
Learns from partial trajectory
Works with infinite MDP
Needs less experience to learn

Exploration vs Exploitation

Don't want agent to stuck with current best action

Balance between using what you learned and trying to find
something even better

Exploration vs Exploitation

ε-greedy
With probability ε take random action;
otherwise, take optimal action

Softmax
Pick action proportional to softmax of shifted
normalized Q-values

Cliff world

(not Doom)

Q-learning will learn to follow the shortest path from the "optimal" policy

Reality: robot will fall due to
epsilon-greedy “exploration"

Introducing SARSA

Cliff world

(not Doom)

Difference:

SARSA gets optimal rewards under current policy

where
Q-learning assume policy would be optimal

Cliff world

(not Doom)

on-policy (e.g. SARSA)

Agent can pick actions
Agent always follows his own policy

off-policy (e.g. Q-learning)

Agent can't pick actions
Learning with exploration, playing without exploration
Learning from expert (expert is imperfect)
Learning from sessions (recorded data)

Experience replay

Store several past interactions in buffer
Train on random subsamples
Don't need to re-visit same (s,a) many times to learn it
Only works with off-policy algorithms

Approx. Q learning

State space is usually large,
sometimes continuous.
And so is action space;

Approximate agent with a function

Learn Q value using neural network

However, states do have a structure,

similar states have similar action outcomes.

DQN

Paper published by Google Deep Mind

to play Atari Breakout in 2015

https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf

Stacked 4 flames together and use a CNN as an agent (see the screen then take action)

DQN

Thank you!

Slides: https://slides.com/cheukting_ho/intro-rl

Course: https://github.com/yandexdataschool/Practical_RL