by Cheuk Ting Ho (@cheukting_ho)

Define the problem

- Environment, agent, rewards
- Markov Decision Process

Black-box method

- (Deep) Cross-entropy method
- Evolution Strategies

Open the box

- Finding optimal policy using Bellman Equations
- Model-free model
- Exploration vs Exploitation
- Cliff World - Q learning vs SARSA
- Experience reply
- Approx. Q learning and DQN

Agent: cart (**Action**: left, right)

Environment: the mountain

**State**: Location of the cart (x, y)

**Reward**: reaching the flag (+10)

**Policy**: series of actions

outcomes are partly under the control of a decision maker (choosing an action) partly random (probability to a state)

**Tabular**

- a table to keep track of the policy

- reward corresponding to the state and action pair

- update policy according to elite state and actions

**Deep learning**

- approximate with neural net

- when the table becomes too big

*****caution: randomness in environment

- Sample rewards
- Check the rewards distribution
- Pick the elite policies (reward > certain percentile)
- Update policy with only the elite policies

**Deep learning**

- Agent pick actions with prediction from a MLP classifier on the current state

- Black-box: don't care if there's an agent or environment
- Guess and check: optimising rewards by tweaking parameters
- No backprop: ES injects noise directly in the parameter space

(RL injects noise in the action space and uses backprop to compute the parameter updates)

knowledge of intermediate rewards

Tools

- Use dynamic programming (Bellman equations)
- Policy evaluation (based on Bellman expectation eq.)
- Policy improvement (based on Bellman optimality eq.)

Steps

- Evaluate given policy (Policy or Value iteration)
- Policy iteration evaluate policy until convergence
- Value iteration evaluate policy only with single iteration
- Improve policy by acting greedily w.r.t. to its value function

Bellman equations depends on P(s',r|s,a)

What if we don't know P(s',r|s,a)?

Introduction Qπ(s,a) which is the expected gain at a state and action following policy π

Learning from trajectories

which is a sequence of

– states (s)

– actions (a)

– rewards (r)

**Model-based**: you know P(s'|s,a)

- can apply dynamic programming

- can plan ahead

**Model-free**: you can sample trajectories

- can try stuff out

- insurance not included

Finding expectation by:

**1: Monte-Carlo**

- Averages Q over sampled paths
- Needs full trajectory to learn
- Less reliant on markov property

**2: temporal difference**

- Uses recurrent formula for Q
- Learns from partial trajectory
- Works with infinite MDP
- Needs less experience to learn

Don't want agent to stuck with current best action

Balance between using what you learned and trying to find

something even better

**ε-greedy**

With probability ε take random action;

otherwise, take optimal action

**Softmax**

Pick action proportional to softmax of shifted

normalized Q-values

(not Doom)

Q-learning will learn to follow the shortest path from the "optimal" policy

Reality: robot will fall due to

epsilon-greedy “exploration"

Introducing SARSA

(not Doom)

Difference:

**SARSA** gets optimal rewards under current policy

where

**Q-learning** assume policy would be optimal

(not Doom)

**on-policy (e.g. SARSA)**

- Agent can pick actions
- Agent always follows his own policy

**off-policy (e.g. Q-learning)**

- Agent can't pick actions
- Learning with exploration, playing without exploration
- Learning from expert (expert is imperfect)
- Learning from sessions (recorded data)

- Store several past interactions in buffer
- Train on random subsamples
- Don't need to re-visit same (s,a) many times to learn it
- Only works with off-policy algorithms

State space is usually large,

sometimes continuous.

And so is action space;

Approximate agent with a function

Learn Q value using neural network

However, states do have a structure,

similar states have similar action outcomes.

Paper published by Google Deep Mind

to play Atari Breakout in 2015

https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf

Stacked 4 flames together and use a CNN as an agent (see the screen then take action)

Slides: https://slides.com/cheukting_ho/intro-rl

Course: https://github.com/yandexdataschool/Practical_RL