0

Advanced issues found▲

Cheuk Ting Ho

@cheukting_ho

Cheukting

**Part 1**

- What is Reinforcement Learning
- 101 of Reinforcement Learning
- Crossentropy Method
- Exercise - Crossentropy Method
- Exercise - Deep Crossentropy Method

**Part 2**

- Model-free Model
- Cliff World: Q-learning vs SARSA
- Exercise - Cliff World

**Part 3**

- Experience Replay
- Approximate Q-learning and Deep Q-Network
- Exercise - DQN

Agent: cart (**Action**: left, right)

Environment: the mountain

**State**: Location of the cart (x, y)

**Reward**: reaching the flag (+10)

**Policy**: series of actions

outcomes are partly under the control of a decision maker (choosing an action) partly random (probability to a state)

**Tabular**

- a table to keep track of the policy

- reward corresponding to the state and action pair

- update policy according to elite state and actions

**Deep learning**

- approximate with neural net

- when the table becomes too big

*****caution: randomness in environment

- Sample rewards
- Check the rewards distribution
- Pick the elite policies (reward > certain percentile)
- Update policy with only the elite policies

**Deep learning**

- Agent pick actions with prediction from a MLP classifier on the current state

Bellman equations depends on P(s',r|s,a)

What if we don't know P(s',r|s,a)?

Introduction Qπ(s,a) which is the expected gain at a state and action following policy π

Learning from trajectories

which is a sequence of

– states (s)

– actions (a)

– rewards (r)

**Model-based**: you know P(s'|s,a)

- can apply dynamic programming

- can plan ahead

**Model-free**: you can sample trajectories

- can try stuff out

- insurance not included

Finding expectation by:

**1: Monte-Carlo**

- Averages Q over sampled paths
- Needs full trajectory to learn
- Less reliant on markov property

**2: temporal difference**

- Uses recurrent formula for Q
- Learns from partial trajectory
- Works with infinite MDP
- Needs less experience to learn

Don't want agent to stuck with current best action

Balance between using what you learned and trying to find

something even better

**ε-greedy**

With probability ε take random action;

otherwise, take optimal action

**Softmax**

Pick action proportional to softmax of shifted

normalized Q-values

(not Doom)

Q-learning will learn to follow the shortest path from the "optimal" policy

Reality: robot will fall due to

epsilon-greedy “exploration"

Introducing SARSA

(not Doom)

Difference:

**SARSA** gets optimal rewards under current policy

where

**Q-learning** assume policy would be optimal

(not Doom)

**on-policy (e.g. SARSA)**

- Agent can pick actions
- Agent always follows his own policy

**off-policy (e.g. Q-learning)**

- Agent can't pick actions
- Learning with exploration, playing without exploration
- Learning from expert (expert is imperfect)
- Learning from sessions (recorded data)

- Store several past interactions in buffer
- Train on random subsamples
- Don't need to re-visit same (s,a) many times to learn it
- Only works with off-policy algorithms

State space is usually large,

sometimes continuous.

And so is action space;

Approximate agent with a function

Learn Q value using neural network

However, states do have a structure,

similar states have similar action outcomes.

Paper published by Google Deep Mind

to play Atari Breakout in 2015

https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf

Stacked 4 flames together and use a CNN as an agent (see the screen then take action)