Reinforcement learning

Shubham Dokania

@shubhamdokania

shubham1810

overview

Introduction to Reinforcement Learning
Markov Decision Process
Value Based Learning
- state value based learning
- state-action value based learning
- Bellman equations
Temporal Difference Methods
Value function approximation
- Deep Q-learning (DQN)
Code examples!

What is RL?

Reinforcement Learning is about learning what to do - how to map situations to actions, so as to maximize a numerical reward signal. The learner (agent) is not told what to do, but instead it must discover which actions yield the most reward via trial-and-error.

Basic Reinforcement Learning Workflow

learning from reward

The Reward defines the goal is a RL problem.
Gives the agent a sense of what is good and bad.
A reward of higher magnitude is better.
Usually a function of environment state (situation).

R_t

R_t

What are states?

State is a representation of the current environment situation. is a set if all states.
It's usually a function of the history , where the history is defined as a sequence of observations, actions and rewards.

s_t \in S

s_t \in S

S

H_t = O_1, R_1, A_1,..., A_{t-1}, O_t, R_t

H_t = O_1, R_1, A_1,..., A_{t-1}, O_t, R_t

H_t

H_t

s_t = f(H_t)

s_t = f(H_t)

Information state

A state is information state or Markov state if it follows the Markov property.

i.e. the future is independent of the past, given the present

\mathbb{P}[S_{t+1} | S_t, S_{t-1}, S_{t-2},...] = \mathbb{P}[S_{t+1} | S_t]

\mathbb{P}[S_{t+1} | S_t, S_{t-1}, S_{t-2},...] = \mathbb{P}[S_{t+1} | S_t]

Markov process (MP)

A Markov Process (Markov Chain) is a memoryless random process which follows the Markov Property.

A Markov Process is defined by

The probability of state transition is defined as:

P_{ss'} = \mathbb{P}[S_{t+1}=s' | S_t=s]

P_{ss'} = \mathbb{P}[S_{t+1}=s' | S_t=s]

< P, S >

< P, S >

Markov REWARD process (MRP)

A Markov Reward Process is a Markov process with Rewards/Values associated with states.

It's represented by

The Reward function is

and is the discount factor.

< S, P, R, \gamma >

< S, P, R, \gamma >

R_s = \mathbb{E}[R_{t+1} | S_t = s]

R_s = \mathbb{E}[R_{t+1} | S_t = s]

\gamma

\gamma

model of a MRP

Return and discount

In a MRP, is defined as the discounted return, given by

But what is the need for a discount factor?

- Avoids infinite returns

- provides a control over long-term and short term rewards

-Mathematically convenient

G_t

G_t

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... ; \gamma \in [0, 1]

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... ; \gamma \in [0, 1]

Markov Decision Process (MDP)

A MDP is a MRP with decisions, represented by

where is a finite set of actions.

The transition probabilities and Reward function both depend on the actions.

The actions are governed by a policy

< S, A, P, R, \gamma >

< S, A, P, R, \gamma >

A

\pi (a | s)

\pi (a | s)

components of agent

Can include one or more of the following

Policy
Value function
Model

Policy

The Policy of an agent defines it's behavior.
It is given as

\pi (a | s) \text{ (stochastic) or } a = \pi (s) \text{ (deterministic)}

\pi (a | s) \text{ (stochastic) or } a = \pi (s) \text{ (deterministic)}

\pi (a | s) = \mathbb{P}[A_t = a | S_t = s]

\pi (a | s) = \mathbb{P}[A_t = a | S_t = s]

value function

The value function is a prediction of the future reward for a state.
It's used to evaluate the quality of a state.
It's the expected return, i.e.

V(s) = \mathbb{E}[G_t | S_t = s]

V(s) = \mathbb{E}[G_t | S_t = s]

V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} +... | S_t = s]

V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} +... | S_t = s]

model

A model predicts what the environment will do next.
The properties of a model are the state transition probability and a reward function.
In case of a Partially Observable MDP (POMDP), the agent may form it's own representation of the environment.

gridworld example

POLICY

VALUE

model

rl methods: categories

A simple categorization of a few RL methods

Temporal Difference Learning
- Q-learning
- SARSA
- Actor-critic
Policy Search based Learning
- Policy Gradient
- Evolutionary Strategies
Model based Learning
- Stochastic Dynamic Programming
- Bayesian Approaches

TD ( \lambda)

TD ( \lambda)

explore and exploit

Reinforcement Learning is like trial and error
The agent should explore the environment to search for better policies.
After selection of optimal policy, the agent maximises the rewards.
- Exploration finds more information about the environment.
- Exploitation uses known information to maximise reward.

prediction and control

Prediction : evaluate the future, given a policy.
Control: optimise the future, find best policy.

bellman equations

Bellman expectation equation
Bellman optimality equation

Value learning

State based value learning

In general

V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]

V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]

V(s) = \mathbb{E}[R_{t+1} + \gamma V(s_{t+1}) | S_t = s]

V(s) = \mathbb{E}[R_{t+1} + \gamma V(s_{t+1}) | S_t = s]

Value learning

State-Action based value learning

In general

Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s, A_t = a]

Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s, A_t = a]

Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]

Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]

optimality equations

For state based value learning

For state-action based learning

For optimal condition, the optimal policy is:

V_*(s) = max_{s \in S} R_{t+1} + V(s')

V_*(s) = max_{s \in S} R_{t+1} + V(s')

Q_*(s, a) = max_{a \in A} R_{t+1} + \gamma Q(s', a')

Q_*(s, a) = max_{a \in A} R_{t+1} + \gamma Q(s', a')

\pi_*(a | s) = arg_{a \in A}max Q(s, a)

\pi_*(a | s) = arg_{a \in A}max Q(s, a)

temporal difference learning

TD methods can learn without a model of the environment, through sampling.
TD can learn from incomplete episodes.
TD updates a guess, towards a guess return. (like DP)

SARSA

The update rule in SARSA for state-action value is

which is essentially a generalisation of

where, is the TD target and

is the TD Error.

SARSA follows Bellman expectation equation

Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a))

Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a))

Q(s, a) = Q(s, a) + \alpha (G_t - Q(s, a))

Q(s, a) = Q(s, a) + \alpha (G_t - Q(s, a))

G_t

G_t

\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a)

\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a)

q-learning

Q-learning is similar to SARSA, but consists of two different policies
- Behaviour policy: Used to evaluate
- Estimation policy: Used for update rule.

The update in Q-learning is:

For behaviour policy, we may use -greedy policy.

Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma max_{a' \in A} Q(s', a') - Q(s, a))

Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma max_{a' \in A} Q(s', a') - Q(s, a))

\epsilon

\epsilon

implement q-learning

implementation of Q-learning in gridworld like environment

Code: https://goo.gl/CE8xpC

problems with q-learning

What happens if the state space is large?
- Millions of states
- Continuous state space
Cannot store so much states in memory.
Computation also becomes very slow!

value function approximation

Generalise unseen states from seen state information.
Estimate the value function through approximation.

value function approximation

Instead of using discrete state value representation, use

For instance, consider a linear combination:

where, weights can be updated using TD methods and is a feature representation

V(s, w) \approx V_\pi(s)

V(s, w) \approx V_\pi(s)

Q(s, a, w) \approx Q_\pi(s, a)

Q(s, a, w) \approx Q_\pi(s, a)

Q(s, a, w) = \sum_i w_i . f_i(s, a)

Q(s, a, w) = \sum_i w_i . f_i(s, a)

w_i

w_i

f_i(s, a)

f_i(s, a)

types of models

For function approximation, we can choose from

Decision Trees / Random Forests
Linear models
Non-linear models (Neural Networks)
Nearest Neighbours
etc...

We choose models that can be differentiated!
(Linear and Non-linear)

defining a loss

There is no "training data" in RL
So, use Temporal Difference methods
We create a guess target, and approximated guess
Try to minimize the difference in target and appx.

defining a loss

update the weights by minimising a mean-squared error

And use Gradient Descent to update weights

J(w) = \mathbb{E}_\pi[(V_\pi(S) - \hat V(s, w))^2]

J(w) = \mathbb{E}_\pi[(V_\pi(S) - \hat V(s, w))^2]

\Delta w = -\frac{1}{2} \alpha \nabla_w J(w)

\Delta w = -\frac{1}{2} \alpha \nabla_w J(w)

\Delta w = \alpha \mathbb{E_\pi}[(V_\pi(S) - \hat V(s, w))\nabla_w \hat V(s, w)]

\Delta w = \alpha \mathbb{E_\pi}[(V_\pi(S) - \hat V(s, w))\nabla_w \hat V(s, w)]

FOR Q-LEARNING

update the weights by minimising a mean-squared error

And use Gradient Descent to update weights

J(w) = (R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))^2

J(w) = (R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))^2

\Delta w = \alpha(R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))\nabla_w \hat Q(s, a, w))

\Delta w = \alpha(R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))\nabla_w \hat Q(s, a, w))

the deep q-network

Given the previous information, we can use any function approximator for estimating the value of Q, with the condition that the function be differentiable.

In a scenario where a Deep Neural Network is used as the function approximator, it's called as a DQN.

Reinforcement learning

overview

What is RL?

learning from reward

What are states?

Information state

Markov process (MP)

Markov REWARD process (MRP)

model of a MRP

Return and discount

Markov Decision Process (MDP)

components of agent

Policy

value function

model

gridworld example

POLICY

VALUE

model

rl methods: categories

explore and exploit

prediction and control

bellman equations

Value learning

Value learning

optimality equations

temporal difference learning

SARSA

q-learning

implement q-learning

problems with q-learning

value function approximation

value function approximation

types of models

defining a loss

defining a loss

FOR Q-LEARNING

the deep q-network

Architecture

DEEPMIND atari dqn

Thank you