Reinforcement learning

Shubham Dokania




  • Introduction to Reinforcement Learning
  • Markov Decision Process
  • Value Based Learning
    • state value based learning
    • state-action value based learning
    • Bellman equations
  • Temporal Difference Methods
  • Value function approximation
    • Deep Q-learning (DQN)
  • Code examples!

What is RL?

Reinforcement Learning is about learning what to do - how to map situations to actions, so as to maximize a numerical reward signal. The learner (agent) is not told what to do, but instead it must discover which actions yield the most reward via trial-and-error.

Basic Reinforcement Learning Workflow

learning from reward

  • The Reward      defines the goal is a RL problem.
  • Gives the agent a sense of what is good and bad.
  • A reward of higher magnitude is better.
  • Usually a function of environment state (situation).

What are states?

  • State             is a representation of the current environment situation.     is a set if all states.
  • It's usually a function of the history      , where the history is defined as a sequence of observations, actions and rewards.
s_t \in S
stSs_t \in S
H_t = O_1, R_1, A_1,..., A_{t-1}, O_t, R_t
Ht=O1,R1,A1,...,At1,Ot,RtH_t = O_1, R_1, A_1,..., A_{t-1}, O_t, R_t
s_t = f(H_t)
st=f(Ht)s_t = f(H_t)

Information state

  • A state is information state or Markov state if it follows the Markov property.


i.e. the future is independent of the past, given the present

\mathbb{P}[S_{t+1} | S_t, S_{t-1}, S_{t-2},...] = \mathbb{P}[S_{t+1} | S_t]
P[St+1St,St1,St2,...]=P[St+1St]\mathbb{P}[S_{t+1} | S_t, S_{t-1}, S_{t-2},...] = \mathbb{P}[S_{t+1} | S_t]

Markov process (MP)

A Markov Process (Markov Chain) is a memoryless random process which follows the Markov Property.

A Markov Process is defined by        

The probability of state transition is defined as:

P_{ss'} = \mathbb{P}[S_{t+1}=s' | S_t=s]
Pss=P[St+1=sSt=s]P_{ss'} = \mathbb{P}[S_{t+1}=s' | S_t=s]
< P, S >
<P,S>< P, S >

Markov REWARD process (MRP)

A Markov Reward Process is a Markov process with Rewards/Values associated with states.

It's represented by                 

The Reward function is



and     is the discount factor.

< S, P, R, \gamma >
<S,P,R,γ>< S, P, R, \gamma >
R_s = \mathbb{E}[R_{t+1} | S_t = s]
Rs=E[Rt+1St=s]R_s = \mathbb{E}[R_{t+1} | S_t = s]

model of a MRP

Return and discount

In a MRP,      is defined as the discounted return, given by



But what is the need for a discount factor?

- Avoids infinite returns

- provides a control over long-term and short term rewards

-Mathematically convenient

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... ; \gamma \in [0, 1]
Gt=Rt+1+γRt+2+γ2Rt+3+...;γ[0,1]G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... ; \gamma \in [0, 1]

Markov Decision Process (MDP)

A MDP is a MRP with decisions, represented by                 

where      is a finite set of actions.

The transition probabilities and Reward function both depend on the actions.

The actions are governed by a policy           

< S, A, P, R, \gamma >
<S,A,P,R,γ>< S, A, P, R, \gamma >
\pi (a | s)
π(as)\pi (a | s)

components of agent

Can include one or more of the following

  • Policy
  • Value function
  • Model


  • The Policy of an agent defines it's behavior.
  • It is given as
\pi (a | s) \text{ (stochastic) or } a = \pi (s) \text{ (deterministic)}
π(as) (stochastic) or a=π(s) (deterministic)\pi (a | s) \text{ (stochastic) or } a = \pi (s) \text{ (deterministic)}
\pi (a | s) = \mathbb{P}[A_t = a | S_t = s]
π(as)=P[At=aSt=s]\pi (a | s) = \mathbb{P}[A_t = a | S_t = s]

value function

  • The value function is a prediction of the future reward for a state.
  • It's used to evaluate the quality of a state.
  • It's the expected return, i.e.
V(s) = \mathbb{E}[G_t | S_t = s]
V(s)=E[GtSt=s]V(s) = \mathbb{E}[G_t | S_t = s]
V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} +... | S_t = s]
V(s)=E[Rt+1+γRt+2+...St=s]V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} +... | S_t = s]


  • A model predicts what the environment will do next.
  • The properties of a model are the state transition probability and a reward function.
  • In case of a Partially Observable MDP (POMDP), the agent may form it's own representation of the environment.

gridworld example




rl methods: categories

A simple categorization of a few RL methods

  • Temporal Difference Learning
    • Q-learning
    • SARSA
    • Actor-critic
  • Policy Search based Learning
    • Policy Gradient
    • Evolutionary Strategies
  • Model based Learning
    • Stochastic Dynamic Programming
    • Bayesian Approaches
TD ( \lambda)
TD(λ)TD ( \lambda)

explore and exploit

  • Reinforcement Learning is like trial and error
  • The agent should explore the environment to search for better policies.
  • After selection of optimal policy, the agent maximises the rewards.
    • Exploration finds more information about the environment.
    • Exploitation uses known information to maximise reward.

prediction and control

  • Prediction : evaluate the future, given a policy.
  • Control: optimise the future, find best policy.

bellman equations

  • Bellman expectation equation
  • Bellman optimality equation

Value learning

State based value learning


In general

V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]
V(s)=E[Rt+1+γRt+2+γ2Rt+3+...St=s]V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]
V(s) = \mathbb{E}[R_{t+1} + \gamma V(s_{t+1}) | S_t = s]
V(s)=E[Rt+1+γV(st+1)St=s]V(s) = \mathbb{E}[R_{t+1} + \gamma V(s_{t+1}) | S_t = s]

Value learning

State-Action based value learning


In general

Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s, A_t = a]
Qπ(s,a)=E[Rt+1+γRt+2+γ2Rt+3+...St=s,At=a]Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s, A_t = a]
Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]
Qπ(s,a)=E[Rt+1+γQ(St+1,At+1)St=s,At=a]Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]

optimality equations

For state based value learning


For state-action based learning

For optimal condition, the optimal policy is:

V_*(s) = max_{s \in S} R_{t+1} + V(s')
V(s)=maxsSRt+1+V(s)V_*(s) = max_{s \in S} R_{t+1} + V(s')
Q_*(s, a) = max_{a \in A} R_{t+1} + \gamma Q(s', a')
Q(s,a)=maxaARt+1+γQ(s,a)Q_*(s, a) = max_{a \in A} R_{t+1} + \gamma Q(s', a')
\pi_*(a | s) = arg_{a \in A}max Q(s, a)
π(as)=argaAmaxQ(s,a)\pi_*(a | s) = arg_{a \in A}max Q(s, a)

temporal difference learning

  • TD methods can learn without a model of the environment, through sampling.
  • TD can learn from incomplete episodes.
  • TD updates a guess, towards a guess return. (like DP)


The update rule in SARSA for state-action value is

which is essentially a generalisation of

where,      is the TD target and


is the TD Error.

SARSA follows Bellman expectation equation

Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a))
Q(s,a)=Q(s,a)+α(Rt+1+γQ(St+1,At+1)Q(s,a))Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a))
Q(s, a) = Q(s, a) + \alpha (G_t - Q(s, a))
Q(s,a)=Q(s,a)+α(GtQ(s,a))Q(s, a) = Q(s, a) + \alpha (G_t - Q(s, a))
\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a)
δt=Rt+1+γQ(St+1,At+1)Q(s,a)\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a)


  • Q-learning is similar to SARSA, but consists of two different policies
    • Behaviour policy: Used to evaluate
    • Estimation policy: Used for update rule.

The update in Q-learning is:

For behaviour policy, we may use   -greedy policy.

Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma max_{a' \in A} Q(s', a') - Q(s, a))
Q(s,a)=Q(s,a)+α(Rt+1+γmaxaAQ(s,a)Q(s,a))Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma max_{a' \in A} Q(s', a') - Q(s, a))

implement q-learning

implementation of Q-learning in gridworld like environment



problems with q-learning

  • What happens if the state space is large?
    • Millions of states
    • Continuous state space
  • Cannot store so much states in memory.
  • Computation also becomes very slow!

value function approximation

  • Generalise unseen states from seen state information.
  • Estimate the value function through approximation.

value function approximation

Instead of using discrete state value representation, use




For instance, consider a linear combination:

where,      weights can be updated using TD methods and             is a feature representation

V(s, w) \approx V_\pi(s)
V(s,w)Vπ(s)V(s, w) \approx V_\pi(s)
Q(s, a, w) \approx Q_\pi(s, a)
Q(s,a,w)Qπ(s,a)Q(s, a, w) \approx Q_\pi(s, a)
Q(s, a, w) = \sum_i w_i . f_i(s, a)
Q(s,a,w),a)Q(s, a, w) = \sum_i w_i . f_i(s, a)
f_i(s, a)
fi(s,a)f_i(s, a)

types of models

For function approximation, we can choose from

  • Decision Trees / Random Forests
  • Linear models
  • Non-linear models (Neural Networks)
  • Nearest Neighbours
  • etc...

We choose models that can be differentiated!
(Linear and Non-linear)

defining a loss

  • There is no "training data" in RL
  • So, use Temporal Difference methods
  • We create a guess target, and approximated guess
  • Try to minimize the difference in target and appx.

defining a loss

update the weights by minimising a mean-squared error

And use Gradient Descent to update weights

J(w) = \mathbb{E}_\pi[(V_\pi(S) - \hat V(s, w))^2]
J(w)=Eπ[(Vπ(S)V^(s,w))2]J(w) = \mathbb{E}_\pi[(V_\pi(S) - \hat V(s, w))^2]
\Delta w = -\frac{1}{2} \alpha \nabla_w J(w)
Δw=12αwJ(w)\Delta w = -\frac{1}{2} \alpha \nabla_w J(w)
\Delta w = \alpha \mathbb{E_\pi}[(V_\pi(S) - \hat V(s, w))\nabla_w \hat V(s, w)]
Δw=αEπ[(Vπ(S)V^(s,w))wV^(s,w)]\Delta w = \alpha \mathbb{E_\pi}[(V_\pi(S) - \hat V(s, w))\nabla_w \hat V(s, w)]


update the weights by minimising a mean-squared error

And use Gradient Descent to update weights

J(w) = (R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))^2
J(w)=(Rt+1+γmaxaAQ^(s,a)Q^(s,a,w))2J(w) = (R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))^2
\Delta w = \alpha(R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))\nabla_w \hat Q(s, a, w))
Δw=α(Rt+1+γmaxaAQ^(s,a)Q^(s,a,w))wQ^(s,a,w))\Delta w = \alpha(R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))\nabla_w \hat Q(s, a, w))

the deep q-network

Given the previous information, we can use any function approximator for estimating the value of Q, with the condition that the function be differentiable.


In a scenario where a Deep Neural Network is used as the function approximator, it's called as a DQN.


DEEPMIND atari dqn

Thank you

Introduction to Reinforcement Learning

By Shubham Dokania

Introduction to Reinforcement Learning

Presentation for Reinforcement Learning Lecture at Coding Blocks

  • 1,236