Reinforcement learning

Shubham Dokania
@shubhamdokania

shubham1810

overview

  • Introduction to Reinforcement Learning
  • Markov Decision Process
  • Value Based Learning
    • state value based learning
    • state-action value based learning
    • Bellman equations
  • Temporal Difference Methods
  • Value function approximation
    • Deep Q-learning (DQN)
  • Code examples!

What is RL?

Reinforcement Learning is about learning what to do - how to map situations to actions, so as to maximize a numerical reward signal. The learner (agent) is not told what to do, but instead it must discover which actions yield the most reward via trial-and-error.

Basic Reinforcement Learning Workflow

learning from reward

  • The Reward      defines the goal is a RL problem.
  • Gives the agent a sense of what is good and bad.
  • A reward of higher magnitude is better.
  • Usually a function of environment state (situation).
R_t

What are states?

  • State             is a representation of the current environment situation.     is a set if all states.
  • It's usually a function of the history      , where the history is defined as a sequence of observations, actions and rewards.
s_t \in S
S
H_t = O_1, R_1, A_1,..., A_{t-1}, O_t, R_t
H_t
s_t = f(H_t)

Information state

  • A state is information state or Markov state if it follows the Markov property.


 

i.e. the future is independent of the past, given the present

\mathbb{P}[S_{t+1} | S_t, S_{t-1}, S_{t-2},...] = \mathbb{P}[S_{t+1} | S_t]

Markov process (MP)

A Markov Process (Markov Chain) is a memoryless random process which follows the Markov Property.

A Markov Process is defined by        

The probability of state transition is defined as:

P_{ss'} = \mathbb{P}[S_{t+1}=s' | S_t=s]
< P, S >

Markov REWARD process (MRP)

A Markov Reward Process is a Markov process with Rewards/Values associated with states.

It's represented by                 

The Reward function is

 

 

and     is the discount factor.

< S, P, R, \gamma >
R_s = \mathbb{E}[R_{t+1} | S_t = s]
\gamma

model of a MRP

Return and discount

In a MRP,      is defined as the discounted return, given by

 

 

But what is the need for a discount factor?

- Avoids infinite returns

- provides a control over long-term and short term rewards

-Mathematically convenient

G_t
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... ; \gamma \in [0, 1]

Markov Decision Process (MDP)

A MDP is a MRP with decisions, represented by                 

where      is a finite set of actions.

The transition probabilities and Reward function both depend on the actions.

The actions are governed by a policy           

< S, A, P, R, \gamma >
A
\pi (a | s)

components of agent

Can include one or more of the following

  • Policy
  • Value function
  • Model

Policy

  • The Policy of an agent defines it's behavior.
  • It is given as
     
\pi (a | s) \text{ (stochastic) or } a = \pi (s) \text{ (deterministic)}
\pi (a | s) = \mathbb{P}[A_t = a | S_t = s]

value function

  • The value function is a prediction of the future reward for a state.
  • It's used to evaluate the quality of a state.
  • It's the expected return, i.e.
V(s) = \mathbb{E}[G_t | S_t = s]
V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} +... | S_t = s]

model

  • A model predicts what the environment will do next.
  • The properties of a model are the state transition probability and a reward function.
  • In case of a Partially Observable MDP (POMDP), the agent may form it's own representation of the environment.

gridworld example

POLICY

VALUE

model

rl methods: categories

A simple categorization of a few RL methods

  • Temporal Difference Learning
    •   
    • Q-learning
    • SARSA
    • Actor-critic
  • Policy Search based Learning
    • Policy Gradient
    • Evolutionary Strategies
  • Model based Learning
    • Stochastic Dynamic Programming
    • Bayesian Approaches
TD ( \lambda)

Example: ES on walker

explore and exploit

  • Reinforcement Learning is like trial and error
  • The agent should explore the environment to search for better policies.
  • After selection of optimal policy, the agent maximises the rewards.
    • Exploration finds more information about the environment.
    • Exploitation uses known information to maximise reward.

prediction and control

  • Prediction : evaluate the future, given a policy.
  • Control: optimise the future, find best policy.

bellman equations

  • Bellman expectation equation
  • Bellman optimality equation

Value learning

State based value learning


 

In general

V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]
V(s) = \mathbb{E}[R_{t+1} + \gamma V(s_{t+1}) | S_t = s]

Value learning

State-Action based value learning


 

In general

Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s, A_t = a]
Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]

optimality equations

For state based value learning


 

For state-action based learning


For optimal condition, the optimal policy is:

V_*(s) = max_{s \in S} R_{t+1} + V(s')
Q_*(s, a) = max_{a \in A} R_{t+1} + \gamma Q(s', a')
\pi_*(a | s) = arg_{a \in A}max Q(s, a)

temporal difference learning

  • TD methods can learn without a model of the environment, through sampling.
  • TD can learn from incomplete episodes.
  • TD updates a guess, towards a guess return. (like DP)

SARSA

The update rule in SARSA for state-action value is


which is essentially a generalisation of


where,      is the TD target and

 

is the TD Error.

SARSA follows Bellman expectation equation

Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a))
Q(s, a) = Q(s, a) + \alpha (G_t - Q(s, a))
G_t
\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a)

q-learning

  • Q-learning is similar to SARSA, but consists of two different policies
    • Behaviour policy: Used to evaluate
    • Estimation policy: Used for update rule.

The update in Q-learning is:


For behaviour policy, we may use   -greedy policy.

Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma max_{a' \in A} Q(s', a') - Q(s, a))
\epsilon

implement q-learning

implementation of Q-learning in gridworld like environment

 

Code: https://goo.gl/CE8xpC

problems with q-learning

  • What happens if the state space is large?
    • Millions of states
    • Continuous state space
  • Cannot store so much states in memory.
  • Computation also becomes very slow!

value function approximation

  • Generalise unseen states from seen state information.
  • Estimate the value function through approximation.

value function approximation

Instead of using discrete state value representation, use

 

 

 

For instance, consider a linear combination:



where,      weights can be updated using TD methods and             is a feature representation

V(s, w) \approx V_\pi(s)
Q(s, a, w) \approx Q_\pi(s, a)
Q(s, a, w) = \sum_i w_i . f_i(s, a)
w_i
f_i(s, a)

types of models

For function approximation, we can choose from

  • Decision Trees / Random Forests
  • Linear models
  • Non-linear models (Neural Networks)
  • Nearest Neighbours
  • etc...
     

We choose models that can be differentiated!
(Linear and Non-linear)

defining a loss

  • There is no "training data" in RL
  • So, use Temporal Difference methods
  • We create a guess target, and approximated guess
  • Try to minimize the difference in target and appx.

defining a loss

update the weights by minimising a mean-squared error



And use Gradient Descent to update weights

J(w) = \mathbb{E}_\pi[(V_\pi(S) - \hat V(s, w))^2]
\Delta w = -\frac{1}{2} \alpha \nabla_w J(w)
\Delta w = \alpha \mathbb{E_\pi}[(V_\pi(S) - \hat V(s, w))\nabla_w \hat V(s, w)]

FOR Q-LEARNING

update the weights by minimising a mean-squared error



And use Gradient Descent to update weights

J(w) = (R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))^2
\Delta w = \alpha(R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))\nabla_w \hat Q(s, a, w))

the deep q-network

Given the previous information, we can use any function approximator for estimating the value of Q, with the condition that the function be differentiable.

 

In a scenario where a Deep Neural Network is used as the function approximator, it's called as a DQN.

Architecture

DEEPMIND atari dqn

DQN

Example: Flappy bird

Thank you

Made with Slides.com