Reinforcement learning

Shubham Dokania
@shubhamdokania

shubham1810

overview

• Introduction to Reinforcement Learning
• Markov Decision Process
• Value Based Learning
• state value based learning
• state-action value based learning
• Bellman equations
• Temporal Difference Methods
• Value function approximation
• Deep Q-learning (DQN)
• Code examples!

What is RL?

Reinforcement Learning is about learning what to do - how to map situations to actions, so as to maximize a numerical reward signal. The learner (agent) is not told what to do, but instead it must discover which actions yield the most reward via trial-and-error.

Basic Reinforcement Learning Workflow

learning from reward

• The Reward      defines the goal is a RL problem.
• Gives the agent a sense of what is good and bad.
• A reward of higher magnitude is better.
• Usually a function of environment state (situation).
R_t

What are states?

• State             is a representation of the current environment situation.     is a set if all states.
• It's usually a function of the history      , where the history is defined as a sequence of observations, actions and rewards.
s_t \in S
S
H_t = O_1, R_1, A_1,..., A_{t-1}, O_t, R_t
H_t
s_t = f(H_t)

Information state

• A state is information state or Markov state if it follows the Markov property.

i.e. the future is independent of the past, given the present

\mathbb{P}[S_{t+1} | S_t, S_{t-1}, S_{t-2},...] = \mathbb{P}[S_{t+1} | S_t]

Markov process (MP)

A Markov Process (Markov Chain) is a memoryless random process which follows the Markov Property.

A Markov Process is defined by

The probability of state transition is defined as:

P_{ss'} = \mathbb{P}[S_{t+1}=s' | S_t=s]
< P, S >

Markov REWARD process (MRP)

A Markov Reward Process is a Markov process with Rewards/Values associated with states.

It's represented by

The Reward function is

and     is the discount factor.

< S, P, R, \gamma >
R_s = \mathbb{E}[R_{t+1} | S_t = s]
\gamma

Return and discount

In a MRP,      is defined as the discounted return, given by

But what is the need for a discount factor?

- Avoids infinite returns

- provides a control over long-term and short term rewards

-Mathematically convenient

G_t
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... ; \gamma \in [0, 1]

Markov Decision Process (MDP)

A MDP is a MRP with decisions, represented by

where      is a finite set of actions.

The transition probabilities and Reward function both depend on the actions.

The actions are governed by a policy

< S, A, P, R, \gamma >
A
\pi (a | s)

components of agent

Can include one or more of the following

• Policy
• Value function
• Model

Policy

• The Policy of an agent defines it's behavior.
• It is given as

\pi (a | s) \text{ (stochastic) or } a = \pi (s) \text{ (deterministic)}
\pi (a | s) = \mathbb{P}[A_t = a | S_t = s]

value function

• The value function is a prediction of the future reward for a state.
• It's used to evaluate the quality of a state.
• It's the expected return, i.e.
V(s) = \mathbb{E}[G_t | S_t = s]
V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} +... | S_t = s]

model

• A model predicts what the environment will do next.
• The properties of a model are the state transition probability and a reward function.
• In case of a Partially Observable MDP (POMDP), the agent may form it's own representation of the environment.

rl methods: categories

A simple categorization of a few RL methods

• Temporal Difference Learning
•
• Q-learning
• SARSA
• Actor-critic
• Policy Search based Learning
• Policy Gradient
• Evolutionary Strategies
• Model based Learning
• Stochastic Dynamic Programming
• Bayesian Approaches
TD ( \lambda)

explore and exploit

• Reinforcement Learning is like trial and error
• The agent should explore the environment to search for better policies.
• After selection of optimal policy, the agent maximises the rewards.
• Exploration finds more information about the environment.
• Exploitation uses known information to maximise reward.

prediction and control

• Prediction : evaluate the future, given a policy.
• Control: optimise the future, find best policy.

bellman equations

• Bellman expectation equation
• Bellman optimality equation

Value learning

State based value learning

In general

V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]
V(s) = \mathbb{E}[R_{t+1} + \gamma V(s_{t+1}) | S_t = s]

Value learning

State-Action based value learning

In general

Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s, A_t = a]
Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]

optimality equations

For state based value learning

For state-action based learning

For optimal condition, the optimal policy is:

V_*(s) = max_{s \in S} R_{t+1} + V(s')
Q_*(s, a) = max_{a \in A} R_{t+1} + \gamma Q(s', a')
\pi_*(a | s) = arg_{a \in A}max Q(s, a)

temporal difference learning

• TD methods can learn without a model of the environment, through sampling.
• TD can learn from incomplete episodes.
• TD updates a guess, towards a guess return. (like DP)

SARSA

The update rule in SARSA for state-action value is

which is essentially a generalisation of

where,      is the TD target and

is the TD Error.

SARSA follows Bellman expectation equation

Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a))
Q(s, a) = Q(s, a) + \alpha (G_t - Q(s, a))
G_t
\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a)

q-learning

• Q-learning is similar to SARSA, but consists of two different policies
• Behaviour policy: Used to evaluate
• Estimation policy: Used for update rule.

The update in Q-learning is:

For behaviour policy, we may use   -greedy policy.

Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma max_{a' \in A} Q(s', a') - Q(s, a))
\epsilon

implement q-learning

implementation of Q-learning in gridworld like environment

Code: https://goo.gl/CE8xpC

problems with q-learning

• What happens if the state space is large?
• Millions of states
• Continuous state space
• Cannot store so much states in memory.
• Computation also becomes very slow!

value function approximation

• Generalise unseen states from seen state information.
• Estimate the value function through approximation.

value function approximation

Instead of using discrete state value representation, use

For instance, consider a linear combination:

where,      weights can be updated using TD methods and             is a feature representation

V(s, w) \approx V_\pi(s)
Q(s, a, w) \approx Q_\pi(s, a)
Q(s, a, w) = \sum_i w_i . f_i(s, a)
w_i
f_i(s, a)

types of models

For function approximation, we can choose from

• Decision Trees / Random Forests
• Linear models
• Non-linear models (Neural Networks)
• Nearest Neighbours
• etc...

We choose models that can be differentiated!
(Linear and Non-linear)

defining a loss

• There is no "training data" in RL
• So, use Temporal Difference methods
• We create a guess target, and approximated guess
• Try to minimize the difference in target and appx.

defining a loss

update the weights by minimising a mean-squared error

And use Gradient Descent to update weights

J(w) = \mathbb{E}_\pi[(V_\pi(S) - \hat V(s, w))^2]
\Delta w = -\frac{1}{2} \alpha \nabla_w J(w)
\Delta w = \alpha \mathbb{E_\pi}[(V_\pi(S) - \hat V(s, w))\nabla_w \hat V(s, w)]

FOR Q-LEARNING

update the weights by minimising a mean-squared error

And use Gradient Descent to update weights

J(w) = (R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))^2
\Delta w = \alpha(R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))\nabla_w \hat Q(s, a, w))

the deep q-network

Given the previous information, we can use any function approximator for estimating the value of Q, with the condition that the function be differentiable.

In a scenario where a Deep Neural Network is used as the function approximator, it's called as a DQN.

DQN

Example: Flappy bird

Thank you

Introduction to Reinforcement Learning

By Shubham Dokania

Introduction to Reinforcement Learning

Presentation for Reinforcement Learning Session at MBRDI

• 539