Shubham Dokania

@shubhamdokania

shubham1810

- Introduction to Reinforcement Learning
- Markov Decision Process
- Value Based Learning
- state value based learning
- state-action value based learning
- Bellman equations

- Temporal Difference Methods
- Value function approximation
- Deep Q-learning (DQN)

- Code examples!

Reinforcement Learning is about learning what to do - how to map situations to actions, so as to maximize a numerical reward signal. The learner (agent) is not told what to do, but instead it must discover which actions yield the most reward via trial-and-error.

Basic Reinforcement Learning Workflow

- The Reward defines the goal is a RL problem.
- Gives the agent a sense of what is good and bad.
- A reward of higher magnitude is better.
- Usually a function of environment state (situation).

R_t

$R_t$

- State is a representation of the current environment situation. is a set if all states.
- It's usually a function of the history , where the history is defined as a sequence of observations, actions and rewards.

s_t \in S

$s_t \in S$

S

$S$

H_t = O_1, R_1, A_1,..., A_{t-1}, O_t, R_t

$H_t = O_1, R_1, A_1,..., A_{t-1}, O_t, R_t$

H_t

$H_t$

s_t = f(H_t)

$s_t = f(H_t)$

- A state is information state or Markov state if it follows the Markov property.

i.e. the future is independent of the past, given the present

\mathbb{P}[S_{t+1} | S_t, S_{t-1}, S_{t-2},...] = \mathbb{P}[S_{t+1} | S_t]

$\mathbb{P}[S_{t+1} | S_t, S_{t-1}, S_{t-2},...] = \mathbb{P}[S_{t+1} | S_t]$

A Markov Process (Markov Chain) is a memoryless random process which follows the Markov Property.

A Markov Process is defined by

The probability of state transition is defined as:

P_{ss'} = \mathbb{P}[S_{t+1}=s' | S_t=s]

$P_{ss'} = \mathbb{P}[S_{t+1}=s' | S_t=s]$

< P, S >

$< P, S >$

A Markov Reward Process is a Markov process with Rewards/Values associated with states.

It's represented by

The Reward function is

and is the discount factor.

< S, P, R, \gamma >

$< S, P, R, \gamma >$

R_s = \mathbb{E}[R_{t+1} | S_t = s]

$R_s = \mathbb{E}[R_{t+1} | S_t = s]$

\gamma

$\gamma$

In a MRP, is defined as the discounted return, given by

But what is the need for a discount factor?

- Avoids infinite returns

- provides a control over long-term and short term rewards

-Mathematically convenient

G_t

$G_t$

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... ; \gamma \in [0, 1]

$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... ; \gamma \in [0, 1]$

A MDP is a MRP with decisions, represented by

where is a finite set of actions.

The transition probabilities and Reward function both depend on the actions.

The actions are governed by a policy

< S, A, P, R, \gamma >

$< S, A, P, R, \gamma >$

A

$A$

\pi (a | s)

$\pi (a | s)$

Can include one or more of the following

- Policy
- Value function
- Model

- The Policy of an agent defines it's behavior.
- It is given as

\pi (a | s) \text{ (stochastic) or } a = \pi (s) \text{ (deterministic)}

$\pi (a | s) \text{ (stochastic) or } a = \pi (s) \text{ (deterministic)}$

\pi (a | s) = \mathbb{P}[A_t = a | S_t = s]

$\pi (a | s) = \mathbb{P}[A_t = a | S_t = s]$

- The value function is a prediction of the future reward for a state.
- It's used to evaluate the quality of a state.
- It's the expected return, i.e.

V(s) = \mathbb{E}[G_t | S_t = s]

$V(s) = \mathbb{E}[G_t | S_t = s]$

V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} +... | S_t = s]

$V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} +... | S_t = s]$

- A model predicts what the environment will do next.
- The properties of a model are the state transition probability and a reward function.
- In case of a Partially Observable MDP (POMDP), the agent may form it's own representation of the environment.

A simple categorization of a few RL methods

- Temporal Difference Learning
- Q-learning
- SARSA
- Actor-critic

- Policy Search based Learning
- Policy Gradient
- Evolutionary Strategies

- Model based Learning
- Stochastic Dynamic Programming
- Bayesian Approaches

TD ( \lambda)

$TD ( \lambda)$

- Reinforcement Learning is like trial and error
- The agent should explore the environment to search for better policies.
- After selection of optimal policy, the agent maximises the rewards.
- Exploration finds more information about the environment.
- Exploitation uses known information to maximise reward.

- Prediction : evaluate the future, given a policy.
- Control: optimise the future, find best policy.

- Bellman expectation equation
- Bellman optimality equation

State based value learning

In general

V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]

$V(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]$

V(s) = \mathbb{E}[R_{t+1} + \gamma V(s_{t+1}) | S_t = s]

$V(s) = \mathbb{E}[R_{t+1} + \gamma V(s_{t+1}) | S_t = s]$

State-Action based value learning

In general

Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s, A_t = a]

$Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s, A_t = a]$

Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]

$Q_{\pi}(s, a) = \mathbb{E}[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]$

For state based value learning

For state-action based learning

For optimal condition, the optimal policy is:

V_*(s) = max_{s \in S} R_{t+1} + V(s')

$V_*(s) = max_{s \in S} R_{t+1} + V(s')$

Q_*(s, a) = max_{a \in A} R_{t+1} + \gamma Q(s', a')

$Q_*(s, a) = max_{a \in A} R_{t+1} + \gamma Q(s', a')$

\pi_*(a | s) = arg_{a \in A}max Q(s, a)

$\pi_*(a | s) = arg_{a \in A}max Q(s, a)$

- TD methods can learn without a model of the environment, through sampling.
- TD can learn from incomplete episodes.
- TD updates a guess, towards a guess return. (like DP)

The update rule in SARSA for state-action value is

which is essentially a generalisation of

where, is the TD target and

is the TD Error.

SARSA follows Bellman expectation equation

Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a))

$Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a))$

Q(s, a) = Q(s, a) + \alpha (G_t - Q(s, a))

$Q(s, a) = Q(s, a) + \alpha (G_t - Q(s, a))$

G_t

$G_t$

\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a)

$\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a)$

- Q-learning is similar to SARSA, but consists of two different policies
- Behaviour policy: Used to evaluate
- Estimation policy: Used for update rule.

The update in Q-learning is:

For behaviour policy, we may use -greedy policy.

Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma max_{a' \in A} Q(s', a') - Q(s, a))

$Q(s, a) = Q(s, a) + \alpha (R_{t+1} + \gamma max_{a' \in A} Q(s', a') - Q(s, a))$

\epsilon

$\epsilon$

implementation of Q-learning in gridworld like environment

Code: https://goo.gl/CE8xpC

- What happens if the state space is large?
- Millions of states
- Continuous state space

- Cannot store so much states in memory.
- Computation also becomes very slow!

- Generalise unseen states from seen state information.
- Estimate the value function through approximation.

Instead of using discrete state value representation, use

For instance, consider a linear combination:

where, weights can be updated using TD methods and is a feature representation

V(s, w) \approx V_\pi(s)

$V(s, w) \approx V_\pi(s)$

Q(s, a, w) \approx Q_\pi(s, a)

$Q(s, a, w) \approx Q_\pi(s, a)$

Q(s, a, w) = \sum_i w_i . f_i(s, a)

$Q(s, a, w) = \sum_i w_i . f_i(s, a)$

w_i

$w_i$

f_i(s, a)

$f_i(s, a)$

For function approximation, we can choose from

- Decision Trees / Random Forests
- Linear models
- Non-linear models (Neural Networks)
- Nearest Neighbours
- etc...

We choose models that can be differentiated!

(Linear and Non-linear)

- There is no "training data" in RL
- So, use Temporal Difference methods
- We create a guess target, and approximated guess
- Try to minimize the difference in target and appx.

update the weights by minimising a mean-squared error

And use Gradient Descent to update weights

J(w) = \mathbb{E}_\pi[(V_\pi(S) - \hat V(s, w))^2]

$J(w) = \mathbb{E}_\pi[(V_\pi(S) - \hat V(s, w))^2]$

\Delta w = -\frac{1}{2} \alpha \nabla_w J(w)

$\Delta w = -\frac{1}{2} \alpha \nabla_w J(w)$

\Delta w = \alpha \mathbb{E_\pi}[(V_\pi(S) - \hat V(s, w))\nabla_w \hat V(s, w)]

$\Delta w = \alpha \mathbb{E_\pi}[(V_\pi(S) - \hat V(s, w))\nabla_w \hat V(s, w)]$

update the weights by minimising a mean-squared error

And use Gradient Descent to update weights

J(w) = (R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))^2

$J(w) = (R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))^2$

\Delta w = \alpha(R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))\nabla_w \hat Q(s, a, w))

$\Delta w = \alpha(R_{t+1} + \gamma max_{a' \in A} \hat Q(s', a') - \hat Q(s, a, w))\nabla_w \hat Q(s, a, w))$

Given the previous information, we can use any function approximator for estimating the value of Q, with the condition that the function be differentiable.

In a scenario where a Deep Neural Network is used as the function approximator, it's called as a DQN.