# Reinforcement learning

Shubham Dokania

@shubhamdokania

shubham1810

## overview

- Introduction to Reinforcement Learning
- Markov Decision Process
- Value Based Learning
- state value based learning
- state-action value based learning
- Bellman equations

- Temporal Difference Methods
- Value function approximation
- Deep Q-learning (DQN)

- Code examples!

## What is RL?

Reinforcement Learning is about learning what to do - how to map situations to actions, so as to maximize a numerical reward signal. The learner (agent) is not told what to do, but instead it must discover which actions yield the most reward via trial-and-error.

Basic Reinforcement Learning Workflow

## learning from reward

- The Reward defines the goal is a RL problem.
- Gives the agent a sense of what is good and bad.
- A reward of higher magnitude is better.
- Usually a function of environment state (situation).

## What are states?

- State is a representation of the current environment situation. is a set if all states.
- It's usually a function of the history , where the history is defined as a sequence of observations, actions and rewards.

## Information state

- A state is information state or Markov state if it follows the Markov property.

i.e. the future is independent of the past, given the present

## Markov process (MP)

A Markov Process (Markov Chain) is a memoryless random process which follows the Markov Property.

A Markov Process is defined by

The probability of state transition is defined as:

## Markov REWARD process (MRP)

A Markov Reward Process is a Markov process with Rewards/Values associated with states.

It's represented by

The Reward function is

and is the discount factor.

## model of a MRP

## Return and discount

In a MRP, is defined as the discounted return, given by

But what is the need for a discount factor?

- Avoids infinite returns

- provides a control over long-term and short term rewards

-Mathematically convenient

## Markov Decision Process (MDP)

A MDP is a MRP with decisions, represented by

where is a finite set of actions.

The transition probabilities and Reward function both depend on the actions.

The actions are governed by a policy

## components of agent

Can include one or more of the following

- Policy
- Value function
- Model

## Policy

- The Policy of an agent defines it's behavior.
- It is given as

## value function

- The value function is a prediction of the future reward for a state.
- It's used to evaluate the quality of a state.
- It's the expected return, i.e.

## model

- A model predicts what the environment will do next.
- The properties of a model are the state transition probability and a reward function.
- In case of a Partially Observable MDP (POMDP), the agent may form it's own representation of the environment.

## gridworld example

## POLICY

## VALUE

## model

## rl methods: categories

A simple categorization of a few RL methods

- Temporal Difference Learning
- Q-learning
- SARSA
- Actor-critic

- Policy Search based Learning
- Policy Gradient
- Evolutionary Strategies

- Model based Learning
- Stochastic Dynamic Programming
- Bayesian Approaches

## Example: ES on walker

## explore and exploit

- Reinforcement Learning is like trial and error
- The agent should explore the environment to search for better policies.
- After selection of optimal policy, the agent maximises the rewards.
- Exploration finds more information about the environment.
- Exploitation uses known information to maximise reward.

## prediction and control

- Prediction : evaluate the future, given a policy.
- Control: optimise the future, find best policy.

## bellman equations

- Bellman expectation equation
- Bellman optimality equation

## Value learning

State based value learning

In general

## Value learning

State-Action based value learning

In general

## optimality equations

For state based value learning

For state-action based learning

For optimal condition, the optimal policy is:

## temporal difference learning

- TD methods can learn without a model of the environment, through sampling.
- TD can learn from incomplete episodes.
- TD updates a guess, towards a guess return. (like DP)

## SARSA

The update rule in SARSA for state-action value is

which is essentially a generalisation of

where, is the TD target and

is the TD Error.

SARSA follows Bellman expectation equation

## q-learning

- Q-learning is similar to SARSA, but consists of two different policies
- Behaviour policy: Used to evaluate
- Estimation policy: Used for update rule.

The update in Q-learning is:

For behaviour policy, we may use -greedy policy.

## implement q-learning

implementation of Q-learning in gridworld like environment

Code: https://goo.gl/CE8xpC

## problems with q-learning

- What happens if the state space is large?
- Millions of states
- Continuous state space

- Cannot store so much states in memory.
- Computation also becomes very slow!

## value function approximation

- Generalise unseen states from seen state information.
- Estimate the value function through approximation.

## value function approximation

Instead of using discrete state value representation, use

For instance, consider a linear combination:

where, weights can be updated using TD methods and is a feature representation

## types of models

For function approximation, we can choose from

- Decision Trees / Random Forests
- Linear models
- Non-linear models (Neural Networks)
- Nearest Neighbours
- etc...

We choose models that can be differentiated!

(Linear and Non-linear)

## defining a loss

- There is no "training data" in RL
- So, use Temporal Difference methods
- We create a guess target, and approximated guess
- Try to minimize the difference in target and appx.

## defining a loss

update the weights by minimising a mean-squared error

And use Gradient Descent to update weights

## FOR Q-LEARNING

update the weights by minimising a mean-squared error

And use Gradient Descent to update weights

## the deep q-network

Given the previous information, we can use any function approximator for estimating the value of Q, with the condition that the function be differentiable.

In a scenario where a Deep Neural Network is used as the function approximator, it's called as a DQN.

## Architecture

## DEEPMIND atari dqn

## DQN

Example: Flappy bird

# Thank you

#### Introduction to Reinforcement Learning

By Shubham Dokania

# Introduction to Reinforcement Learning

Presentation for Reinforcement Learning Session at MBRDI

- 539