Reinforcement learning
Shubham Dokania
@shubhamdokania
shubham1810
SHUBHAM DOKANIA
workshop overview
- Introduction to Reinforcement Learning
- Markov Decision Process
- Value Based Learning
- state value based learning
- state-action value based learning
- Bellman equations
- Temporal Difference Methods
- Value function approximation
- Deep Q-learning (DQN)
- Code examples!
What is RL?
Reinforcement Learning is about learning what to do - how to map situations to actions, so as to maximize a numerical reward signal. The learner (agent) is not told what to do, but instead it must discover which actions yield the most reward via trial-and-error.
Basic Reinforcement Learning Workflow
learning from reward
- The Reward defines the goal is a RL problem.
- Gives the agent a sense of what is good and bad.
- A reward of higher magnitude is better.
- Usually a function of environment state (situation).
What are states?
- State is a representation of the current environment situation. is a set if all states.
- It's usually a function of the history , where the history is defined as a sequence of observations, actions and rewards.
Information state
- A state is information state or Markov state if it follows the Markov property.
i.e. the future is independent of the past, given the present
Markov process (MP)
A Markov Process (Markov Chain) is a memoryless random process which follows the Markov Property.
A Markov Process is defined by
The probability of state transition is defined as:
Markov REWARD process (MRP)
A Markov Reward Process is a Markov process with Rewards/Values associated with states.
It's represented by
The Reward function is
and is the discount factor.
model of a MRP
Return and discount
In a MRP, is defined as the discounted return, given by
But what is the need for a discount factor?
- Avoids infinite returns
- provides a control over long-term and short term rewards
-Mathematically convenient
Markov Decision Process (MDP)
A MDP is a MRP with decisions, represented by
where is a finite set of actions.
The transition probabilities and Reward function both depend on the actions.
The actions are governed by a policy
components of agent
Can include one or more of the following
- Policy
- Value function
- Model
Policy
- The Policy of an agent defines it's behavior.
- It is given as
value function
- The value function is a prediction of the future reward for a state.
- It's used to evaluate the quality of a state.
- It's the expected return, i.e.
model
- A model predicts what the environment will do next.
- The properties of a model are the state transition probability and a reward function.
- In case of a Partially Observable MDP (POMDP), the agent may form it's own representation of the environment.
gridworld example
POLICY
VALUE
model
rl methods: categories
A simple categorization of a few RL methods
- Temporal Difference Learning
- Q-learning
- SARSA
- Actor-critic
- Policy Search based Learning
- Policy Gradient
- Evolutionary Strategies
- Model based Learning
- Stochastic Dynamic Programming
- Bayesian Approaches
explore and exploit
- Reinforcement Learning is like trial and error
- The agent should explore the environment to search for better policies.
- After selection of optimal policy, the agent maximises the rewards.
- Exploration finds more information about the environment.
- Exploitation uses known information to maximise reward.
prediction and control
- Prediction : evaluate the future, given a policy.
- Control: optimise the future, find best policy.
bellman equations
- Bellman expectation equation
- Bellman optimality equation
Value learning
State based value learning
In general
Value learning
State-Action based value learning
In general
optimality equations
For state based value learning
For state-action based learning
For optimal condition, the optimal policy is:
temporal difference learning
- TD methods can learn without a model of the environment, through sampling.
- TD can learn from incomplete episodes.
- TD updates a guess, towards a guess return. (like DP)
SARSA
The update rule in SARSA for state-action value is
which is essentially a generalisation of
where, is the TD target and
is the TD Error.
SARSA follows Bellman expectation equation
q-learning
- Q-learning is similar to SARSA, but consists of two different policies
- Behaviour policy: Used to evaluate
- Estimation policy: Used for update rule.
The update in Q-learning is:
For behaviour policy, we may use -greedy policy.
implement q-learning
implementation of Q-learning in gridworld like environment
Code: https://goo.gl/CE8xpC
problems with q-learning
- What happens if the state space is large?
- Millions of states
- Continuous state space
- Cannot store so much states in memory.
- Computation also becomes very slow!
value function approximation
- Generalise unseen states from seen state information.
- Estimate the value function through approximation.
value function approximation
Instead of using discrete state value representation, use
For instance, consider a linear combination:
where, weights can be updated using TD methods and is a feature representation
types of models
For function approximation, we can choose from
- Decision Trees / Random Forests
- Linear models
- Non-linear models (Neural Networks)
- Nearest Neighbours
- etc...
We choose models that can be differentiated!
(Linear and Non-linear)
defining a loss
- There is no "training data" in RL
- So, use Temporal Difference methods
- We create a guess target, and approximated guess
- Try to minimize the difference in target and appx.
defining a loss
update the weights by minimising a mean-squared error
And use Gradient Descent to update weights
FOR Q-LEARNING
update the weights by minimising a mean-squared error
And use Gradient Descent to update weights
the deep q-network
Given the previous information, we can use any function approximator for estimating the value of Q, with the condition that the function be differentiable.
In a scenario where a Deep Neural Network is used as the function approximator, it's called as a DQN.
Architecture
DEEPMIND atari dqn
flappy bird!
Implementation of a DQN agent on the game of Flappy Bird (Pygame)
Thank you
Deep Reinforcement Learning: A hands-on introduction
By Shubham Dokania
Deep Reinforcement Learning: A hands-on introduction
Workshop presentation for PyData 2017
- 2,043