Shubham Dokania
@shubhamdokania
shubham1810
Reinforcement Learning is about learning what to do - how to map situations to actions, so as to maximize a numerical reward signal. The learner (agent) is not told what to do, but instead it must discover which actions yield the most reward via trial-and-error.
Basic Reinforcement Learning Workflow
i.e. the future is independent of the past, given the present
A Markov Process (Markov Chain) is a memoryless random process which follows the Markov Property.
A Markov Process is defined by
The probability of state transition is defined as:
A Markov Reward Process is a Markov process with Rewards/Values associated with states.
It's represented by
The Reward function is
and is the discount factor.
In a MRP, is defined as the discounted return, given by
But what is the need for a discount factor?
- Avoids infinite returns
- provides a control over long-term and short term rewards
-Mathematically convenient
A MDP is a MRP with decisions, represented by
where is a finite set of actions.
The transition probabilities and Reward function both depend on the actions.
The actions are governed by a policy
Can include one or more of the following
A simple categorization of a few RL methods
State based value learning
In general
State-Action based value learning
In general
For state based value learning
For state-action based learning
For optimal condition, the optimal policy is:
The update rule in SARSA for state-action value is
which is essentially a generalisation of
where, is the TD target and
is the TD Error.
SARSA follows Bellman expectation equation
The update in Q-learning is:
For behaviour policy, we may use -greedy policy.
implementation of Q-learning in gridworld like environment
Code: https://goo.gl/CE8xpC
Instead of using discrete state value representation, use
For instance, consider a linear combination:
where, weights can be updated using TD methods and is a feature representation
For function approximation, we can choose from
We choose models that can be differentiated!
(Linear and Non-linear)
update the weights by minimising a mean-squared error
And use Gradient Descent to update weights
update the weights by minimising a mean-squared error
And use Gradient Descent to update weights
Given the previous information, we can use any function approximator for estimating the value of Q, with the condition that the function be differentiable.
In a scenario where a Deep Neural Network is used as the function approximator, it's called as a DQN.