Introduction to
Reinforcement Learning
part 1 (probably the single one)
Tabular Value-based RL
made by Pavel Temirchev
Deep RL
reading group
Background
-
Basic Machine Learning (Linear Algebra, Probability, Stochastic Optimization, Neural Networks)
-
Basic Scientific Python (numpy, matplotlib)
- Desirable, but not mandatory: PyTorch
Course Plan
Introduction, Tabular RL
Value-based RL: Q-learning
Policy-based RL: REINFORCE
Exploitation vs. Exploration dilemma
Oral Exam
(optional) RL as Probabilistic Inference
Lectures
Homework
Value Iteration
Playing games with DQN
- // - // - // -
Thoroughly read an article
- // - // - // -
- // - // - // -
Assessment criteria
In order to pass the course, students should:
- Pass 2 out of 2 assignments
- Pass the oral exam
Passing the exam:
- (1 day before exam) - choose an article from a given list
- Thoroughly describe the chosen article during the exam
Passing the assignments:
- code up mandatory formulas
- analyze the results for different initializations \ parameters \ models
What problems does RL solve?
Assume we have:
- Environment (with its state)
- Agent
- Agent's actions within the environment
We want an agent to act optimally in the environment.
Forgotten something?
The measure of optimality - REWARD
Agent
Environment
Example 1: Robotics (baking pancakes)
Environment state:
Agent's actions:
Rewards:
- coordinates x, y, z of all joints
- voltage applied to all joints
Pancake flipped +1
Pancake baked +10
Pancake failed -10
Example 2: Self-driving cars
Environment state:
Agent's actions:
Rewards:
- images from the camera
- measures from sensors
- gas, break applied
Destination point achieved +1
Rules break -1
Accident -10
Example 3: Chess
Environment state:
Agent's actions:
Rewards:
- coordinates x, y of all figures
- step applied for a figure
Win +1
Lose -1
Some formal definitions
The set of the environment states
The set of agent's actions
The reward (is a scalar)
- transition probabilities
- agent's policy (behavior, strategy)
- reward function
Reminder: Tabular Definition of Functions
If the domain of function \(f\) is finite, then it can be written as a table:
Markov Decision Processes (finite-time):
Aim is to maximize expected cumulative reward:
Trajectory:
New states are dependent only on the previous state and action made
(not on the history!)
It is enough to make decisions solely using the current state (not the history!)
Markov Decision Processes (infinite-time):
What if the process is infinite?
Then
can be unbounded
Let's make it bounded again!
Introduce - discount factor
Then, given bounded reward:
Discounted sum of rewards:
Cake today is better
then cake tomorrow
Encourages the agent to
get rewards faster!
Cake eating problem:
At which time-step you should eat the cake?
Episode terminates after eating
Why not Supervised Machine Learning?
Advertisement problem:
Need a dataset:
No way to take time dependencies into account!
And no dataset too!
State-value function (v-function)
Action-value function (q-function)
Optimality Bellman Equation
Theorem (kinda):
If the Q-function of a policy \(\pi^*\) for any state-action pair \( (s_t, a_t) \) is equal to:
then the policy \(\pi^*\) - is the optimal policy
and \(Q^{\pi^*} = Q^*\) - is the optimal state-value function
If the optimal Q-function for any state-action pair \( (s_t, a_t) \) is known (it is just a table)
Then recovering optimal policy is easy:
Optimality Bellman Equation
Theorem (kinda):
If the Q-function of a policy \(\pi^*\) for any state-action pair \( (s_t, a_t) \) is equal to:
then the policy \(\pi^*\) - is the optimal policy
and \(Q^{\pi^*} = Q^*\) - is the optimal state-value function
If the optimal Q-function for any state-action pair \( (s_t, a_t) \) is known (it is just a table)
Then recovering optimal policy is easy:
Dynamic Programming
for Optimality Bellman Equation
The loop:
while not convergent: for : for :
Will converge to the optimal Q-function
What's wrong with this algorithm? Can you use it in practice?
It requires us to know the transition probabilities \( p(s'|s, a) \)
Q-learning
Dynamic Programming with Monte-Carlo sampling
REMINDER: Monte-Carlo estimate of an expectation
Will (with some dirty hacks) converge to the optimal Q-function
What's wrong with this algorithm? Can you use it in practice?
while not convergent: for t=0 to T:
will be discussed at the next slide
use exponential averaging as MC estimate
Q-learning
Exploration vs. Exploitation
Let's learn how to move forward:
Q-function estimate:
Solution (\(\epsilon\)-greedy strategy): add noise to \( \pi \) :
make random action with probability \( \epsilon \)
Q-learning
Graphical representation of the learning loop
behave
collect
update \(Q\)-function
set optimal policy \(\pi\)
set behavioral policy \(\pi_b\)
Q-learning example
Windy Gridworld Navigation Problem
wind
Actions:
States:
Rewards:
Discounting:
Learning rate:
\(\epsilon\)-greedy strategy:
(decrease \(\epsilon\) after each episode)
Approximate Q-learning
(one step behind neural networks)
Why do we need approximations? All is good, stop.
number of atoms in the Universe
For Atari Games:
DQN (Deep Q-Network)
Thank you for your attention!
Links (clickable):
for reading:
online course:
Intro to RL, lecture 1: Tabular RL (ISP)
By cydoroga
Intro to RL, lecture 1: Tabular RL (ISP)
- 533