Artyom Sorokin | 05 Feb
Supervised Learning case:
Given Dataset
Learn a function that will predict y from X:
e. g. find parameters theta that will minimize: where is a loss function
Standart Assumptions:
You don't have answers at all
Your answers are not good enough
Assume that we have expert trajectories, i.e. sufficiently good answers:
Will this work?
Assume that we have expert trajectories, i.e. sufficiently good answers:
Yes, but only if for any possible behavior there is a close example from the data
Will this work?
Not always possible!
New Plan (DAGGER algorithm):
But this is really hard to do: 3. Ask humans to label \(D'\) with actions \(a_t\)
Do we still need Reinforcement Learning?
Impressive results in image and text generation with SL!
Impressive because no person had thought of it!
Impressive because it looks like something a person might do!
SL results
RL results
If You know what you want, but don't know how to do it... USE REWARDS!
WIN: +1
LOSE: -1
Assumptions:
You have Agent and Environment that interact with each other:
Interaction with environment is typically divided into episodes.
Agent has a policy:
Agent learns its policy via Trial and Error!
The goal is to find a policy that maximizes total expected reward:
Why we need ?
A non-deterministic policy or environment lead to a distribution of total rewards!
Why not use or ?
What should an agent observe?
Is this enough?
Does agent need past observations?
Task: Open the red door with the key
Details: Agent starts at random location
Actions:
Which observations are enough to learn the optimal policy?
For 2 and 3 agent doesn't need to remember it's history:
Markovian property: "The future is independent of the past given the present."
MDP is a 5-tuple \(<S,A,R,T, \gamma >\):
Given Agent's policy \(\pi\), RL objective become:
Discount factor \(\gamma\) determines how much we should care about the future!
MDP for Contextual Multi-Armed Bandits:
MDP for Multi-Armed Bandits:
ChatGPT finetuned in this setting with PPO algorithm ( explained in lectures 6 & 7)
You can formulate SL problem as RL problem!
Given Dataset:
We consider X_i as states, and y_i as correct actions!
Then the reward function will be \(R(X_i, a_i) = 1\). if \(a_i = y_i\) else 0.
Because Reinforcement learning is a harder problem!
Why don't we use Reinforcement learning every where?
Reward contains less information than a correct answer!
fox
bread
truck
dog
0.
0.
0.
1.
We have ground truth labels
fox
bread
truck
dog
-3
We have rewards
?
?
?
Reward is a proxy for you goal, but they are not the same!
Goal: Train a bot to win the game!
Rewards:
Your data is not i.i.d. Previous actions affect future states and rewards.
Credit Assignment Problem:
How to determine which actions are responsible for the outcome?
The training dataset is changing with the policy.
This can lead to a catastrophic forgetting problem:
Agent unlearns it's policy in some parts of the State Space
As a team, we have conducted several RL courses:
RL Basis:
Deep Reinforcement Learning:
Advanced Topics:
Code this
Code this
Learn about this
Оценка выставляется по десятибальной шкале.
Планируется 6 заданий. Задания приносят 2 или 1 балла в зависимости от сложности.
Если задание сдается в течение двух недель после дедлайна, то ценность задания составляет 80% изначальной, если позже то только 60%.
В конце курса будет необязательный тест общей стоимостью в 2 балла.
Детали: