Introduction to
Reinforcement Learning

part 1 (probably the single one)

Tabular Value-based RL

made by Pavel Temirchev

Deep RL

reading group

Some formal definitions

s_t \in \mathcal{S}

s_t \in \mathcal{S}

a_t \in \mathcal{A}

a_t \in \mathcal{A}

r_t \in \mathbb{R}

r_t \in \mathbb{R}

s_{t+1} \sim p(s_{t+1}|s_t, a_t)

s_{t+1} \sim p(s_{t+1}|s_t, a_t)

a_t \sim \pi(a_t|s_t)

a_t \sim \pi(a_t|s_t)

r_t = r(s_{t+1}, s_t, a_t)

r_t = r(s_{t+1}, s_t, a_t)

The set of the environment states

The set of agent's actions

The reward (is a scalar)

- transition probabilities

- agent's policy (behavior, strategy)

- reward function

Reminder: Tabular Definition of Functions

If the domain of function $f$ is finite, then it can be written as a table:

f

x_0

x_0

x_1

x_1

x_2

x_2

x_3

x_3

x_4

x_4

f(x_0)

f(x_0)

f(x_1)

f(x_1)

f(x_2)

f(x_2)

f(x_3)

f(x_3)

f(x_4)

f(x_4)

\mathcal{X}

\mathcal{X}

x_0

x_0

x_1

x_1

\mathcal{X}

\mathcal{X}

f(x_0, y_0)

f(x_0, y_0)

f(x_1, y_0)

f(x_1, y_0)

f(x_1, y_1)

f(x_1, y_1)

f(x_0, y_1)

f(x_0, y_1)

y_0

y_0

y_1

y_1

\mathcal{Y}

\mathcal{Y}

f

Markov Decision Processes (finite-time):

\tau = \{s_0, a_0, s_1, \dots, s_t, a_t, s_{t+1}, \dots, a_{T-1}, s_T \}

\tau = \{s_0, a_0, s_1, \dots, s_t, a_t, s_{t+1}, \dots, a_{T-1}, s_T \}

\mathbb{E}_{\tau \sim p(\tau)} \sum_{t=0}^T r_t \rightarrow \max_\pi

\mathbb{E}_{\tau \sim p(\tau)} \sum_{t=0}^T r_t \rightarrow \max_\pi

p(\tau) = \prod_{t=0}^T p(s_{t+1}|s_t, a_t) \pi(a_t|s_t)

p(\tau) = \prod_{t=0}^T p(s_{t+1}|s_t, a_t) \pi(a_t|s_t)

Aim is to maximize expected cumulative reward:

Trajectory:

New states are dependent only on the previous state and action made

(not on the history!)

It is enough to make decisions solely using the current state (not the history!)

Markov Decision Processes (infinite-time):

\mathbb{E}_{\tau \sim p(\tau)} \sum_{t=0}^\infty r_t

\mathbb{E}_{\tau \sim p(\tau)} \sum_{t=0}^\infty r_t

What if the process is infinite?

T \rightarrow +\infty

T \rightarrow +\infty

Then

can be unbounded

Let's make it bounded again!

\mathbb{E}_{\tau \sim p(\tau)} \sum_{t=0}^\infty \gamma^t r_t < \infty

\mathbb{E}_{\tau \sim p(\tau)} \sum_{t=0}^\infty \gamma^t r_t < \infty

Introduce - discount factor

0 \le \gamma < 1

0 \le \gamma < 1

Then, given bounded reward:

Discounted sum of rewards:

r_0 + \gamma r_1 + \gamma^2 r_2 + \gamma^3 r_3 + \dots

r_0 + \gamma r_1 + \gamma^2 r_2 + \gamma^3 r_3 + \dots

Cake today is better

then cake tomorrow

Encourages the agent to

get rewards faster!

Cake eating problem:

\mathcal{A} = \{ \text{eat}, \text{do nothing} \}

\mathcal{A} = \{ \text{eat}, \text{do nothing} \}

r(\text{eat}) = 1

r(\text{eat}) = 1

r(\text{do nothing}) = 0

r(\text{do nothing}) = 0

At which time-step you should eat the cake?

Episode terminates after eating

Why not Supervised Machine Learning?

Advertisement problem:

Need a dataset:

(s_0, a_0) \rightarrow p(click)

(s_0, a_0) \rightarrow p(click)

(s_0, a_1) \rightarrow p(click)

(s_0, a_1) \rightarrow p(click)

\dots

\dots

(s_N, a_N) \rightarrow p(click)

(s_N, a_N) \rightarrow p(click)

No way to take time dependencies into account!

And no dataset too!

State-value function (v-function)

V^\pi(s_t) = \mathbb{E}_{a_t} \mathbb{E}_{s_{t+1}} \Bigg[ r_t + \mathbb{E}_{a_{t+1}} \mathbb{E}_{s_{t+2}} \Big[ \gamma r_{t+1} + \mathbb{E}_{a_{t+2}} \mathbb{E}_{s_{t+3}}[\dots]\Big]\Bigg]

V^\pi(s_t) = \mathbb{E}_{a_t} \mathbb{E}_{s_{t+1}} \Bigg[ r_t + \mathbb{E}_{a_{t+1}} \mathbb{E}_{s_{t+2}} \Big[ \gamma r_{t+1} + \mathbb{E}_{a_{t+2}} \mathbb{E}_{s_{t+3}}[\dots]\Big]\Bigg]

V^\pi(s_t) = \sum_{a_t} \pi(a_t|s_t) \sum_{s_{t+1}} p(s_{t+1}|s_t,a_t)\Bigg[ r_t + \dots \Bigg]

V^\pi(s_t) = \sum_{a_t} \pi(a_t|s_t) \sum_{s_{t+1}} p(s_{t+1}|s_t,a_t)\Bigg[ r_t + \dots \Bigg]

V^\pi(s_{T-1}) = \sum_{a_{T-1}} \pi(a_{T-1}|s_{T-1}) \sum_{s_T} p(s_T|s_{T-1},a_{T-1}) r_T

V^\pi(s_{T-1}) = \sum_{a_{T-1}} \pi(a_{T-1}|s_{T-1}) \sum_{s_T} p(s_T|s_{T-1},a_{T-1}) r_T

s_{T-1}

s_{T-1}

a_{T-1}

a_{T-1}

s_{T}

s_{T}

V^\pi(s_t) = \mathbb{E}_{a_t} \mathbb{E}_{s_{t+1}} \Bigg[ r_t + \gamma V^\pi(s_{t+1}) \Bigg]

V^\pi(s_t) = \mathbb{E}_{a_t} \mathbb{E}_{s_{t+1}} \Bigg[ r_t + \gamma V^\pi(s_{t+1}) \Bigg]

Action-value function (q-function)

Q^\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}} \Bigg[ r_t + \mathbb{E}_{a_{t+1}} \mathbb{E}_{s_{t+2}} \Big[ \gamma r_{t+1} + \mathbb{E}_{a_{t+2}} \mathbb{E}_{s_{t+3}}[\dots]\Big]\Bigg]

Q^\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}} \Bigg[ r_t + \mathbb{E}_{a_{t+1}} \mathbb{E}_{s_{t+2}} \Big[ \gamma r_{t+1} + \mathbb{E}_{a_{t+2}} \mathbb{E}_{s_{t+3}}[\dots]\Big]\Bigg]

Q^\pi(s_{T-1}, a_{T-1}) = \sum_{s_T} p(s_T|s_{T-1},a_{T-1}) r_T

Q^\pi(s_{T-1}, a_{T-1}) = \sum_{s_T} p(s_T|s_{T-1},a_{T-1}) r_T

s_{T-1}

s_{T-1}

a_{T-1}

a_{T-1}

s_{T}

s_{T}

Q^\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}} \Bigg[ r_t + \gamma \mathbb{E}_{a_{t+1}} Q^\pi(s_{t+1}, a_{t+1}) \Bigg]

Q^\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}} \Bigg[ r_t + \gamma \mathbb{E}_{a_{t+1}} Q^\pi(s_{t+1}, a_{t+1}) \Bigg]

Optimality Bellman Equation

Q^{\pi^*}(s_t, a_t) = \mathbb{E}_{s_{t+1}} \Big[ r_t + \gamma \max_a Q^{\pi^*}(s_{t+1}, a) \Big]

Q^{\pi^*}(s_t, a_t) = \mathbb{E}_{s_{t+1}} \Big[ r_t + \gamma \max_a Q^{\pi^*}(s_{t+1}, a) \Big]

Theorem (kinda):

If the Q-function of a policy $\pi^*$ for any state-action pair $(s_t, a_t)$ is equal to:

then the policy $\pi^*$ - is the optimal policy

and $Q^{\pi^*} = Q^*$ - is the optimal state-value function

If the optimal Q-function for any state-action pair $(s_t, a_t)$ is known (it is just a table)

Then recovering optimal policy is easy:

a_t = \arg\max_a Q^*(s_t, a)

a_t = \arg\max_a Q^*(s_t, a)

Dynamic Programming

for Optimality Bellman Equation

The loop:

Q(s, a) \leftarrow 0 \;\;\;\; \forall(s, a)

Q(s, a) \leftarrow 0 \;\;\;\; \forall(s, a)

Q(s, a) \leftarrow \mathbb{E}_{s'} \Big[ r + \gamma \max_{a'} Q(s', a') \Big]

Q(s, a) \leftarrow \mathbb{E}_{s'} \Big[ r + \gamma \max_{a'} Q(s', a') \Big]

while not convergent:
   for      :
      for      :

s \in \mathcal{S}

s \in \mathcal{S}

a \in \mathcal{A}

a \in \mathcal{A}

Will converge to the optimal Q-function

What's wrong with this algorithm? Can you use it in practice?

It requires us to know the transition probabilities $p(s'|s, a)$

Q-learning

Dynamic Programming with Monte-Carlo sampling

REMINDER: Monte-Carlo estimate of an expectation

Q(s, a) \leftarrow 0 \;\;\;\; \forall(s, a)

Q(s, a) \leftarrow 0 \;\;\;\; \forall(s, a)

Q(s_t, a_t) \leftarrow \alpha Q^{old}(s_t, a_t) + (1-\alpha) \big[r_t + \gamma \max_{a} Q(s_{t+1}, a) \big]

Q(s_t, a_t) \leftarrow \alpha Q^{old}(s_t, a_t) + (1-\alpha) \big[r_t + \gamma \max_{a} Q(s_{t+1}, a) \big]

Will (with some dirty hacks) converge to the optimal Q-function

What's wrong with this algorithm? Can you use it in practice?

\mathbb{E}_{x\sim p(x)} f(x) \approx \frac{1}{N} \sum_{i=0}^N f(x_i),\;\;\;\; x_i \sim p(x)

\mathbb{E}_{x\sim p(x)} f(x) \approx \frac{1}{N} \sum_{i=0}^N f(x_i),\;\;\;\; x_i \sim p(x)

while not convergent:
   for t=0 to T:

a_t = \arg\max_a Q(s_t, a)

a_t = \arg\max_a Q(s_t, a)

(r_t, s_{t+1}) \sim \texttt{environment}(a_t)

(r_t, s_{t+1}) \sim \texttt{environment}(a_t)

Q^{old}(s_t, a_t) \leftarrow Q(s_t, a_t)

Q^{old}(s_t, a_t) \leftarrow Q(s_t, a_t)

will be discussed at the next slide

use exponential averaging as MC estimate

Q-learning

Exploration vs. Exploitation

Let's learn how to move forward:

x

0

r_t = x_{t+1} - x_t

r_t = x_{t+1} - x_t

Q-function estimate:

Q(s_0, a=\text{fall}) = 0

Q(s_0, a=\text{fall}) = 0

Q(s_0, a=\text{step}) = 0

Q(s_0, a=\text{step}) = 0

\pi(a|s_0)=\arg\max_aQ(s_0, a) = \text{fall}

\pi(a|s_0)=\arg\max_aQ(s_0, a) = \text{fall}

\Delta x

\Delta x

Q(s_0, a=\text{fall}) = \Delta x

Q(s_0, a=\text{fall}) = \Delta x

Solution ( $\epsilon$ -greedy strategy): add noise to $\pi$ :

make random action with probability $\epsilon$

Q-learning

Graphical representation of the learning loop

behave

collect

update $Q$ -function

set optimal policy $\pi$

\pi(a|s)=\arg\max_aQ(s, a)

\pi(a|s)=\arg\max_aQ(s, a)

set behavioral policy $\pi_b$

\pi_b =

\pi_b =

\{

\pi, \;\;\;\text{w.p.} \;\;(1-\epsilon)

\pi, \;\;\;\text{w.p.} \;\;(1-\epsilon)

rand, \;\;\;\text{w.p.} \;\;\epsilon

rand, \;\;\;\text{w.p.} \;\;\epsilon

Q-learning example

Windy Gridworld Navigation Problem

wind

s_t = (x, y)

s_t = (x, y)

r(s_{goal}) = 1; \;\;\;\;0 \;\;else

r(s_{goal}) = 1; \;\;\;\;0 \;\;else

\gamma = 0.99

\gamma = 0.99

\alpha = 0.9

\alpha = 0.9

\epsilon_0 = 1

\epsilon_0 = 1

\epsilon_{i+1} = 0.99\epsilon_i

\epsilon_{i+1} = 0.99\epsilon_i

Actions:

States:

Rewards:

Discounting:

Learning rate:

$\epsilon$ -greedy strategy:

(decrease $\epsilon$ after each episode)

Introduction to Reinforcement Learning part 1 (probably the single one) Tabular Value-based RL made by Pavel Temirchev Deep RL reading group

Introduction to
Reinforcement Learning

Tabular Value-based RL

What problems does RL solve?

Example 1: Robotics (baking pancakes)

Example 2: Optimal control for a well

Example 3: Self-driving cars

Example 4: Chess

Some formal definitions

Reminder: Tabular Definition of Functions

Markov Decision Processes (finite-time):

Markov Decision Processes (infinite-time):

Discounted sum of rewards:

Why not Supervised Machine Learning?

State-value function (v-function)

Action-value function (q-function)

Optimality Bellman Equation

Dynamic Programming

Q-learning

Q-learning

Q-learning

Q-learning example

Approximate Q-learning

DQN (Deep Q-Network)

Intro to RL, lecture 2: Q-learning (ISP)

Intro to RL, lecture 2: Q-learning (ISP)

cydoroga

Introduction to​ Reinforcement Learning

Tabular Value-based RL

Intro to RL, lecture 2: Q-learning (ISP)

More from cydoroga

Introduction to
Reinforcement Learning