0: Intro

ABout & not about

- Not about Neuronal Networks

- Not about Mathematics

- Reinforcement Learning

ABout & not about

- Not about Neuronal Networks

- Not about Mathematics

- Reinforcement Learning

- But were still doing it

- Were creating AI that learns a Game

- Were not creating AI to be in a Game

- Important for reasoning

- Unavoidable for certain problems

About me

Course Overview

Chapters

0

RL

Intro

General Infos
Base Principle
Markov Decision Process

Chapters

1

Dynamic

Programming

0

RL

Intro

Markov Decision Process
General Policy Iteration
Asuming perfect Conditions
Learning
- On Policy
- Model Based

Chapters

2

Monte

Carlo

1

Dynamic

Programming

0

RL

Intro

Sampling Theory
Monte Carlo Variants
Learning
- Off Policy
- Model Free

Chapters

2

Monte

Carlo

1

Dynamic

Programming

3

Temporal

Difference

0

RL

Intro

Merging Chapter 1 & 2
SARSA / Q-Learning
Learning
- Online
- Immediately

Chapters

2

Monte

Carlo

1

Dynamic

Programming

3

Temporal

Difference

4

Function

Approximation

0

RL

Intro

Eliminate memory limitations
Improve on Chapter 3
Learn Generalisation
Pytorch Intro

Chapters

2

Monte

Carlo

1

Dynamic

Programming

3

Temporal

Difference

4

Function

Approximation

5

Deep

QLearning

0

RL

Intro

Neuronal Networks
Improve RL Loop
Improve NN Loop

Chapters

2

Monte

Carlo

1

Dynamic

Programming

3

Temporal

Difference

4

Function

Approximation

5

Deep

QLearning

0

RL

Intro

6

Policy

Gradient

Portfolio

2

Monte

Carlo

1

Dynamic

Programming

3

Temporal

Difference

4

Function

Approximation

5

Deep

QLearning

0

RL

Intro

6

Policy

Gradient

Portfolio

2

Monte

Carlo

1

Dynamic

Programming

3

Temporal

Difference

4

Function

Approximation

5

Deep

QLearning

0

RL

Intro

6

Policy

Gradient

Portfolio

2

Monte

Carlo

1

Dynamic

Programming

3

Temporal

Difference

4

Function

Approximation

5

Deep

QLearning

PORTFOLIO

2

Monte

Carlo

Blackjack

1

Dynamic

Programming

Icy Lake

3

Temporal

Difference

Taxi

4

Function

Approximation

Mountain Car

5

Deep

QLearning

Atari 2600

Rating

2

Monte

Carlo

Blackjack

1

Dynamic

Programming

Icy Lake

3

Temporal

Difference

Taxi

4

Function

Approximation

Mountain Car

5

Deep

QLearning

Atari 2600

20%

5

Rating

2

Monte

Carlo

Blackjack

1

Dynamic

Programming

Icy Lake

3

Temporal

Difference

Taxi

4

Function

Approximation

Mountain Car

5

Deep

QLearning

Atari 2600

20%

5

Note 5

Note 4

Note 3

Note 2

Note 1

Rating

2

Monte

Carlo

Blackjack

1

Dynamic

Programming

Icy Lake

3

Temporal

Difference

Taxi

4

Function

Approximation

Mountain Car

5

Deep

QLearning

Atari 2600

20%

5

Note 5

Note 4

Note 3

Note 2

Note 1

00 - 04

05 - 10

11 - 15

16 - 20

21 - 25

Literature

The Book

"Bible" of RL
Very detailed
PDF is for free

A Video Playlist

A VIDEO PLAYLIST

Fast & compressed
Beatiful visualisations
Linked Github Repo

Examples

https://openai.com/research/openai-five

Dota 2

https://deepmind.google/technologies/alphago/

AlphaGo

https://arxiv.org/pdf/1801.05086.pdf

Autonomous UAV Navigation

https://www.roboticsproceedings.org/rss05/p27.pdf

LITTLEDOG BOSTON DYNAMICS

https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

PLAYING ATARI

Tools

Gymnasium

PyTorch

Python + JUPYTER

Reinforcment Learning

WHat is RL

Supervised

Unsupervised

Machine Learning

Reinforcement

Labeled Data
Regression
Classification

Unlabeled Data
Clustering
Embedding

Semisupervised
Interactive

WHat is RL

Supervised

Unsupervised

Machine Learning

Reinforcement

Labeled Data
Regression
Classification

Unlabeled Data
Clustering
Embedding

Semisupervised
Interactive

WHat is RL

Supervised

Unsupervised

Machine Learning

Reinforcement

Labeled Data
Regression
Classification

Unlabeled Data
Clustering
Embedding

Semisupervised
Interactive

WHat is RL

Supervised

Unsupervised

Machine Learning

Reinforcement

Labeled Data
Regression
Classification

Unlabeled Data
Clustering
Embedding

Semisupervised
Interactive

WHat is RL

Supervised

Unsupervised

Machine Learning

Reinforcement

PRINCIPLE

Agent

PRINCIPLE

Agent

Environment

PRINCIPLE

Agent

A_t

Environment

t \in T \{0,1,2,...\}

a \in A

Time

Action

PRINCIPLE

Agent

S_t

S_{t+1}

A_t

Environment

t \in T \{0,1,2,...\}

s \in S

a \in A

Time

State

Action

PRINCIPLE

Agent

S_t

R_t

S_{t+1}

R_{t+1}

A_t

Environment

t \in T \{0,1,2,...\}

s \in S

r \in R \in \mathbb{R}

a \in A

Time

State

Action

Reward

PRINCIPLE

Agent

S_t

R_t

S_{t+1}

R_{t+1}

A_t

Environment

t \in T \{0,1,2,...\}

s \in S

r \in R \in \mathbb{R}

a \in A

S_0, A_0, R_1, \hspace{3mm} S_1, A_1, R_2, \hspace{3mm} S_2, A_2, R_3, \hspace{3mm} \ldots

Time

State

Action

Reward

Trail:

PRINCIPLE

Agent

S_t

R_t

S_{t+1}

R_{t+1}

A_t

Environment

t \in T \{0,1,2,...\}

s \in S

r \in R \in \mathbb{R}

a \in A

S_0, A_0, R_1, \hspace{3mm} S_1, A_1, R_2, \hspace{3mm} S_2, A_2, R_3, \hspace{3mm} \ldots

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^T\gamma^k R_{t+k+1}

Time

State

Action

Reward

Trail:

Goal:

PRINCIPLE

Agent

S_t

R_t

S_{t+1}

R_{t+1}

A_t

Environment

t \in T \{0,1,2,...\}

s \in S

r \in R \in \mathbb{R}

a \in A

S_0, A_0, R_1, \hspace{3mm} S_1, A_1, R_2, \hspace{3mm} S_2, A_2, R_3, \hspace{3mm} \ldots

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^T\gamma^k R_{t+k+1}

Time

State

Action

Reward

Trail:

Goal:

PRINCIPLE

Agent

S_t

R_t

S_{t+1}

R_{t+1}

A_t

Environment

p(s', r \mid s, a)

t \in T \{0,1,2,...\}

s \in S

r \in R \in \mathbb{R}

a \in A

S_0, A_0, R_1, \hspace{3mm} S_1, A_1, R_2, \hspace{3mm} S_2, A_2, R_3, \hspace{3mm} \ldots

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^T\gamma^k R_{t+k+1}

Time

State

Action

Reward

Trail:

Goal:

PRINCIPLE

Agent

S_t

R_t

S_{t+1}

R_{t+1}

A_t

Environment

p(s', r \mid s, a)

\pi(a \mid s)

t \in T \{0,1,2,...\}

s \in S

r \in R \in \mathbb{R}

a \in A

S_0, A_0, R_1, \hspace{3mm} S_1, A_1, R_2, \hspace{3mm} S_2, A_2, R_3, \hspace{3mm} \ldots

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^T\gamma^k R_{t+k+1}

Time

State

Action

Reward

Trail:

Goal:

Finite Markov Decision process

Agent

S_t

R_t

S_{t+1}

R_{t+1}

A_t

Environment

p(s', r \mid s, a)

\pi(a \mid s)

t \in T \{0,1,2,...\}

s \in S

r \in R \in \mathbb{R}

a \in A

S_0, A_0, R_1, \hspace{3mm} S_1, A_1, R_2, \hspace{3mm} S_2, A_2, R_3, \hspace{3mm} \ldots

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^T\gamma^k R_{t+k+1}

Time

State

Action

Reward

Trail:

Goal:

Finite Markov Decision process

Agent

S_t

R_t

S_{t+1}

R_{t+1}

A_t

Environment

p(s', r \mid s, a)

\pi(a \mid s)

Markov What?

Markov Property

The probability of transitioning to the next state depends only on the current state, not on the sequence of events that preceded it.

MDP

Finite Markov Decision process

Agent

S_t

R_t

S_{t+1}

R_{t+1}

A_t

Environment

p(s', r \mid s, a)

\pi(a \mid s)

t \in T \{0,1,2,...\}

s \in S

r \in R \in \mathbb{R}

a \in A

S_0, A_0, R_1, \hspace{3mm} S_1, A_1, R_2, \hspace{3mm} S_2, A_2, R_3, \hspace{3mm} \ldots

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^T\gamma^k R_{t+k+1}

Time

State

Action

Reward

Trail:

Goal:

Finite Markov Decision process

Agent

S_t

R_t

S_{t+1}

R_{t+1}

A_t

Environment

p(s', r \mid s, a)

\pi(a \mid s)

t \in T \{0,1,2,...\}

s \in S

r \in R \in \mathbb{R}

a \in A

S_0, A_0, R_1, \hspace{3mm} S_1, A_1, R_2, \hspace{3mm} S_2, A_2, R_3, \hspace{3mm} \ldots

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^T\gamma^k R_{t+k+1}

Time

State

Action

Reward

REWARDS & OBJECTIVE

G_t = \sum_{k=0}^\infty\gamma^k R_{t+k+1}

Goal:

State Value:

Action Value:

The actual Outcome for t

q_\pi(s, a) \equiv \mathbb{E}_\pi[G_t \,|\, S_t = s, A_t = a]

v_\pi(s) \equiv \mathbb{E}_\pi[G_t \,|\, S_t = s]

REWARDS & OBJECTIVE

G_t = \sum_{k=0}^\infty\gamma^k R_{t+k+1}

Goal:

v_\pi(s) \equiv \mathbb{E}_\pi\Rightarrow

State Value:

Action Value:

What is the expected Outcome at the given state

The actual Outcome for t

q_\pi(s, a) \equiv \mathbb{E}_\pi[G_t \,|\, S_t = s, A_t = a]

REWARDS & OBJECTIVE

G_t = \sum_{k=0}^\infty\gamma^k R_{t+k+1}

Goal:

v_\pi(s) \equiv \mathbb{E}_\pi\Rightarrow

q_\pi(s, a) \equiv \mathbb{E}_\pi \Rightarrow

State Value:

Action Value:

What is the expected Outcome at the given state

What is the expected Outcome at the given state taking an action

The actual Outcome for t

REWARDS & OBJECTIVE

G_t = \sum_{k=0}^\infty\gamma^k R_{t+k+1}

Goal:

v_\pi(s) \equiv \mathbb{E}_\pi[G_t \,|\, S_t = s]

q_\pi(s, a) \equiv \mathbb{E}_\pi[G_t \,|\, S_t = s, A_t = a]

State Value:

Action Value:

REWARDS & OBJECTIVE

G_t = \sum_{k=0}^\infty\gamma^k R_{t+k+1}

Goal:

v_\pi(s) \equiv \mathbb{E}_\pi[G_t \,|\, S_t = s]

q_\pi(s, a) \equiv \mathbb{E}_\pi[G_t \,|\, S_t = s, A_t = a]

State Value:

G_t = \sum_{k=0}^\infty\gamma^k R_{t+k+1}

Goal:

v_\pi(s) \equiv \mathbb{E}_\pi[G_t \,|\, S_t = s]

q_\pi(s, a) \equiv \mathbb{E}_\pi[G_t \,|\, S_t = s, A_t = a]

State Value:

Action Value:

\pi(a | s)

Policy:

\pi_*(a | s)

optimal Policy:

v_{\pi_*}(s) \geq v_\pi(s)

\text{for all}\ s\ \text{and any}\ \pi

REWARDS & OBJECTIVE & POLICIES

G_t = \sum_{k=0}^\infty\gamma^k R_{t+k+1}

Goal:

v_\pi(s) \equiv \mathbb{E}_\pi[G_t \,|\, S_t = s]

q_\pi(s, a) \equiv \mathbb{E}_\pi[G_t \,|\, S_t = s, A_t = a]

State Value:

Action Value:

\pi(a | s)

Policy:

\pi_*(a | s)

optimal Policy:

v_{\pi_*}(s) \geq v_\pi(s)

\text{for all}\ s\ \text{and any}\ \pi

q_{\pi_*}(s,a) \geq q_\pi(s,a)

\text{for all}\ s,a\ \text{and any}\ \pi

REWARDS & OBJECTIVE & POLICIES

G_t = \sum_{k=0}^\infty\gamma^k R_{t+k+1}

Goal:

v_\pi(s) \equiv \mathbb{E}_\pi[G_t \,|\, S_t = s]

q_\pi(s, a) \equiv \mathbb{E}_\pi[G_t \,|\, S_t = s, A_t = a]

State Value:

Action Value:

\pi(a | s)

Policy:

\pi_*(a | s)

optimal Policy:

v_{\pi_*}(s) \geq v_\pi(s)

\text{for all}\ s\ \text{and any}\ \pi

q_{\pi_*}(s,a) \geq q_\pi(s,a)

\text{for all}\ s,a\ \text{and any}\ \pi

\pi_* = \argmax_{a}\ q_*(s, a)

Games AI

0: Intro

ABout & not about

ABout & not about

ABout & not about

About me

About me

About me

About me

Course Overview

Chapters

Chapters

Chapters

Chapters

Chapters

Chapters

Chapters

Chapters

Portfolio

Portfolio

Portfolio

PORTFOLIO

Rating

Rating

Rating

Literature

The Book

The Book

The Book

A Video Playlist

A VIDEO PLAYLIST

A VIDEO PLAYLIST

Examples

Dota 2

AlphaGo

Autonomous UAV Navigation

LITTLEDOG BOSTON DYNAMICS

PLAYING ATARI

Tools

Gymnasium

PyTorch

Python + JUPYTER

Reinforcment Learning

WHat is RL

WHat is RL

WHat is RL

WHat is RL

WHat is RL

WHat is RL

PRINCIPLE

PRINCIPLE

PRINCIPLE

PRINCIPLE

PRINCIPLE

PRINCIPLE

PRINCIPLE

PRINCIPLE

PRINCIPLE

PRINCIPLE

PRINCIPLE

Finite Markov Decision process

Finite Markov Decision process

Markov What?

Markov Property

MDP

Finite Markov Decision process

Finite Markov Decision process

Rewards & Objective

REWARDS & OBJECTIVE

REWARDS & OBJECTIVE

REWARDS & OBJECTIVE

REWARDS & OBJECTIVE

REWARDS & OBJECTIVE

REWARDS & OBJECTIVE

REWARDS & OBJECTIVE

REWARDS & OBJECTIVE

REWARDS & OBJECTIVE

REWARDS & OBJECTIVE

REWARDS & OBJECTIVE

REWARDS & OBJECTIVE