Machine Learning Introduction

Introduction

1

Introduction

2

  • Fundamentals
  • Introduction Supervised Learning
    • Neural Nets with AND
    • MNIST
  • Introduction Unsupervised Learning
    • K-Means Clustering
  • Introduction Reinforcement Learning
    • Q-Learning
  • Summary

Structure

Fundamentals

3

Fundamentals

  • What is ML?

 

  • Why do we need it?
  • What are the use cases?
  • Do I need a Ph.D to understand all of this?

Supervised Learning

4

Supervised Learning

Mammal

Not a mammal

Mammal?

Supervised Learning

5

Logical AND - Problem

x1

x2

0

1

0

1

x1 x2
0 0 0
0 1 0
1 0 0
1 1 1

∧ = 0

∧ = 1

0

0

0

1

Supervised Learning

6

Logical AND - Problem

x1

x2

0

1

0

1

h < 0

h >= 0

h = 0 => x_2 = \frac{a + b \cdot x_1}{-c}
h=0=>x2=a+bx1ch = 0 => x_2 = \frac{a + b \cdot x_1}{-c}
h = a + b \cdot x_1 + c \cdot x_2
h=a+bx1+cx2h = a + b \cdot x_1 + c \cdot x_2
h
hh

How to learn such a function?

Supervised Learning

7

Perceptron

\sum
\sum
x_1
x1x_1
x_2
x2x_2
\cdot w_1
w1\cdot w_1
\cdot w_2
w2\cdot w_2
\sigma
σ\sigma
\sigma(\sum_{i=0} x_i \cdot w_i)
σ(i=0xiwi)\sigma(\sum_{i=0} x_i \cdot w_i)
x = input
x=inputx = input
w = weights
w=weightsw = weights
\sigma = activation\; function
σ=activationfunction\sigma = activation\; function
x_n
xnx_n
\cdot w_n
wn\cdot w_n
1
11
\cdot w_0
w0\cdot w_0

Supervised Learning

8

Perceptron for AND

\sum
\sum
x_1
x1x_1
x_2
x2x_2
\cdot\; w_1
w1\cdot\; w_1
\cdot \; w_2
w2\cdot \; w_2
w_0 + x_1\cdot w_1 + x_2 \cdot w_2
w0+x1w1+x2w2w_0 + x_1\cdot w_1 + x_2 \cdot w_2
1
11
\cdot \; w_0
w0\cdot \; w_0

< 0: 0 (false)

>= 0: 1 (true)

h = a + x_1\cdot b + x_2 \cdot c
h=a+x1b+x2ch = a + x_1\cdot b + x_2 \cdot c
\sigma
σ\sigma

Supervised Learning

9

Perceptron for AND

-0.3 + 0 \cdot 0.3 + 1 \cdot 0.5 = 0.2
0.3+00.3+10.5=0.2-0.3 + 0 \cdot 0.3 + 1 \cdot 0.5 = 0.2

0 = false

1 = true

Input:

x_1 = 0, \; x_2 = 1
x1=0,x2=1x_1 = 0, \; x_2 = 1

Randomly choosen weights:

w_0 = -0.3, \; w_1 = 0.3, \; x_2 = 0.5
w0=0.3,w1=0.3,x2=0.5w_0 = -0.3, \; w_1 = 0.3, \; x_2 = 0.5

x1

x2

0

1

0

1

\sigma(0.2) = 1
σ(0.2)=1\sigma(0.2) = 1

Supervised Learning

10

Cost Function

x_1 x_2 y ^y C
0 0 0 0 0
0 1 0 1 -1
1 0 0 1 -1
1 1 1 1 0
\hat{y} = our\;result
y^=ourresult\hat{y} = our\;result
y = desired\;result
y=desiredresulty = desired\;result
cost = - \sum_{i\in\mathcal{M}} | y_i - \hat{y_i}|
cost=iMyiyi^cost = - \sum_{i\in\mathcal{M}} | y_i - \hat{y_i}|

x1

x2

0

1

0

1

-1

-1

\mathcal{M} := misclassified\; patterns
M:=misclassifiedpatterns\mathcal{M} := misclassified\; patterns
= -2
=2= -2

Supervised Learning

11

Backpropagation

\hat{y} = \sigma (\sum_{i = 0} x_i \cdot w_i)
y^=σ(i=0xiwi)\hat{y} = \sigma (\sum_{i = 0} x_i \cdot w_i)
cost = -\sum_{i\in\mathcal{M}} | y_i - \hat{y_i}|
cost=iMyiyi^cost = -\sum_{i\in\mathcal{M}} | y_i - \hat{y_i}|
\frac{\partial cost}{\partial w_i} \approx -\sum_{i \in \mathcal{M}} - x_i
costwiiMxi\frac{\partial cost}{\partial w_i} \approx -\sum_{i \in \mathcal{M}} - x_i

-1

0

0

1

- \sum |y_i - \hat{y_i} |
yiyi^- \sum |y_i - \hat{y_i} |
| \mathcal{M} |
M| \mathcal{M} |
w_i \leftarrow w_i + \eta \cdot (\sum_{i\in\mathcal{M}} -x_i)
wiwi+η(iMxi)w_i \leftarrow w_i + \eta \cdot (\sum_{i\in\mathcal{M}} -x_i)

Update rule:

Supervised Learning

12

Backpropagation

x_0 = 1, x_1 = 0, x_2 = 1
x0=1,x1=0,x2=1x_0 = 1, x_1 = 0, x_2 = 1
w_i \leftarrow w_i + \eta \cdot (\sum_{i\in\mathcal{M}} -x_i)
wiwi+η(iMxi)w_i \leftarrow w_i + \eta \cdot (\sum_{i\in\mathcal{M}} -x_i)

Misclassified:

Update rule:

\eta = 0.1
η=0.1\eta = 0.1

Learning rate:

Weights:

w_0 = -0.3, w_1 = 0.3, w_2 = 0.5
w0=0.3,w1=0.3,w2=0.5w_0 = -0.3, w_1 = 0.3, w_2 = 0.5
w_0 = -0.3 + 0.1 \cdot (-1 - 1) = -0.5
w0=0.3+0.1(11)=0.5w_0 = -0.3 + 0.1 \cdot (-1 - 1) = -0.5
w_1 = \;\;\;0.3 + 0.1 \cdot (0 - 1)\;\;\: = \;\;0.2
w1=0.3+0.1(01)=0.2w_1 = \;\;\;0.3 + 0.1 \cdot (0 - 1)\;\;\: = \;\;0.2
w_2 = \;\;\;0.5 + 0.1 \cdot (-1 - 0) = \;\;0.4
w2=0.5+0.1(10)=0.4w_2 = \;\;\;0.5 + 0.1 \cdot (-1 - 0) = \;\;0.4

Supervised Learning

13

Test

x1

x2

0

1

0

1

\sigma(-0.5 + 0.2 \cdot 0 + 0.4 \cdot 0) = \sigma(-0.5) = 0 \;
σ(0.5+0.20+0.40)=σ(0.5)=0\sigma(-0.5 + 0.2 \cdot 0 + 0.4 \cdot 0) = \sigma(-0.5) = 0 \;
\sigma(-0.5 + 0.2 \cdot 0 + 0.4 \cdot 1) = \sigma(-0.1) = 0 \;
σ(0.5+0.20+0.41)=σ(0.1)=0\sigma(-0.5 + 0.2 \cdot 0 + 0.4 \cdot 1) = \sigma(-0.1) = 0 \;
\sigma(-0.5 + 0.2 \cdot 1 + 0.4 \cdot 0) = \sigma(-0.3) = 0 \;
σ(0.5+0.21+0.40)=σ(0.3)=0\sigma(-0.5 + 0.2 \cdot 1 + 0.4 \cdot 0) = \sigma(-0.3) = 0 \;
\sigma(-0.5 + 0.2 \cdot 1 + 0.4 \cdot 1) = \sigma(0.1) = 1 \;
σ(0.5+0.21+0.41)=σ(0.1)=1\sigma(-0.5 + 0.2 \cdot 1 + 0.4 \cdot 1) = \sigma(0.1) = 1 \;
x_1 = 0, x_2 = 1:
x1=0,x2=1:x_1 = 0, x_2 = 1:
x_1 = 0, x_2 = 0:
x1=0,x2=0:x_1 = 0, x_2 = 0:
x_1 = 1, x_2 = 0:
x1=1,x2=0:x_1 = 1, x_2 = 0:
x_1 = 1, x_2 = 1:
x1=1,x2=1:x_1 = 1, x_2 = 1:

Supervised Learning

14

Summary

x1

x2

0

1

0

  • Get labeled data (AND - Table)
  • Run the data and calculate the error
  • Use partial deriviative of cost function to create a learning rule
  • For every mislabled sample, apply learning rule
  • Hope that it's linear separable

XOR

1

Supervised Learning

15

Example - MNIST

?

13% #0
0% #1
5% #2
1% #3
67% #4
2% #5
2% #6
3% #7
3% #8
4% #9

Supervised Learning

16

Example - MNIST

?

13% #0
0% #1
5% #2
1% #3
67% #4
2% #5
2% #6
3% #7
3% #8
4% #9

Supervised Learning

17

Multiple Perceptrons

x_1
x1x_1
x_2
x2x_2
x =
x=x =
x_n
xnx_n
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum

b/w pixel data

softmax
softmaxsoftmax
\sigma(\textbf{z})_j = \frac{e^{\textbf{z}_j}}{\sum^{K}_{k = 1} e^{\textbf{z}_k}}
σ(z)j=ezjk=1Kezk\sigma(\textbf{z})_j = \frac{e^{\textbf{z}_j}}{\sum^{K}_{k = 1} e^{\textbf{z}_k}}
13% #0
0% #1
5% #2
1% #3
67% #4
2% #5
2% #6
3% #7
3% #8
4% #9
\textbf{z}
z\textbf{z}

Supervised Learning

18

Linear Separable

:= 0,1,0
:=0,1,0:= 0,1,0
:= 1,1,1
:=1,1,1:= 1,1,1
:= 1,1,0
:=1,1,0:= 1,1,0
:= 0,1,0
:=0,1,0:= 0,1,0
:= 1,1,1
:=1,1,1:= 1,1,1
:= 1,1,0
:=1,1,0:= 1,1,0
:= 0,1,1
:=0,1,1:= 0,1,1
:= 1,1,0
:=1,1,0:= 1,1,0
0
00
1
11

Index

b/w

= 6
=6= 6
= 4
=4= 4
= 5
=5= 5

Supervised Learning

19

Multiple Layers of Perceptrons

x_1
x1x_1
x_2
x2x_2
x =
x=x =
x_n
xnx_n
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum

b/w pixel data

softmax
softmaxsoftmax
\sigma(\textbf{z})_j = \frac{e^{\textbf{z}_j}}{\sum^{K}_{k = 1} e^{\textbf{z}_k}}
σ(z)j=ezjk=1Kezk\sigma(\textbf{z})_j = \frac{e^{\textbf{z}_j}}{\sum^{K}_{k = 1} e^{\textbf{z}_k}}
13% #0
0% #1
5% #2
1% #3
67% #4
2% #5
2% #6
3% #7
3% #8
4% #9
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum
\sum

Supervised Learning

20

Additional Information

  • Convolutional Networks
  • Recurrent Networks
  • LSTM Neurons

Supervised Learning

21

Neural Style Transfer

https://handong1587.github.io/deep_learning/2015/10/09/fun-with-deep-learning.html

Supervised Learning

22

Neural Photorealistic Style Transfer

https://github.com/luanfujun/deep-photo-styletransfer

Supervised Learning

23

Text to Speech

Normal text

Randomly generated text

Music

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Unsupervised Learning

24

Unsupervised Learning

Feature 1

Feature 2

Feature 1

Feature 2

Unknown structure

Known structure

Unsupervised Learning

25

K - Means Clustering

Feature 1

Feature 2

0
00
1
11
2
22
3
33
1
11
2
22
3
33
X = \{ (1,2), (1,1), (2,3), (3,3) \}
X={(1,2),(1,1),(2,3),(3,3)}X = \{ (1,2), (1,1), (2,3), (3,3) \}
C_1 = \{ (1,1) \}
C1={(1,1)}C_1 = \{ (1,1) \}
C_2 = \{ (3,3) \}
C2={(3,3)}C_2 = \{ (3,3) \}
euclid(\textbf{x}, \textbf{y}) = \sqrt{(\sum_{i = 0}^n (x_i - y_i)^2)}
euclid(x,y)=(i=0n(xiyi)2)euclid(\textbf{x}, \textbf{y}) = \sqrt{(\sum_{i = 0}^n (x_i - y_i)^2)}
euclid(x_1, C_1) = \sqrt{(1-2)^2 + (2-1)^2} = 1.41
euclid(x1,C1)=(12)2+(21)2=1.41euclid(x_1, C_1) = \sqrt{(1-2)^2 + (2-1)^2} = 1.41
euclid(x_1, C_2) = \sqrt{(1-3)^2 + (2-3)^2} = 2.23
euclid(x1,C2)=(13)2+(23)2=2.23euclid(x_1, C_2) = \sqrt{(1-3)^2 + (2-3)^2} = 2.23

Unsupervised Learning

26

K - Means Clustering

Feature 1

Feature 2

0
00
1
11
2
22
3
33
1
11
2
22
3
33
X = \{ (1,2), (1,1), (2,3), (3,3) \}
X={(1,2),(1,1),(2,3),(3,3)}X = \{ (1,2), (1,1), (2,3), (3,3) \}
C_1 = \{ (1,1) \}
C1={(1,1)}C_1 = \{ (1,1) \}
C_2 = \{ (3,3) \}
C2={(3,3)}C_2 = \{ (3,3) \}
euclid(\textbf{x}, \textbf{y}) = \sqrt{(\sum_{i = 0}^n (x_i - y_i)^2)}
euclid(x,y)=(i=0n(xiyi)2)euclid(\textbf{x}, \textbf{y}) = \sqrt{(\sum_{i = 0}^n (x_i - y_i)^2)}
euclid(x_1, C_1) = \sqrt{(1-2)^2 + (2-1)^2} = 1.41
euclid(x1,C1)=(12)2+(21)2=1.41euclid(x_1, C_1) = \sqrt{(1-2)^2 + (2-1)^2} = 1.41
euclid(x_1, C_2) = \sqrt{(1-3)^2 + (2-3)^2} = 2.23
euclid(x1,C2)=(13)2+(23)2=2.23euclid(x_1, C_2) = \sqrt{(1-3)^2 + (2-3)^2} = 2.23
C_1 = \{ (\frac{1 + 1}{2} \cdot , \frac{1 + 2}{2}) \} = \{(1, 1.5)\}
C1={(1+12,1+22)}={(1,1.5)}C_1 = \{ (\frac{1 + 1}{2} \cdot , \frac{1 + 2}{2}) \} = \{(1, 1.5)\}

Unsupervised Learning

27

K - Means Clustering

Feature 1

Feature 2

0
00
1
11
2
22
3
33
1
11
2
22
3
33
X = \{ (1,2), (1,1), (2,3), (3,3) \}
X={(1,2),(1,1),(2,3),(3,3)}X = \{ (1,2), (1,1), (2,3), (3,3) \}
C_1 = \{ (1,1.5) \}
C1={(1,1.5)}C_1 = \{ (1,1.5) \}
C_2 = \{ (3,3) \}
C2={(3,3)}C_2 = \{ (3,3) \}
euclid(\textbf{x}, \textbf{y}) = \sqrt{(\sum_{i = 0}^n (x_i - y_i)^2)}
euclid(x,y)=(i=0n(xiyi)2)euclid(\textbf{x}, \textbf{y}) = \sqrt{(\sum_{i = 0}^n (x_i - y_i)^2)}
euclid(x_3, C_1) = \sqrt{(2-1)^2 + (3-1)^2} = 2.23
euclid(x3,C1)=(21)2+(31)2=2.23euclid(x_3, C_1) = \sqrt{(2-1)^2 + (3-1)^2} = 2.23
euclid(x_3, C_2) = \sqrt{(2-3)^2 + (3-3)^2} = 1
euclid(x3,C2)=(23)2+(33)2=1euclid(x_3, C_2) = \sqrt{(2-3)^2 + (3-3)^2} = 1
C_2 = \{ (\frac{2 + 3}{2}, \frac{3 + 3}{2}) \} = \{(2.5, 3)\}
C2={(2+32,3+32)}={(2.5,3)}C_2 = \{ (\frac{2 + 3}{2}, \frac{3 + 3}{2}) \} = \{(2.5, 3)\}

Unsupervised Learning

28

K - Means Clustering

Feature 1

Feature 2

0
00
1
11
2
22
3
33
1
11
2
22
3
33
X = \{ (1,2), (1,1), (2,3), (3,3) \}
X={(1,2),(1,1),(2,3),(3,3)}X = \{ (1,2), (1,1), (2,3), (3,3) \}
C_1 = \{ (1,1.5) \}
C1={(1,1.5)}C_1 = \{ (1,1.5) \}
C_2 = \{ (2.5,3) \}
C2={(2.5,3)}C_2 = \{ (2.5,3) \}
euclid(\textbf{x}, \textbf{y}) = \sqrt{(\sum_{i = 0}^n (x_i - y_i)^2)}
euclid(x,y)=(i=0n(xiyi)2)euclid(\textbf{x}, \textbf{y}) = \sqrt{(\sum_{i = 0}^n (x_i - y_i)^2)}
\forall x \in X: euclid(x, C_1) < euclid(x,C_2): x \in C_1
xX:euclid(x,C1)<euclid(x,C2):xC1\forall x \in X: euclid(x, C_1) < euclid(x,C_2): x \in C_1
\forall x \in X: euclid(x, C_1) > euclid(x,C_2): x \in C_2
xX:euclid(x,C1)>euclid(x,C2):xC2\forall x \in X: euclid(x, C_1) > euclid(x,C_2): x \in C_2
\text{if updated, recompute centroid of } C_{1,2}
if updated, recompute centroid of C1,2\text{if updated, recompute centroid of } C_{1,2}

Unsupervised Learning

29

K - Means Clustering - Caveats

Feature 1

Feature 2

0
00
1
11
2
22
3
33
1
11
2
22
3
33
  • Amount of clusters

Unsupervised Learning

30

K - Means Clustering - Caveats

Feature 1

Feature 2

0
00
1
11
2
22
3
33
1
11
2
22
3
33
  • Amount of clusters
  • Similarity measure

Unsupervised Learning

31

K - Means Clustering - Caveats

Feature 1

Feature 2

0
00
1
11
2
22
3
33
1
11
2
22
3
33
  • Amount of clusters
  • Similarity measure
  • No convergence

Unsupervised Learning

32

Additional Information

  • Principal Component Analysis
  • Support Vector Machines
  • Autoencoder

Unsupervised Learning

33

K-Means Clustering of 40K samples of homework

http://practicalquant.blogspot.de/2013/10/semi-automatic-method-for-grading-a-million-homework-assignments.html

Reinforcement Learning

34

Reinforcement Learning

Environment

Agent

Action

State

Reward

Reinforcement Learning

35

Formalizing RL

Environment

Agent

Action

s_t = \text{current state}
st=current states_t = \text{current state}
r_t = \text{reward}
rt=rewardr_t = \text{reward}
a_t = \text{action}
at=actiona_t = \text{action}

State

Reward

s_0
s0s_0
r_1
r1r_1
a_0
a0a_0
\rightarrow
\rightarrow
s_1
s1s_1
,
,,
,
,,
t \;\;= \text{timestep}
t=timestept \;\;= \text{timestep}

Reinforcement Learning

36

Formalizing RL

Environment

Agent

Action

s_t = \text{current state}
st=current states_t = \text{current state}
r_t = \text{reward}
rt=rewardr_t = \text{reward}
a_t = \text{action}
at=actiona_t = \text{action}

State

Reward

s_0
s0s_0
r_1
r1r_1
a_0
a0a_0
\rightarrow
\rightarrow
s_1
s1s_1
,
,,
,
,,
r_2
r2r_2
a_1
a1a_1
\rightarrow
\rightarrow
s_2
s2s_2
,
,,
,
,,
t \;\;= \text{timestep}
t=timestept \;\;= \text{timestep}
s_{n-1}
sn1s_{n-1}
r_n
rnr_n
a_{n-1}
an1a_{n-1}
\rightarrow
\rightarrow
s_n
sns_n
,
,,
,
,,

Reinforcement Learning

37

Reward

Environment

Agent

Action

s_t = \text{current state}
st=current states_t = \text{current state}
r_t = \text{reward}
rt=rewardr_t = \text{reward}
a_t = \text{action}
at=actiona_t = \text{action}

State

Reward

R = r_1 + r_2 + \ldots + r_n
R=r1+r2++rnR = r_1 + r_2 + \ldots + r_n
t \;\;= \text{timestep}
t=timestept \;\;= \text{timestep}

Reinforcement Learning

38

Timed Reward

Environment

Agent

Action

s_t = \text{current state}
st=current states_t = \text{current state}
r_t = \text{reward}
rt=rewardr_t = \text{reward}
a_t = \text{action}
at=actiona_t = \text{action}

State

Reward

R_t = r_t + r_{t+1} + \ldots + r_n
Rt=rt+rt+1++rnR_t = r_t + r_{t+1} + \ldots + r_n
t \;\;= \text{timestep}
t=timestept \;\;= \text{timestep}

Reinforcement Learning

39

Discount Rate

Environment

Agent

Action

s_t = \text{current state}
st=current states_t = \text{current state}
r_t = \text{reward}
rt=rewardr_t = \text{reward}
a_t = \text{action}
at=actiona_t = \text{action}

State

Reward

t \;\;= \text{timestep}
t=timestept \;\;= \text{timestep}
R_t = r_t +
Rt=rt+R_t = r_t +
\gamma \;= \text{discount rate} \; [0,1]
γ=discount rate[0,1]\gamma \;= \text{discount rate} \; [0,1]
(r_{t+1} + \ldots + r_n)
(rt+1++rn)(r_{t+1} + \ldots + r_n)
\gamma \cdot
γ\gamma \cdot

Reinforcement Learning

40

Discount Rate

Environment

Agent

Action

s_t = \text{current state}
st=current states_t = \text{current state}
r_t = \text{reward}
rt=rewardr_t = \text{reward}
a_t = \text{action}
at=actiona_t = \text{action}

State

Reward

t \;\;= \text{timestep}
t=timestept \;\;= \text{timestep}
R_t = r_t +
Rt=rt+R_t = r_t +
\gamma \;= \text{discount rate} \; [0,1]
γ=discount rate[0,1]\gamma \;= \text{discount rate} \; [0,1]
r_{t+1} +
rt+1+r_{t+1} +
\gamma \cdot
γ\gamma \cdot
r_{t+2} + \ldots +
rt+2++r_{t+2} + \ldots +
r_n
rnr_n
\gamma^2 \cdot
γ2\gamma^2 \cdot
\gamma^{n-t} \cdot
γnt\gamma^{n-t} \cdot

Reinforcement Learning

41

Short-Sighted Reward

Environment

Agent

Action

s_t = \text{current state}
st=current states_t = \text{current state}
r_t = \text{reward}
rt=rewardr_t = \text{reward}
a_t = \text{action}
at=actiona_t = \text{action}

State

Reward

t \;\;= \text{timestep}
t=timestept \;\;= \text{timestep}
R_t = r_t +
Rt=rt+R_t = r_t +
\gamma \;= 0
γ=0\gamma \;= 0
r_{t+1} +
rt+1+r_{t+1} +
0 \cdot
00 \cdot
r_{t+2} + \ldots +
rt+2++r_{t+2} + \ldots +
r_n
rnr_n
0 \cdot
00 \cdot
0 \;\cdot
00 \;\cdot
\rightarrow R_t = r_t
Rt=rt\rightarrow R_t = r_t

Reinforcement Learning

42

Balanced Rewards

Environment

Agent

Action

s_t = \text{current state}
st=current states_t = \text{current state}
r_t = \text{reward}
rt=rewardr_t = \text{reward}
a_t = \text{action}
at=actiona_t = \text{action}

State

Reward

t \;\;= \text{timestep}
t=timestept \;\;= \text{timestep}
R_t = r_t +
Rt=rt+R_t = r_t +
\gamma \;= 0.9
γ=0.9\gamma \;= 0.9
r_{t+1} +
rt+1+r_{t+1} +
0.9 \cdot
0.90.9 \cdot
r_{t+2} + \ldots +
rt+2++r_{t+2} + \ldots +
r_n
rnr_n
0.81 \cdot
0.810.81 \cdot
(\gamma^{n-t} \ll 0.9) \cdot
(γnt0.9)(\gamma^{n-t} \ll 0.9) \cdot

Reinforcement Learning

43

Q(uality) - Learning

Environment

Agent

Action

s_t = \text{current state}
st=current states_t = \text{current state}
r_t = \text{reward}
rt=rewardr_t = \text{reward}
a_t = \text{action}
at=actiona_t = \text{action}

State

Reward

t \;\;= \text{timestep}
t=timestept \;\;= \text{timestep}
R_{t+1}
Rt+1R_{t+1}
\gamma \;= \text{discount rate}
γ=discount rate\gamma \;= \text{discount rate}
Q(
Q(Q(
s_t
sts_t
,
,,
a_t
ata_t
) = max (
)=max() = max (

Represents the quality of an action in the current state, while continuing to play optimally from that point on

)
))

Reinforcement Learning

44

Q(uality) - Learning

s_t = \text{current state}
st=current states_t = \text{current state}
r_t = \text{reward}
rt=rewardr_t = \text{reward}
a_t = \text{action}
at=actiona_t = \text{action}
t \;\;= \text{timestep}
t=timestept \;\;= \text{timestep}
R_{t+1}
Rt+1R_{t+1}
\gamma \;= \text{discount rate}
γ=discount rate\gamma \;= \text{discount rate}
Q(
Q(Q(
s_t
sts_t
,
,,
a_t
ata_t
) = max (
)=max() = max (
)
))
\pi \;= \text{policy}
π=policy\pi \;= \text{policy}
\pi
π\pi
(
((
s_t
sts_t
) = argmax_a[Q(
)=argmaxa[Q() = argmax_a[Q(
s_t
sts_t
,
,,
a
aa
)]
)])]

Problem: How to construct such a Q function?

Reinforcement Learning

45

Bellmann Equation

s_t = \text{current state}
st=current states_t = \text{current state}
r_t = \text{reward}
rt=rewardr_t = \text{reward}
a_t = \text{action}
at=actiona_t = \text{action}
t \;\;= \text{timestep}
t=timestept \;\;= \text{timestep}
\gamma \;= \text{discount rate}
γ=discount rate\gamma \;= \text{discount rate}
Q
QQ
(
((
s_t
sts_t
) =
)=) =
s_{t+1}
st+1s_{t+1}
,
,,
a_{t+1}
at+1a_{t+1}
)]
)])]
,
,,
a_t
ata_t
r_{t+1}
rt+1r_{t+1}
+ \; \gamma \cdot max_{a_{t+1}}[Q(
+γmaxat+1[Q( + \; \gamma \cdot max_{a_{t+1}}[Q(

Maximal reward is defined as immediate reward + maximum future reward for next state

Reinforcement Learning

46

Learning Q-Function

s_t = \text{current state}
st=current states_t = \text{current state}
r_t = \text{reward}
rt=rewardr_t = \text{reward}
a_t = \text{action}
at=actiona_t = \text{action}
t \;\;= \text{timestep}
t=timestept \;\;= \text{timestep}
\gamma \;= \text{discount rate}
γ=discount rate\gamma \;= \text{discount rate}
s_0
s0s_0
a_0
a0a_0
a_1
a1a_1
\ldots
\ldots
a_n
ana_n
s_1
s1s_1
s_n
sns_n
\vdots
\vdots
0
00
0
00
\ldots
\ldots
0
00
0
00
0
00
\ldots
\ldots
0
00
\vdots
\vdots
\vdots
\vdots
\vdots
\vdots
\ddots
\ddots
0
00
0
00
\ldots
\ldots
0
00
s_0
s0s_0
\rightarrow
\rightarrow

Reinforcement Learning

47

Learning Q-Function

s_0
s0s_0
a_0
a0a_0
a_1
a1a_1
\ldots
\ldots
a_n
ana_n
s_1
s1s_1
s_n
sns_n
\vdots
\vdots
0.1
0.10.1
0
00
\ldots
\ldots
0
00
0
00
0
00
\ldots
\ldots
0
00
\vdots
\vdots
\vdots
\vdots
\vdots
\vdots
\ddots
\ddots
0
00
0
00
\ldots
\ldots
0
00
s_0
s0s_0
\rightarrow
\rightarrow
\pi
π\pi
(
((
s_0
s0s_0
) = argmax_a[Q(
)=argmaxa[Q() = argmax_a[Q(
s_0
s0s_0
,
,,
a
aa
)]
)])]
s_1
s1s_1
,
,,
r_1 = 0.1
r1=0.1r_1 = 0.1
\leftarrow \text{execute}
execute\leftarrow \text{execute}
a_0
a0a_0
Q[
Q[Q[
s_0
s0s_0
] =
]=] =
,
,,
a_0
a0a_0
0.1
0.10.1
Q[
Q[Q[
s_0
s0s_0
]
]]
,
,,
a_0
a0a_0
+ \; \alpha \cdot(
+α(+ \; \alpha \cdot(
+ \; \gamma \cdot max_{a}[Q(
+γmaxa[Q(+ \; \gamma \cdot max_{a}[Q(
s_1
s1s_1
,
,,
a
aa
)]
)])]
-\;Q(
Q(-\;Q(
s_0
s0s_0
,
,,
a_0
a0a_0
)])
)]))])
\text{restart with}
restart with\text{restart with}
s_1
s1s_1
0
00
=
==
a_0
a0a_0

Reinforcement Learning

48

Learning Q-Function with NNs

s_0
s0s_0
\vdots
\vdots
s_0
s0s_0
\rightarrow
\rightarrow
\pi
π\pi
(
((
s_0
s0s_0
) = \text{feedforward}
)=feedforward) = \text{feedforward}
s_0
s0s_0
s_1
s1s_1
,
,,
r_1 = 0.1
r1=0.1r_1 = 0.1
\leftarrow \text{execute}
execute\leftarrow \text{execute}
a_0
a0a_0
0.1
0.10.1
+ \; \gamma \cdot max_{a}[Q(
+γmaxa[Q(+ \; \gamma \cdot max_{a}[Q(
s_1
s1s_1
,
,,
a
aa
)]
)])]
\text{restart with}
restart with\text{restart with}
s_1
s1s_1
Q(
Q(Q(
\rightarrow
\rightarrow

NN

\rightarrow
\rightarrow
s_0
s0s_0
,
,,
a_0
a0a_0
)
))
Q(
Q(Q(
\rightarrow
\rightarrow
s_0
s0s_0
,
,,
a_1
a1a_1
)
))
Q(
Q(Q(
\rightarrow
\rightarrow
s_0
s0s_0
,
,,
a_n
ana_n
)
))
\text{feedforward}
feedforward\text{feedforward}
s_1
s1s_1
\rightarrow
\rightarrow
max_{a}[Q(
maxa[Q( max_{a}[Q(
s_1
s1s_1
,
,,
a
aa
)]
)])]
\text{backprop for }
backprop for \text{backprop for }
s_0
s0s_0
\text{with}
with\text{with}
a_0
a0a_0
,
,,

Reinforcement Learning

49

Additional Information

Experience Replay

Exploration - Exploitation

\epsilon - \text{greedy}
ϵgreedy\epsilon - \text{greedy}

Slides adapted from excellent tutorial

https://www.nervanasys.com/demystifying-deep-reinforcement-learning/

Reinforcement Learning

50

TORCS

https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html

Reinforcement Learning

51

Buzzwords we learned today

Perceptron

Backpropagation

Neural Nets

Supervised Learning

MNIST Dataset

Linear Separable

Unsupervised Learning

Clustering

K-Means

Distance Measures

Convergence

RL Learning

Policy

Q-Learning

Discount Rate

TORCS

Reinforcement Learning

52

Imagesources

  • Cetacea - http://www.toggo.de/media/slider-wal-3-14295-10110.jpg
  • Orca - http://elelur.com/mammals/orca.html
  • Pinniped - http://www.interestingfunfacts.com/amazing-facts-about-pinniped.html
  • Deep Sea Frill Shark - http://images.nationalgeographic.com/wpf/media-live/photos/000/181/cache/deep-sea01-frill-shark_18161_600x450.jpg
  • Shark - http://www.livescience.com/55001-shark-attacks-increasing.html
  • Dolphin - http://weknownyourdreamz.com/dolphin.html