INF1339 W11 Lobby
2019 Slide Deck
* | |||||||||
-1 | |||||||||
+1 | -1 | -1 | |||||||
-1 | |||||||||
-1 | |||||||||
#002554
#fed141
#007396
#382F2D
#3eb1c8
#f9423a
#C8102E
* | |||||||||
-1 | |||||||||
+1 | -1 | -1 | |||||||
-1 | |||||||||
-1 | |||||||||
A |
||
D |
NOW |
B |
C |
The Road Not Taken
BY ROBERT FROST
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;
Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,
And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.
I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.
Did it make ALL the difference?
https://devblogs.nvidia.com/deep-learning-nutshell-reinforcement-learning/ (Video courtesy of Mark Harris, who says he is “learning reinforcement” as a parent.)
"I returned, and saw under the sun, that the race is not to the swift, not the battle to the strong, neither yet bread to the wise, nor yet riches to men of understanding, nor yet favour to men of skill; but time and chance happeneth to them all."
Ecclesiastes 9:11, King James Version
We can choose between actions based on their expected value. Here we choose action 1.
Imagine life as movement in a chess board world. Each square is a "state" and we choose one of four actions: move, one square at time, UP, DOWN, LEFT, or RIGHT.
Actions take
us from state to state
0.5
Choosing Among Options: Policy
0.25
0.25
0.00
Read: 50% of the time I go UP, 25% of the time I go DOWN, 25% of the time I go RIGHT. I never go LEFT.
Policy(state 12) = {0.5,0.25, 0, 0.25)
0.5
Choosing Among Options: Policy
0.25
0.25
0.0
Policy(state 12) = {0.5,0.25, 0, 0.25)
reward=10
reward=0
reward=0
reward=-4
Ex(Policy(state 12)) = 0.5x10+0.25x(-4)=4
The "value" of a choice at a fork in the road is its expected value - the weighted average of the things life paths that start with heading down that fork.
We can simplify life and think of it as moving around on a chess board.
A policy is a list of how likely we choose which actions in a given state.
We can compute the value of a policy in a given state.
"Advice to a Young Tradesman", Benjamin Franklin
https://www.pickpik.com/
"The future balance in year n of starting balance B0 is ..."
"The discounted value future balance Bn is that value times gamma to the n"
1 plus interest rate raised to the nth power times the initial balance
Rearrange to express current balance in terms of future balance
Rewardnow = 0.910 x 1000 = 349
Assume gamma = 0.9
Discounting Example
Rewardnow = 0.914 x 1000 = 227
the probabilities of downstream rewards
AND
is related to
in gridworld
of an action
The expected value
the length of the paths to those rewards.
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 13 | 14 | ||
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 31 | 33 | 34 | 35 | ||
36 | 38 | 40 | 41 | 42 | ||
43 | 44 | 45 | 46 | 47 | 48 | 49 |
start here
end here
walls/obstacles - can't go here
Assume reward at goal
100 |
90
81
73
66
59
53
48
43
31
90
81
81
73
73
73
66
66
66
59
59
59
53
53
53
53
48
48
48
43
43
43
39
39
39
39
35
35
35
31
28
100x0.9=90
53x0.9=48
48x0.9=43
35x0.9=31
90x0.9=81
81x0.9=73
73x0.9=66
66x0.9=59
59x0.9=53
43=0.9=39
39x0.9=35
31x0.9=28
This sort of* says "the expected reward averaged over all the paths that start out in the upper left state is 28.
* not quite, though, because we haven't taken into account all the wandering paths we could take
100 |
90
81
73
66
59
53
48
43
31
90
81
81
73
73
73
66
66
66
59
59
59
53
53
53
53
48
48
48
43
43
43
39
39
39
39
35
35
35
31
28
Try out everything over and over and over again and keep track of how everything works out each time.
The Plan
Putting it all together
A grid of STATES
a set of actions
S = {s1, s2, ..., sn}
A = {a1, a2, a3, a4}
19 | 13 | 25 | 20 | X |
20 | 14 | 26 | 21 | 19 |
21 | 15 | 27 | 22 | 20 |
22 | 16 | 28 | 23 | 21 |
23 | 17 | 9 | 24 | 22 |
24 | 18 | 30 | X | 23 |
25 | 19 | 31 | 26 | X |
26 | 20 | 32 | 27 | 25 |
27 | 21 | 33 | 28 | 26 |
28 | 22 | 34 | 29 | 27 |
29 | 23 | 35 | 30 | 28 |
30 | 24 | 36 | X | 29 |
31 | 25 | X | 32 | X |
32 | 26 | X | 33 | 32 |
33 | 27 | X | 34 | 33 |
34 | 28 | X | 35 | 34 |
35 | 29 | X | 36 | 35 |
36 | 30 | X | X | 35 |
1 | X | 7 | 2 | X |
2 | X | 8 | 3 | 1 |
3 | X | 9 | 4 | 2 |
4 | X | 10 | 5 | 3 |
5 | X | 11 | 6 | 4 |
6 | X | 12 | X | 5 |
7 | 1 | 13 | 8 | X |
8 | 2 | 14 | 9 | 7 |
9 | 3 | 15 | 10 | 8 |
10 | 4 | 16 | 11 | 9 |
11 | 5 | 17 | 12 | 10 |
12 | 6 | 18 | X | 11 |
13 | 7 | 19 | 14 | X |
14 | 8 | 20 | 15 | 13 |
15 | 9 | 21 | 16 | 14 |
16 | 10 | 22 | 17 | 15 |
17 | 11 | 23 | 18 | 16 |
18 | 12 | 24 | X | 17 |
STATE | U | D | R | L |
---|
STATE | U | D | R | L |
---|
ACTION
ACTION
In each state an action takes me to another state.
The rule I use for choosing actions in each state is my policy.
For example: mostly I go down, sometimes right or up, rarely left.
STATE | UP | DOWN | LEFT | RIGHT |
---|---|---|---|---|
1 | 0 | .5 | 0 | .5 |
2 | 0 | .33 | .33 | .33 |
3 | 0 | .33 | .33 | .33 |
4 | 0 | .33 | .33 | .33 |
5 | 0 | .33 | .33 | .33 |
6 | 0 | .5 | .5 | 0 |
... | ... | ... | ... | ... |
1 | 2 | 3 | 4 | 5 | 6 |
policy = [ [0, 0.5, 0.5, 0],
[0, 0.33, 0.33, 0.33],
[0, 0.33, 0.33, 0.33],
[0, 0.33, 0.33, 0.33],
[0, 0.33, 0.33, 0.33],
[0, 0.5, 0.5, 0]
...
]
Repeat a task many times to determine a policy that will maximize the expected reward.
Our discussion of "points" in a solution space.
What combination of, say, Kp, Ki, Kd yield the minimal error?
What combination of weights in a neural network yield best predictions?
What combination of actions in states (policy) yields the highest reward?
For computer vision some of the readings introduced the idea of "gradient descent" - learning by moving in "weight landscape" toward weight combinations that reduced the cost/loss.
HERE, the learning proceeds by moving in the direction of increasing reward on the policy landscape.
state "Quality" estimate
policy in this state
reward in this state
walls
goal state
Current policy in this state
"Quality" of this state
R=+1.0
R=-1.0
R=-1.0
R=+1.0
R=-1.0
R=-1.0
Policy Evaluation (one sweep)
Q of each state is just equal to the reward of that state
policy in this state is .25 in every direction
left is blocked so we stay in this state, up has no reward, right has R=-1, D has R=1.
0.25x1x0.9=.225= ~.23
so, the actions yield 0, -.23, +.23, and -.23 so new Q is -1.23
Initial Condition
1 policy evaluation
2 policy evaluations
Q of each state is just equal to the reward of that state
Initial Condition
1 policy evaluation
1 policy update
Policy here is updated from random
L and D options have Q=-1.0, U and R have 0.0
New policy here is U=0.5, R=0.5
Evaluate
Recalculate expected reward from this state based on current policy here and current quality of next states
Upate
Reset policy based on best alternative next states
One policy evaluation step has assigned values to the states that have rewards in them. And then the agent's policy was updated.
Consider the state outlined in purple. What will its value be if we evaluate the current policy?
The state's previous value was -1.0. The policy for this state has only one option: down. Following this policy lands us in a state with reward 1.0. But this state is one step away so we discount the reward with gamma=0.9. The result is -1.0 + 0.9 = 0.1.
URHERE
Each state has a "quality" Q based on expected value of rest-of-your-life paths that start there
What's your policy? How has it evolved? Remember not to play it too safely!
Let's play with Andrej Karpathy (Links to an external site.)'s Gridworld RL simulation written using the ReinforceJS library. The next few paragraphs paraphrase his text. The simulation takes place in what we call a "toy environment" called Gridworld - a rectangle of squares on which an "agent" can move. Each square represents a state. One square (upper left) is the start square. Another square is the goal square. In this particular simulation:
In other words, this is a deterministic, finite Markov Decision Process (MDP) and as always the goal is to find an agent policy (shown here by arrows) that maximizes the future discounted reward. My favorite part is letting Value iteration converge, then change the cell rewards and watch the policy adjust.
Interface. The color of the cells (initially all white) shows the current estimate of the Value (discounted reward) of that state, with the current policy. Note that you can select any cell and change its reward with the Cell reward slider.
Text
Google's AI/ML platform
"Competitors": OpenCV, Caffe2, pyTorch, Keras, and many more
scalar
vector
matrix
tensor
Nodes have a weight for each input.
Layers have multiple nodes.
Models have multiple layers.
You get the picture.
Resources
image classification
object detection
body segmentation
pose estimation
language embedding
speech recognition
model = tf.sequential();
model.add(tf.layers.conv2d...
model.add(tf.layers.maxPooling2d
model.add(tf.layers.conv2d...
model.add(tf.layers.maxPooling2d
model.add(tf.layers.flatten());
model.add(tf.layers.dense({
kernel
stride