James L. Weaver
Developer Advocate

jweaver@pivotal.io
JavaFXpert.com

Katharine Beaumont
Developer / Mathematician

kbe@voxxed.com
voxxed.com

@KatharineCodes

Machine Learning

Workshop: Part Four

Reinforcement Learning

(Let's dive in now)

@KatharineCodes @JavaFXpert

Using BURLAP for Reinforcement Learning

burlap.cs.brown.edu

@KatharineCodes @JavaFXpert

Learning to Navigate a Grid World with Q-Learning

@KatharineCodes @JavaFXpert

Rules of this Grid World

Agent may move left, right, up, or down (actions)
Reward is 0 for each move
Reward is 5 for reaching top right corner (terminal state)
Agent can't move into a wall or off-grid
Agent doesn't have a model of the grid world. It must discover as it interacts.

Challenge: Given that there is only one state that gives a reward, how can the agent work out what actions will get it to the reward?

(AKA the credit assignment problem)

Goal of an episode is to maximize total reward

@KatharineCodes @JavaFXpert

Visualizing training episodes

From BasicBehavior example in https://github.com/jmacglashan/burlap_examples

@KatharineCodes @JavaFXpert

This Grid World's MDP (Markov Decision Process)

In this example, all actions are deterministic

@KatharineCodes @JavaFXpert

Agent learns optimal policy from interactions with the environment (s, a, r, s')

Source: http://www.mdpi.com/sensors/sensors-15-06668/article_deploy/html/images/sensors-15-06668-g002-1024.png

@KatharineCodes @JavaFXpert

Expected future discounted rewards, and polices

@KatharineCodes @JavaFXpert

This example used discount factor 0.9

Low discount factors cause agent to prefer immediate rewards

@KatharineCodes @JavaFXpert

Exploration vs. Exploitation

How often should the agent try new paths vs. greedily taking known paths?

@KatharineCodes @JavaFXpert

Q-Learning approach to reinforcement learning

	Left	Right	Up	Down
...
2, 7	2.65	4.05	0.00	3.20
2, 8	3.65	4.50	4.50	3.65
2, 9	4.05	5.00	5.00	4.05
2, 10	4.50	4.50	5.00	3.65
...

Q-Learning table of expected values (cumulative discounted rewards) as a result of taking an action from a state and following an optimal policy. Here's an explanation of how calculations in a Q-Learning table are performed.

Actions

States

@KatharineCodes @JavaFXpert

Tic-Tac-Toe with Reinforcement Learning

Learning to win from experience rather than by being trained

@KatharineCodes @JavaFXpert

Inspired by the Tic-Tac-Toe Example section...

...of Reinforcement Learning: An Introduction

@KatharineCodes @JavaFXpert

Tic-Tac-Toe Learning Agent and Environment

Our learning agent is the "X" player, receiving +5 for winning, -5 for losing, and -1 for each turn

The "O" player is part of the Environment. State and reward updates that it gives the Agent consider the "O" play.

@KatharineCodes @JavaFXpert

Tic-Tac-Toe state is the game board and status

States	0	1	2	3	4	5	6	7	8
O I X I O X X I O, O won	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
I I I I I I O I X, in prog	1.24	1.54	2.13	3.14	2.23	3.32	N/A	1.45	N/A
I I O I I X O I X, in prog	2.34	1.23	N/A	0.12	2.45	N/A	N/A	2.64	N/A
I I O O X X O I X, in prog	+4.0	-6.0	N/A	N/A	N/A	N/A	N/A	-6.0	N/A
X I O I I X O I X, X won	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
...