Suppose I wanted to teach a robot to sort a deck of cards. One approach would be to come up with an algorithm for sorting cards and then tell the robot to follow the recipe. A good sorting recipe should work on different sized piles of cards and should stop when it is done and should complete the task efficiently.
But would it be possible to train a robot to do this task just by giving it a reward for succeeding but never telling it what the task is and never telling it HOW it should carry out the task?
To keep things simple, let's imagine we have a very simple world consisting of two cards and four "cells" - two where the cards are laid out and two where the cards can be moved in order to exchange places. There are 12 arrangements with the cards in different cells and four where they end up in the same cell ("collisions").
A | B |
B | A |
A | B |
B | A |
B | |
A |
A | |
B |
B | |
A |
A | |
B |
B | |
A |
B | |
A |
A | |
B |
A | |
B |
Arm Up
Arm Down
AB |
AB | |
AB | |
AB |
Collisions
So we have 16 card states and 8 arm states. Combined that gives 16x8=128 states.
One is the start state and one is the end state.
The system moves from state to state via one of 6 possible actions:
UP
DOWN
EAST
WEST
NORTH
SOUTH
B | A |
A | B |
start
end
B | A |
start
B | A |
B | |
A |
B | |
A |
down
north
up
B | |
A |
east
B | |
A |
south
down
B | |
A |
west
B | |
A |
up
B | |
A |
north
B | |
A |
down
B | |
A |
east
B | |
A |
south
A | B |
up
A | B |
Instruct a Robot to Find the Largest Card
The robot can only
Point at the cards as card1, card2, etc.
Read the value of a card.
Remember things.
Compare things.
A robot arm hovers above a table
It can move in four directions - north, south, east, west
and can go down and touch table or go up and hover above
down
UP
EAST
WEST
NORTH
SOUTH
HOME
The arm can go to its "HOME" position
HOME
The arm can go to its "HOME" position
HOME
A>B?
and it can COMPARE the card it is hovering over and the card to the right
HOME
HOME
HOME
HOME
A>B?
HOME
YES
HOME
HOME
HOME
HOME
HOME
HOME
HOME
HOME
HOME
HOME
HOME
HOME
HOME
HOME
A>B?
HOME
NO
HOME
HOME
A>B?
HOME
YES
HOME
YES
HOME
HOME
HOME
HOME
HOME
HOME
HOME
HOME
HOME
HOME
HOME
HOME
HOME
D
D, N
D, N, E
D, N, E, U
D, N, E, U, S
D, N, E, U, S, D
D, N, E, U, S, D, W
D, N, E, U, S, D, W, U
D, N, E, U, S, D, W, U, N
D, N, E, U, S, D, W, U, N, W
D, N, E, U, S, D, W, U, N, W, D
D, N, E, U, S, D, W, U, N, W, D, S
D, N, E, U, S, D, W, U, N, W, D, S, U