(achieve super human performance)
Learn features from samples
No more input than a human sees (pixels)
Learn new features in the same model and conserve the existing
Improve existing model
Reuse existing model
One model pr. game
Algorithm is the same for all games
Hyperparameters for the algorithm are the same for all games
state = game.start()
pixels = state.getPixels()
rewards = model.predict(pixels)
bestAction = getBestActionFromRewards(rewards)
state = game.move(bestAction)
import gym
env = gym.make('BreakoutDeterministic-v0')
env.reset()
while True:
env.render()
# Random action
action = env.action_space.sample()
next_state, reward, is_done, info = env.step(action)
print("Reward: {}, is_done: {}".format(reward, is_done))
if is_done:
env.reset()
(erstat med pseudokode)
Given state and action, predict Q and update the network with the Q value for that action
The next state is this state with the next image attached at the end, and then the first image is removed
Given state and action, predict Q and update the network with the Q value for that action
Predict an action from the model
Make next move in the game with this action
Record the resulting state, action and score
Put that state into a memory containing up to 1 million elements
For each time we finish a game or reach a threshold, pick 32 states from memory
Calculate Q for each of these 32 states
Train the network with these 32 states and the Q value for them
Start a new game
Gem state i memory, spil til man dør, træn batch med 32 elementer mod netværk, lav et nyt spil
1 + 0.95 x (0.7) = 1.665
The network is trained with 1.665 put into the Q value of the current action and next state
[0.45][0.23][1.665][0.7664]
initialise Q(numstates,numactions) arbitrarily
observe initial state s
repeat
select and carry out an action a
observe reward r and new state s'
Q(s,a) = r + γ(max(Q(s',a')))
s = s'
until terminated
initialize replay memory D
initialize action-value function Q with random weights
observe initial state s
repeat
select an action a
with probability ε select a random action
otherwise select a = argmaxa’Q(s,a’)
carry out action a
observe reward r and new state s’
store experience <s, a, r, s’> in replay memory D
sample random transitions <ss, aa, rr, ss’> from replay memory D
calculate target for each minibatch transition
if ss’ is terminal state then tt = rr
otherwise tt = rr + γ (max(Q(ss’, aa’)))
// definition for back propagation
train the Q network using (tt - Q(ss, aa))^2 as loss
s = s'
until terminated
Convolution2D(32, 8, 8, subsample=(4, 4), input_shape=(84, 84, 4)))
Activation('relu')
Convolution2D(64, 4, 4, subsample=(2, 2))
Activation('relu')
Convolution2D(64, 3, 3, subsample=(1, 1))
Activation('relu')
Dense(512)
Activation('relu')
Dense(4) # Range of possible actions
Activation('linear')