Reinforcement learning

in Scala

Chris Birchall

Reinforcement learning

Unsupervised learning

Fools say they learn from experience;

I prefer to learn from the experience of others.

Delayed rewards

Environment

Markov Decision Process

State machine satisfying Markov property
Defines two functions:
- Given current state and an action, what is the next state?
- Given current state, action and next state, what is the reward?

Environment.scala



trait Environment[State, Action] {

  def step(
    
    
  ): 



}

Environment.scala



trait Environment[State, Action] {

  def step(
    currentState: State,
    actionTaken: Action
  ): 



}

Environment.scala



trait Environment[State, Action] {

  def step(
    currentState: State,
    actionTaken: Action
  ): (State, Reward)



}

Environment.scala

type Reward = Double

trait Environment[State, Action] {

  def step(
    currentState: State,
    actionTaken: Action
  ): (State, Reward)



}

Environment.scala

type Reward = Double

trait Environment[State, Action] {

  def step(
    currentState: State,
    actionTaken: Action
  ): (State, Reward)

  def isTerminal(state: State): Boolean

}

Agent

At every time step t

knows what state it is currently in
chooses an action to take
is told the new state, and what reward it received
learns something!

AgentBehaviour.scala



 
                              

trait AgentBehaviour[AgentData, State, Action] {

 





}

AgentBehaviour.scala



 
                              

trait AgentBehaviour[AgentData, State, Action] {

  def chooseAction(
    agentData: AgentData,
    state: State,
    validActions: List[Action]
  ):

}

AgentBehaviour.scala






trait AgentBehaviour[AgentData, State, Action] {

  def chooseAction(
    agentData: AgentData,
    state: State,
    validActions: List[Action]
  ): (Action,                             )

}

AgentBehaviour.scala

type Reward = Double

case class ActionResult[State](reward: Reward, 
                               nextState: State)

trait AgentBehaviour[AgentData, State, Action] {

  def chooseAction(
    agentData: AgentData,
    state: State,
    validActions: List[Action]
  ): (Action, ActionResult[State] => AgentData)

}

Runner

Start with initial agent data and state
At every time step:

Ask the agent to choose an action
Tell the environment, which will return the new state and a reward
Tell these to the agent, which will return an improved version of itself
(Update the UI)

Runner

var agentData    = initialAgentData
var currentState = initialState

def step(): Unit = {
  
    

  
    

  
  

  
}

Runner

var agentData    = initialAgentData
var currentState = initialState

def step(): Unit = {
  val (nextAction, updateAgent) =
    agentBehaviour.chooseAction(agentData, currentState, ...)

  
  

  
  

  
}

Runner

var agentData    = initialAgentData
var currentState = initialState

def step(): Unit = {
  val (nextAction, updateAgent) =
    agentBehaviour.chooseAction(agentData, currentState, ...)

  val (nextState, reward) = 
    env.step(currentState, nextAction)

  
  

  
}

Runner

var agentData    = initialAgentData
var currentState = initialState

def step(): Unit = {
  val (nextAction, updateAgent) =
    agentBehaviour.chooseAction(agentData, currentState, ...)

  val (nextState, reward) = 
    env.step(currentState, nextAction)

  agentData = updateAgent(ActionResult(reward, nextState))
  currentState = nextState

  updateUI(agentData, currentState)
}

DEMO

Gridworld

How does it learn?

State-action values
Policies
Prediction and control
Model-free vs model-driven
Exploitation vs exploration
Bootstrapping

State-action values

For each state s

(e.g. agent is in cell (1, 2) on the grid)

and each action a

(e.g. "move left"),

Q(s, a) = estimate of value of being in state s and taking action a

(Q*(s, a) = the optimal value)

Value?

Total return of all rewards from that point onward

Policy

If we have a state-action value function Q(s, a),

then making a policy is trivial

Agent needs a policy:

"If I'm in some state s, what action should I take?"

If I'm in state s,

choose the action a with the highest Q(s, a)

My Super Awesome Policy

(state, action)	Q(s, a)
((1, 1), Move Left)	1.4
((1, 1), Move Right)	9.3
((1, 1), Move Up)	2.2
((1, 1), Move Down)	3.7

"If I'm in state (1, 1), I should move right"

Reduce a hard problem (learning an optimal policy)

into two easier problems:

"evaluation"/"prediction"
- measure the performance of some policy π
"improvement"/"control"
- find a policy slightly better than that one

Prediction and control

Start with arbitrary policy
Repeat:
1. Evaluate current policy
2. Improve it slightly

General strategy

Model-free/model-driven

Exploitation/exploration

ε-greedy

follow the policy most of the time,

but occasionally pick a random action

Bootstrapping

Basing estimates on other estimates

Eventually converges to the right answer!

Recap

State-action values
Policies
Prediction and control
Model-free vs model-driven
Exploitation vs exploration
Bootstrapping

Q-learning

Model-free
Exploration
Bootstrapping

Q(s_t, a_t) \leftarrow Q(s_t, a_t)+ \alpha [ r_{t+1} + \gamma \max\limits_a Q(s_{t+1}, a)- Q(s_t, a_t)]

Q-learning

s_t

s_{t+1}

a_t

r_{t+1}

QLearning.scala

case class QLearning[State, Action](
    α: Double, // step size, 0.0 ≦ α ≦ 1.0
    γ: Double, // discount rate, 0.0 ≦ γ ≦ 1.0
    ε: Double, // 0.0 ≦ ε ≦ 1.0
    Q: Map[State, Map[Action, Double]]
)

Agent data

QLearning.scala

object QLearning {

  implicit def agentBehaviour[State, Action] =
    new AgentBehaviour[QLearning[State, Action], State, Action] {

      type UpdateFn = 
        ActionResult[State] => QLearning[State, Action]

      def chooseAction(
          agentData: QLearning[State, Action],
          state: State,
          validActions: List[Action]): (Action, UpdateFn) = {

        ...

      }

    }

}

Agent behaviour

QLearning.scala

def chooseAction(
  agentData: QLearning[State, Action],
  state: State,
  validActions: List[Action]): (Action, UpdateFn) = {

  val actionValues = 
    agentData.Q.getOrElse(state, zeroForAllActions)

  // choose the next action
  val (chosenAction, currentActionValue) = 
    epsilonGreedy(actionValues, agentData.ε)

  ...

}

Agent behaviour

QLearning.scala

val updateStateActionValue: UpdateFn = { actionResult =>

  val maxNextStateActionValue = ...

  val updatedActionValue = 
    currentActionValue + agentData.α * (
        actionResult.reward 
      + agentData.γ * maxNextStateActionValue
      - currentActionValue
    )

  val updatedQ = ...

  agentData.copy(Q = updatedQ)

}

Agent behaviour

Scaling up to more interesting problems

trait StateConversion[EnvState, AgentState] {

  def convertState(envState: EnvState): AgentState

}