Reinforcement learning
in Scala
Chris Birchall
Fools say they learn from experience;
I prefer to learn from the experience of others.
trait Environment[State, Action] {
  def step(
    
    
  ): 
}
trait Environment[State, Action] {
  def step(
    currentState: State,
    actionTaken: Action
  ): 
}
trait Environment[State, Action] {
  def step(
    currentState: State,
    actionTaken: Action
  ): (State, Reward)
}type Reward = Double
trait Environment[State, Action] {
  def step(
    currentState: State,
    actionTaken: Action
  ): (State, Reward)
}type Reward = Double
trait Environment[State, Action] {
  def step(
    currentState: State,
    actionTaken: Action
  ): (State, Reward)
  def isTerminal(state: State): Boolean
}At every time step t
 
                              
trait AgentBehaviour[AgentData, State, Action] {
 
}
 
                              
trait AgentBehaviour[AgentData, State, Action] {
  def chooseAction(
    agentData: AgentData,
    state: State,
    validActions: List[Action]
  ):
}
trait AgentBehaviour[AgentData, State, Action] {
  def chooseAction(
    agentData: AgentData,
    state: State,
    validActions: List[Action]
  ): (Action,                             )
}type Reward = Double
case class ActionResult[State](reward: Reward, 
                               nextState: State)
trait AgentBehaviour[AgentData, State, Action] {
  def chooseAction(
    agentData: AgentData,
    state: State,
    validActions: List[Action]
  ): (Action, ActionResult[State] => AgentData)
}
var agentData    = initialAgentData
var currentState = initialState
def step(): Unit = {
  
    
  
    
  
  
  
}
var agentData    = initialAgentData
var currentState = initialState
def step(): Unit = {
  val (nextAction, updateAgent) =
    agentBehaviour.chooseAction(agentData, currentState, ...)
  
  
  
  
  
}
var agentData    = initialAgentData
var currentState = initialState
def step(): Unit = {
  val (nextAction, updateAgent) =
    agentBehaviour.chooseAction(agentData, currentState, ...)
  val (nextState, reward) = 
    env.step(currentState, nextAction)
  
  
  
}
var agentData    = initialAgentData
var currentState = initialState
def step(): Unit = {
  val (nextAction, updateAgent) =
    agentBehaviour.chooseAction(agentData, currentState, ...)
  val (nextState, reward) = 
    env.step(currentState, nextAction)
  agentData = updateAgent(ActionResult(reward, nextState))
  currentState = nextState
  updateUI(agentData, currentState)
}
For each state s
(e.g. agent is in cell (1, 2) on the grid)
and each action a
(e.g. "move left"),
Q(s, a) = estimate of value of being in state s and taking action a
(Q*(s, a) = the optimal value)
Total return of all rewards from that point onward
If we have a state-action value function Q(s, a),
then making a policy is trivial
Agent needs a policy:
"If I'm in some state s, what action should I take?"
If I'm in state s,
choose the action a with the highest Q(s, a)
| (state, action) | Q(s, a) | 
|---|---|
| ((1, 1), Move Left) | 1.4 | 
| ((1, 1), Move Right) | 9.3 | 
| ((1, 1), Move Up) | 2.2 | 
| ((1, 1), Move Down) | 3.7 | 
"If I'm in state (1, 1), I should move right"
Reduce a hard problem (learning an optimal policy)
into two easier problems:
ε-greedy
follow the policy most of the time,
but occasionally pick a random action
Eventually converges to the right answer!
case class QLearning[State, Action](
    α: Double, // step size, 0.0 ≦ α ≦ 1.0
    γ: Double, // discount rate, 0.0 ≦ γ ≦ 1.0
    ε: Double, // 0.0 ≦ ε ≦ 1.0
    Q: Map[State, Map[Action, Double]]
)Agent data
object QLearning {
  implicit def agentBehaviour[State, Action] =
    new AgentBehaviour[QLearning[State, Action], State, Action] {
      type UpdateFn = 
        ActionResult[State] => QLearning[State, Action]
      def chooseAction(
          agentData: QLearning[State, Action],
          state: State,
          validActions: List[Action]): (Action, UpdateFn) = {
        ...
      }
    }
}Agent behaviour
def chooseAction(
  agentData: QLearning[State, Action],
  state: State,
  validActions: List[Action]): (Action, UpdateFn) = {
  val actionValues = 
    agentData.Q.getOrElse(state, zeroForAllActions)
  // choose the next action
  val (chosenAction, currentActionValue) = 
    epsilonGreedy(actionValues, agentData.ε)
  ...
}Agent behaviour
val updateStateActionValue: UpdateFn = { actionResult =>
  val maxNextStateActionValue = ...
  val updatedActionValue = 
    currentActionValue + agentData.α * (
        actionResult.reward 
      + agentData.γ * maxNextStateActionValue
      - currentActionValue
    )
  val updatedQ = ...
  agentData.copy(Q = updatedQ)
}
Agent behaviour
trait StateConversion[EnvState, AgentState] {
  def convertState(envState: EnvState): AgentState
}Reinforcement Learning: An Introduction (Sutton & Barto
Slides, demo and code: