Reinforcement learning

in Scala

Chris Birchall

Reinforcement learning

Unsupervised learning

Fools say they learn from experience;

I prefer to learn from the experience of others.

Delayed rewards

Environment

Markov Decision Process

  • State machine satisfying Markov property
  • Defines two functions:
    • Given current state and an action,                       what is the next state?
    • Given current state, action and next state,       what is the reward?

Environment.scala



trait Environment[State, Action] {

  def step(
    
    
  ): 



}

Environment.scala



trait Environment[State, Action] {

  def step(
    currentState: State,
    actionTaken: Action
  ): 



}

Environment.scala



trait Environment[State, Action] {

  def step(
    currentState: State,
    actionTaken: Action
  ): (State, Reward)



}

Environment.scala

type Reward = Double

trait Environment[State, Action] {

  def step(
    currentState: State,
    actionTaken: Action
  ): (State, Reward)



}

Environment.scala

type Reward = Double

trait Environment[State, Action] {

  def step(
    currentState: State,
    actionTaken: Action
  ): (State, Reward)

  def isTerminal(state: State): Boolean

}

Agent

Agent

At every time step t

  1. knows what state it is currently in
  2. chooses an action to take
  3. is told the new state, and what reward it received
  4. learns something!

AgentBehaviour.scala



 
                              

trait AgentBehaviour[AgentData, State, Action] {

 





}

AgentBehaviour.scala



 
                              

trait AgentBehaviour[AgentData, State, Action] {

  def chooseAction(
    agentData: AgentData,
    state: State,
    validActions: List[Action]
  ):

}

AgentBehaviour.scala






trait AgentBehaviour[AgentData, State, Action] {

  def chooseAction(
    agentData: AgentData,
    state: State,
    validActions: List[Action]
  ): (Action,                             )

}

AgentBehaviour.scala

type Reward = Double

case class ActionResult[State](reward: Reward, 
                               nextState: State)

trait AgentBehaviour[AgentData, State, Action] {

  def chooseAction(
    agentData: AgentData,
    state: State,
    validActions: List[Action]
  ): (Action, ActionResult[State] => AgentData)

}

Runner

  • Start with initial agent data and state
  • At every time step:
  1. Ask the agent to choose an action
  2. Tell the environment, which will return the new state and a reward
  3. Tell these to the agent, which will return an improved version of itself
  4. (Update the UI)

 

Runner

var agentData    = initialAgentData
var currentState = initialState

def step(): Unit = {
  
    

  
    

  
  

  
}


Runner

var agentData    = initialAgentData
var currentState = initialState

def step(): Unit = {
  val (nextAction, updateAgent) =
    agentBehaviour.chooseAction(agentData, currentState, ...)

  
  

  
  

  
}


Runner

var agentData    = initialAgentData
var currentState = initialState

def step(): Unit = {
  val (nextAction, updateAgent) =
    agentBehaviour.chooseAction(agentData, currentState, ...)

  val (nextState, reward) = 
    env.step(currentState, nextAction)

  
  

  
}


Runner

var agentData    = initialAgentData
var currentState = initialState

def step(): Unit = {
  val (nextAction, updateAgent) =
    agentBehaviour.chooseAction(agentData, currentState, ...)

  val (nextState, reward) = 
    env.step(currentState, nextAction)

  agentData = updateAgent(ActionResult(reward, nextState))
  currentState = nextState

  updateUI(agentData, currentState)
}


DEMO

Gridworld

How does it learn?

  • State-action values
  • Policies
  • Prediction and control
  • Model-free vs model-driven
  • Exploitation vs exploration
  • Bootstrapping

State-action values

For each state s

(e.g. agent is in cell (1, 2) on the grid)

and each action a

(e.g. "move left"),

Q(s, a) = estimate of value of being in state s and taking action a

(Q*(s, a) = the optimal value)

Value?

Total return of all rewards from that point onward

Policy

If we have a state-action value function Q(s, a),

then making a policy is trivial

Agent needs a policy:

"If I'm in some state s, what action should I take?"

If I'm in state s,

choose the action a with the highest Q(s, a)

My Super Awesome Policy

(state, action) Q(s, a)
((1, 1), Move Left) 1.4
((1, 1), Move Right) 9.3
((1, 1), Move Up) 2.2
((1, 1), Move Down) 3.7

"If I'm in state (1, 1), I should move right"

Reduce a hard problem (learning an optimal policy)

into two easier problems:

  • "evaluation"/"prediction"
    • measure the performance of some policy π
  • "improvement"/"control"
    • find a policy slightly better than that one

Prediction and control

  1. Start with arbitrary policy
  2. Repeat:
    1. Evaluate current policy
    2. Improve it slightly

General strategy

Model-free/model-driven

Exploitation/exploration

ε-greedy

follow the policy most of the time,

but occasionally pick a random action

Bootstrapping

Basing estimates on other estimates

Eventually converges to the right answer!

Recap

  • State-action values
  • Policies
  • Prediction and control
  • Model-free vs model-driven
  • Exploitation vs exploration
  • Bootstrapping

Q-learning

  • Model-free
  • Exploration
  • Bootstrapping
Q(s_t, a_t) \leftarrow Q(s_t, a_t)+ \alpha [ r_{t+1} + \gamma \max\limits_a Q(s_{t+1}, a)- Q(s_t, a_t)]

Q-learning

s_t
s_{t+1}
a_t
r_{t+1}

QLearning.scala

case class QLearning[State, Action](
    α: Double, // step size, 0.0 ≦ α ≦ 1.0
    γ: Double, // discount rate, 0.0 ≦ γ ≦ 1.0
    ε: Double, // 0.0 ≦ ε ≦ 1.0
    Q: Map[State, Map[Action, Double]]
)

Agent data

QLearning.scala

object QLearning {

  implicit def agentBehaviour[State, Action] =
    new AgentBehaviour[QLearning[State, Action], State, Action] {

      type UpdateFn = 
        ActionResult[State] => QLearning[State, Action]

      def chooseAction(
          agentData: QLearning[State, Action],
          state: State,
          validActions: List[Action]): (Action, UpdateFn) = {

        ...

      }

    }

}

Agent behaviour

QLearning.scala

def chooseAction(
  agentData: QLearning[State, Action],
  state: State,
  validActions: List[Action]): (Action, UpdateFn) = {

  val actionValues = 
    agentData.Q.getOrElse(state, zeroForAllActions)

  // choose the next action
  val (chosenAction, currentActionValue) = 
    epsilonGreedy(actionValues, agentData.ε)

  ...

}

Agent behaviour

QLearning.scala

val updateStateActionValue: UpdateFn = { actionResult =>

  val maxNextStateActionValue = ...

  val updatedActionValue = 
    currentActionValue + agentData.α * (
        actionResult.reward 
      + agentData.γ * maxNextStateActionValue
      - currentActionValue
    )

  val updatedQ = ...

  agentData.copy(Q = updatedQ)

}

Agent behaviour

Scaling up to more interesting problems

trait StateConversion[EnvState, AgentState] {

  def convertState(envState: EnvState): AgentState

}

StateConversion.scala

DEMO

Pole-balancing

HOMEWORK

Pacman!

Next steps

  • Smarter policies
    • Softmax, decay ε over time, ...
  • More efficient learning
    • TD(λ), Q(λ), eligibility traces, ...
  • Large/∞ state space
    • Function approximation
    • Deep RL

Learn more

Thank you!

Slides, demo and code:

https://cb372.github.io/rl-in-scala/

Made with Slides.com