Reinforcement learning
in Scala
Chris Birchall
Fools say they learn from experience;
I prefer to learn from the experience of others.
trait Environment[State, Action] {
def step(
):
}
trait Environment[State, Action] {
def step(
currentState: State,
actionTaken: Action
):
}
trait Environment[State, Action] {
def step(
currentState: State,
actionTaken: Action
): (State, Reward)
}
type Reward = Double
trait Environment[State, Action] {
def step(
currentState: State,
actionTaken: Action
): (State, Reward)
}
type Reward = Double
trait Environment[State, Action] {
def step(
currentState: State,
actionTaken: Action
): (State, Reward)
def isTerminal(state: State): Boolean
}
At every time step t
trait AgentBehaviour[AgentData, State, Action] {
}
trait AgentBehaviour[AgentData, State, Action] {
def chooseAction(
agentData: AgentData,
state: State,
validActions: List[Action]
):
}
trait AgentBehaviour[AgentData, State, Action] {
def chooseAction(
agentData: AgentData,
state: State,
validActions: List[Action]
): (Action, )
}
type Reward = Double
case class ActionResult[State](reward: Reward,
nextState: State)
trait AgentBehaviour[AgentData, State, Action] {
def chooseAction(
agentData: AgentData,
state: State,
validActions: List[Action]
): (Action, ActionResult[State] => AgentData)
}
var agentData = initialAgentData
var currentState = initialState
def step(): Unit = {
}
var agentData = initialAgentData
var currentState = initialState
def step(): Unit = {
val (nextAction, updateAgent) =
agentBehaviour.chooseAction(agentData, currentState, ...)
}
var agentData = initialAgentData
var currentState = initialState
def step(): Unit = {
val (nextAction, updateAgent) =
agentBehaviour.chooseAction(agentData, currentState, ...)
val (nextState, reward) =
env.step(currentState, nextAction)
}
var agentData = initialAgentData
var currentState = initialState
def step(): Unit = {
val (nextAction, updateAgent) =
agentBehaviour.chooseAction(agentData, currentState, ...)
val (nextState, reward) =
env.step(currentState, nextAction)
agentData = updateAgent(ActionResult(reward, nextState))
currentState = nextState
updateUI(agentData, currentState)
}
For each state s
(e.g. agent is in cell (1, 2) on the grid)
and each action a
(e.g. "move left"),
Q(s, a) = estimate of value of being in state s and taking action a
(Q*(s, a) = the optimal value)
Total return of all rewards from that point onward
If we have a state-action value function Q(s, a),
then making a policy is trivial
Agent needs a policy:
"If I'm in some state s, what action should I take?"
If I'm in state s,
choose the action a with the highest Q(s, a)
(state, action) | Q(s, a) |
---|---|
((1, 1), Move Left) | 1.4 |
((1, 1), Move Right) | 9.3 |
((1, 1), Move Up) | 2.2 |
((1, 1), Move Down) | 3.7 |
"If I'm in state (1, 1), I should move right"
Reduce a hard problem (learning an optimal policy)
into two easier problems:
ε-greedy
follow the policy most of the time,
but occasionally pick a random action
Eventually converges to the right answer!
case class QLearning[State, Action](
α: Double, // step size, 0.0 ≦ α ≦ 1.0
γ: Double, // discount rate, 0.0 ≦ γ ≦ 1.0
ε: Double, // 0.0 ≦ ε ≦ 1.0
Q: Map[State, Map[Action, Double]]
)
Agent data
object QLearning {
implicit def agentBehaviour[State, Action] =
new AgentBehaviour[QLearning[State, Action], State, Action] {
type UpdateFn =
ActionResult[State] => QLearning[State, Action]
def chooseAction(
agentData: QLearning[State, Action],
state: State,
validActions: List[Action]): (Action, UpdateFn) = {
...
}
}
}
Agent behaviour
def chooseAction(
agentData: QLearning[State, Action],
state: State,
validActions: List[Action]): (Action, UpdateFn) = {
val actionValues =
agentData.Q.getOrElse(state, zeroForAllActions)
// choose the next action
val (chosenAction, currentActionValue) =
epsilonGreedy(actionValues, agentData.ε)
...
}
Agent behaviour
val updateStateActionValue: UpdateFn = { actionResult =>
val maxNextStateActionValue = ...
val updatedActionValue =
currentActionValue + agentData.α * (
actionResult.reward
+ agentData.γ * maxNextStateActionValue
- currentActionValue
)
val updatedQ = ...
agentData.copy(Q = updatedQ)
}
Agent behaviour
trait StateConversion[EnvState, AgentState] {
def convertState(envState: EnvState): AgentState
}
Reinforcement Learning: An Introduction (Sutton & Barto
Slides, demo and code: