S ← ∅
b ← b0
PUSH(S, b)
while Ǝn, where (b, n) ∈ H do
A ← {n|(b,n) ∈ H}
C ← {a|a ∈ A, TEST(preconds(a), W)}
----> b ← CHOOSE(C, P, S, W) <----
PUSH(S, b)
How do we CHOOSE(...)?
Decisions are only valid if our "strategy" holds
Reinforcement Learning
What it brings to the table, talking on self acting agents
In a nutshell
CHOOSE(Q, C, W) :-
a ← ZEROVAL()
for all c ∈ C do
t ← GET(Q, W, c)
if t > a then
a ← t
return DECIDE(a, C, Q)
We have our a table Q that maps a composite key: (W, c) to a value.
C represents all the possible actions.
W represents the our state.
The maximum value is picked, and then a DECISION of Exploitation vs Exploration has to be made.
Q TAble, What is your PROFESSION?!
LEARN(Q, a, W, W*, P) :-
v ← GET(Q, W*, c)
d ← ZEROVAL()
for all c* ∈ C do
t ← GET(Q, W, c*) – GET(Q, W*, a)
if t > d then
d ← t
r ← REWARD(a, W, W*, P)
q ← v + α*(r + γ*d)
PUT(Q, W*, e, q)
The learning process is basically updating our assumptions on the problem (represented by a value in Q), based on new information from the last action performed.
12: E ← ∅
13: while E = ∅ do
14: W* ← W
15: K ← UPDATE(W)
16: W ← REVISE(W, K)
17: E ← {a|a ∈ S, TEST(termconds(a), W)}
18: for all e ∈ E do
19: LEARN(Q, e, W, W*, P)
Once the running was done for the selected step and his sub-steps, now is the place where the state updates and thus allowing us to apply the learning process.
So over time our table Q shall converge to an optimal policy