RL in Systems-land
Nikhilesh Singh
Many examples taken from David Silver's UCL course
(literally took screenshots this time!)
https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDMOYHWgPebj2MfCFzFObQ
What we know?

RL agent attributes
- Policy
- Value Function
- Model
The parts left out
- But, how exactly are the actions chosen?
- How do you update the values?
- It's all just tables, there is a limit to what we can store, right?
- My problem has states unimaginably large, what to do?
- How do you do it in real life? Any algorithm?
- ...
- Wait a minute, how did the AlphaGo guys did it?

Q. What do we want?
A. An optimal way to behave in the environment.
The Search for Optimality
To get the optimal policy
- Evaluation (Prediction Problem)
- Comparison
- Improvement (Control Problem)
Prediction
Some terms
- Return:
- Value function:
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-1}R_{T}
v_\pi(s) = E_\pi[G_t | S_t=s]
Remember that Markov thingy?
- Since we are taking decisions we call it MDP.
- Markov Decision Process
- Just a fancy name for our setup!
- If we know the MDP, DP would do.
(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma)
If we have an Oracle...
- Dynamic Programming
- Wait, doesn't DP require some properties?
- Optimal substructure
- Overlapping sub-problems
- MDPs have that, always!
Q. What if we have an unknown MDP?
A. We sample.
A common theme
- We want to estimate some value x.
- There is some Target value (some function of x) and the observed value.
- Error = Target - Observed
- Update x to push in the direction that reduces error
- x = x + k*(Target - Observed)
- Iterate
Q. I have a sample, what now?
A. Update accordingly.
How about this?
- Generate an episode.
- Log when you visit a state.
- Take the average return you get from each state.
How about this?


Monte-Carlo Policy Evaluation
- First-visit/Every-visit
- Updates done at the end of the episode.
- But what if the episodes are really, really long?
How about this?
- Try attempting updates at every step.
- But we need a target to move towards, G_t in MC.
- Idea: use the estimated return as the target.
How about this?


TD(0) Learning
- Incomplete episodes, bootstrapping.
- Guessing towards a guess.

MC vs TD(0)

An overused example
- Two states, A and B, no discounting, 8 episodes.
- A,0,B,0
- B,1
- B,1
- B,1
- B,1
- B,1
- B,1
- B,0
- V(A) = ?, V(B) = ?
MC
- Best Fit to observed returns
- V(B) = 6/8 = 0.75
- V(A) = avg return when A was visited = 0/1 = 0
TD(0)
- Solution to the underlying MDP
- V(B) = 0.25*0 + 0.75*1 = 0.75
- V(A) = (1*0.75) = 0.75

DP

MC vs TD(0)


The link

Are we done?
- We have solved the Prediction Problem.
- We still have to solve Control, to improve and get the optimal behavior.
Control
Value Function
- V(s) - State Value Function
- Q(s,a) - State-Action Value Function
A common theme


Policy Evaluation: Estimate
Policy Improvement: Generate
\pi^\prime \geq \pi
V_\pi
Monte-Carlo

Policy Evaluation: MC evaluation
Policy Improvement: -Greedy
\epsilon
Beneath the illustration

TD version


SARSA

A windy example!

Goal: Reach G in the least time-steps
State: Location in the grid
Actions: U, D, L, R, Diag.
Rewards: -1 per timestep
A windy example!

The impact of

\lambda


Underneath SARSA( )
\lambda

"The wise learn from the mistake of others."
- Otto Van Bismarck
Off-policy

Q-learning

Q-learning


Q. All these have tables, what if we have lots and lots of states?
What do we need?
- Q(s,a) for all s and a
- Problems:
- We need to visit each state enough.
- And the tables can be huge.
Q. What do we do when we don't know a function correctly?
A. Approximate.

Key Idea
- We already have the sampled episodes.
- Use them as labeled data.
- Feed it to the approximator (IID).
- Not proven for convergence but works quite well in practice.

An example

Mapping to our ideas
- Goal: Break into a malware detection system.
- Environment: The system.
Mapping to our ideas
-
State: A 2530-dimensional feature vector
- PE header metadata
- Section metadata: section name, size and characteristics
- Import and Export Table metadata
- Counts of human-readable strings (e.g. file paths, URLs, and registry key names)
- Byte histogram
Mapping to our ideas
-
Actions: Should not break the code
- adding a function to the import address table that is never used
- manipulating existing section names
- creating new (unused) sections
- appending bytes to extra space at the end of sections
- removing signer information
- packing or unpacking the file
- manipulating debug info
Mapping to our ideas
-
Rewards: Should drive towards the goal
- 0 - malware sample was detected.
- R - evaded.
Mapping to our ideas
-
Algorithm:
- A form of Q-Learning
-
Problem:
- Map the nodes of a netlist onto a chip canvas.
- Optimize final power, performance, and area.
Mapping to our ideas
-
State:
-
graph embedding of the netlist
- placed and unplaced nodes.
- node embedding of the current micro to place.
- metadata about the netlist.
- a mask representing the feasibility of placing the current node on to each cell of the grid
-
graph embedding of the netlist
Mapping to our ideas
-
Actions:
- All the locations in the discrete canvas space a given the current macro can be placed without violating any hard constraints on density or blockages.
- canvas = m x n grid
- action space = prob dist for m x n locations
- action = argmax (m x n grid)
Mapping to our ideas
-
Rewards:
- 0 for all actions except the last action where the reward is a negative weighted sum of proxy wire length and congestion, subject to density constraints.
1
Mapping to our ideas
-
Some trickery is applied here.
- Ideally, we want a cool enough EDA tool.
- but we need to be able to calculate the rewards quickly, as there would be huge amount of iterations.
- So can, we have something that is faster to compute, and correlates with the final values?
- That is what the wirelength and congestion values are doing.
- And as we thought : they are using a neural network to predict the reward as well.
1
Next steps
- Check out sentdex tutorials on youtube.
- Explore gym framework.
- In fact Endgame had released a gym framework for PE.

Resources
sysDL_rl2
By Nikhilesh Singh
sysDL_rl2
- 44