RL in Systems-land

Nikhilesh Singh

Many examples taken from David Silver's UCL course

(literally took screenshots this time!)

https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDMOYHWgPebj2MfCFzFObQ

What we know?

www.incompleteideas.net/book/the-book-2nd.html

RL agent attributes

Policy
Value Function
Model

The parts left out

But, how exactly are the actions chosen?
How do you update the values?
It's all just tables, there is a limit to what we can store, right?
My problem has states unimaginably large, what to do?
How do you do it in real life? Any algorithm?
...
Wait a minute, how did the AlphaGo guys did it?

Q. What do we want?

A. An optimal way to behave in the environment.

The Search for Optimality

To get the optimal policy

Evaluation (Prediction Problem)
Comparison
Improvement (Control Problem)

Prediction

Some terms

Return:
Value function:

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-1}R_{T}

v_\pi(s) = E_\pi[G_t | S_t=s]

Remember that Markov thingy?

Since we are taking decisions we call it MDP.
Markov Decision Process
Just a fancy name for our setup!
If we know the MDP, DP would do.

(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma)

If we have an Oracle...

Dynamic Programming
Wait, doesn't DP require some properties?
- Optimal substructure
- Overlapping sub-problems
MDPs have that, always!

Q. What if we have an unknown MDP?

A. We sample.

A common theme

We want to estimate some value x.
There is some Target value (some function of x) and the observed value.
Error = Target - Observed
Update x to push in the direction that reduces error
x = x + k*(Target - Observed)
Iterate

Q. I have a sample, what now?

A. Update accordingly.

How about this?

Generate an episode.
Log when you visit a state.
Take the average return you get from each state.

How about this?

Monte-Carlo Policy Evaluation

First-visit/Every-visit
Updates done at the end of the episode.
But what if the episodes are really, really long?

How about this?

Try attempting updates at every step.
But we need a target to move towards, G_t in MC.
Idea: use the estimated return as the target.

How about this?

TD(0) Learning

Incomplete episodes, bootstrapping.
Guessing towards a guess.

MC vs TD(0)

An overused example

Two states, A and B, no discounting, 8 episodes.
- A,0,B,0
- B,1
- B,1
- B,1
- B,1
- B,1
- B,1
- B,0
V(A) = ?, V(B) = ?

MC

Best Fit to observed returns
V(B) = 6/8 = 0.75
V(A) = avg return when A was visited = 0/1 = 0

TD(0)

Solution to the underlying MDP
V(B) = 0.25*0 + 0.75*1 = 0.75
V(A) = (1*0.75) = 0.75

DP

MC vs TD(0)

The link

Are we done?

We have solved the Prediction Problem.
We still have to solve Control, to improve and get the optimal behavior.

Control

Value Function

V(s) - State Value Function
Q(s,a) - State-Action Value Function

A common theme

Policy Evaluation: Estimate

Policy Improvement: Generate

\pi^\prime \geq \pi

V_\pi

Monte-Carlo

Policy Evaluation: MC evaluation

Policy Improvement: -Greedy

\epsilon

Beneath the illustration

TD version

SARSA

A windy example!

Goal: Reach G in the least time-steps

State: Location in the grid

Actions: U, D, L, R, Diag.

Rewards: -1 per timestep

A windy example!

The impact of

\lambda

Underneath SARSA( )

\lambda

"The wise learn from the mistake of others."

- Otto Van Bismarck

Off-policy

Q-learning

Q-learning

Q. All these have tables, what if we have lots and lots of states?

What do we need?

Q(s,a) for all s and a
Problems:
- We need to visit each state enough.
- And the tables can be huge.

Q. What do we do when we don't know a function correctly?

A. Approximate.

Key Idea

We already have the sampled episodes.
Use them as labeled data.
Feed it to the approximator (IID).
Not proven for convergence but works quite well in practice.

An example

Mapping to our ideas

Goal: Break into a malware detection system.
Environment: The system.

Mapping to our ideas

State: A 2530-dimensional feature vector
- PE header metadata
- Section metadata: section name, size and characteristics
- Import and Export Table metadata
- Counts of human-readable strings (e.g. file paths, URLs, and registry key names)
- Byte histogram

Mapping to our ideas

Actions: Should not break the code
- adding a function to the import address table that is never used
- manipulating existing section names
- creating new (unused) sections
- appending bytes to extra space at the end of sections
- removing signer information
- packing or unpacking the file
- manipulating debug info

Mapping to our ideas

Rewards: Should drive towards the goal
- 0 - malware sample was detected.
- R - evaded.

Mapping to our ideas

Algorithm:
- A form of Q-Learning

Chip Placement with Deep Reinforcement Learning

Problem:
- Map the nodes of a netlist onto a chip canvas.
- Optimize final power, performance, and area.

Mapping to our ideas

State:
- graph embedding of the netlist
  - placed and unplaced nodes.
  - node embedding of the current micro to place.
  - metadata about the netlist.
  - a mask representing the feasibility of placing the current node on to each cell of the grid

Mapping to our ideas

Actions:
- All the locations in the discrete canvas space a given the current macro can be placed without violating any hard constraints on density or blockages.
- canvas = m x n grid
- action space = prob dist for m x n locations
- action = argmax (m x n grid)

Mapping to our ideas

Rewards:
- 0 for all actions except the last action where the reward is a negative weighted sum of proxy wire length and congestion, subject to density constraints.

Mapping to our ideas

Some trickery is applied here.
- Ideally, we want a cool enough EDA tool.
- but we need to be able to calculate the rewards quickly, as there would be huge amount of iterations.
- So can, we have something that is faster to compute, and correlates with the final values?
- That is what the wirelength and congestion values are doing.
- And as we thought : they are using a neural network to predict the reward as well.

Next steps

Check out sentdex tutorials on youtube.
Explore gym framework.
In fact Endgame had released a gym framework for PE.

RL in Systems-land

Nikhilesh Singh

What we know?

RL agent attributes

The parts left out

Q. What do we want?

A. An optimal way to behave in the environment.

The Search for Optimality

To get the optimal policy

Prediction

Some terms

Remember that Markov thingy?

If we have an Oracle...

Q. What if we have an unknown MDP?

A. We sample.

A common theme

Q. I have a sample, what now?

A. Update accordingly.

How about this?

How about this?

Monte-Carlo Policy Evaluation

How about this?

How about this?

TD(0) Learning

MC vs TD(0)

An overused example

MC

TD(0)

DP

MC vs TD(0)

The link

Are we done?

Control

Value Function

A common theme

Monte-Carlo

Beneath the illustration

TD version

SARSA

A windy example!

A windy example!

The impact of

Underneath SARSA( )

"The wise learn from the mistake of others."

- Otto Van Bismarck

Off-policy

Q-learning

Q-learning

Q. All these have tables, what if we have lots and lots of states?

What do we need?

Q. What do we do when we don't know a function correctly?

A. Approximate.

Key Idea

An example

Mapping to our ideas

Mapping to our ideas

Mapping to our ideas

Mapping to our ideas

Mapping to our ideas

Chip Placement with Deep Reinforcement Learning

Mapping to our ideas

Mapping to our ideas

Mapping to our ideas

Mapping to our ideas

Next steps

Resources