Attack of the Killer Microseconds
Finlay Fehlauer - Computing Platforms 2024
[6]
Latency
Cause
Magnitude
Mitigation
< 100ns
1 - 100 μs
>1 ms
DRAM
Networking/Flash
Disk I/O
Caching, Prefetching, Out-of-Order Execution, Branch Prediction
Context Switch, Preloading
???
Finlay Fehlauer - Computing Platforms 2024
New Microseconds

Finlay Fehlauer - Computing Platforms 2024
Current Mitigations
Out-of-Order Execution
Context Switching
Prefetching
...
Finlay Fehlauer - Computing Platforms 2024
Recap Async vs Sync
Finlay Fehlauer - Computing Platforms 2024
Synchronous
Asynchronous
A
B
Wait for response
A
B
Continue work
Get response
Poll
Poll
Poll
Poll
Recap Async vs Sync
Load blocking
Context-switching
...
All synchronous!
Why?
- Easier to code
- No need to poll
- No different idioms
- No need for custom implementation
Async/Await
Coroutines
Virtual Threads
Finlay Fehlauer - Computing Platforms 2024
In his own words ...
Finlay Fehlauer - Computing Platforms 2024
Case Study: Count Words
Finlay Fehlauer - Computing Platforms 2024
// Returns the number of words found in the document specified by ‘doc_id’.
int CountWords(const string& doc_id) {
Index index;
bool status = ReadDocumentIndex(doc_id, &index);
if (!status)
return -1;
string doc;
status = ReadDocument(index.location, &doc);
if (!status)
return -1;
return CountWordsInString(doc);
}Case Study: Count Words
Finlay Fehlauer - Computing Platforms 2024
// Heap-allocated state tracked between asynchronous operations.
struct AsyncCountWordsState {
bool status;
std::function<void(int)> done_callback;
Index index;
string doc;
};
// Invokes the ‘done’ callback, passing the number of words found in the
// document specified by ‘doc_id’.
void AsyncCountWords(const string& doc_id, std::function<void(int)> done) {
// Kick off the first asynchronous operation, and invoke the
// ReadDocumentIndexDone when it finishes. State between asynchronous
// operations is tracked in a heap-allocated ‘state’ object.
auto state = new AsyncCountWordsState();
state->done_callback = done;
AsyncReadDocumentIndex(doc_id, &state->status, &state->index, std::bind(&ReadDocumentIndexDone, state));
}
// First callback function.
void ReadDocumentIndexDone(AsyncCountWordsState* state) {
if (!state->status) {
state->done_callback(-1);
delete state;
} else {
// Kick off the second asynchronous operation, and invoke the
// ReadDocumentDone function when it finishes. The 'state' object
// is passed to the second callback for final cleanup.
AsyncReadDocument(state->index.location, &state->status, &state->doc, std::bind(&ReadDocumentDone, state));
}
}
// Second callback function.
void ReadDocumentDone(AsyncCountWordsState* state) {
if (!state->status) {
state->done_callback(-1);
} else {
state->done_callback(CountWordsInString(state->doc));
}
delete state;
}Why the problem?(pt. 1)
1. Demand
2. Supply

Moore's law / Dennard scaling failed
Finlay Fehlauer - Computing Platforms 2024
More accelerators needed
More RPC calls
This is a single websearch!!!
[1]
Hardware mitigations don't scale
Not enough ILP or h.w. threads
Software mitigations don't scale
Context switching is expensive
Why the problem?(pt. 2)
Finlay Fehlauer - Computing Platforms 2024

Why the problem?(pt. 2)
Finlay Fehlauer - Computing Platforms 2024
Server 1
RMD
RDMA

Why the problem?(pt. 2)
Finlay Fehlauer - Computing Platforms 2024

Server 1
RMD
RDMA
Remote Software
Why the problem?(pt. 2)
Finlay Fehlauer - Computing Platforms 2024

Server 1
RMD
RDMA
Operation Thread
Network Thread
Why the problem?(pt. 2)
Finlay Fehlauer - Computing Platforms 2024

Server 1
RMD
RDMA
Operation Thread
Network Thread
Processor
Core
Interrupt
Why the problem?(pt. 2)
Finlay Fehlauer - Computing Platforms 2024

Server 1
RMD
RDMA
Operation Thread
Network Thread
Processor
Core
Interrupt
RPC
Why the problem?(pt. 2)
Finlay Fehlauer - Computing Platforms 2024

Server 1
RMD
RDMA
Operation Thread
Network Thread
Processor
Core
Interrupt
RPC
TCP
In addition: Sleep state, debugging
Why the problem?(pt. 3)
Finlay Fehlauer - Computing Platforms 2024

Overhead vastly exceeds processing time
Why the problem?(pt. 3)
Finlay Fehlauer - Computing Platforms 2024

Infrequent microsecond calls: OK
Uncoupled calls: OK
Frequent, coupled calls: Issue
Why the problem?(pt. 3)
Finlay Fehlauer - Computing Platforms 2024

Infrequent microservice calls: OK
Uncoupled calls: OK
Frequent, coupled calls: Issue

[2]
Solutions (pt. 1)
Finlay Fehlauer - Computing Platforms 2024
Design for microsecond latencies:
Reduce lock contention
Job scheduling
Interrupt handling
Spin polling
Finlay Fehlauer - Computing Platforms 2024
Solutions (pt. 2)
Idea: Look at HPC!
Finlay Fehlauer - Computing Platforms 2024
Solutions (pt. 2)
Idea: Look at HPC!
Not directly applicable
vs.
Simple, static data structures
Complex, dynamic data sets
Few programmers / changes
Many programmers / changes
vs.
Hardware visible
Scale-out, hardware invisible
vs.
vs.
Highest performance
Highest efficiency (P/$)
vs.
Underutilization accepted
Underutilization unacceptable
vs.
Reliable hardware
Reliability in stack
vs.
No encryption
Authentication / Encryption
Solutions (pt. 3)
mwait
monitorCPU
L1
L2
Context
4. Stop processor (don't use power!)
1. Redesign ILP to handle microsecond latencies
2. Light weight threads (also for I/O!)
3. Redesign hardware mechanism to manage state for I/O
...
Finlay Fehlauer - Computing Platforms 2024
Outlook: Net Clone
Finlay Fehlauer - Computing Platforms 2024
Idea: Clone RPCs to reduce tail latency
[3]
Problem: Cloning RPCs incurs same μs overhead
Outlook: Net Clone
Finlay Fehlauer - Computing Platforms 2024
Idea: Clone RPCs to reduce tail latency
Switch
Networking Chip
Clone ASIC
RPC 1
RPC 2
[3]
Problem: Cloning RPCs incurs same μs overhead
Outlook: AgileWatts
Finlay Fehlauer - Computing Platforms 2024
Idea: CPU Boot-Down/Up
[4]
Problem: Restarting CPU and loading context overhead
Outlook: AgileWatts
Finlay Fehlauer - Computing Platforms 2024
Idea: CPU Boot-Down/Up
[4,5]
Problem: Restarting CPU and loading context overhead
Vector Execution Engine (ZMM)
Vector
Execution Engine
Execution Ports
Fast power gating
Load / Store
Decode & MS-ROM
L1 (Inst.)
Fetch /
Prefetch
L1 (Data) & Ctl.
L2 (256kb) & Ctl.
L2 (768kb)
Out-of-order engine
Ungated power + cache in sleep
Thank you for your attention!
Finlay Fehlauer - Computing Platforms 2024
Sources
0. Barroso, L., Marty, M., Patterson, D., & Ranganathan, P. (2017). Attack of the killer microseconds. In: CACM. 1. Sites, R. (2021). Understanding Software Dynamics. (ch. 6.6). 2. Kanev, S. et al. (2015). Profiling a warehouse-scale computer. For: ISCA 2015 3. Kim G. (2023). NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCs 4. Haj-Yahya, J. et al. (2022). AgileWatts: An Energy-Efficient CPU Core Idle-State Architecture for Latency-Sensitive Server Applications. For: 55th IEEE/ACM (MICRO) 5. Haj-Yahya, J. Personal Communication (26.02.24) 6. Kogias, M. Personal Communication (27.02.24)
Finlay Fehlauer - Computing Platforms 2024
Attack of the Killer Microseconds
By Finlay Fehlauer
Attack of the Killer Microseconds
- 79