Attack of the Killer Microseconds

Finlay Fehlauer - Computing Platforms 2024

[6]

Latency

Cause

Magnitude

Mitigation

< 100ns

1 - 100 μs

>1 ms

DRAM

Networking/Flash

Disk I/O

Caching, Prefetching, Out-of-Order Execution, Branch Prediction

Context Switch, Preloading

???

Finlay Fehlauer - Computing Platforms 2024

New Microseconds

Finlay Fehlauer - Computing Platforms 2024

Current Mitigations

Out-of-Order Execution

Context Switching

Prefetching

...

Finlay Fehlauer - Computing Platforms 2024

Recap Async vs Sync

Finlay Fehlauer - Computing Platforms 2024

Synchronous

Asynchronous

A

B

Wait for response

A

B

Continue work

Get response

Poll

Poll

Poll

Poll

Recap Async vs Sync

Load blocking

Context-switching

...

All synchronous!

Why?

  • Easier to code
  • No need to poll
  • No different idioms
  • No need for custom implementation

Async/Await

Coroutines

Virtual Threads

Finlay Fehlauer - Computing Platforms 2024

In his own words ...

Finlay Fehlauer - Computing Platforms 2024

Case Study: Count Words

Finlay Fehlauer - Computing Platforms 2024

// Returns the number of words found in the document specified by ‘doc_id’.
int CountWords(const string& doc_id) {
  Index index;
  bool status = ReadDocumentIndex(doc_id, &index);
  if (!status)
    return -1;
  string doc;
  status = ReadDocument(index.location, &doc);
  if (!status)
    return -1;
  return CountWordsInString(doc);
}

Case Study: Count Words

Finlay Fehlauer - Computing Platforms 2024

// Heap-allocated state tracked between asynchronous operations.
struct AsyncCountWordsState {
  bool status;
  std::function<void(int)> done_callback;
  Index index;
  string doc;
};

// Invokes the ‘done’ callback, passing the number of words found in the
// document specified by ‘doc_id’.
void AsyncCountWords(const string& doc_id, std::function<void(int)> done) {
  // Kick off the first asynchronous operation, and invoke the
  // ReadDocumentIndexDone when it finishes. State between asynchronous
  // operations is tracked in a heap-allocated ‘state’ object.
  auto state = new AsyncCountWordsState();
  state->done_callback = done;
  AsyncReadDocumentIndex(doc_id, &state->status, &state->index, std::bind(&ReadDocumentIndexDone, state));
}

// First callback function.
void ReadDocumentIndexDone(AsyncCountWordsState* state) {
  if (!state->status) {
    state->done_callback(-1);
    delete state;
  } else {
    // Kick off the second asynchronous operation, and invoke the
    // ReadDocumentDone function when it finishes. The 'state' object
    // is passed to the second callback for final cleanup.
    AsyncReadDocument(state->index.location, &state->status, &state->doc, std::bind(&ReadDocumentDone, state));
  }
}

// Second callback function.
void ReadDocumentDone(AsyncCountWordsState* state) {
  if (!state->status) {
    state->done_callback(-1);
  } else {
    state->done_callback(CountWordsInString(state->doc));
  }
  delete state;
}

Why the problem?(pt. 1)

1. Demand

2. Supply

Moore's law / Dennard scaling failed

\Rightarrow

Finlay Fehlauer - Computing Platforms 2024

More accelerators needed

More RPC calls

This is a single websearch!!!

[1]

Hardware mitigations don't scale

\rightarrow

Not enough ILP or h.w. threads

Software mitigations don't scale

\rightarrow

Context switching is expensive

Why the problem?(pt. 2)

Finlay Fehlauer - Computing Platforms 2024

Why the problem?(pt. 2)

Finlay Fehlauer - Computing Platforms 2024

Server 1

RMD

RDMA

Why the problem?(pt. 2)

Finlay Fehlauer - Computing Platforms 2024

Server 1

RMD

RDMA

Remote Software

Why the problem?(pt. 2)

Finlay Fehlauer - Computing Platforms 2024

Server 1

   RMD

RDMA

Operation Thread

Network Thread

Why the problem?(pt. 2)

Finlay Fehlauer - Computing Platforms 2024

Server 1

   RMD

RDMA

Operation Thread

Network Thread

Processor

Core

Interrupt

Why the problem?(pt. 2)

Finlay Fehlauer - Computing Platforms 2024

Server 1

   RMD

RDMA

Operation Thread

Network Thread

Processor

Core

Interrupt

RPC

Why the problem?(pt. 2)

Finlay Fehlauer - Computing Platforms 2024

Server 1

   RMD

RDMA

Operation Thread

Network Thread

Processor

Core

Interrupt

RPC

TCP

In addition: Sleep state, debugging

Why the problem?(pt. 3)

Finlay Fehlauer - Computing Platforms 2024

Overhead vastly exceeds processing time

Why the problem?(pt. 3)

Finlay Fehlauer - Computing Platforms 2024

Infrequent microsecond calls: OK

Uncoupled calls: OK

Frequent, coupled calls: Issue

Why the problem?(pt. 3)

Finlay Fehlauer - Computing Platforms 2024

Infrequent microservice calls: OK

Uncoupled calls: OK

Frequent, coupled calls: Issue

[2]

Solutions (pt. 1)

Finlay Fehlauer - Computing Platforms 2024

Design for microsecond latencies:

Reduce lock contention

Job scheduling

Interrupt handling

Spin polling

Finlay Fehlauer - Computing Platforms 2024

Solutions (pt. 2)

Idea: Look at HPC!

Finlay Fehlauer - Computing Platforms 2024

Solutions (pt. 2)

Idea: Look at HPC!

Not directly applicable

vs.

Simple, static data structures

Complex, dynamic data sets

Few programmers / changes

Many programmers / changes

vs.

Hardware visible

Scale-out, hardware invisible

vs.

vs.

Highest performance

Highest efficiency (P/$)

vs.

Underutilization accepted

Underutilization unacceptable

vs.

Reliable hardware

Reliability in stack

vs.

No encryption

Authentication / Encryption

Solutions (pt. 3)

mwait
monitor

CPU

L1

L2

Context

4. Stop processor (don't use power!)

1. Redesign ILP to handle microsecond latencies

2. Light weight threads (also for I/O!)

3. Redesign hardware mechanism to manage state for  I/O

...

Finlay Fehlauer - Computing Platforms 2024

Outlook: Net Clone

Finlay Fehlauer - Computing Platforms 2024

Idea: Clone RPCs to reduce tail latency

[3]

Problem: Cloning RPCs incurs same μs overhead

Outlook: Net Clone

Finlay Fehlauer - Computing Platforms 2024

Idea: Clone RPCs to reduce tail latency

Switch

Networking Chip

Clone ASIC

RPC 1

RPC 2

[3]

Problem: Cloning RPCs incurs same μs overhead

Outlook: AgileWatts

Finlay Fehlauer - Computing Platforms 2024

Idea: CPU Boot-Down/Up

[4]

Problem: Restarting CPU and loading context overhead

Outlook: AgileWatts

Finlay Fehlauer - Computing Platforms 2024

Idea: CPU Boot-Down/Up

[4,5]

Problem: Restarting CPU and loading context overhead

Vector Execution Engine (ZMM)

Vector

Execution Engine

Execution Ports

Fast power gating

Load / Store

Decode & MS-ROM

L1 (Inst.)

 

Fetch /

Prefetch

L1 (Data) & Ctl.

L2 (256kb) & Ctl.

L2 (768kb)

Out-of-order engine

Ungated power + cache in sleep

Thank you for your attention!

Finlay Fehlauer - Computing Platforms 2024

Sources

0. Barroso, L., Marty, M., Patterson, D., & Ranganathan, P. (2017). Attack of the killer microseconds. In: CACM.
1. Sites, R. (2021). Understanding Software Dynamics. (ch. 6.6).
2. Kanev, S. et al. (2015). Profiling a warehouse-scale computer. For: ISCA 2015
3. Kim G. (2023). NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCs
4. Haj-Yahya, J. et al. (2022). AgileWatts: An Energy-Efficient CPU Core Idle-State Architecture for Latency-Sensitive Server Applications. For: 55th IEEE/ACM (MICRO)
5. Haj-Yahya, J. Personal Communication (26.02.24)
6. Kogias, M. Personal Communication (27.02.24)

Finlay Fehlauer - Computing Platforms 2024

Attack of the Killer Microseconds

By Finlay Fehlauer

Attack of the Killer Microseconds

  • 79