Let's build our own message streaming platform

Piotr Gankiewicz

# whoami

Hello

+10 YOE

# intro

The origins

# intro

How it started

# iggy

How it's going

iggy.apache.org

Apache Iggy (Incubating)

# iggy

Blazingly fast message streaming in Rust
TCP, QUIC, HTTP transport protocols
Very high throughput and stable P99+
+5 GB/s writes & reads on a single node
Vibrant community, multiple SDK supported
Benchmarking as the first-class citizen
Built-in CLI, Web UI and other tooling

Iggy CLI

# iggy

cargo install iggy-cli

Iggy Web UI

# iggy

Iggy Bench CLI

# iggy

Iggy Benchmarks Platform

# iggy

benchmarks.iggy.rs

The Stream

# stream

Stream

# stream

Parallel reads

# build

Talk is cheap, show me the code

# build

Hello world

struct Stream {
  id: u32,
  offset: u64,
  path: String
}


struct Message {
  offset: u64,
  payload: Vec<u8>
}

# build

Reads & writes

impl Stream {
    fn append(&mut self, message: Message) {
      // TODO: Persist the append-only data 
    }

    fn poll(&self, offset: u64, count: u64) -> Vec<Message> {
      // TODO: Load the persisted data by offset
    }
}

# build

Serialization

impl Message {
  fn as_bytes(&self) -> Vec<u8> {
      let mut bytes = vec![];
      bytes.put_u64(self.offset);
      bytes.put_u32(self.payload.len());
      bytes.put(&self.payload);
      bytes
  }
}

# build

Deserialization

impl Message {
  fn from_bytes(bytes: &[u8]) -> Message {
      let offset = bytes[0..8].into();
      let length = bytes[8..12].into();
      let payload = bytes[12..12 + length].to_vec();
      Message {
        offset,
        payload
      }
  }
}

# build

hexdump

|................|
|....hello.......|
|.............wor|
|ld|


00 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00
05 00 00 00 68 65 6c 6c  6f 01 00 00 00 00 00 00
00 02 00 00 00 00 00 00  00 05 00 00 00 77 6f 72
6c 64

# build

File storage

impl Stream {
    async fn append(&mut self, message: Message) {
        self.offset += 1;
        message.offset = self.offset;
        let bytes = message.as_bytes();
        let mut file = file::open(&self.path).await;
        file.write_all(&bytes).await;
    }
}

# build

Durable file storage

impl Stream {
    async fn append(&mut self, message: Message) {
        self.offset += 1;
        message.offset = self.offset;
        let bytes = message.as_bytes();
        let mut file = file::open(&self.path).await;
        file.write_all(&bytes).await;
        file.sync_all().await;
    }
}

# build

fsync

"This frequency of application-level fsyncs has a large impact on both latency and throughput. Setting a large flush interval will improve throughput as the operating system can buffer the many small writes into a single large write."

# build

Messages batching

async fn append(&mut self, messages: Vec<Message>) {
  for message in message {
      self.unsaved_messages.push(message);
  }
  if self.unsaved_messages.len() < 1000 {
      return;
  }

  let mut bytes = vec![];
  for message in self.unsaved_messages {
      bytes.put(message.as_bytes());
  }
  file.write_all(&bytes).await;
  file.sync_all().await; // fsync() here?
  self.unsaved_messages.clear();
}

# build

Background saver

spawn(async move {
    let mut interval = interval(Duration::from_secs(5));
    loop {
        interval.tick().await;
        stream.persist_unsaved_messages().await; // fsync()
    }
}

# build

Clustering

Replicate the data across multiple nodes
Provide high-availability & reliability
Results in additional complexity
Might have an impact on the latency
Raft, Viewstamped Replication etc.
Probably not within the scope of this talk :)

# build

Reading the data at any offset

# build

Index - the offset position

# build

Storing the index

async fn append(&mut self, message: Message) {
    // ... previous stuff
    self.position += bytes.len(); // Message as bytes
    let mut file = file::open(&self.index_path).await;
    file.write_u32(self.position).await;
}

# build

Reading the index

async fn poll(&self, offset: u64, count: u64) -> Vec<Message> {
    let file = file::open(&self.index_path).await;
    file.seek(SeekFrom::Start(4 * offset)).await;
    let position = file.read_u32().await;

    let file = file::open(&self.stream_path).await;
    file.seek(SeekFrom::Start(position)).await;
    // Load N messages based on the count  
}

Networking

# network

Let there be network

# network

Multiple streams support

struct Server {
    streams: HashMap<u32, Stream>,
    clients: HashMap<u32, Client>,
}

impl Server {
    async fn append(&self, stream_id: u32, message: Message) {
      let stream = self.get_stream(stream_id);
      stream.append(message).await
    }

    async fn poll(&self, stream_id: u32, offset: u64, count: u64) -> Vec<Message> {
      let stream = self.get_stream(stream_id);
      stream.poll(offset, count).await
    }
}

# network

Parallel writes & reads

struct Server {
    streams: HashMap<u32, Arc<RwLock<Stream>>>,
    clients: HashMap<u32, Client>,
}

impl Server {
    async fn append(&self, stream_id: u32, message: Message) {
      let stream = self.get_stream(stream_id);
      let stream = stream.write().await; // Acquire write lock
      stream.append(message).await
    }

    async fn poll(&self, stream_id: u32, offset: u64, count: u64) -> Vec<Message> {
      let stream = self.get_stream(stream_id);
      let stream = stream.read().await; // Acquire read lock
      stream.poll(offset, count).await
    }
}

# network

Partitioning

# network

Partitioning

struct Topic {
    id: u32,
    path: String,
    partitions: HashMap<u32, Partition>
}

struct Partition {
    id: u64,
    offset: u64,
    path: String
}

# network

Partitioning

struct Server {
    topics: HashMap<u32, Arc<RwLock<Topic>>>,
    clients: HashMap<u32, Client>
}

struct Topic {
    id: u32,
    path: String,
    partitions: HashMap<u32, Partition>
}

# network

Partitioning

struct Server {
    topics: HashMap<u32, Topic>,
    clients: HashMap<u32, Client>
}

struct Topic {
    id: u32,
    path: String,
    partitions: HashMap<u32, Arc<RwLock<Partition>>>
}

Performance

Zero-copy (de)serialization

Regular (de)serialization consists of 2 stages
- Break down a model into serializable types
- Serialization of the types using given format
Read + Parse + Reconstruct
Zero-copy directly references these bytes
Deserialization is just casting the pointer
&[u8] -> &T (without an additional cost)

# performance

How to zero-copy?

Just use (github.com/rkyv/rkyv)
Mom, we have zero-copy at home

# performance

let mut pos = 0;
while position < slice.len() {
  let offset = u64::from_le_bytes(slice[pos..pos+8].try_into()?);
  pos += 8;
  let payload_length = u64::from_le_bytes(slice[pos..pos+8].try_into()?);
  pos += 8;
  let payload = &slice[pos..pos + payload_length as usize];
  position += payload_length as usize;
  let message: MessageView<'_> = MessageView::new(offset, payload);
}

How to zero-copy?

# performance

fn write_value_at<const N: usize>(slice: &mut [u8],
    value: [u8; N], position: usize) {
    let slice = &mut slice[position..position + N];
    let ptr = slice.as_mut_ptr();
    unsafe { 
      std::ptr::copy_nonoverlapping(value.as_ptr(), ptr, N);
    }
}

# performance

User vs Kernel space

Syscalls

impl Stream {
    async fn append(&mut self, message: Message) {
        self.offset += 1;
        message.offset = self.offset;
        let bytes = message.as_bytes();
        // 1. Open file
        let mut file = file::open(&self.path).await;
        // 2. Write to file
        file.write_all(&bytes).await;
    }   // 3. Close file
}

# performance

ulimit

"Kafka uses a very large number of files and a large number of sockets to communicate with the clients.

All of this requires a relatively high number of available file descriptors."

"You should increase your file descriptor count to at least 100,000, and possibly much more."

# performance

io_uring

New, asynchronous I/O for Linux
An alternative for epoll, kqueue, aio
Unified interface for network & storage
Reduces syscalls and context switches
Allows batching multiple calls as a single one
Readiness-based vs Completion-based I/O
Battle-tested solution (TigerBeetle and others)

# performance

io_uring

# performance

monoio

let file = file::open(&self.path).await?;
let mut position = 0;
let buffer = Vec::with_capacity(4);
let (result, buffer) = file.read_exact_at(buffer, position).await;
if result.is_err() {
    return Err(Error::InvalidOffset);
}

let offset = u32::from_le_bytes(buffer.try_into()?);
position += 4;

# performance

DirectIO

Bypass kernel page cache via DMA
Reduce unnecessary memory copies
Lower and more predictable (tail) latencies
Better CPU & RAM utilization
Fsyncgate - errors on fsync are unrecovarable
Doesn't play too well with Tokio...

# performance

DirectIO

const O_DIRECT = 0x4000;
const O_DSYNC = 0x4096;
const ALIGNED_SIZE = 512;

let file = std::fs::File::options()
    .read(true)
    .write(true)
    .custom_flags(O_DIRECT | O_DSYNC)
    .open(self.file_path);

# performance

Context switch

# performance

Context switch

impl Stream {
    async fn append(
    	&self,
    	partition_id: u32,
    	message: Message
    ) {
      let partition = self.get_partition(partition_id);
      // Maybe a context switch if lock is contended
      let partition = partition.write().await;
      // Context switch due to async
      partition.append(message).await
    }
}

# performance

Work stealing

# performance

https://tokio.rs/blog/2019-10-scheduler

Thread affinity & Thread-per-core

# performance

Shared Nothing

# performance

monoio

# performance

Optimization rabbit hole

Kernel bypass
DPDK
eBPF
Just to name a few :)

# performance

# thanks

https://spetz.github.io/posts/rustikon-2025/

Rustikon 2025

By Piotr Gankiewicz

Rustikon 2025

a month ago
32

Let's build our own message streaming platform

Hello

The origins

How it started

How it's going

Apache Iggy (Incubating)

Blazingly fast message streaming in Rust

TCP, QUIC, HTTP transport protocols

Very high throughput and stable P99+

+5 GB/s writes & reads on a single node

Vibrant community, multiple SDK supported

Benchmarking as the first-class citizen

Built-in CLI, Web UI and other tooling

Iggy CLI

Iggy Web UI

Iggy Bench CLI

Iggy Benchmarks Platform

The Stream

Stream

Parallel reads

Talk is cheap, show me the code

Hello world

Reads & writes

Serialization

Deserialization

hexdump

File storage

Durable file storage

fsync

"This frequency of application-level fsyncs has a large impact on both latency and throughput. Setting a large flush interval will improve throughput as the operating system can buffer the many small writes into a single large write."

Messages batching

Background saver

Clustering

Replicate the data across multiple nodes

Provide high-availability & reliability

Results in additional complexity

Might have an impact on the latency

Raft, Viewstamped Replication etc.

Probably not within the scope of this talk :)

Reading the data at any offset

Index - the offset position

Storing the index

Reading the index

Networking

Let there be network

Multiple streams support

Parallel writes & reads

Partitioning

Partitioning

Partitioning

Partitioning

Performance

Zero-copy (de)serialization

Regular (de)serialization consists of 2 stages

Break down a model into serializable types

Serialization of the types using given format

Read + Parse + Reconstruct

Zero-copy directly references these bytes

Deserialization is just casting the pointer

&[u8] -> &T (without an additional cost)

How to zero-copy?

Just use (github.com/rkyv/rkyv)

Mom, we have zero-copy at home

How to zero-copy?

User vs Kernel space

Syscalls

ulimit

"Kafka uses a very large number of files and a large number of sockets to communicate with the clients.

All of this requires a relatively high number of available file descriptors."

"You should increase your file descriptor count to at least 100,000, and possibly much more."

io_uring

New, asynchronous I/O for Linux

An alternative for epoll, kqueue, aio

Unified interface for network & storage

Reduces syscalls and context switches

Allows batching multiple calls as a single one

Readiness-based vs Completion-based I/O

Battle-tested solution (TigerBeetle and others)

io_uring

monoio