🐦 Ingesting Petabytes at Tinybird

PyData Madrid, 2024-01-18

🤳 Your Host Tonight

Alex Fernández "pinchito"

Silverback developer at Tinybird

🗂️ What we will see today

🤔 What does Tinybird do?

📑 Principles

🔧 Techniques

🪄 Tricks

🤔 What does Tinybird do?

And what does Tinybird not do?

🙅‍♀️ First, what does Tinybird not do?

It is not a ClickHouse wrapper

It does not store logs

It is not a DWH

🤌 So, what does Tinybird do?

Real time analytics

Process data at scale

Publish API endpoints

🗣️ Customer quotes

🙊 Non-secret monthly numbers

Ingest petabytes of data

Process many petabytes of data

Serve billions of requests

📑 Principles

How do we do it?

⏱️ Real real-time

From days, hours or minutes — to seconds

Process data as it comes

Reduce latencies

Blog: Real-Time Data Ingestion: The Foundation for Real-time Analytics

🕵️ Customer focus

Really close to customers

Direct communication channels

Everyone does customer support

Fast iteration

🏗️ Production centric

Most engineers do on-call

Everyone suffers the pain

Everyone Deploys

Top-notch Production Culture

🐶 Eat your own dog food

Use Tinybird as much as possible

Be your first user

Blog: Using Tinybird for real-time marketing at Tinybird

⚡ Speed wins

Iterate fast

Communicate often

Don't wait for permission

Hardest principle to implement — and copy

🔧 Techniques

More to the Point

🗿 Monorepo

Everything in the same repo

Ingest, Backend, UI Together

Includes code, tests, docs, tooling, infra, CI itself

A controversial practice (Google, Facebook)

When it works, works great!

🛬 Continuous deployment

Deploy tens of times per day

Write → Test → Review → Merge → Deploy

Everything goes straight to production

Blog: How we cut our CI pipeline execution time in half

🌐 HTTP interface

Use the path of least resistance

Most popular ingest interface: Events API

An afterthought at Devo

🚏  Requests vs events

Event

Request

🥅 Gather data

Pre-aggregate data

Aggregate ClickHouse operations

Fastest, more predictable response

📈 Be ahead

Customers increase their data overnight by 2x, 5x, 10x

Be ready for deluges

Look for the next bottleneck

🪄 Tricks

Of the Trade

🎛️ ClickHouse optimizations

Team of experts

Open source model

Contribute everything upstream

Blog: Resolving a Year-long ClickHouse Lock Contention

🖇️ Async programming

async/await? Copied from JavaScript? Or .net?

Python servers: tornado, gunicorn, starlette

Fastest performance

🍱 Use types

Explicit types

Checkmypy every time

I hate it

Catches a lot of errors

🔌 C extensions

Invoked from Python using CFFI

For extreme situations

Blog: Splitting CSV files at 3 GB/s

🙏 Thanks!

❓ Questions?