Making Sense of Time in Distributed Systems

Ju Liu @arkh4m Elixir London Meetup

🙌 HI!!! 🙌

I'm Ju 🙇🏻

Also known as @arkh4m

I'm an Italian 🇮🇹

Living in London 🇬🇧

And I love to climb ⛰

Falsehoods Programmers Believe About Time

36. Time always goes forward

Let me give you a real world example

At midnight UTC on New Year’s Day, deep inside Cloudflare’s custom RRDNS software, a number went negative when it should always have been, at worst, zero. A little later this negative value caused RRDNS to panic.

RRDNS is written in Go and uses Go’s time.Now() function to get the time. Unfortunately, this function does not guarantee monotonicity. Go currently doesn’t offer a monotonic time source (see issue 12914 for discussion).

The code takes the upstream time values and feeds them to Go’s rand.Int63n() function, which promptly panics if its argument is negative. That's where the RRDNS panics were coming from.

The Fix

In Distributed Systems Time is even Harder

Each computer has a clock built in, but those clocks are independent. The clocks on different machines can vary quite a bit.

Network Time Protocol

NTP

NTP is a networking protocol for clock synchronization between computer systems over packet-switched, variable-latency data networks. It is intended to synchronize all participating computers to within a few milliseconds of Coordinated Universal Time (UTC)

A Few Milliseconds

My Macbook can do about 40'000 MIPS

In a millisecond, that's 40 million instructions.

How long does it take an IP packet to go from London to New York and back?

QUIZ TIME

Distance London - New York

5585 Km

Speed of light over physical medium

200'000 Km/s

Time = \frac{Distance}{Speed}
Time=DistanceSpeedTime = \frac{Distance}{Speed}

The Formula™

So it takes 28 ms to get from London to New York.

\frac{5585 Km}{200000 \frac{Km}{s}} = 0.027925 s
5585Km200000Kms=0.027925s\frac{5585 Km}{200000 \frac{Km}{s}} = 0.027925 s

Which means a 56 ms round trip time.

Considering network hops it ends up being closer to 70 ms.

So my laptop can do 2.8 billion instructions while a packet goes back and forth from London to New York

If we connect 10 nodes together that's about 28 billion instructions in total

In Distributed Systems Time is Really Hard

Let's look at an example

  1. We have a central logger and a set of workers.

  2. The workers send messages between each other and report to the logger.
  3. We want to be able to reconstruct the order of the events that the logger receives.

A central logger

Happened-Before Relationships

A happened-before C can be written as  A→C

A→C

E→F→G

DEMO TIME

Lamport Timestamps To The Rescue!

Lamport timestamps are logical timestamps

How they work

  • Each node has a local counter, which is initialized to a starting number.
  • Each node increments the counter when it performs an action or sends a message.
  • Each node sends its local counter along with the message.

The Algorithm™

  • When a node receives a message, it compares the message counter with its local counter and sets it to the biggest of the two increased by one.
  • Profit!

The Algorithm™

DEMO TIME

  1. Lamport timestamps can only give a partial ordering of the events.
  2. If A happened-before B, then we know that lamport(A) < lamport(B)
  3. But if we have two timestamps lamport(A) < lamport(B) we cannot infer that A happened-before B!

Caveat Emptor

Vector Clocks 

To The Rescue

THANK YOU! QUESTIONS?

Bibliography

Making Sense of Time in Distributed Systems

By arkh4m

Making Sense of Time in Distributed Systems

  • 1,047