Making Sense of Time in Distributed Systems
Ju Liu @arkh4m Elixir London Meetup
🙌 HI!!! 🙌
I'm Ju 🙇🏻
Also known as @arkh4m
I'm an Italian 🇮🇹
Living in London 🇬🇧
And I love to climb ⛰
Falsehoods Programmers Believe About Time
36. Time always goes forward
Let me give you a real world example
At midnight UTC on New Year’s Day, deep inside Cloudflare’s custom RRDNS software, a number went negative when it should always have been, at worst, zero. A little later this negative value caused RRDNS to panic.
RRDNS is written in Go and uses Go’s time.Now() function to get the time. Unfortunately, this function does not guarantee monotonicity. Go currently doesn’t offer a monotonic time source (see issue 12914 for discussion).
The code takes the upstream time values and feeds them to Go’s rand.Int63n() function, which promptly panics if its argument is negative. That's where the RRDNS panics were coming from.
The Fix
In Distributed Systems Time is even Harder
Each computer has a clock built in, but those clocks are independent. The clocks on different machines can vary quite a bit.
Network Time Protocol
NTP
NTP is a networking protocol for clock synchronization between computer systems over packet-switched, variable-latency data networks. It is intended to synchronize all participating computers to within a few milliseconds of Coordinated Universal Time (UTC)
A Few Milliseconds
My Macbook can do about 40'000 MIPS
In a millisecond, that's 40 million instructions.
How long does it take an IP packet to go from London to New York and back?
QUIZ TIME
Distance London - New York
5585 Km
Speed of light over physical medium
200'000 Km/s
The Formula™
So it takes 28 ms to get from London to New York.
Which means a 56 ms round trip time.
Considering network hops it ends up being closer to 70 ms.
So my laptop can do 2.8 billion instructions while a packet goes back and forth from London to New York
If we connect 10 nodes together that's about 28 billion instructions in total
In Distributed Systems Time is Really Hard
Let's look at an example
-
We have a central logger and a set of workers.
- The workers send messages between each other and report to the logger.
- We want to be able to reconstruct the order of the events that the logger receives.
A central logger
Happened-Before Relationships
A happened-before C can be written as A→C
A→C
E→F→G
DEMO TIME
Lamport Timestamps To The Rescue!
Lamport timestamps are logical timestamps
How they work
- Each node has a local counter, which is initialized to a starting number.
- Each node increments the counter when it performs an action or sends a message.
- Each node sends its local counter along with the message.
The Algorithm™
- When a node receives a message, it compares the message counter with its local counter and sets it to the biggest of the two increased by one.
- Profit!
The Algorithm™
DEMO TIME
- Lamport timestamps can only give a partial ordering of the events.
- If A happened-before B, then we know that lamport(A) < lamport(B)
- But if we have two timestamps lamport(A) < lamport(B) we cannot infer that A happened-before B!
Caveat Emptor
Vector Clocks
To The Rescue
THANK YOU! QUESTIONS?
Bibliography
- https://github.com/Arkham/lamport_logger
- https://en.wikipedia.org/wiki/Lamport_timestamps
- http://www.goodmath.org/blog/2016/03/16/time-in-distributed-systems-lamport-timestamps/
- https://8thlight.com/blog/rylan-dirksen/2013/10/04/synchronization-in-a-distributed-system.html
- http://basho.com/posts/technical/why-vector-clocks-are-easy/
Making Sense of Time in Distributed Systems
By arkh4m
Making Sense of Time in Distributed Systems
- 1,047