The most important part of a distributed system: Time

The most important part of distributed systems, and often the bane of the distributed system engineer’s career.

Nov 15, 2024

Atomic, consistent, and isolated transactions are a feature that every developer has likely used at least once in their career, and every internet user indirectly.

Time is the most important part of distributed systems, and often the bane of the distributed system engineer’s career.

But why?

Why is time so important? Why is something as simple as `time.Now()` so hard? And why does everything depend on it being right?

Why time is so important

Time is how we allow concurrent transactions to take place in databases.

Databases could get pretty slow if all operations had to happen in order, so by keeping track of the time that we started a transaction, we know how to ignore certain subsequent updates (isolation), and make sure that future transactions see everything that we’ve done (atomic & consistent).

This is called MVCC, or “Multi-Version Concurrency Control”.

Traditionally, this hasn’t been too much of an issue for databases (well, Postgres DBA’s might disagree with XID wraparound).

Transactions simply just get the current system time, and use that to compare the MVCC timestamps, and go on their way. Clocks drift, but we can detect when they drift backwards (or at least, further back than we last knew it to be).

With distributed systems, time is the #1 enemy.

Time is hard

Time, and clocks, have always been at least a bit of a challenge for computer systems, and if all computers had perfect clocks, distributed systems wouldn’t be quite as hard.

The major problem is that clocks drift, sometimes upwards of seconds. Even in cloud data centers with atomic clocks, you can see drift by upwards of 100 microseconds, which is close to the RTT time of a single AZ.

Now there’s a bit of a hint in there, if the precision of clocks are known, then we can round up beyond the precision, and use that time, right?

It’s not that simple, unfortunately.

Distributed systems are often used to solve problems at scale (where one server just wouldn’t cut it), and scale means lots of requests. So many so, that it’s quite frequent to get request timestamp collisions (multiple requests within the same millisecond, or 100us interval).

And the more servers you have (the larger the distributed system), the more likely your system-drift nears the maximum! It’s getting even worse!

We have to find some way to agree upon time!

The hybrid-timestamp oracle

The solution that most major systems have taken to is a hybrid-timestamp oracle.

A timestamp oracle is a dedicated server or process that just does one thing: serves monotonic hybrid timestamps.

What’s a hybrid timestamp?

A hybrid timestamp is a combination of a timestamp and a monotonic counter for a chosen time interval. Typically represented some way like this:

1731706496411:0
^^^^^^^^^^^^^ ^
time in ms    monotonic counter
("interval")

The interval is typically some higher-level granularity of time (e.g. every 10 milliseconds), and the monotonic counter is reset every time this interval updates.

The timestamp oracle must stricly follow two rules:

The time interval must never move backwards
The monotonic counter must move forward with every request under the current interval

By having a hybrid timestamp, we don’t have to worry about “clock overrun”, or requests coming in so fast that we simply could not guarantee that every request would get an incrementing number without moving our clock forward too quickly.

By using coarse timestamp intervals, the only operation we need to perform to serve a request is:

Read an atomic integer from memory (time interval)
Increment and load an atomic integer from memory (monotinc counter)

In terms of responding to an HTTP request, that’s negligible overhead: There’s so little data that this could be entirely kept in L1 cache!

This is actually one case where the performance of the HTTP framework probably matters more than the task it’s performing.

Transactions MUST start with time

Before a client can begin a transaction, it must get a hybrid timestamp to indicate the start of the transaction (often it needs a second timestamp for commit time as well).

The process goes like:

Client requests an exclusive timestamp from the oracle
The oracle gets the current time interval, increments the monotonic counterail
The oracle combines the two, and returns the hybrid timestamp to the client
The client begins distributed transaction processing

To get a better understanding of why distributed transactions need shared time, I suggest reading Google’s Percolator paper, it’s well written and makes distributed transaction processing clear.

The TLDR is that constant time checks are needed for both the read and write path to make sure that we maintain ACID guarantees. It also enables “time travel queries”, which is a feature that’s hard to give up once you’ve got it.

The timestamp oracle will typically use some replicated storage, and in the background (e.g. every 100 milliseconds), sync an update with followers to advance the time interval, and reset the monotic counter.

By building a distributed timestamp oracle with a consensus protocol like Raft, we can ensure that our oracle service minimizes downtime by quickly electing a new leader, and directing timestamp requests to it.

This is how I’ve done it in my exploration implementation: EpicEpoch.

This simple service, written in Go, uses Raft for consensus with followers, and can comfortably serve millions of unique timestamps per second. I chose a 10ms time interval, as that’s generally more than enough headroom for a single cloud region. Serving timestamps is disjoint from Raft, so if raft is currently in an election, we can keep incrementing the counter.

The design specifically disaggregates Raft and serving timestamps, so the timestamp process can serve timestamps as fast as it can increment an atomic counter, and respond to HTTP requests.

And thankfully, serving hybrid timestamps is quite light on resources, meaning it’s easy for a single process to serve millions of timestamps per second (or even more!).

I’d encourage you to poke around the code, it’s a pretty small codebase, and easy to run.

Defending the title

Now you can see why timestamp oracles are so important: They are the first step to reading or writing from the database with the ACID guarantees we often expect.

In strongly consistent systems like FoundationDB, we must get a timestamp before we can read, so we know whether a write has committed or not.

In eventually consistent systems like ScyllaDB or Cassandra, the client’s time is used for LWW (last-write wins) conflict resolution. If the drift is significant, this can mean losing writes to a misbehaving client. Not good!

Many systems build timestamp oracles directly in:

FoundationDB has a dedicated “sequencer” process
TiDB (and TiKV) uses the Placement Driver (PD) as the timestamp oracle
Google Spanner simply “waits out” the uncertainty, so you get a floor of ~7ms lowest latency on an operation (this may have changed since I last looked it up)

Without reliable time, distributed systems fall apart really fast.

They exhibit undefined behavior, present insidious bugs, and gaslight developers into thinking transactions aren’t isolated (they are, they’re just given bad information).

And often these bugs go undetected for a long time, because they don’t throw errors. They just silently overwrite data or include uncomitted data, unknowngly violating the sacred ACID pact between DBs and devs.

aspiring.dev