08 / 09

GCP / 08

Spanner

A relational database that speaks SQL, scales out by adding machines, replicates across continents, and still gives you transactions that behave as if the whole thing ran on one box. That sentence should be impossible, and for thirty years it was. Spanner makes it true with a strange ingredient: a clock that knows how wrong it might be. This page explains TrueTime and commit wait properly, then works through splits, interleaved tables, Paxos leaders, the two kinds of read, multi-region trade-offs, and the bill. It ends with a gcloud lab where you run the real thing for about an hour.

A promise that sounds like a lie

Before Spanner, distributed databases offered a forced choice. You could have a relational database with real transactions, but it lived on one primary machine and scaled by buying a bigger machine until there were no bigger machines. Or you could have a horizontally scaled store, but you gave up cross-row transactions, or consistency, or SQL, or all three, and your application code absorbed the difference. Google felt this pain directly: the AdWords backend ran on a sharded MySQL fleet whose resharding operations took years of engineering time, and the teams building on Bigtable kept asking for transactions that spanned rows.

The Spanner paper (Corbett et al., OSDI 2012) answered with a position that was almost confrontational for its era: "we believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions." In other words, give people the strong abstraction and let them optimise later, because working around missing transactions is the expensive part. The system behind that sentence is a sharded, replicated key-value substrate with a SQL layer on top, and its headline guarantee is external consistency: if transaction T1 commits before transaction T2 starts, then T1's commit timestamp is smaller than T2's. Every observer, in every region, sees the same order, and that order agrees with real time. This is the strictest practical isolation level there is, equivalent to strict serializability, and Spanner provides it across datacenters on opposite sides of the planet.

Cloud Spanner, generally available since 2017, is the externalised version of that internal system. You get the same machinery (TrueTime, splits, Paxos replication) behind a managed surface with two SQL dialects, GoogleSQL and a PostgreSQL-flavoured one. It belongs to the same family of ideas as the other distributed SQL systems (CockroachDB, YugabyteDB, TiDB), all of which descend from the paper, but Spanner remains the only one that gets to assume Google's clock hardware. That assumption is where the story starts.

TrueTime: a clock that admits its error

The hard problem in a distributed database is ordering. Transactions land on machines in different buildings and different countries, and you need to put them in one global order that every replica agrees on. The classic approaches either funnel everything through a single timestamp oracle (a bottleneck and a single point of failure) or use logical clocks, which order events that communicate but say nothing about events that do not. Wall clocks would be ideal, except that ordinary server clocks synchronised over NTP can be off by tens of milliseconds, occasionally much more, and the worst part is that you have no idea by how much at any given moment. Comparing two timestamps from two machines is meaningless if either clock might be lying. The general problem of why machine time is so slippery has its own page on time and clocks; Spanner's contribution is a way to live with the slipperiness instead of pretending it away.

TrueTime's move is to stop returning a single number. The API call TT.now() returns an interval, [earliest, latest], with a guarantee: the true absolute time, whatever it is, lies inside that interval. Two helper calls follow directly: TT.after(t) is true once t is definitely in the past (the whole current interval sits above it), and TT.before(t) is true while t is definitely in the future. Instead of a clock that is silently wrong, you get a clock that is honestly uncertain, and honest uncertainty turns out to be something you can build on.

Keeping that interval narrow takes hardware. Each Google datacenter runs a set of time masters; most have GPS receivers with dedicated antennas, and a minority carry atomic clocks as a hedge against GPS-wide failures (the paper calls these Armageddon masters, which tells you the failure they were designed for). The two source types fail in unrelated ways: GPS is vulnerable to antenna faults, jamming, and spoofing, while atomic clocks drift slowly but steadily. Every machine runs a time daemon that polls a mix of nearby and distant masters roughly every 30 seconds, rejects liars using an agreement algorithm, and then, between polls, inflates its reported uncertainty at a worst-case drift rate of 200 microseconds per second. The result is a sawtooth: uncertainty (called epsilon) starts near 1 millisecond right after a poll, ramps to roughly 7 milliseconds before the next one, and averaged about 4 milliseconds at the time of the paper. Single-digit milliseconds of admitted error, fleet-wide, with a hard guarantee attached. That number is the whole foundation.

Commit wait: turning clock error into ordering

Here is the famous trick, and it is small enough to state in two sentences. When a transaction is ready to commit, its Paxos leader picks the commit timestamp s = TT.now().latest, the pessimistic top of the current uncertainty interval. Then the leader deliberately stalls, holding the transaction's locks, until TT.after(s) becomes true, and only then applies the writes and makes the commit visible. That stall is commit wait.

Commit wait. The leader assigns the most pessimistic timestamp in its interval, then waits the uncertainty out before letting anyone see the commit.

Why does this buy external consistency? Walk it through. When T1's commit becomes visible, TT.after(s1) holds, which means true time is already past s1. Any transaction T2 that starts after seeing T1's effects therefore starts at a true time greater than s1, and when T2's leader calls TT.now(), the guarantee says the whole returned interval contains true time, so even its earliest edge is above s1. T2's timestamp, picked from that interval, must exceed s1. No messages were exchanged, no central oracle was consulted, and the two transactions may have run on different continents through different leaders. The ordering falls out of two local clock reads plus a short nap. Bounded clock error, waited out, becomes a global ordering guarantee.

Notice what the design did to the engineering incentives. Clock uncertainty is no longer a correctness threat; it is a latency tax, paid as a few milliseconds of commit wait. Buy better clocks and your database commits faster, which must be the only system where the database team's roadmap includes antenna placement. In practice the tax is mild: commit wait overlaps with the Paxos replication round the commit needs anyway, so much of it hides inside work the leader was already doing.

The timestamps earn their keep a second time on the read side. Because every commit carries a timestamp consistent with real time, "the state of the entire database at time t" is a well-defined object, and any replica that has caught up through t can serve a perfectly consistent snapshot at t without taking a single lock. Lock-free consistent snapshots across a global database are arguably the feature; external consistency is the guarantee that makes them trustworthy.

Splits: how a table becomes a fleet

Under the SQL surface, a Spanner table is its primary key. Rows live physically sorted by primary key, and the sorted keyspace is chopped into contiguous ranges called splits. The split is the unit of everything: it is the unit of distribution (splits are spread across the servers in your instance), the unit of replication (each split is copied to multiple zones or regions), and the unit of consensus (each split has its own Paxos group, more on that below). A small database is one split; a large one is thousands, and you never name or manage them.

Spanner moves split boundaries on its own, for two reasons. Size: when a split grows past a few gigabytes it divides. Load: when a split is hot, it divides at the point of heat and the pieces can move to less busy servers. This is the operational pitch in one line — the resharding project that costs other teams a quarter happens here continuously, silently, and without a maintenance window. The flip side is that everything about performance now hangs on the primary key, because the key decides which split, and therefore which machine, every operation lands on.

Interleaved tables, and the hot-key trap

A range-sharded system gives you one big lever: things that are close in key order are close in the world. Spanner's schema language exposes that lever as interleaved tables. Declare Orders as interleaved in parent Customers, with the parent's key as a prefix of the child's key, and the child rows are stored physically between their parent rows. A customer and all of her orders sit next to each other on disk, inside the same split, served by the same Paxos group.

Interleaving stores child rows physically under their parent, so parent–child joins and transactions never leave the split.

The payoff is twofold. Joins along the hierarchy become local scans instead of cross-machine lookups, and transactions that touch a parent and its children stay inside one Paxos group, which means they commit in a single consensus round with no two-phase coordination. The constraint that makes this work is also the caveat: split boundaries only fall between root rows, so a root row plus every descendant under it must fit in one split. Keep a hierarchy to single-digit gigabytes; a celebrity-customer row with a hundred million children becomes an unsplittable hot lump. Interleave when children are almost always fetched with their parent, and stop nesting when a level starts being queried independently.

Now the trap, and it catches almost everyone arriving from MySQL or Postgres. There is no auto-increment in Spanner, and the omission is deliberate. A monotonically increasing primary key (a sequence, a timestamp, anything that only grows) means every insert lands at the tail of the keyspace, in the same final split, on one machine. Your globally distributed database now has the write throughput of a single server, and load-based splitting cannot save you because the heat never spreads: it just follows the tail. The fixes all amount to scattering the key: use a UUIDv4, or one of Spanner's built-in bit-reversed sequences (monotonic inside, scattered outside), or prefix the key with a hash or shard number, or reorder a composite key so a well-distributed column comes first. The same reasoning applies to secondary indexes: an index on a timestamp column is itself a table whose key is the timestamp, with the same tail split and the same single hot machine.

Paxos groups, leaders, and the two kinds of read

Each split is replicated as its own consensus group running Paxos. In a regional instance, that means a replica in each of three zones; one of them holds a renewable leader lease (about ten seconds at a time) and does the talking. The leader owns the lock table for its split, sequences all writes through Paxos, and assigns commit timestamps. Different splits have different leaders, deliberately scattered, so the leadership load of a thousand splits spreads across the whole instance rather than electing one boss machine.

Every split is its own Paxos group. Leaders are scattered across zones so no single zone carries all the write coordination.

A write, then. The client sends its buffered mutations (Spanner buffers writes client-side until commit, so a "blind write" that updates rows without reading them first never takes read locks at all) to the leader of the affected split. The leader acquires write locks, picks the commit timestamp, replicates the write to a majority of replicas through Paxos, sits out commit wait, then applies and acknowledges. If the transaction touches several splits, Spanner layers two-phase commit over the Paxos groups, with one group acting as coordinator. Textbook 2PC has a terrifying failure mode, a coordinator that dies and leaves everyone blocked, but here each participant is itself a replicated state machine that does not really die, which defuses most of the horror. The cost that remains is latency: extra rounds of cross-group coordination. This is why schema design keeps coming back to "make related data share a split": single-group transactions skip 2PC entirely.

Reads are where Spanner hands you a meaningful dial. A strong read reads at a current timestamp and is guaranteed to see everything committed before it. The replica serving it must prove it is caught up through that timestamp; each replica tracks a "safe time" derived from its Paxos state, and if the replica's safe time lags, the read either waits or asks the leader, which can mean a cross-zone or cross-region round trip. A stale read gives up recency on purpose: exact staleness ("read as of 14:03:00") or bounded staleness ("read at most ten seconds old") lets the nearest replica serve immediately from its own history, no leader conversation needed. In a regional instance the difference is small. In a multi-region instance it is the difference between a single-digit-millisecond local read and a transoceanic round trip, which is why the standard advice for read-heavy global workloads is to use bounded staleness of ten to fifteen seconds for anything that can tolerate it. The dial is per-read, not per-database: the same application can read its own writes strongly and render everyone else's dashboards stale.

Rule of thumb. Reserve strong reads for read-your-own-writes paths and invariant checks. Everything else (feeds, dashboards, catalogues, search backfills) takes a bounded-staleness read and gets served by the replica in the same building as the user.

Schema changes without a maintenance window

Spanner performs DDL online, and the mechanism is another TrueTime trick worth knowing. In the paper's design, a schema change is a transaction that gets assigned a timestamp in the future. Reads and writes carry the schema version they expect; anything timestamped before the change uses the old schema, anything after uses the new one, and the changeover needs no global lock and no quiesced fleet, because the timestamp itself is the coordination. Cloud Spanner exposes this as plain DDL statements that run asynchronously while traffic continues: adding a column is quick, while building a secondary index triggers a background backfill that can take a while on large tables but never blocks reads or writes. You still owe the usual application discipline (deploy code that tolerates both schemas, never drop a column something still reads), but the database-side downtime of a MySQL ALTER on a big table simply is not part of the model.

Multi-region, and what it does to your latency

An instance's replication topology is fixed by its configuration. Regional configurations keep three read-write replicas in three zones of one region: quorum traffic stays inside the region, writes cost a few milliseconds, and the SLA is 99.99%. Multi-region configurations spread replicas across regions in fixed recipes. A representative one, nam3, puts two read-write replicas in each of two North American regions plus a witness replica in a third; the witness votes in Paxos but stores no readable data, existing purely so quorum survives a regional outage. Wider configurations such as nam-eur-asia1 add read-only replicas on other continents: full copies that serve reads but never vote, so they cannot slow commits down. Multi-region carries a 99.999% SLA, which is about five minutes of budgeted unavailability a year.

The latency arithmetic follows from the machinery you have already seen. Every configuration names a default leader region, and writes must reach the leader and then a quorum, so write latency has a floor of roughly the round trip between the leader region and the nearest other read-write region. No clever schema fixes that; it is physics plus Paxos. Strong reads from a non-leader region may need that round trip too, when the local replica's safe time lags. Stale reads are the escape hatch: a read-only replica in Tokyo serves a fifteen-second-stale read in a couple of milliseconds regardless of where the leader sits. So the placement rule is blunt: put the leader region where the writers are, lean on read-only replicas plus staleness everywhere else, and treat any product requirement of "strong reads, everywhere, fast" as a requirement to renegotiate.

What it costs, and when it is actually worth it

Spanner is priced on three meters: compute (nodes, or tenths of a node called processing units, 1000 PU to a node), storage per gigabyte-month, and network egress. Ballpark figures, worth re-checking against current pricing: a regional node runs about $0.90 an hour, around $650 a month, with storage near $0.30 per GB-month; a node in a large multi-region configuration is roughly three times that, with pricier storage to match the extra copies. The minimum footprint is 100 PU, which brings a real, production-SLA regional Spanner down to roughly $65 a month, and a node handles on the order of thousands of writes and tens of thousands of reads per second (workload-dependent; CPU utilisation is the metric that actually governs scaling). Storage caps out around 10 TB per node, so very large datasets set a compute floor even when traffic is light.

Need	Honest answer
One region, fits on a big primary, Postgres or MySQL ecosystem	Cloud SQL. Boring, cheap, fine.
Postgres compatibility with more headroom and read pools	AlloyDB. Real Postgres, vertical writes, scale-out reads.
Write throughput beyond any single primary, or five-nines, or multi-region strong consistency	Spanner. This is the territory it was built for.
Global analytics over huge data, no transactions needed	Not a Spanner job at all; that is BigQuery.

The honest comparison with Cloud SQL is not really about price per gigabyte; it is about which failure you would rather own. Cloud SQL is a managed single-primary database: cheaper at small scale, instantly familiar, compatible with the whole Postgres extension ecosystem, and bounded by the largest machine available, with failover measured in seconds-to-minutes and maintenance windows that exist. AlloyDB raises the single-primary ceiling considerably and keeps wire-level Postgres compatibility, but writes still funnel through one primary. Spanner removes the ceiling and the failover concept (a zone loss is absorbed by quorum, not handled by a script), and charges for it three ways: money, the latency floors above, and schema discipline, because it is not wire-compatible with anything and its performance model punishes Postgres habits like sequential keys. A useful tiebreaker question: would your team rather spend the next two years sharding a relational database by hand, or adapting to Spanner's key-design rules? Teams that face that choice tend to find Spanner cheap at the price. Teams that do not face it are buying a global consensus machine to run a blog.

The CAP question, answered carefully

"Spanner beats CAP" is the popular take, and it is wrong in an instructive way. Eric Brewer, who formulated the conjecture and works at Google, wrote a 2017 paper specifically to keep the record straight. His framing: Spanner is technically CP. When a network partition happens, it chooses consistency and forfeits availability; the minority side of a partitioned Paxos group stops serving rather than serve wrong answers. No theorem was harmed.

What Brewer actually claims is narrower and more interesting: Spanner is "effectively CA", in his words, because Google made partitions so rare that choosing C over A almost never costs anything users can measure. Spanner runs on Google's private backbone, with redundant paths, owned fibre, and traffic engineering, so the network is not the open internet the CAP conjecture worries about; and with better than five-nines availability, partitions sit below users' noise floor, lost among the other rare failures that any system has. The lesson generalises poorly and is still worth internalising: CAP describes a forced choice at the moment of partition, and one valid engineering response is to spend enormous effort making that moment vanishingly rare, then choose consistency without much regret. If you cannot make that spend (and outside a hyperscaler you mostly cannot), your CP system will visibly pause during partitions. Spanner's availability story is bought with infrastructure, not cleverness, and Brewer is unusually candid about that.

CLI lab: run a real Spanner for an hour

Everything above is checkable from a terminal. This lab creates the smallest billable Spanner (100 processing units, roughly nine cents an hour; there is also a 90-day free-trial instance type if your account qualifies), builds the interleaved Customers/Orders schema from the diagrams, and demonstrates the strong-versus-stale read difference with timestamps you pick yourself. Tear it down at the end; the meter runs while the instance exists.

Enable the API and create the instance. 100 processing units is a tenth of a node and the smallest unit money can buy.
gcloud services enable spanner.googleapis.com gcloud spanner instances create lab-spanner \ --config=regional-us-central1 \ --description="Spanner codex lab" \ --processing-units=100
Create a database with an interleaved schema. Orders is physically stored inside Customers; note the parent key repeated as the prefix of the child key.
gcloud spanner databases create labdb --instance=lab-spanner \ --ddl='CREATE TABLE Customers ( CustomerId INT64 NOT NULL, Name STRING(64) ) PRIMARY KEY (CustomerId); CREATE TABLE Orders ( CustomerId INT64 NOT NULL, OrderId INT64 NOT NULL, Item STRING(64), Total INT64 ) PRIMARY KEY (CustomerId, OrderId), INTERLEAVE IN PARENT Customers ON DELETE CASCADE'
Insert a parent and two children.
gcloud spanner rows insert --database=labdb --instance=lab-spanner \ --table=Customers --data=CustomerId=1,Name=Ada gcloud spanner rows insert --database=labdb --instance=lab-spanner \ --table=Orders --data=CustomerId=1,OrderId=1001,Item=keyboard,Total=120 gcloud spanner rows insert --database=labdb --instance=lab-spanner \ --table=Orders --data=CustomerId=1,OrderId=1002,Item=monitor,Total=300
Run a strong read. This is the default; the serving replica proves it has seen every commit up to now. The join stays inside one split thanks to interleaving — add --query-mode=PROFILE to see the plan confirm it.
gcloud spanner databases execute-sql labdb --instance=lab-spanner \ --sql='SELECT c.Name, o.OrderId, o.Item, o.Total FROM Customers c JOIN Orders o ON c.CustomerId = o.CustomerId'
Capture a timestamp, then change the data. The recorded instant becomes the "past" you will read from in the next step.
TS=$(date -u +%Y-%m-%dT%H:%M:%SZ) sleep 2 gcloud spanner databases execute-sql labdb --instance=lab-spanner \ --sql='UPDATE Orders SET Total = 999 WHERE CustomerId = 1 AND OrderId = 1001'
Strong versus stale, side by side. The strong read sees 999. The stale read at your captured timestamp sees 120: a consistent snapshot of the database as it was, served without locks. Bounded staleness ("at most n seconds old") works the same way and is available through the client libraries.
gcloud spanner databases execute-sql labdb --instance=lab-spanner \ --sql='SELECT Total FROM Orders WHERE CustomerId = 1 AND OrderId = 1001' gcloud spanner databases execute-sql labdb --instance=lab-spanner \ --read-timestamp=$TS \ --sql='SELECT Total FROM Orders WHERE CustomerId = 1 AND OrderId = 1001'
Watch ON DELETE CASCADE follow the interleaving. Deleting the parent row removes its co-located children in the same single-split transaction.
gcloud spanner databases execute-sql labdb --instance=lab-spanner \ --sql='DELETE FROM Customers WHERE CustomerId = 1' gcloud spanner databases execute-sql labdb --instance=lab-spanner \ --sql='SELECT COUNT(*) AS remaining FROM Orders'
Tear down. The instance bills until this runs.
gcloud spanner databases delete labdb --instance=lab-spanner --quiet gcloud spanner instances delete lab-spanner --quiet

Total cost if you finish inside an hour: about a dime. What you touched was not a demo tier; it was the same TrueTime, the same splits, and the same Paxos groups that run the large installations, just with one-tenth of one node's worth of compute in front of them.

Up next

09 — Pub/Sub

The messaging fabric: push versus pull subscriptions, ordering keys, exactly-once delivery, and why the fan-out model differs from a Kafka-style log.

Continue

← Back to the GCP index ↑ The codex

Found this useful?

Spanner

A promise that sounds like a lie

TrueTime: a clock that admits its error

Commit wait: turning clock error into ordering

Splits: how a table becomes a fleet

Interleaved tables, and the hot-key trap

Paxos groups, leaders, and the two kinds of read

Schema changes without a maintenance window

Multi-region, and what it does to your latency

What it costs, and when it is actually worth it

The CAP question, answered carefully

Further reading

CLI lab: run a real Spanner for an hour

09 — Pub/Sub