Cloud Databases

Google Spanner: Globally Distributed SQL at Scale

Understand Google Spanner's architecture — TrueTime, interleaved tables, schema design, and cost model

JusDB Team
April 12, 2023
11 min read
176 views

Running a globally distributed database that maintains strong consistency across continents sounds like an unsolved problem — until Google shipped Spanner into production in 2012 and later made it available on Google Cloud. Spanner powers Google Ads, Gmail, and Google Photos, systems that cannot afford split-brain scenarios or eventual consistency anomalies. The engineering team solved the hardest distributed systems problem — external consistency at planetary scale — by leaning on atomic clocks and GPS receivers instead of purely logical time. What came out of that work is a relational database that behaves like a single-node PostgreSQL instance from an ACID perspective while transparently sharding across data centers on multiple continents.

TL;DR
  • Spanner achieves external consistency (stronger than serializability) using TrueTime — GPS and atomic clocks that bound clock skew to under 7 ms.
  • Replication uses multi-Paxos; writes go to a leader replica and are committed only after a quorum acknowledges.
  • Interleaved tables physically co-locate child rows with parent rows, eliminating cross-shard joins for hierarchical data.
  • Stale reads can dramatically reduce latency and cost by reading from the nearest replica without acquiring locks.
  • Pricing starts at $0.90 per node-hour for regional configs, making Spanner expensive for workloads that don't justify global distribution.
  • CockroachDB is the most direct open-source alternative if vendor lock-in or cost is a concern.

What Is Google Spanner?

Spanner is a fully managed, horizontally scalable relational database service available on Google Cloud. Unlike most distributed databases that sacrificed consistency for availability (per the CAP theorem framing popular in the 2010s), Spanner makes a different trade-off: it prioritizes consistency and partition tolerance, accepting that during a network partition some requests will block rather than return stale data.

The original 2012 paper described Spanner as the first system to distribute data at global scale and support externally consistent distributed transactions. External consistency is a property stronger than serializability: it means if transaction T1 commits before transaction T2 starts (in wall-clock time), then T2 will observe all of T1's writes. No distributed system can guarantee this property without a reliable source of wall-clock time — which is exactly what TrueTime provides.

From a developer-facing perspective, Spanner exposes a SQL interface (GoogleSQL dialect, which is largely ANSI SQL-compliant with extensions), supports DDL-driven schema changes that run online without downtime, and integrates natively with the Google Cloud ecosystem including IAM, VPC Service Controls, and Cloud Monitoring.

TrueTime: The Clock That Makes It All Work

Every distributed database must decide how to order events across nodes that have independent clocks. NTP synchronization is good to within tens of milliseconds, and logical clocks (Lamport timestamps, hybrid logical clocks) can order events causally but cannot bound the gap between logical and wall-clock time. Spanner takes a different approach entirely.

TrueTime is a global time API developed by Google that exposes time as an interval [earliest, latest] rather than a single point. The interval represents the uncertainty in the current wall-clock time. Internally, TrueTime uses GPS receivers and atomic clocks deployed in every Google data center to synchronize time. The resulting clock skew is bounded to under 7 milliseconds in practice (Google targets under 6 ms epsilon).

Here is how TrueTime enables external consistency for commits:

  1. When a transaction is ready to commit, Spanner calls TT.now() and gets back an interval [t_earliest, t_latest].
  2. Spanner assigns the commit timestamp s >= t_latest.
  3. Before releasing the commit, Spanner waits until TT.now().earliest > s — meaning the current time is guaranteed to be after the commit timestamp on every node.
  4. This "commit-wait" ensures no future transaction can receive a timestamp earlier than s, preserving the external consistency invariant.
Warning

Commit-wait introduces latency proportional to TrueTime epsilon (typically 4–14 ms). For workloads that issue many small writes in tight loops, this is a meaningful overhead. Design your write path to batch operations where possible.

Architecture: Paxos, Splits, and Directories

Spanner organizes data into tablets — contiguous ranges of rows ordered by primary key. Each tablet is replicated across multiple zones (physical data centers) using a variant of Paxos called multi-Paxos. One replica in each Paxos group is elected leader; all writes go to the leader, which proposes the write to the group and commits once a quorum (typically 2 of 3 or 3 of 5) acknowledges.

As a table grows, Spanner automatically splits it into multiple tablets at row-key boundaries. Splits are the unit of distribution: different splits can have different Paxos leaders in different zones, enabling geographic distribution of read/write traffic. Splits merge back together when data is deleted.

The concept of a directory (now called a "key range" in Cloud Spanner documentation) sits between the application schema and the physical split. A directory is the smallest unit of data movement between Paxos groups. Spanner moves directories between groups to balance load and co-locate frequently accessed data.

For multi-region configurations, Spanner places voting replicas across regions. A typical 5-replica multi-region setup might have 2 voting replicas in us-central1, 2 in us-east1, and 1 non-voting witness in us-west1. Writes must reach a quorum, so round-trip latency between regions directly affects write throughput. Google publishes multi-region configs (like nam6, eur3, asia1) with predefined replica placements optimized for continental distribution.

Data Modeling: Interleaved Tables and Schema Design

Spanner's most distinctive schema feature is table interleaving. When you declare a child table as interleaved in a parent table, Spanner physically stores child rows adjacent to their parent row on disk, within the same tablet. This eliminates cross-shard joins for parent-child access patterns and is the primary tool for modeling hierarchical data efficiently.

text
CREATE TABLE Users (
  UserId     INT64 NOT NULL,
  Email      STRING(255),
  CreatedAt  TIMESTAMP,
) PRIMARY KEY (UserId);

CREATE TABLE Orders (
  UserId     INT64 NOT NULL,
  OrderId    INT64 NOT NULL,
  TotalCents INT64,
  Status     STRING(50),
) PRIMARY KEY (UserId, OrderId),
  INTERLEAVE IN PARENT Users ON DELETE CASCADE;

With this schema, a query fetching a user and all their orders reads a single contiguous range of bytes — no distributed join required. The ON DELETE CASCADE clause propagates deletes from parent to child rows within the same atomic operation.

Tip

Interleaving is effective for access patterns where you almost always read a parent alongside its children. If your queries frequently access child rows independently (e.g., scanning all orders across all users), a flat table with a secondary index on UserId may outperform an interleaved design.

Spanner does support foreign key constraints (added in 2020), but they behave differently from traditional RDBMS: they enforce referential integrity at transaction commit time using distributed lookups, which adds latency. Many Spanner practitioners enforce referential integrity at the application layer and avoid database-level foreign keys for performance-sensitive paths.

Secondary indexes in Spanner are global by default — they are stored as separate interleaved tables internally. You can scope an index to store additional columns to cover common queries and avoid base-table lookups:

text
CREATE INDEX OrdersByStatus
ON Orders (Status, UserId)
STORING (TotalCents, CreatedAt);

The GoogleSQL dialect is largely compatible with ANSI SQL but has notable differences: no AUTO_INCREMENT (use UUID() or application-generated UUIDs — but beware hotspots with sequential IDs), no sequences (as of early 2024 sequences are available in preview), and ARRAY and STRUCT types for semi-structured data.

Warning

Using monotonically increasing integers (like Unix timestamps or auto-increment IDs) as leading primary key components creates a write hotspot: all inserts go to the same split leader. Use bit-reversed sequences, UUIDs, or hash-prefixed keys to distribute writes across splits.

Stale Reads and Read-Only Transactions

Not every query needs to read the absolute latest committed data. Spanner exposes stale reads — reads that return data as of a timestamp in the past (bounded staleness) or a specific past timestamp (exact staleness). Because stale reads do not need to acquire locks and can be served by any replica (including non-leader replicas in nearby zones), they are both lower-latency and lower-cost.

text
// Java client example: bounded staleness read
try (ReadOnlyTransaction txn =
    dbClient.singleUseReadOnlyTransaction(
        TimestampBound.ofMaxStaleness(15, TimeUnit.SECONDS))) {
  ResultSet rs = txn.executeQuery(
      Statement.of("SELECT OrderId, Status FROM Orders WHERE UserId = @uid"),
      Map.of("uid", Value.int64(userId)));
  // process rs
}

For analytics, reporting dashboards, and read-heavy microservices where a few seconds of staleness is acceptable, bounded staleness reads can cut query latency by 50–80% compared to strong reads in multi-region configurations.

When to Use Spanner

Spanner is the right tool when several conditions are true simultaneously:

  • Global user base requiring low write latency across regions. If your users span multiple continents and they write data (not just read), Spanner's multi-region configs provide consistent write performance everywhere without application-level sharding.
  • Strong consistency is non-negotiable. Financial ledgers, inventory management, seat reservation systems — anywhere a dirty read or lost update has real-world consequences.
  • Scale beyond a single PostgreSQL instance. Spanner's horizontal scaling is transparent; adding nodes increases throughput without schema changes or application rewrites.
  • Operational simplicity matters. Spanner is fully managed: no replication configuration, no failover scripts, no vacuum tuning.

Limitations and Cost

Spanner's pricing model is based on compute capacity measured in Processing Units (PUs). 1000 PUs equals 1 node. Regional instance pricing starts at approximately $0.90 per node-hour (1000 PUs), plus storage at $0.30 per GB/month and network egress. A multi-region configuration costs roughly 3x the regional price for the same compute capacity, reflecting the additional replicas and cross-region replication traffic.

For a workload running 1 node 24/7 in a regional config: $0.90 × 24 × 30 = $648/month before storage. A comparable RDS PostgreSQL db.r6g.xlarge runs ~$250/month. Spanner's premium is justified only when you genuinely need its distributed consistency guarantees.

Spanner vs CockroachDB: CockroachDB is the most direct architectural peer — both use distributed consensus replication (Raft in CockroachDB's case) and support SQL. CockroachDB is open-source and can run on your own infrastructure, eliminating vendor lock-in. Spanner's TrueTime gives it a latency advantage for commit-wait scenarios, but CockroachDB's hybrid logical clocks provide equivalent causal consistency for most workloads. CockroachDB's SQL compatibility with PostgreSQL is significantly higher, reducing migration friction.

Warning

Spanner is overkill — and expensive — if your application is single-region, your data fits comfortably on a single PostgreSQL instance, or your write throughput doesn't exceed what a well-tuned primary-replica PostgreSQL cluster can handle. Start with a managed PostgreSQL or MySQL offering and migrate to Spanner when you actually hit the ceiling.

Additional limitations to plan around:

  • Mutations per transaction are limited to 80,000 by default.
  • Maximum row size is 10 MB.
  • DDL changes are online but can take minutes to hours for large tables.
  • No stored procedures or triggers (logic lives in the application layer).
  • Limited ecosystem compared to PostgreSQL — ORMs, migration tools, and query analyzers have uneven support.
Key Takeaways
  • Spanner's TrueTime API uses GPS and atomic clocks to bound clock uncertainty, enabling external consistency — the gold standard for distributed transaction ordering.
  • Multi-Paxos replication with automatic split management provides transparent horizontal scaling; no manual sharding or resharding is ever required.
  • Interleaved tables are Spanner's primary performance optimization: co-locate child rows with parents to eliminate cross-shard joins in hierarchical access patterns.
  • Stale reads serve any replica without locking — use bounded staleness for dashboards, analytics, and read-heavy paths to cut latency and cost.
  • The minimum viable Spanner instance costs roughly $648/month (regional, 1 node) before storage; multi-region triples that figure. Budget accordingly before committing to the platform.
  • CockroachDB is the most viable open-source alternative, offering similar distributed SQL semantics with PostgreSQL wire compatibility and no vendor lock-in.
  • Avoid sequential integer primary keys — they hotspot writes onto a single split leader. Use UUIDs, hash prefixes, or bit-reversed sequences instead.

Evaluating Your Database Architecture?

Choosing between Spanner, CockroachDB, Aurora Global, and a well-tuned single-region PostgreSQL cluster is one of the highest-leverage architectural decisions a team makes. The wrong choice means either paying for global distribution you don't need or hitting scalability walls at the worst possible time.

JusDB tracks performance benchmarks, pricing changes, and architectural patterns across every major managed database — relational, distributed, and beyond. Browse the full database comparison library to find the right fit for your workload, or use the side-by-side comparison tool to pit Spanner directly against CockroachDB, AlloyDB, or Aurora on the dimensions that matter to your team.

Share this article