Zero-Conflict Config at Scale: A Cellular Multi-Raft Approach

This post explores the design of a distributed configuration system for multi-DC infrastructure — the kind of system that underpins GPU clouds, large-scale Kubernetes deployments, and any platform where thousands of nodes need consistent, safe, and fast access to configuration state.

The Problem

Any platform operating across multiple data centers faces the same configuration challenge. Every server in every DC needs to know its operating rules: tenant resource quotas, network policies, feature flags, scheduling constraints, storage mount configurations. This is configuration state — not user data, not telemetry, but the rules that tell infrastructure how to behave.

The characteristics of this workload shape every design decision:

Read-heavy. Thousands of nodes poll config every few seconds. Expect ~10K reads/sec per DC.
Write-infrequent. Operators and automation push config changes at <100 writes/sec globally.
Safety-critical. A bad config push can misallocate expensive resources, break network connectivity, or disable security controls. Stale config is almost always safer than wrong config.
Must survive DC failure. If one data center goes down, the others must keep running — including their ability to read and (eventually) write config.

The system we need: fast local reads, safe linearizable writes, regional survivability, and instant rollback when things go wrong.

Cellular Multi-Raft Architecture

The foundational principle is no fate sharing. Each data center is an independent failure domain. A crash in one tenant's config service should never affect another tenant, and should never propagate to other DCs.

This calls for a Cellular Multi-Raft design. Each DC runs multiple independent Raft groups — one per tenant, plus a dedicated group for platform-wide config. Each Raft group is a small cluster (3 or 5 nodes), handling consensus for a narrow slice of the key space.

┌─── us-east-1 ───────────────────────┐     ┌─── eu-west-1 ───────────────────────┐
│                                     │     │                                     │
│  Raft Group: tenant-acme (3 nodes)  │     │  Raft Group: tenant-acme (3 nodes)  │
│  Raft Group: tenant-globex          │     │  Raft Group: tenant-globex          │
│                                     │     │                                     │
│  Read Replica Layer (every node)    │     │  Read Replica Layer                 │
│  Config API Gateway                 │     │  Config API Gateway                 │
└──────────────┬──────────────────────┘     └──────────────┬──────────────────────┘
               │          Async Replication Bus            │
               └──────────────────┬────────────────────────┘
                                  │
                    ┌─────────────┴───────────────┐
                    │  Global Consensus Group     │
                    │  (spans all DCs — manages   │
                    │   Home-DC ownership map     │
                    │   and failover)             │
                    └─────────────────────────────┘

Why Multi-Raft, not one monolithic Raft? Three reasons. First, blast radius — a bug or crash in acme's Raft group can't touch globex. Second, scale — each group manages hundreds of keys, keeping log replay fast and snapshotting cheap. Third, ownership — one Raft group per tenant maps cleanly to permissions boundaries and operational isolation.

The tenant-level Raft groups in each DC are local — all nodes in the same data center, sub-millisecond latency, no WAN dependency. There is exactly one exception: the Global Consensus Group, which spans all DCs with one node per DC. This group manages the ownership map (which DC is authoritative for which keys) and coordinates failover. It's the only component that pays the cross-DC latency tax, and it does so for a very small number of writes — typically fewer than 10 per day.

Data Model: Hierarchical Key Space

Config state is naturally hierarchical. Tenants contain services, services contain components, components have settings. The key space reflects this:

/platform/feature/gradual_rollout     → false       (global default)
/tenant/acme/feature/gradual_rollout  → true        (tenant override)
/dc/us-east-1/network/mtu            → 1500        (DC-specific)

The alternative — flat key-value with naming conventions like acme.network.mtu — breaks down in three specific ways. Bulk operations depend on everyone following the naming convention perfectly. Permissions require regex parsing of key names. And inheritance (platform defaults that tenants can override) must be implemented in every client rather than being built into the data model.

With hierarchical keys, "show me all config for tenant acme" is a subtree read. Permissions are structural: a tenant admin owns /tenant/acme/*. And inheritance uses a resolution chain — the system walks from most-specific to least-specific and returns the first match:

/dc/us-east-1/tenant/acme/network/mtu  → not set
/tenant/acme/network/mtu               → not set
/dc/us-east-1/network/mtu              → 1500 ✓ found
/platform/network/mtu                  → 9000  (fallback if above missed)

Physical Storage

Under the hood, the hierarchy is stored in a sorted key-value engine (RocksDB, backed by an LSM-tree). The tree structure is a logical abstraction — physical storage is flat, with keys sorted lexicographically. This means prefix scans are range scans: O(log N) to seek to the start of the range, then sequential reads from there. Fast and cache-friendly.

Each value is a protobuf message:

ConfigEntry {
  value: bytes              // the actual config, encoded
  version: uint64           // monotonically increasing per key
  hlc: (physical, logical, node_id)
  home_epoch: uint32        // increments on Home-DC failover
  updated_by: string        // who made this change
  rollout_policy: {...}     // canary configuration
}

Protobuf over JSON is a deliberate choice. Binary encoding is smaller on the wire and faster to parse. Strict typing catches schema violations at write time. And protobuf's backward/forward compatibility guarantees mean config schemas can evolve without coordinated rollouts across every reader.

API Surface

Put(key, value, expected_rev)     // Compare-And-Swap
Get(key)                          // single key read
Range(prefix)                     // subtree scan
Txn(checks[], ops[])              // mini-transaction across multiple keys
Watch(prefix)                     // gRPC stream for real-time push

Compare-And-Swap is the foundation. A blind Put allows two operators to silently clobber each other's changes. CAS enforces explicit coordination: "set resource quota to 8, but only if it's still at 4." If someone else changed it to 12, the CAS fails and the operator gets a clear error.

Transactions matter because config changes often span multiple keys. "Enable the new feature AND increase the resource limit" must be atomic — a node seeing one without the other is a misconfiguration waiting to become an incident.

Watch is how infrastructure nodes get real-time config updates. Instead of polling on a fixed interval, a node opens a gRPC stream on a key prefix and receives change events as they're committed. This reduces both latency (changes propagate in milliseconds instead of polling intervals) and load (no redundant polls when nothing has changed).

Home-DC Pinning: Eliminating Conflicts by Construction

This is the core consistency model, and it's deliberately simple: every key has exactly one authoritative data center (the "Home DC"). All writes for that key go to the Home DC. Other DCs receive read-only replicas via async replication.

Every key → ONE Home-DC → writes go to local Raft leader → linearizable
Other DCs → async mirror → local reads at <1ms
Conflicts: impossible during normal operation (single writer per key)

The alternative — allowing writes to any DC and resolving conflicts after the fact with CRDTs or semantic merge — adds significant complexity. Config state has a natural ownership model: a tenant's config is managed by that tenant's team, typically from one primary DC. When the problem has a natural single-writer structure, adding multi-writer machinery is complexity without payoff.

This is the same model used by Spanner (single leaseholder per split) and CockroachDB (single leaseholder per range). It works because the consistency guarantee — linearizable writes, no conflicts — is more valuable for config state than the write-availability benefit of multi-DC writes.

The explicit tradeoff: an operator in eu-west-1 writing to a key homed in us-east-1 pays ~150ms cross-DC round-trip latency. This is acceptable because config writes are infrequent. Reads — the hot path — are always served locally from the async mirror.

Cross-DC Replication

When a write commits at the Home DC, the Raft leader publishes the change event to an async replication bus. Mirror DCs consume this stream and apply changes to their local Raft groups.

The replication bus guarantees causal ordering per source DC. If us-east-1 commits event 46 (create tenant "acme") and then event 47 (set acme's resource quota), every mirror DC is guaranteed to see 46 before 47. Without this guarantee, a mirror could temporarily have config for a tenant that doesn't exist yet locally — a dangling reference that causes errors in every node that tries to resolve it.

Each change event carries a causal dependency vector. The receiving DC checks whether it has applied all prior events from the source. If yes, it applies the new event. If not, it buffers and waits. This is stronger than eventual consistency (which permits arbitrary reordering) but cheaper than global consensus (which requires cross-DC round trips on every write).

Hybrid Logical Clocks

Events in a distributed system need a total order. Physical clocks drift between machines — two servers can disagree on "now" by milliseconds. Pure logical clocks (Lamport) fix ordering but lose all connection to wall time, making operational debugging nearly impossible.

Hybrid Logical Clocks combine both:

HLC = (physical_time, logical_counter, node_id)

On local event:   physical = max(local_hlc.physical, wall_clock())
                  logical  = if physical advanced → 0, else → logical + 1

On receive:       physical = max(local.physical, remote.physical, wall_clock())
                  logical  = derived from which physical won

Comparison:       physical first → logical second → node_id as final tiebreaker

The physical component stays close to wall-clock time, so operators can reason about "when did this change happen" in human terms. The logical counter handles the case where two events occur within the same clock tick. The node ID is a tiebreaker that guarantees a total order even when physical time and logical counter are identical across two nodes. It can be any globally unique, consistently ordered identifier.

HLC respects causality (if event A happened before B, A's HLC is strictly less than B's), stays close to real time (unlike Lamport clocks), and requires no specialized hardware (unlike Google's TrueTime, which depends on GPS receivers and atomic clocks).

Home-DC Failover

Normal operation produces zero conflicts. The single-writer model guarantees it. But there is one edge case: DC failover.

When a Home DC goes down, ownership must transfer to another DC. The process:

t=0s     us-east-1 becomes unreachable
t=0s     Reads continue from mirrors everywhere (no impact)
t=5s     Global consensus group detects the failure
t=10s    Old Raft leader's lease expires (fencing guarantee)
t=11s    Global consensus assigns eu-west-1 as new Home, epoch increments
t=11s    eu-west-1 begins accepting writes

Lease-based fencing prevents split-brain. The old leader holds a time-bounded lease. Even if it comes back online after a network blip, it cannot accept writes once the lease has expired. It must re-join the cluster and re-sync state from the new Home.

The mirror Raft groups in non-Home DCs exist primarily for this moment. A pure cache (in-memory, no durability, no consensus) could serve reads during normal operation at lower cost. But during failover, promoting a cache to accept writes requires bootstrapping a new Raft group from scratch — a process that takes minutes. The mirror Raft group is pre-built infrastructure: durable data on disk, three-node consensus already running, ready to accept writes in seconds. It's an insurance premium paid during normal operation for fast failover when it counts.

The Zombie Leader Edge Case

There is a narrow window — the lease-expiry period — where conflicts can theoretically occur. If the old leader accepted a write just before the network blip (while its lease was still valid), that write may not have reached the replication bus before the failover. When the old DC reconnects, it delivers this "zombie write" to a system that has already moved on.

The resolution is simple: every write carries a home_epoch number, which increments each time ownership transfers. Higher epoch always wins.

zombie write:  epoch=5, hlc=(9000, 0)    ← from old home
new write:     epoch=6, hlc=(11000, 0)   ← from new home

epoch 6 > epoch 5 → new write wins. Zombie write → audit log.

The zombie write is never silently lost — it's preserved in the audit trail. But it does not affect the authoritative state. This keeps the entire conflict resolution logic confined to a single, auditable comparison function.

The Read Path

This is the hot path — thousands of requests per second per DC. It must be fast, local, and resilient.

Every infrastructure node runs a config agent as a sidecar process. The agent maintains an in-memory cache of relevant config keys, populated from the local Raft follower via a background refresh loop.

Infrastructure Node
┌──────────────────────────────────────────────┐
│  Config Agent                                │
│                                              │
│  In-memory cache:                            │
│    /tenant/acme/resource/max = 8  (v47)      │
│    last_refresh: 2s ago                      │
│    staleness_bound: 5s                       │
│                                              │
│  Application reads → from cache              │
│  Latency: <1 microsecond, zero network calls │
└──────────────────┬───────────────────────────┘
                   │ background poll every 2-5s
                   │ "give me changes since version 47"
                   ▼
         Local Raft Follower (same DC)
         → returns incremental deltas
         → latency: <1ms

When the local Raft follower becomes unreachable, the agent makes a graded decision. If cached config is within the staleness bound (default: 5 seconds), it continues serving with a stale: true flag — the application decides whether stale config is acceptable for its use case. If staleness exceeds the bound, the agent falls back to compiled-in safe defaults. Under no circumstances does the agent reach across DC boundaries for a read. If the local Raft follower is down, the DC likely has a broader infrastructure issue; adding cross-DC network calls would compound the problem.

Watch Notifications with a Radix Tree

The Watch API pushes changes to subscribed clients in real time. The implementation challenge: a client watching the prefix /tenant/acme/ must be notified when /tenant/acme/resource/max changes. A hash map can only match exact keys — it cannot find prefix watchers.

A radix tree (also called a compressed trie) solves this in O(K) time, where K is the length of the changed key. When a write commits, the system traverses the key path through the radix tree, collecting all registered watchers along the way:

             (root)
            /      \
      tenant/      platform/
        |
      acme/  ← watcher registered here → NOTIFIED
        |
      resource/
        |
      max  ← this key changed

Every node along the path from root to the changed key is checked for watchers. This is a single traversal with no backtracking — efficient enough to run on every commit without measurable overhead.

Operational Safety: Config Changes as Deployments

A config change to production infrastructure is operationally equivalent to a code deployment. A bad value can misallocate resources, break network connectivity, or disable security controls. The system treats it accordingly.

Canary Rollouts

Every config mutation can carry a rollout policy. Instead of applying the new value to all nodes simultaneously, the system rolls it out to a small percentage first:

PUT /tenant/acme/resource/max_allocation
Body: { value: 16, rollout: { strategy: "canary", percent: 5, duration: 300 } }

Each node's config agent determines its canary cohort membership deterministically: hash(node_id) % 100 < canary_percent. Nodes in the cohort apply the new value; others stay on the previous version. The system monitors error rates and latency for the configured duration. If metrics remain healthy, the change auto-promotes to 100%. If metrics degrade, it auto-rolls back.

This transforms config changes from a binary "push and pray" into a controlled, observable, reversible deployment.

Instant Rollback

Every config mutation creates an immutable entry in a version store. Rollback is not a special operation — it's a normal write that sets the active version pointer to a previous entry. This means rollback goes through the same Raft consensus, replication, and audit trail as any other write. Target: <1 second within the originating DC, <5 seconds globally.

Observability

The system exposes four key metrics. Config version per node detects nodes stuck on old config — a common operational issue when an agent crashes or loses its connection to the Raft follower. Propagation latency (time from Home-DC commit to last mirror seeing the change) gives a direct measure of replication health. HLC drift between DCs serves as a replication lag indicator, alerting operators when a mirror DC is falling dangerously behind. Change velocity catches runaway automation — a sudden spike in config writes per second usually indicates a feedback loop in an automation system.

Performance Considerations

Request hedging addresses tail latency on the read path. If a read from a local Raft follower takes longer than 10ms (well above the expected <1ms), the agent fires a parallel request to the Raft leader. The first response wins. This clips p99 latency without adding meaningful load — the hedge only fires for slow requests.

Thundering herds are a risk with the Watch API. A single config change can trigger notifications to thousands of subscribed agents. Without mitigation, all agents process the notification simultaneously, potentially overwhelming downstream services. The defense is two-layered: server-side batching (coalesce changes within a 50ms window into a single notification) and backpressure (limit concurrent Watch pushes, queue the rest with jitter).

Log compaction prevents unbounded Raft WAL growth. Once the LSM-tree flushes committed entries to SSTables on disk, old WAL entries can be replaced with a point-in-time snapshot. Raft followers that fall too far behind receive the snapshot directly instead of replaying the entire log — a critical optimization for recovering nodes or adding new replicas.

Security Model

Configuration state controls resource allocation, network policies, and access rules. Compromising the config database is equivalent to compromising the entire infrastructure.

Transport security uses mTLS between all components. Every node holds a certificate issued by an internal certificate authority. The config API gateway validates both the client certificate and the RBAC token before accepting any request.

Access control follows the hierarchical key structure. Tenant admins have write access to /tenant/{name}/* and nothing else. Platform admins have write access to /platform/*. Infrastructure nodes have read-only access scoped to their tenant's config and platform defaults. The permissions model is structural, not regex-based — a direct benefit of the hierarchical key design.

Audit logging records every mutation: identity, timestamp, source IP, the key changed, the old value, and the new value. The audit log is immutable and append-only, shipped to a separate storage system under independent security monitoring.

Encryption at rest uses envelope encryption with per-tenant data keys. The master key lives in an HSM. Key rotation does not require re-encrypting existing config entries — new entries use the new key, and old entries are re-encrypted lazily during compaction.

Summary

This design is built on three core ideas.

Conflicts eliminated by construction. Home-DC pinning ensures every key has exactly one authoritative writer. The same model used by Spanner and CockroachDB, chosen because config state has a natural ownership structure that makes multi-writer complexity unjustified.

Reads are always local, always fast. Async replication keeps every DC's mirror current. Nodes read from an in-memory cache populated by a local Raft follower. No read ever crosses a DC boundary.

Config changes are deployments. Canary rollouts, instant rollback, and comprehensive observability treat every config mutation with the same rigor as a code push. The operational safety mechanisms exist because in large-scale infrastructure, the cost of a bad config change is high and immediate.

The architecture can start simple — single-DC, etcd-backed — and evolve toward purpose-built infrastructure as scale demands. No big-bang migration, no premature optimization — just a well-defined interface that allows the implementation to change underneath while the contract remains stable.