Kubernetes: Notes from Building a Multi-Cluster Orchestrator
A problem I'd been reading about: how do platforms place stateless workloads across many clusters, and migrate them when clusters fail?
Code: multi-cluster-orchestrator · DESIGN.md
The Gap
Kubernetes stops at the cluster boundary. A Deployment lives in one cluster. There's no native way to say "place this across these regions, with these constraints, and survive any single cluster's failure."
KubeFed and ArgoCD ApplicationSets propagate copies everywhere — they don't decide placement. Cluster API manages cluster lifecycle, not workloads on them. Karmada gets closer but is heavyweight for an exploration.
The thesis: a CRD declares intent, a controller decides which cluster runs what, applies it, migrates on failure.
Architecture
┌─────────────────────────────┐
│ mgmt cluster │
│ CRDs, controller, secrets │
└──────────┬──────────────────┘
│ K8s client per region
┌─────┴─────┐
▼ ▼
region-a region-b
Three clusters: one management, two targets. Management holds the platform state and runs the controller. Targets are platform-naive — they just hold Deployments.
A platform's control plane should be a separate failure domain from the workloads it manages. Karmada, OCM, Cluster API all separate management from targets for the same reason.
Placement
For each workload, the engine does three things: filter feasible clusters, score them, distribute replicas. Same model as kube-scheduler internally, applied one level up. Hierarchical scheduling — meta picks the cluster, local picks the node — is how Borg, Mesos, and modern multi-cluster systems all work.
Three choices worth naming:
The engine is a pure function. Takes a workload and clusters, returns a plan. No I/O. Trivially testable. The plan is data — you can log it, replay it, simulate it.
Scoring is a weighted sum. Region preference (weight 10), capacity (weight 1), headroom as a separate large penalty. Adding a new component is one function plus one line. Composes cleanly.
Headroom is unconditional. Not folded into the capacity score. Drift safety should be structural, not strategic — fold it into one strategy and it stops applying to others. Keeping it separate means the guardrail applies regardless of strategy.
Failover
Two reconcilers. One probes clusters every 15 seconds. The other watches workloads and registrations — a health flip enqueues every workload, the engine excludes the dead cluster, the reconciler migrates.
One tradeoff: when a cluster goes unreachable, you can't drain its old Deployments. Blocking until it returns means failover never completes. Log-and-continue means orphans in the dead cluster, reclaimed on recovery.
I chose log-and-continue. Workloads stuck terminating across a permanently-dead cluster is operationally worse than orphans in a dead one.
The Bug
First time I tested failover by corrupting a kubeconfig secret, migration worked but the reconcile loop kept failing. The cached client outlived the cluster behind it. ClientFor returned cleanly. Delete failed at the network layer.
The fix: distinguish "couldn't build a client" from "built a client but the cluster is unreachable." Both want the same response, but they arrive through different paths.
Abstractions over networked resources need to handle errors at every layer they could occur, not just at the API surface. Same pattern as database connection pools — a pooled connection doesn't mean the database is reachable.
What changes at scale
- Communication mode. Direct K8s clients work for tens of clusters. Hub-and-spoke agents at mid-hundreds, mostly for network direction and regional autonomy.
- Capacity reservation. Headroom absorbs drift; reservation APIs eliminate it.
- Workload identity. Kubeconfigs for the prototype; SPIFFE for true hybrid cloud.
- Hysteresis on recovery. Eager rebalancing causes thundering herd; production deprioritizes recently-recovered clusters.
DESIGN.md walks through each with effort estimates.
Why
I wanted to build it rather than just read about it. Hybrid cloud, AI inference, and the limits of open-source tooling are pushing large platforms to build custom orchestration layers — I wanted to see what those actually look like.