How Kubernetes Routing Evolved for Stateful LLM Inference

Kubernetes routing was designed for stateless, equal-cost web services. Large language model (LLM) inference violates nearly every assumption that design rests on. The result has been a multi-year evolution of new Kubernetes APIs, controllers, and routing components purpose-built for inference traffic — culminating, as of May 2026, in a generally available Gateway API Inference Extension.

This article traces the reasoning behind that evolution: the physics inside a single GPU pod that makes inference stateful, the specific assumptions standard routing makes, where each one fails under inference traffic, and the primitives the community built in response. Each new abstraction maps directly to a property of the workload, and understanding the workload is the fastest way to understand why the primitives look the way they do.

The request shape that breaks the model

From the outside, a web request and an LLM request are indistinguishable: HTTP in, HTTP out. Internally they share almost nothing.

LLM generation is autoregressive. The model emits one token at a time, and each token depends on every token before it. Producing 500 tokens of output requires 500 sequential forward passes through the model, because token N+1 takes token N as part of its input. Generation within a single request cannot be parallelized.

This produces two phases with fundamentally different performance characteristics:

Prefill processes the entire prompt in a single pass and produces the first output token. It is compute-bound: the arithmetic dominates, and the work scales with prompt length.
Decode produces each subsequent token, one pass per token. It is memory-bandwidth-bound: the GPU spends most of each step streaming model weights from high-bandwidth memory and performs comparatively little arithmetic.

Two latency metrics follow directly from this split, and they are the metrics inference systems are measured against:

Time To First Token (TTFT) — the delay before the first output token appears. It is dominated by prefill plus any time the request spends queued.
Inter-Token Latency (ITL) — the interval between subsequent tokens, sometimes called Time Per Output Token (TPOT). It is dominated by decode speed.

These are distinct quantities governed by distinct bottlenecks. A single aggregate "P99 latency," the standard web-service metric, conflates them and obscures both.

The deeper consequence is cost variance. An inference request's total work depends on prompt length and output length, and the two together produce a cost range exceeding 100×: a 50-token prompt requesting 10 tokens is trivial, while a 4,000-token prompt requesting 2,000 tokens is enormous. Requests per second is therefore not a meaningful unit of load. The accurate unit is tokens per second, separated into prefill and decode load.

The KV cache: where the state lives

Generating each token requires the model to attend to every preceding token. For each prior token, the model computed two vectors — a Key and a Value — during the forward pass that produced it. Recomputing these at every step would impose quadratic cost, so they are retained in GPU memory. This is the KV cache, and it is the central data structure of modern inference serving.

The cache grows linearly with sequence length and is substantial. On a 70-billion-parameter model sharded across eight GPUs, a 4,000-token context consumes on the order of a gigabyte of GPU memory per request. Model weights are static and loaded once; the KV cache is the dynamic component, and it dominates the memory available for serving.

This inverts the capacity model. In a web service, concurrency is bounded by CPU and connection limits — elastic constraints that usually accommodate one additional request. In an inference pod, concurrency is bounded by GPU memory holding KV cache, a hard limit. When that memory is exhausted, the engine cannot admit another request; it must queue it, reject it, or evict an existing request.

Two optimizations make the model practical:

PagedAttention, introduced by the vLLM project, applies operating-system-style virtual memory to the KV cache: fixed-size blocks, a per-request block table, a free list, and copy-on-write for shared prefixes. It reduced KV cache memory waste from roughly 60% to under 4%, approximately doubling the number of concurrent requests a pod can hold.
Prefix caching reuses cached blocks across requests that share a leading token sequence. In conversational workloads, each turn resends the full history; in retrieval-augmented generation (RAG), each request prepends the same documents. The prefix overlap is large, and a cache hit can reduce TTFT by four to seven times.

Prefix caching carries a structural implication that propagates through every layer above it: pods are no longer interchangeable. A pod that already holds a request's prefix in cache serves it several times faster than a pod that does not. This single property invalidates the assumption of pod fungibility that underlies conventional load balancing.

Continuous batching and the redefinition of queue depth

Because decode is memory-bandwidth-bound, a single decoding request barely utilizes the GPU's compute capacity — the engine loads the full weight set from memory to produce one token. The resolution is continuous batching: load the weights once and produce a token for many in-flight requests simultaneously. The batch is reconstructed on every decode step, with completed requests removed and newly admitted requests added, keeping GPU utilization high. This technique is what made high-throughput LLM serving viable.

Continuous batching also redefines what "queue depth" means, because a modern engine maintains three distinct queues:

Waiting — requests received but not yet admitted to the running batch, because no KV cache memory is available for them. This is the queue most often meant by "queue depth," and it is a memory-admission queue rather than a CPU work queue.
Running — requests currently decoding and holding KV cache.
Swapped — requests that were admitted and then preempted under memory pressure, their state evicted for later recomputation.

These signals form a lead/lag hierarchy: KV cache utilization leads queue depth, which in turn leads TTFT. By the time TTFT breaches an SLO, the underlying cause has been observable in KV utilization for several seconds.

The degradation is not gradual. Past approximately 85–90% KV utilization, a pod transitions from stable operation into a preemption cascade — a sharp, self-amplifying phase change rather than a linear slowdown. This behavior contrasts with stateless services, where overload typically manifests as graceful degradation. The practical implication for routing is that decisions must act on the leading signal, before the cliff; reacting to TTFT alone is reacting too late.

Five assumptions in standard routing, and where inference breaks each

Standard Kubernetes routing follows a well-understood path: a Service resolves to a set of pod endpoints via EndpointSlice, and kube-proxy distributes connections across them, typically at random. This model rests on five assumptions, each of which inference traffic violates.

Endpoints are binary. A readiness probe reports a pod as Ready or NotReady. An inference pod, however, can report Ready=True while operating at 95% KV utilization with TTFT already several times its baseline, mid-cascade. Readiness is binary; inference health is a gradient. A finer-grained probe does not resolve this, because the appropriate response to a degrading pod is reduced traffic, not removal — evicting it from the endpoint set redirects its full load onto the remaining pods and risks pushing them over the same cliff.

Pods are fungible. This holds for stateless services and fails under prefix caching. A pod holding a request's cached prefix may respond in 100 milliseconds where a random pod takes 700 milliseconds for the identical request. Routing that ignores cache locality discards the four-to-sevenfold speedup and prevents any pod from accumulating a useful working set.

Requests are equal-cost. Round-robin and random distribution depend on this. With cost variance exceeding 100×, distributing equal request counts produces highly unequal work. One pod may draw several long requests and saturate while another draws short requests and idles. The load balancer, counting connections, cannot observe the imbalance.

Cost is observable at the load-balancer layer. The signals that determine routing quality — KV utilization, running batch size, recent TTFT — reside inside the pod, in the inference engine. They are not visible to kube-proxy. The only remedy is to scrape engine metrics into a routing component that consumes them. But scraping operates on an interval, which means the router necessarily acts on data that is stale by construction. In a system where pod state can shift within seconds, even a few seconds of staleness is significant.

Termination is fast. A 30-second termination grace period drains web requests cleanly. An inference request mid-decode may require 30 or more additional seconds to complete, and the pod may hold gigabytes of cache representing minutes of accumulated computation. Pod replacement, inexpensive in the web model, becomes one of the most costly operations in the inference model.

Each of these is a structural mismatch rather than a parameter to tune. None can be resolved within the standard model, which is why the ecosystem developed new primitives — and why those primitives take the form they do.

The new primitives

The community's response is the Gateway API Inference Extension (GIE), developed under SIG-Network's WG-Serving. As of May 2026, GIE has reached general availability, with its InferencePool resource graduated to a stable v1 API. It introduces two resources, each aligned with a distinct concern.

InferencePool replaces the Kubernetes Service as the routing backend for model servers. It selects a set of pods that share a model, accelerator type, and server configuration, and binds them to a routing extension. Functionally, it is a Service that carries inference-specific semantics.

InferenceObjective expresses the objectives of a request — its priority and criticality — enabling the router to make per-workload shedding and fairness decisions.

The component that performs per-request routing is the Endpoint Picker (EPP). On each request, the gateway data plane — Envoy, via its external processing interface — consults the EPP, which selects a target pod. The EPP continuously observes engine metrics such as KV utilization, queue length, and active LoRA adapters, and returns a routing decision based on them. The EPP is pluggable: different implementations apply different scoring policies behind a single standard interface. This decomposition reframes the routing problem from "which load balancer" to "which scoring policy," and allows that policy to evolve independently of the gateway implementation.

GIE is an open standard rather than a managed-cloud feature. It runs on Envoy Gateway, kgateway, Istio, NGINX, and GKE Gateway, by extending any gateway that supports both the Gateway API and Envoy external processing. Inference-aware routing is therefore portable across clouds and on-premises deployments.

The frontier: unsolved problems

The GIE roadmap names the problems that remain open, which is itself a useful signal of where the difficulty lies.

Prefix-cache-aware load balancing, including interfaces to remote KV caches, aims to route requests to the pods that already hold their prefixes. Projects such as llm-d maintain a near-real-time, fleet-wide index of which pod holds which KV blocks, so the router can favor cache hits — while mitigating the failure mode in which many requests for the same popular prefix converge on a single cache-hot pod and overload it.

Distributed and tiered KV cache (llm-d, LMCache, and the Mooncake design) treats the cache as a pooled resource across pods, organized into GPU, CPU, and NVMe tiers, rather than an opaque allocation confined to one pod.

Prefill/decode disaggregation separates the compute-bound prefill phase and the memory-bandwidth-bound decode phase onto distinct pods, each provisioned for its phase's bottleneck.

Heterogeneous accelerators require cost- and latency-aware routing across pools containing multiple GPU types.

Latency-prediction routing replaces scoring on raw point-in-time metrics with per-pod predictions of TTFT and TPOT, routing on predicted SLO headroom. To remain accurate as workloads shift, such predictors are retrained continuously on a sliding window of recent traffic.

A common thread connects these efforts. The router operates on an inherently delayed, partial view of pod state. Classical load-balancing theory established decades ago that acting on stale load information can induce herd behavior — traffic converges on whichever endpoint most recently appeared underloaded, saturates it, and then converges elsewhere, producing oscillation. The inference case is more severe because the degradation cliff is steeper. How a router should weigh a metric's age, not only its value, is among the least settled questions in the layer, and current systems address it largely implicitly.

Conclusion

Inference did not expose a flaw in Kubernetes routing so much as a mismatch of workload. The routing model was built for services that are stateless, equal-cost, fast to drain, and observable from outside the pod. LLM inference is none of these: it is stateful by virtue of the KV cache, highly variable in cost, expensive to displace, and observable in detail only from inside the engine, on a delay.

The new primitives — InferencePool, InferenceObjective, and the Endpoint Picker — exist because routing for inference is a delayed-observation control problem over backends that are stateful, non-fungible, and degrade on a gradient. The relevant design questions have shifted from selecting a load balancer to determining how a picker should score endpoints when every available signal is gradient, request-specific, gathered from inside the pod, and stale on arrival. It is a new problem class for the Kubernetes data path, and the abstractions for it are being standardized in the open.

Sources and currency: API status (GIE general availability, InferencePool v1, InferenceObjective, the Endpoint Picker, and roadmap items) reflects the upstream Gateway API Inference Extension project and the Kubernetes blog as of May 2026. Engine internals (PagedAttention, prefix caching, continuous batching) reflect vLLM's published design. The stale-load-information argument draws on classical load-balancing results, including work by Mitzenmacher and by Dahlin. This area changes rapidly; specific API and status details should be verified against current upstream documentation.