Evolving Kubernetes for generative AI inference

Friday August 29, 2025. 11:00 AM , from InfoWorld

Kubernetes has become the leading platform for deploying cloud-native applications and microservices, backed by an extensive community and comprehensive feature set for managing distributed systems. However, the rise of generative AI has introduced unique challenges for container orchestration. Large language models, specialized hardware, and demanding request/response patterns require a platform that is more than just a microservices manager. It needs to be “AI-aware.”

To meet the needs of generative AI, Google Cloud joined forces with ByteDance and Red Hat to engineer these foundational improvements directly into the Kubernetes open-source project.

This community-driven effort has equipped Kubernetes with a native understanding of AI inference, tackling critical areas like inference performance benchmarking, LLM-aware routing, inference gateway load balancing, and dynamic resource allocation. These foundational investments create a more robust and efficient platform for AI, allowing the entire ecosystem to benefit from the following:

Benchmarking and qualification of accelerators with the Inference Perf project.

Operationalizing scale-out architectures with LLM-aware routing via the Gateway API Inference extension.

Scheduling and fungibility across a wide range of accelerators with DRA for hardware accelerators and the vLLM library for LLM inference and serving.

Figure 1. Kubernetes inference stack.
Google Cloud

Vertically integrating inference servers with Kubernetes

While we were engineering core improvements into Kubernetes, we also recognized the need to evolve the inference servers that run on top of it. Previously, inference servers like vLLM, SGLang, and Triton were considered standalone components deployed to Kubernetes infrastructure. However, with techniques like disaggregated serving, there are benefits to vertically integrating inference servers and Kubernetes and operating as a single system.

For instance, with disaggregated serving, having a global process running in the Kubernetes control plane to maximize key-value (KV) cache utilization can lead to better performance. That’s why Google Cloud is now working with Red Hat, Nvidia, IBM Research, CoreWeave, and others to form llm-d. This initiative integrates vLLM and Kubernetes, and not only brings two technologies together but also two communities to better collaborate.

The open-source work lays the groundwork for new features in Google Cloud’s Google Kubernetes Engine (GKE) to provide an out-of-the-box implementation of all the Kubernetes and llm-d primitives.

Simplifying deployment with GKE Inference Quickstart

The journey to production for an AI model can be fraught with complexities, from selecting the right hardware accelerators to configuring the optimal serving environment. GKE Inference Quickstart is a new feature designed to streamline this process. The back end for Quickstart is an extensive benchmark database, maintained by Google Cloud, that generates latency vs. throughput curves for every new model and accelerator configuration. The benchmarking system is based on the standard Inference Perf project in Kubernetes.

By providing pre-configured and optimized setups, GKE Inference Quickstart takes the guesswork out of deploying models. It helps you make data-driven decisions on the most suitable accelerators for your specific needs, whether that’s using GPUs or TPUs. This accelerates time-to-market for AI applications and ensures you’re starting out with a foundation built for performance and efficiency.

Figure 2. A benchmarked latency vs. throughput curve stored in Quickstart database. Google Cloud

Unlocking the power of TPUs for inference

Google’s Tensor Processing Units (TPUs) have long been a cornerstone of its internal AI development, offering exceptional performance and cost-efficiency for machine learning workloads. Now, GKE is making it easier to leverage these benefits for your own inference tasks.

With the new vLLM/TPU integration, you can deploy your models on TPUs without the need for extensive code modifications. A highlight is the support for the popular vLLM library on TPUs, allowing interoperability across GPUs and TPUs. By opening up the power of TPUs for inference on GKE, Google Cloud is providing extensive choices for customers looking to optimize their price-to-performance ratio for demanding AI workloads.

AI-aware load balancing with GKE Inference Gateway

Unlike traditional load balancers that distribute traffic in a round-robin fashion, GKE Inference Gateway is intelligent and AI-aware. It understands the unique characteristics of generative AI workloads, where a simple request can result in a lengthy, computationally intensive response.

The GKE Inference Gateway intelligently routes requests to the most appropriate model replica, taking into account factors like the current load and the expected processing time, which is proxied by the KV cache utilization. This prevents a single, long-running request from blocking other, shorter requests, a common cause of high latency in AI applications. The result is a dramatic improvement in performance and resource utilization.

Figure 3. Metrics without and with the GKE Inference Gateway, showcasing relative performance gains.
Google Cloud

Toward an AI-aware cloud-native ecosystem

These developments point to a future where generative AI inference is seamlessly integrated into the cloud-native ecosystem, driven by an AI-aware Kubernetes. The collaborative efforts across open-source communities like WG Serving, SIG Scalability, llm-d, and vLLM are creating a vibrant and evolving platform that directly addresses the unique challenges of large language models and specialized hardware.

Ultimately, this ongoing development creates a flywheel of innovation, keeping the Kubernetes platform up-to-date with the latest advancements in AI and hardware. The emphasis on fungibility across accelerators and the strong community collaboration ensure that practitioners have a robust, efficient, and adaptable environment for deploying and scaling their Gen AI applications, accelerating the journey from model development to production.

—

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.