Cutting Kubernetes costs with virtual clusters

Monday August 19, 2024. 10:45 AM , from InfoWorld

Straight out of the webscale playbook, platform engineering was considered a futuristic discipline until just a few years ago. Would platform engineering really trickle down to mainstream enterprise teams? Did companies really want to operate their infrastructure like the major cloud providers? Now it’s a practice that 80% of enterprises will have adopted by 2026, according to Gartner.

Platform engineering means different things to different people, and there’s no golden path that prescribes exactly how to do it right. But the main goals are universally understood. On the one hand, platform engineering strives to boost developer velocity, by removing bottlenecks and adding self-service. On the other hand, it aims to standardize on central controls like security and compliance, so you can keep costs and complexity in check.

Increasing developer velocity has been a clear win for platform engineering. Containers, microservices, Kubernetes, CI/CD, and the modern development workflow undeniably have made software development a faster, more productive, more automated experience for distributed teams.

But the “central control” part of platform engineering? It’s not so easy to declare victory quite yet. We’re in the midst of a multi-year backlash against the high cost and complexity of the cloud operating model. And today that central control side of platform engineering isn’t just a platform engineering team issue, it’s a CFO issue, as cloud bills soar and companies feel severe pressure to find cost savings. The cloud and Kubernetes are here to stay, but fixing a broken central control plane is a multi-million dollar dilemma that many enterprises are struggling with today.

That’s why an open-source project called vCluster is having a breakthrough moment. vCluster takes aim at the heart of the Kubernetes operating model, the cluster abstraction, to deliver a range of benefits to organizations building on Kubernetes. vCluster not only dramatically reduces resource overhead, which in turn can add up to significant cost savings, but also brings more agility and more central control to platform engineering teams.

The open-source path to opportunity

vCluster co-creators (and Loft Labs co-founders) Lukas Gentele and Fabian Kramm met in college at the University of Mannheim, where, as computer science students, they shared a similar technology path from Java and graph databases like Neo4j to working with web-focused technologies like PHP and JavaScript, then diving into the Go language when Docker and Kubernetes took off, seeing that Go was the future. When Gentele started an IT consultancy while still in college, Kramm was his first hire.

Within that IT services business, Gentele and Kramm created a project called DevSpace—essentially a Docker Compose alternative focused on streamlining Kubernetes workflows—and put it on GitHub. That was their first exposure to developing and maintaining an open-source project. They’d both made contributions and fixes to open-source projects, but had never owned a project or driven one as maintainers. Seeing the magic of open source, distributing it, people using it, valuing it, and contributing to it—they were hooked.

After graduating college, the two set out to build a PaaS product (what Gentele says was “like Heroku for Kubernetes”), applied to Y Combinator, got denied but invited to apply again, then parlayed that into participation in the SkyDeck accelerator program at U.C. Berkeley. Ultimately, they concluded that PaaS was a very difficult business to run. They weren’t the first to hit this wall—as Cloud Foundry, Heroku, and the Docker founders’ struggles to monetize dotCloud demonstrated.

“We realized we had a lot of free users for our PaaS but not a lot of willingness to pay,” Gentele said. “OK, so what did we learn? We learned that running large Kubernetes clusters and sharing those clusters with users was extremely complicated and expensive, and that there was a much better way to do this that was a much bigger opportunity than the PaaS. An idea that could be useful for anyone running Kubernetes clusters.”

Fleets are the wrong abstraction

In the early days of container orchestration, the market got comfortable with the idea of treating servers like “cattle” (interchangeable hardware that can be swapped out), versus “pets” (each server approached with its own care and feeding) as a core concept of turning servers into clusters.

For years, there was an architectural debate around whether to create a fleet of small clusters or a single massive cluster. “Kubernetes itself was designed to run at large scale,” said Gentele. “It’s not as meant to run as a five-node cluster. If you have these small clusters, you get so much duplication and inefficiency.”

But as the major cloud providers rolled out their Kubernetes offerings, small single-tenant clusters and fleets of multiple clusters were the units of abstraction sold to the enterprise market, complete with “fleet management” solutions for coordinating all the moving parts and keeping services in sync in the “clusters of mini clusters” approach.

Gentele attributes a large portion of today’s cloud cost overruns back to this original sin by the cloud providers.

The first consequence of the fleet approach is the penalty of heavyweight infrastructure components being paid multiple times. Platform teams want to standardize on core services like Istio and Open Policy Agent—services that are designed to run at scale, so if you run a lot of them at small scale it’s super inefficient. In the fleet approach, these services always get installed in each cluster, which creates a massive duplication of core services as the entire platform stack is replicated across multiple small clusters.

The other major consequence is that these clusters run all the time. Nobody turns them off. There’s no easy way to turn off an entire cluster on the major cloud offerings with the click of a button. Rather, it’s a manual process that requires a policy to be put in place, and 30 minutes to spin up the entire platform stack of services used to connect, manage, secure, and monitor the cluster. It’s also hard to tell when a cluster is truly “idle” when all of these platform services—security components, policy agents, compliance, backup, monitoring, and logging—continue running underneath.

vCluster: Addition by subtraction

Gentele and Kramm had the epiphany that the fleet approach to clusters could be vastly improved upon, and that Kubernetes multitenancy could be redefined beyond traditional namespace approaches.

In 2023, they released vCluster and introduced the concept of “virtual clusters,” an abstraction to create lightweight virtual Kubernetes clusters. Similarly to how virtual private networks create a virtual network over a physical infrastructure, virtual clusters create isolated Kubernetes environments over a shared physical Kubernetes cluster.

vCluster is a certified Kubernetes distribution, so the virtual clusters behave exactly the same way as any other Kubernetes cluster—with one important difference. Whereas each virtual cluster manages its own namespaces, pods, and services, it does not replicate the platform stack. Instead, it shares the heavyweight platform components, such as Istio or Open Policy Agent, run by the underlying physical cluster. With this shared platform stack, virtual clusters are no longer dragging around the albatross of replicating platform services.

And yet each vCluster has its own API server and control plane, providing strong isolation to tenants and giving platform teams the ability to create their own custom policies for security and governance. They can create their own namespaces, deploy their own custom resources, protect cluster data by using their own backing stores, and apply their own access control policies to users.

At the same time, vCluster gives platform teams far greater speed and agility than a physical cluster. A virtual cluster can be spun up in a mere fraction of the time it takes to spin up a physical cluster. A restart of a vCluster takes about six seconds, versus 30 or 45 minutes to restart a physical Kubernetes cluster that’s running heavyweight platform services like Istio underneath.

“Kubernetes is great from an API perspective, a tooling perspective, a standardization perspective—but the architecture that the cloud providers advocated in running clusters of small clusters took the industry back to the physical server in terms of cost and heaviness,” Gentele said. “In the ’90s, someone had to actually physically walk into a data center, plug in a server, issue credentials, and take some other manual steps,” he said. “We’re in a similar boat with Kubernetes today. You have so many enterprises running their entire application stack in each cluster, which creates a lot of duplication.”

vClusters makes the Kubernetes cluster more lightweight and ephemeral, similar to what virtual machines did for physical servers and to what containers did for workloads in general.

“Spinning up small single-tenant Kubernetes clusters was a really terrible idea in the first place, because it’s very costly, and it’s very, very hard to manage,” Gentele said. “You’re going to end up with hundreds of clusters. And then you’ve got to maintain things like ingress controller, cert manager, Prometheus, and metrics across all these clusters. That’s a lot of work, and it’s really hard to keep in sync.”

vCluster by the numbers

vCluster has more than 6,000 stars on GitHub and more than 120 contributors. The project has drawn the attention of Kubernetes experts such as Rancher’s former CTO and co-founder Darren Shepherd, who has been advocating for the use of virtual clusters. Teams from Adobe, CoreWeave, and Codefresh have been outspoken about their use of vCluster at events like KubeCon.

Gentele and Kramm’s startup Loft Labs was recently funded to extend enterprise capabilities around vCluster. The $24M Series A was led by Khosla Ventures, which is known for being the first institutional investor in companies like GitLab and OpenAI.

The startup’s commercial offering on top of vCluster has generated particular excitement over its “sleep mode,” which turns off inactive virtual clusters automatically. Typically, enterprises that spin up clusters tend to see them run all the time. Loft Labs’ product measures virtual cluster activity by monitoring incoming API requests, using the sleep mode to automatically scale down virtual clusters when they’re not used to save on cloud resources and overall cost.

vCluster may help to run cloud infrastructure more efficiently and drive down cloud costs but it also gives enterprises a clearer path to winning on central control and developer velocity. In addition to stretching physical cluster resources, vCluster provides each virtual cluster with its own separate API server and control plane, giving platform teams both more flexibility and more control over the management, security, resource allocations, and scaling of their Kubernetes clusters.