How generative AI could aid Kubernetes operations

Monday December 23, 2024. 10:00 AM , from InfoWorld

Enterprises often encounter friction using Kubernetes (K8s for short) at scale to orchestrate large swaths of containers, not to mention an escalating number of clusters. And due to its complexity, diagnosing issues within Kubernetes isn’t all that easy. At the same time, IT is looking at AI to automate aspects of configuring, managing, and debugging complex back-end technologies.

“Trying to solve IT problems with AI is nothing new,” says Itiel Schwartz, co-founder and CTO at Komodor. “It typically overpromises and underdelivers.” Yet, although he was skeptical at first, he now sees promise in utilizing finely-tuned generative AI models to reduce barriers and streamline Kubernetes operations.

Fine-tuned AI for root cause analysis

Accuracy in AI models hinges on their training data sets. And today’s popular large language models (LLMs), like OpenAI’s GPT, Meta’s Llama, Anthropic’s Claude, or Google’s Gemini, are trained on vast corpora of text data. While this works for general-purpose use, they often produce irrelevant recommendations for ultra-specific devops functions, says Schwartz.

Instead of using catch-all models, Schwartz believes narrow models are better for diagnosing Kubernetes issues. They can help avoid AI hallucinations or errors by following a more authoritative, controlled process—like fetching one piece of highly relevant data, like logs, metrics, or related changes.

One such tool is Komodor’s KlaudiaAI, an AI agent narrowly trained on historical investigations into Kubernetes operational issues. KlaudiaAI is trained for root cause analysis and excels at identifying an issue, sourcing relevant logs, and offering specific remediation steps. For example, when an engineer encounters a crashed pod, KlaudiaAI might correlate this to an API rate limit found in the logs and suggest setting a new rate limit.

Using AI to automate K8s management

Of course, Komodor isn’t the only company investigating the use of AI agents and automation to streamline Kubernetes management. K8sGPT, an open-source Cloud Native Computing Foundation (CNCF) sandbox project, uses Kubernetes-specific analyzers to diagnose cluster issues and respond with remediation advice in plain English. Robusta is a similar AI copilot designed for Kubernetes troubleshooting, such as incident resolution and alerts. Cast AI uses generative AI to auto-scale Kubernetes infrastructure to reduce operating expenses.

And if we look at the major cloud service providers, ChatOps is nothing new. For instance, Amazon offers AWS Chatbot, which can provide alerts and diagnostic information on Amazon Elastic Kubernetes Service workloads and configure resources based on chat commands. Amazon also has Amazon Q, an AI assistant with a variety of skills including building on the AWS cloud, through it’s not specifically geared toward K8s management.

Similarly, Google’s generative AI assistant, Gemini, serves as an all-purpose tool for Google Cloud, and is not specifically bred for remediating Kubernetes issues. However, Google Kubernetes Engine is optimized for training and running AI/ML workloads, and its GKE Autopilot can optimize the performance of infrastructure. A Kubernetes-focused AI assistant may not be far behind.

Other major cloud players are also looking to cash in on generative AI, notably in the monitoring and observability space. Last year, Datadog introduced Bits AI, a devops copilot designed for incident response across any data source Datadog touches. Bits AI can go deep to surface traces and logs and provide remediation advice for incident resolution.

Yet, the ongoing problem, Schwartz says, is that most of the AI models in the enterprise IT market still cast too wide a net with their training sets to be useful for the specific area of Kubernetes diagnosis. “If you use a generic AI model to investigate these issues, it will simply fail you. We tried it time after time,” Schwartz says. “As you narrow the scope down, the possibility of hallucinations goes down.”

That said, a heightened attention to detail can bring downsides. For example, Schwartz notes that Klaudia is often slower than other models (it may take 20 seconds to come up with an answer). This is because it privileges accuracy over speed, using an iterative investigation process until the root cause is completed. The good news is that by incorporating more sanity checks, the model improves accuracy, he says.

Reducing barriers to K8s usability

Kubernetes is the undisputed infrastructure layer for modern IT. Impressively, 84% of respondents to CNCF’s 2023 annual survey said they were using or evaluating Kubernetes. And, much headway has been made in optimizing Kubernetes for AI/ML workloads. “One of the biggest reasons for migrating to K8s is the ability to run more efficient ML,” says Schwartz.

Yet, security, complexity, and monitoring rank as the topmost challenges in using or deploying containers for heavily cloud-native organizations. According to PerfectScale, common issues, such as not setting memory limits, not properly allocating RAM for pods, or not setting CPU requests, threaten the reliability of Kubernetes. Now, the question is if generative AI can help operators, like platform engineers or site reliability engineers, better interface with the platform.

“Gen AI is not really production grade in most companies,” says Schwartz, who acknowledges its limitations and that it tends to work best in human-in-the-loop scenarios. Nevertheless, he foresees AIOps soon to become a helpful ally for addressing root cause, misconfigurations, and network issues and for guiding optimizations.

Kubernetes-specific, finely-tuned AIs could help operators more quickly diagnose problems, like failed deploys or failed jobs, and tie them to root causes when they arise. “Gen AI is going to take this toil and automate it,” Schwartz says.