Automating devops with Azure SRE Agent

Thursday June 5, 2025. 11:00 AM , from InfoWorld

Modern AI tools have been around for some time now, offering a wide variety of different services. Much of what’s been delivered so far has only touched the surface of what’s possible; voice recognition and transcription tools are probably the most useful.

The development of agentic AI has changed things. Instead of using AI to simply generate text, we can use a natural language user interface to use context to construct workflows and manage operations. Support for OpenAPI interfaces allows us to go from user intent to actual service calls, with newer technologies such as the Model Context Protocol (MCP) providing defined interfaces to applications.

Building agents into workflows

Agentic workflows don’t even need to be generated by humans; they can be triggered by events, providing completely automated operations that pop up as natural language in command lines and in tools like Teams. Microsoft’s Adaptive Cards, originally intended to surface elements of microwork inside Teams chats, are an ideal tool for triggering steps in a long workflow managed by an agent but putting a human inside the loop, responding to reports and triggering next steps.

Using agents to handle exceptions in business processes is a logical next step in the evolution of AI applications. We can avoid many of the issues that come with chatbots, as agents are grounded in a known state and use AI tools to orchestrate responses to a closed knowledge domain. A system prompt automatically escalates to a human when unexpected events occur.

Adding agents to site reliability engineering

One domain where this approach should be successful is in automating some of the lower-level functions of site reliability engineering (SRE), such as restarting servers, rotating certificates, managing configurations, and the like. Here we have a known ground state, a working stack composed of infrastructure, platforms, and applications. Microsoft describes this as agentic devops.

With modern cloud-native applications, much of that stack is encoded in configuration files, even infrastructure, using tools like Azure Resource Manager or Terraform templates, or languages like Bicep and Pulumi. At a higher level, APIs can be described using TypeSpec, with Kubernetes platforms described in YAML files.

All these configuration files, often stored as build artifacts in standard repositories, give us a desired state we can build on. In fact, if we’re in a Windows Server environment we can use PowerShell’s Desired State Configuration definitions as a ground state.

Once a system is operating, logs and other metrics can be collected and collated by Azure’s monitoring tools and stored in a Fabric data lake where analytics tools using Kusto Query Language can query that data and generate reports and tables that allow us to analyze what’s happening across what can be very complex, distributed systems. Together these features are the back ends for SRE dashboards, helping engineers pinpoint issues and quickly deploy fixes before they affect users.

Azure agent tools for the rest of us

With its existing devops tools, Microsoft has many of the necessary pieces to build a set of SRE agents. Its existing agent tools and expanding roster of MCP servers can be mixed with this data and these services to provide alerts and even basic remediation. So, it wasn’t surprising to hear that Azure is already using something similar internally and will shortly be launching a preview of a public SRE agent. Microsoft has a long history of taking internal tools and turning them into pieces of its public-facing platform. Much of Azure is built on top of the systems Microsoft needed to build and run its cloud applications.

Announced at Build 2025, Azure SRE Agent is designed to help you manage production services using reasoning large language models (LLMs) to determine root causes and suggest fixes based on logs and other system metrics. The underlying approach is more akin to traditional machine learning tools: looking for exceptions and then comparing the current state of a system with best practices and with desired state configurations to get your system back up and running as quickly as possible—even before any users have been impacted.

The aim of Azure SRE Agent is to reduce the load on site reliability engineers, administrators, and developers, allowing them to concentrate on larger tasks without being distracted from their current flow. It’s designed to run in the background, using normal operations to fine-tune the underlying model to fit applications and their supporting infrastructure.

The resulting context model can be queried at any time, using natural language, much like using the Azure MCP Server in Visual Studio Code. As the system is grounded in your Azure resources, results will be based on data and logs, much like using a specific retrieval-augmented generation (RAG) AI tool but without the complexity that comes with building a vector index in real time. Instead, services like Fabric’s data agent provide an API that manages queries for you. There’s even the option to use such tools to visualize data, using markup to draw graphs and charts.

Making this type of agent event-driven is important, as it can be tied to services like Azure’s Security Graph. By using current Azure security policies as a best practice, it’s able to compare current state with what it should be, informing users of issues and performing basic remediations in line with Azure recommendations. For example, it can update web servers to a new version of TLS, ensuring that your applications remain online.

Events can be sourced from Azure tools like Monitor, pulling alert details to drive an automated root-cause analysis. As the agent is designed to work with known Azure data sources, it’s able to use these to detect exceptions and then determine the possible cause, reporting back its conclusions to a duty site reliability engineer. This gives the engineer not only an alert but a place to start investigations and remediations.

There is even the option of handling basic fixes once they are approved by a site reliability engineer. The list of approved operations is sensibly small, including triggering scaling, restarting, and where appropriate, rolling back changes.

Recording what has been discovered and what has been done is important. The root-cause analysis, the problem discovery, and any fixes will be written to the application’s GitHub as an issue so further investigations can be carried out and longer-term changes made. This is good devops practice, as it brings the developer back into the site reliability discussion, ensuring that everyone involved is informed and that devops teams can make plans to keep this and any other problem from happening again.

The agent produces daily reports focused on incidents and their status, overall resource group health, a list of possible actions to improve performance and health, and suggestions for possible proactive maintenance.

Getting started with Azure SRE Agent

The Azure SRE Agent is currently a gated public preview (you need to sign up to get access). Microsoft has made much of the documentation public so you can see how it works. Working with SRE Agent appears to be relatively simple, starting with creating an agent from the Azure portal and assigning it to an account and a resource group. You’ll need to choose a region to run the agent from, currently only Sweden Central, though it can monitor resource groups in any region.

For now, interacting with the agent takes place in the Azure Portal. Most site reliability engineers working with Azure will have the portal open most of the time—it’s not really the best place to have chats. Hopefully Microsoft will take advantage of Teams and Adaptive Cards to bring it closer to other parts of their ecosystem. Delivering reports as Power BI dashboards would also help the SRE Agent fit into a typical service operations center.

Tools like this are an important evolution of AI and generative AI in particular. They’re not chatbots, though you can chat with them. Instead, they’re grounded in real-time data and they use reasoning approaches to extract context and meaning from data, using that context to construct a workflow based on best practices and current and desired system states. Building the agent around a human-in-the-loop model, where human approval is required before any actions are taken, may slow things down but will increase trust in the early stages of a new way of working.

It’ll be interesting to see how this agent evolves, as Microsoft rolls out deeper Azure integration through its MCP servers and as languages like TypeSpec give us a way to add context to OpenAPI descriptions. Deeply grounded AI applications should go a long way to deliver on the Copilot promise, providing tools that make users’ tasks easier and don’t interrupt them. They will also show us the type of AI applications we should be building as the underlying platforms and practices mature.