Databricks adds customizable evaluation tools to boost AI agent accuracy

Thursday November 6, 2025. 12:28 PM , from InfoWorld

Databricks is expanding the evaluation capabilities of its Agent Bricks interface with three new features that are expected to help enterprises improve the accuracy and reliability of AI agents.

Agent Bricks, released in beta in June, is a generative AI-driven automated interface that streamlines agent development for enterprises and combines technologies developed by MosaicML, including TAO, the synthetic data generation API, and the Mosaic Agent platform.

The new features, which include Agent-as-a-Judge, Tunable Judges, and Judge Builder, enhance Agent Bricks’ automated evaluation system with more flexibility and customization, Craig Wiley, senior director of product management at Databricks, told InfoWorld.

Agent Bricks’ automated evaluation system can generate evaluation benchmarks via an LLM judge based on the defined agent task or workflow, often using synthetic data, to assess agent performance as part of its auto-optimization loop.

However, it didn’t offer an automated ability for developers to dig through the agent’s execution trace to find relevant steps without writing code.

One of the new features, Agent-as-a-Judge, offers that capability for developers, saving time and complexity while offering insights into an agent’s trace that can make evaluations more accurate.

“It’s a new capability that makes those automated evaluations even smarter and more adaptable — adding intelligence that can automatically identify which parts of an agent’s trace to evaluate, removing the need for developers to write or maintain complex traversal logic,” Wiley said.

AI and data consultancy firm Asperitas Consulting’s agentic AI enablement principal Derek Ashmore, too, feels that Agent-as-a-Judge offers a more flexible and explainable way to assess AI agent accuracy than the automated scoring that originally shipped with Agent Bricks.

Tunable Judges for agents with domain expertise

Another feature, Tunable Judges, is designed to give enterprises the flexibility to tune LLM judges for agents with domain expertise, which is a growing requirement in enterprise production environments.

“Enterprises value domain experts’ input to ensure accurate evaluations that reflect unique contexts, business needs, or compliance standards,” said Robert Kramer, principal analyst at Moor Strategy and Insights. “When Agent Bricks was initially introduced, many enterprises welcomed the ability to automate the evaluation and assessment of agents based on quality. As these agents transitioned from prototypes to a more demanding production environment, the limitations of generic evaluation logic became evident,” Kramer added.

Tunable Judges was the result of customer feedback, specifically on capturing subject matter expertise accurately and letting enterprises define what “correctness” is applicable to their agents, Wiley said.

Tunable Judges could be used in ensuring that clinical summaries don’t omit contraindications in healthcare, or in enforcing compliant language in portfolio recommendations, and evaluating tone, de-escalation accuracy, or even in policy adherence in customer support.

Enterprises have the option of using the new “make_judge” SDK introduced in MLflow 3.4.0 to create custom LLM judges by defining tailored evaluation criteria in natural language within Python code and running an evaluation on it.

Easing the complexity of agent evaluation

Enterprises would also have the option of using Judge Builder, a new visual interface within Databricks’ workspace, to create and tune LLM judges with domain knowledge from subject matter experts and utilize the Agent-as-a-Judge capability.

The Judge Builder, according to Kramer, is Databricks’s effort to set itself apart from rivals such as Snowflake, Salesforce, and ServiceNow, which also offer agent evaluation features, by making agent evaluation less complex and customizable.

“Snowflake’s agent tools use frameworks to check quality, but they don’t let you tune checks with business-specific feedback or domain rules in the same way Databricks does,” Kramer said.

Snowflake already offers AI observability and Cortex Agents, including “LLM-as-a-judge” evaluations, which focus on measuring accuracy and performance rather than interpreting an agent’s full execution trace.

Comparing Databricks’ new agent evaluation tools to those of Salesforce and ServiceNow, Kramer said that both vendors mostly focus on automating workflows and outcomes without deep, tunable agent judgment options. “If you need really tailored compliance or want business experts involved in agent quality, Databricks has the edge. For more basic automations, these differences probably matter less,” Kramer added.