2-agent architecture: Separating context from execution in AI systems

Monday September 15, 2025. 11:01 AM , from InfoWorld

When I first started experimenting with voice AI agents for real-world tasks like restaurant reservations and customer service calls, I quickly ran into a fundamental problem. My initial monolithic agent was trying to do everything at once: understand complex customer requests, research restaurant availability, handle real-time phone conversations and adapt to unexpected responses from human staff. The result was an AI that performed poorly at everything.

After days of experimentation with my voice AI prototype — which handles booking dinner reservations — I discovered that the most robust and scalable approach employs two specialized agents working in concert: a context agent and an execution agent. This architectural pattern fundamentally changes how we think about AI task automation by separating concerns and optimizing each component for its specific role.

The problem with monolithic AI agents

My early attempts at building voice AI used a single agent that tried to handle everything. When a user wanted to book a restaurant reservation, this monolithic agent had to simultaneously analyze the request (“book a table for four at a restaurant with vegan options”), formulate a conversation strategy and then execute a real-time phone call with dynamic human staff.

This created two critical challenges that I experienced firsthand:

Missing context during live calls. The most painful problem was when new information surfaced during phone conversations that my agent wasn’t prepared for. A restaurant staff member would ask, “Do you have any allergies we should know about?” and my agent would freeze because they didn’t know the user’s dietary restrictions unless the user was actively listening to provide that information in real-time. I watched calls fail repeatedly because the agent couldn’t access crucial user preferences when humans asked unexpected but reasonable questions.

Conflicting processing speeds. Voice agents need to provide real-time responses during phone calls to feel natural in conversation. But gathering comprehensive context, analyzing user preferences and executing tasks with updated information takes significant processing time. The agent couldn’t simultaneously do deep context analysis and maintain the sub-two-second response times required for natural phone conversations.

The 2-agent architecture pattern

After rebuilding my system, I developed what I call the two-agent architecture. This approach creates specialized agents with distinct responsibilities that mirror how humans actually handle complex tasks.

Context agent: The strategic planner

The context agent operates like a research analyst, taking time to thoroughly understand the situation before any action occurs. In my restaurant reservation system, this agent performs deep analysis through a multi-stage pipeline.

The context agent engages in a natural conversation with the user to gather comprehensive information before any phone calls are made. Here’s how this typically unfolds:

Initial request gathering. When a user says, “I want to book dinner tonight,” the context agent asks clarifying questions: “How many people will be dining? What type of cuisine are you in the mood for? Any dietary restrictions I should know about? What time works best for you?”

Preference refinement. As the conversation develops, the agent digs deeper. If the user mentions “something healthy,” it might ask, “Are you looking for high-carb options, or do you prefer high-protein dishes? Any specific cuisines you’re avoiding?” This back-and-forth continues until the agent has a complete picture.

Research and validation. Using web search and other MCP tools, the context agent researches local restaurants that match the criteria, checks their current availability and reviews their menus for dietary accommodations. It might come back to the user with: “I found three restaurants with excellent vegan options. Would you prefer Thai or Italian cuisine?”

Strategy formulation. Once the agent determines it has sufficient context — knowing the party size, cuisine preference, dietary restrictions, preferred time, backup times and even backup restaurant options — it creates a detailed execution plan for the phone call.

The key insight is that this entire context-gathering conversation happens before any restaurant is called, ensuring the execution agent has everything it needs for a successful phone interaction.

Execution agent: the real-time performer

While the context agent thinks deeply, the execution agent handles the actual phone conversation. In my system, this agent receives the enriched context and immediately begins the call, making split-second decisions during the interaction.

I’ve watched this agent handle scenarios like:

Restaurant staff saying “We’re fully booked at 6pm” → immediately offering alternative times from the context plan.

Being asked “What’s your phone number?” → providing the customer’s number from the context.

Getting transferred to a manager → re-establishing rapport and context without missing a beat.

Discovering the restaurant doesn’t have good vegan options → politely ending the call and moving to the backup restaurant

The key insight I learned is that real-time conversation requires a completely different type of intelligence than strategic planning. The execution agent needs to be fast, adaptive and focused solely on the immediate interaction.

Implementation patterns from the field

Through building and testing my voice AI system, I’ve identified two primary implementation patterns:

Sequential processing

This is the approach I use for complex scenarios. The context agent has a complete conversation with the user, gathers all necessary information, researches options using web search tools and creates a comprehensive execution plan. Only after this entire process is finished does the execution agent begin making phone calls. This ensures maximum context quality but takes more time upfront.

Continuous collaboration

For long-running customer service calls, both agents work together throughout the interaction. The context agent provides ongoing analysis while the execution agent handles the conversation and provides real-time feedback about what’s working.

Real-world benefits I’ve observed

The two-agent architecture has delivered measurable improvements in my voice AI system:

Specialized optimization. My context agent now uses a deliberate, accuracy-focused model configuration, while my execution agent uses a faster, conversation-optimized setup. This specialization improved both context quality and conversation naturalness.

Independent scaling. During peak dinner reservation hours, I can scale up execution agents to handle more simultaneous calls while maintaining fewer context agents for the research-heavy work.

Improved reliability. When my context agent fails to find restaurant information, the execution agent can still make the call and gather information directly. When the execution agent encounters an unexpected conversation flow, it doesn’t break the entire system.

Enhanced debugging. I can now easily identify whether failures stem from poor context analysis (wrong restaurant information) or execution problems (awkward conversation flow). This separation has dramatically reduced my debugging time.

Monitoring what matters

I track different metrics for each agent to understand system performance:

For the context agent, I monitor processing time (how long context analysis takes), context quality scores (completeness of restaurant research) and strategy complexity (how detailed the execution plan is).

For the execution agent, I track conversation success rates, call duration and how often backup strategies are needed. This separation allows me to optimize each agent independently – improving context quality doesn’t affect conversation speed and vice versa.

The path forward

The two-agent architecture represents a fundamental shift in how we design AI systems for complex, real-world tasks. I’ve learned that separating context analysis from execution creates systems that are more reliable, scalable and maintainable than traditional monolithic approaches.

The key to success lies in clearly defining the boundaries between context and execution, implementing robust communication protocols and optimizing each agent for its specific role. When done correctly, the result is an AI system that combines thoughtful analysis with responsive execution, much like how humans naturally approach complex tasks.

For any developer building AI systems that need to handle real-world complexity, I recommend starting with this architectural pattern. The separation of concerns will save you countless hours of debugging and create a foundation that scales as your use cases grow.

This article is published as part of the Foundry Expert Contributor Network.Want to join?