In this article
The wrong mental model
Most people who work with AI still think of it as an assistant — something you talk to, that responds, that augments a human completing a task. This framing made sense when the only thing language models could do was generate text in response to a prompt. It no longer makes sense, and holding onto it is costing businesses real money.
The right mental model is an organisation you design. An organisation in which every role is filled by an agent that specialises in that role — researcher, planner, executor, validator, critic — and which operates under an orchestrator that assigns tasks, monitors outputs, routes feedback, and decides when the work is done. The humans who built it set the goals. The agents execute.
This distinction matters for one non-obvious reason: when you design an agent system as an organisation rather than an assistant, the complexity of problems it can solve changes qualitatively. A single model asked to do research, plan a response, write the content, and then check its own work is asking one entity to context-switch between fundamentally different cognitive modes. A specialist agent for each role doesn't context-switch. It has a single job and does that job well, and its output becomes the input for the next specialist.
The architectural insight: agents improve each other. A critic reviewing a coder's output catches errors neither would catch alone. A planner decomposing a task before an executor attempts it eliminates wasted cycles by an order of magnitude.
Orchestration patterns that work
There are three dominant orchestration patterns in production multi-agent systems today, and choosing between them correctly is the primary architectural decision you'll make when building an agent workflow.
Sequential pipeline
The simplest pattern. Agent A produces output that becomes the input for Agent B, which produces output for Agent C, and so on. No branching, no parallelism, no feedback loops. This is appropriate for well-defined tasks with clear stages where each stage depends strictly on the previous one — content production being the canonical example: research → outline → draft → edit → publish.
The limitation is obvious: if Agent B fails or produces low-quality output, the error propagates downstream. A sequential pipeline without a validation agent at each stage will produce garbage at the end of a chain with a weak link in the middle.
Master/worker with feedback
The more powerful pattern for complex tasks. A master orchestrator breaks work into subtasks, dispatches them to specialist workers in parallel or sequence, collects results, evaluates them against success criteria, and either accepts the output or routes it back to the relevant worker with specific feedback. This loop continues until the master is satisfied.
The master/worker pattern with feedback is what makes agents genuinely robust. The master doesn't just collect outputs — it judges them. An agent that can evaluate its own workers' outputs and route corrections is an agent that improves on each iteration without human intervention. The human doesn't need to be in the loop to fix errors; the orchestrator handles that.
Swarm / emergent coordination
The most experimental pattern, and the one with the highest ceiling for complexity. Individual agents operate with minimal centralised coordination, following simple local rules and sharing state through a common memory or message bus. Emergent coordination arises from the interaction of these simple local decisions. This is appropriate for highly exploratory tasks — research problems, creative generation — where the solution space is poorly defined and rigid orchestration would constrain the search.
The risk is obvious: emergent systems are hard to debug and harder to predict. For most business automation use cases, master/worker with feedback is the right choice.
Tool use, memory, and planning
What distinguishes a capable agent from a language model producing text is its ability to take actions in the world via tool calls, to maintain state across turns, and to plan sequences of actions rather than reacting turn by turn. These three capabilities — tool use, memory, and planning — are what make the difference between an impressive demo and a working production system.
Tool use
Modern agent frameworks expose tools as structured function signatures that a model can call: search the web, execute code, read a file, write to a database, call an API, send a message. The model reasons about which tool to call, constructs the appropriate input, receives the output, and continues its reasoning. A well-designed toolkit is effectively the agent's interface to reality.
The Model Context Protocol (MCP) has emerged as the standard for tool definitions in multi-agent systems. It lets agents share tool definitions, call each other's tools, and hand off context cleanly between agents. A researcher agent's tools (web search, academic databases, document parsing) can be exposed to an orchestrator that dispatches research subtasks without knowing the implementation details of how the researcher gets its information.
Memory
Agents need multiple types of memory. In-context memory is the conversation history — what has happened in the current session. Episodic memory is structured recall of past runs — what worked, what failed, what the user preferred last time. Semantic memory is the distilled knowledge base — facts, documentation, institutional knowledge that persists across all sessions. A well-architected agent uses all three: it reasons in context, learns from episodes, and retrieves from its semantic store when it needs information that isn't in the immediate context window.
The practical implementation is typically a combination of a vector database for semantic retrieval, a structured database for episodic logs, and the model's context window for in-session reasoning. The orchestrator manages what gets loaded into context and what gets retrieved from external stores.
Planning
ReAct (Reasoning + Acting) is the foundational pattern: the agent alternates between explicit reasoning steps (thinking about what to do next) and action steps (doing it), with observation steps (reading the result) in between. This loop continues until the agent decides the task is complete or it has exhausted its approach.
More sophisticated planning involves decomposition before execution: the agent breaks a complex goal into a tree of subtasks, assigns each subtask to the appropriate specialist, and plans the dependency graph before taking any action. This is analogous to a project manager producing a Gantt chart before the team starts work — it front-loads reasoning cost to eliminate downstream execution cost.
Incident response: a real case
The numbers that get cited most often in the enterprise AI agent literature are the incident response benchmarks, because they're concrete and because the gap is stark. In a documented deployment at a mid-size SaaS company, the baseline for resolving a P2 production incident (service degradation, non-critical) was 30 minutes of human engineer time at a fully-loaded cost of approximately $15 per incident, once you account for the engineer's time, the cost of the degraded service, and the on-call overhead.
After deploying an incident response agent pipeline, the same class of incidents resolved in an average of 28 seconds at a compute cost of under $0.80 per incident. The agent pipeline ran the following sequence automatically on alert trigger: query the monitoring system for correlated metrics, search the runbook database for matching symptom patterns, identify the most probable root cause from the top three runbook matches, attempt the recommended mitigation, verify the mitigation by checking the relevant metrics, and either close the incident or escalate to a human with a structured report if mitigation failed after two attempts.
The human engineer was only ever involved when the agent escalated — which happened in approximately 12% of cases. The other 88% resolved without human intervention. The net result was not that the engineers were out of a job; it was that they stopped spending their working hours triaging routine incidents and started spending them on the 12% of incidents that were genuinely novel.
This is the pattern across almost every documented production deployment: the agent handles the routine cases, humans handle the exceptions. The ratio of routine to exceptional varies by domain, but in most operational contexts it's 80/20 or better. The 80% that agents handle represents the work that was consuming the majority of experienced engineers' time.
Customer service at scale
Customer service is the application domain with the most documented production deployments, for an obvious reason: it has the highest volume, the most repetitive query patterns, and the most legible success metrics (resolution rate, time to resolution, customer satisfaction score).
The agent architecture for customer service is typically a triage orchestrator that classifies incoming queries, routes them to specialist agents (billing questions to a billing agent, technical issues to a technical agent, returns to a returns agent), and monitors resolution quality. Each specialist agent has access to the relevant tools: billing agents can query the billing system and issue credits; technical agents can access documentation and known-issue databases; returns agents can check order history and initiate refunds.
The sophisticated versions of this architecture include a quality-control agent that reviews completed interactions for compliance with policy, accuracy of information provided, and tone — and routes interactions for human review when any dimension falls below threshold. The QC agent doesn't just flag problems; it produces structured feedback that the specialist agents can learn from in subsequent runs.
What this architecture enables at scale is personalisation that was previously impossible. A human agent handling 40 conversations per day can maintain some context on returning customers. An agent system handling 40,000 conversations per day can retrieve complete interaction history, preference data, and product usage data for every interaction — and use that context to provide more relevant responses than any human could.
Autonomous code review
Code review is an interesting case because it's a domain where the bottleneck is particularly expensive: senior engineers are the scarcest resource in most software organisations, and spending senior engineering time on reviewing junior engineers' code is a use of that time that has always felt necessary but inefficient.
The agent architecture for autonomous code review runs multiple specialist review agents in parallel on each pull request: a correctness agent that checks for logic errors and edge cases, a security agent that looks for common vulnerability patterns, a style agent that enforces conventions, a test-coverage agent that identifies untested code paths, and a documentation agent that checks that changes are appropriately documented. An orchestrator aggregates their findings, deduplicates overlapping comments, prioritises by severity, and produces a structured review that surfaces the highest-priority issues first.
The results in production deployments show these systems catching approximately 70-80% of the issues that senior engineers would have caught in a manual review, with a false positive rate low enough that developers don't tune them out. The remaining 20-30% of issues — the subtle architectural decisions, the questions of whether this approach is the right one conceptually — still require human judgment. But the routine correctness, security, and style feedback that was consuming senior engineering time is largely automated.
Building the self-managing business
The conceptual ceiling for multi-agent systems is the self-managing business — a business in which the operational layer runs autonomously, humans set goals and handle true exceptions, and agents handle everything in between. This is not a distant possibility. The components exist today.
A content business, for example, can run the following loop autonomously: a research agent identifies trending topics in the domain, a planner agent generates an editorial calendar, a writer agent produces draft content for each scheduled piece, an editor agent refines and fact-checks, a publisher agent schedules and deploys via the CMS, an analytics agent monitors performance, and a feedback agent reports results to the planner to inform the next cycle. The humans who built this system set the editorial direction. The agents execute the operation.
The same structure applies to a software-as-a-service business: monitor customer usage, identify churn signals, draft personalised outreach, handle inbound support, triage bug reports, generate release notes, manage the publish pipeline. Each of these is a multi-agent workflow. They can run in parallel and coordinate through shared state. The business runs while the humans sleep.
The critical architectural requirement is the feedback loop. An autonomous business that cannot evaluate its own performance and adjust its behaviour will drift. The quality-control and analytics layers are not optional enhancements — they are what converts an automation into an intelligent system.
Where to start
The practical starting point for most organisations is not the fully autonomous business. It's identifying the highest-volume, most repetitive workflow in the organisation and building a master/worker pipeline to automate it. The criteria for a good first target are: the workflow has clear inputs and outputs, success is measurable, the cost of a failure is recoverable (not catastrophic), and the volume is high enough that the automation pays for itself quickly.
Incident response, customer service triage, content production, and code review are all strong first targets because they meet all four criteria. They're also domains with mature tooling and documented production deployments to learn from.
The coding environment matters. Cursor brings AI-native editing to developers who are building agent systems — the model as a first-class collaborator that understands the entire codebase. Claude (via the API) provides the reasoning and generation capability at the core of most agent architectures. These are the tools the serious builds run on.
Build the first pipeline. Measure it. Understand where it fails and why. Then extend. The architecture compounds — each additional specialist agent makes the system more capable, and the feedback loops get more sophisticated as you accumulate data on what works and what doesn't. The autonomous workforce is built incrementally, not deployed all at once.