Designing multi-agent systems for complex workflows

Multi-agent systems are having a moment. Every framework on the market is shipping a router, an orchestrator, a manager, a crew. The demos look impressive — three agents passing notes to each other, accomplishing something a single agent "could not". The trouble starts the second one of those agents misbehaves and there is no obvious way to figure out which one it was.

Multi-agent architectures are a useful abstraction. They are also the easiest way to ship a system that nobody on your team can debug at 2 AM. This post is the rules of thumb we use to decide when multiple agents are actually warranted, how to structure the ones we keep, and what we put in place so the system stays observable when it inevitably misbehaves.

First: do you actually need multiple agents?

The honest answer most of the time is no. A single agent with a well-designed tool palette will outperform a multi-agent system on the same task in most workflows we've worked on. One model, one context window, one trace to read when something breaks. Before splitting into agents, the bar we use is one of the following has to be true:

The work is genuinely parallel. Three independent sub-tasks that produce no dependencies between them benefit from running concurrently. Sequential sub-tasks pretending to be agents are just one agent in a costume.
The roles need different system prompts that actively conflict. A reviewer that has to push back on a writer cannot share the writer's "be helpful" framing. Splitting them is genuine isolation, not architectural decoration.
The tools each role needs are large and non-overlapping. A model with thirty tools in its palette will pick the wrong one. Splitting the palette across role-scoped agents reduces the choice surface.
You need different model tiers per role. The planner needs a strong reasoning model; the worker does mechanical extraction on a small model. This is one of the few cases where multi-agent architectures pay for themselves on cost alone.

If none of the above apply, our default recommendation is to stay single-agent and revisit the question after a real production incident has told you what is actually missing.

The three patterns we end up using

When the bar above is cleared, almost every multi-agent system we've shipped reduces to one of three patterns. There are others — auctions, debates, recursive self-organizing crews — but they tend to be research demos. These are the ones that survive a quarter in production.

Three patterns

The multi-agent topologies we actually use

Most production multi-agent systems we ship reduce to one of these three. The others tend to be research demos.

1. Pipeline

Agents run in a fixed sequence. Each one's output becomes the next one's input. Routing is implicit — there is no orchestrator deciding who runs next; the topology is hard-coded. This is the boring, reliable pattern, and it is the pattern we reach for first.

Researcher  ->  Drafter  ->  Reviewer  ->  Publisher

Use it when the workflow is deterministic, the steps are well-defined, and you want the trace to read top-to-bottom. The cost is rigidity: you cannot easily skip steps or rerun a single stage on a different input without code changes.

2. Manager-and-workers

A planning agent decomposes the request into sub-tasks and farms them out to specialist workers, then synthesizes the results. The manager is the only agent with the full picture; the workers see only the slice they need.

Planner
  |
  +-- Data agent
  +-- Search agent
  +-- Code-execution agent
  |
Synthesizer (often the planner again)

Use it when the work is genuinely parallelizable and the decomposition is non-trivial. The risk is that the planner makes bad decompositions and you spend the rest of the run trying to fix them. Cap the number of sub-tasks per request and refuse to recurse — a planner that is allowed to call another planner will eventually do so, and the trace becomes unreadable.

3. Peer-with-blackboard

Several agents share a structured workspace and contribute to it asynchronously. Each agent reads the workspace, decides whether it has something to add, and writes back. A termination condition (vote, max-rounds, supervisor) ends the run.

This is the pattern most people picture when they hear "multi-agent". It is also the easiest one to ship a system you cannot debug. Reserve it for workflows that genuinely benefit from iterative refinement — collaborative writing, design critique, complex search — and instrument it aggressively. We have walked into more than one engagement where this pattern was being used because it was interesting, not because it was needed.

Make the message contract real

The single largest source of bugs we've seen in multi-agent systems is informal message passing. One agent emits a string; the next agent has to guess what shape it is in. Two weeks later one of them changes its prompt and the receiver starts silently misinterpreting half the inputs.

The fix is mundane: define a typed contract for every inter-agent message and validate it on both ends. We use plain JSON schemas with required fields, and we treat a schema violation as a hard failure rather than something to quietly recover from.

# Researcher -> Drafter
{
  "topic": "Q3 cost outlook",
  "sources": [
    { "title": "...", "url": "...", "snippet": "..." }
  ],
  "key_findings": ["..."],
  "confidence": 0.0      # 0.0..1.0, set by researcher
}

The schema does two useful things. It tells you exactly which agent failed when something breaks. And it gives you the ability to swap any one agent for a different model — or even a deterministic non-LLM implementation — without touching the rest of the pipeline.

Observability is not optional

A single-agent run produces one trace. A multi-agent run produces a tree of traces, and if you cannot read the tree, you cannot debug the system. Before we ship, we make sure every run captures:

A correlation ID that flows through every agent call, every tool invocation, every model request, and every queue hop.
The full message payload at every agent boundary. Not a summary. The actual JSON the agent received and the actual JSON it produced.
Per-agent token and cost accounting so a runaway worker shows up as a spike on its own line, not buried in the aggregate.
A timeline view that lets a human read the run in order, with collapsible sub-trees for each agent. LangSmith and Langfuse do this well; rolling your own with OpenTelemetry spans is also fine if you have the time.

Without these, the only debugging tool you have is staring at the model's final answer and reasoning backward. That is not a debugging tool.

Failure modes to plan for from day one

Every multi-agent system we've shipped has hit at least three of these in the first quarter. Designing for them up front is much cheaper than retrofitting them after the incident.

Infinite handoffs. Agent A calls agent B, B calls A, A calls B again. Cap the number of handoffs per request and terminate when you hit it.
Cascading hallucinations. Agent A invents a fact in its summary; agent B treats the summary as ground truth; the invented fact ends up in the user-facing answer with high confidence. The fix is to keep the source material in the message contract, not just the summary.
Confidence laundering. Each agent shaves a little uncertainty off its predecessor's output. By the time the synthesizer sees it, a 0.6-confidence draft has become a 0.95-confidence final answer. Carry confidence scores through the pipeline and refuse to inflate them.
Partial failure. One worker times out; the manager produces a synthesis as if the data were complete. Make missing data explicit in the contract and force the synthesizer to handle it.

Frameworks: useful, but not load-bearing

LangGraph, AutoGen, and CrewAI all do reasonable jobs as scaffolding for the patterns above. Pick whichever your team can read most fluently and which integrates with the observability stack you already have. None of them solve the hard parts — message contracts, observability, failure modes — for you. Those are still your problem regardless of which framework you pick, and they are where almost all of the production failures we've seen come from.

The right way to think about a multi-agent framework is the way you would think about a web framework. It saves you some plumbing. It does not design your system. The system you actually have is the one you can debug, not the one you drew on the whiteboard.