Most agent demos do not survive contact with production traffic. A team will assemble a clever loop of LLM calls, wire it into a couple of tools, and ship a video on Twitter. Two weeks later the same agent is silently looping on a recursive tool call, burning a thousand dollars a day, and returning answers that look fluent and are quietly wrong.
This post is the small set of patterns we put in place every time we move an agent from a working demo to something a customer can rely on. None of it is novel. All of it is the kind of plumbing the demos skip.
Treat the LLM like an unreliable network call
The single most useful mental shift is to stop thinking of the model call as a function and start thinking of it as a network dependency that may time out, return garbage, or rate-limit you under load. Once you accept that, most of the production patterns fall out naturally.
That means timeouts on every model call, not just the obvious ones. It means circuit breakers when error rates spike instead of letting a bad upstream knock your whole agent over. And it means treating retries as a first-class concern, not something you bolt on after the first incident.
Circuit breakers
A circuit breaker is the cheapest insurance you will ever buy for an agent. Track failures over a rolling window; once you cross a threshold, refuse calls to the failing dependency for a cool-down period and then probe it gently before reopening. The shape we use most often:
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_after=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.reset_after = reset_after
self.opened_at = None
self.state = "CLOSED" # or "OPEN" / "HALF_OPEN"
def call(self, func, *args, **kwargs):
if self.state == "OPEN":
if self._cooldown_elapsed():
self.state = "HALF_OPEN"
else:
raise CircuitOpen("upstream is unhealthy")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception:
self._on_failure()
raiseThe wins are not theoretical. The first time a model provider has a regional incident and your agent stays up because the breaker tripped at 5 errors instead of letting 50 retries pile on the queue, the pattern earns itself back permanently.
State machine
The three states of a circuit breaker
A failing dependency moves the breaker from CLOSED to OPEN. A cool-down later, HALF-OPEN probes the upstream gently before reopening the circuit.
Retries with exponential backoff and jitter
Retry on transient failures only — never retry a 4xx, never retry on a content policy refusal — and always add jitter so a coordinated upstream hiccup does not produce a synchronized retry storm from your fleet. Three to five attempts is almost always enough; anything more is masking a real problem.
Stop the agent from looping forever
The single most common production failure of an agent is not a wrong answer; it is the agent that never finishes. A half-formed plan, a tool that returns inconsistent shapes, a model that decides to call the same function thirty times in a row — and now your queue is full and your bill is climbing.
Three guardrails matter:
- A hard step budget. Cap the number of tool-call iterations per request. Twenty is generous for most workflows; fifty is a panic button. Past the budget, the agent surrenders gracefully — return what it has so far and log it as an unfinished trace.
- A token budget. Track cumulative input and output tokens per request. If a single conversation crosses a threshold, terminate. This is the only reliable defence against runaway cost when a model gets stuck.
- A repetition detector. If the agent calls the same tool with the same arguments twice in a row, treat it as a loop. The model has run out of new ideas and will usually keep running out of new ideas.
State you can recover from
Most agents we inherit have no real notion of state. The conversation lives in memory, the tool history lives in memory, and one container restart reduces a half-hour investigation to nothing. Worse, the user has no idea anything went wrong because the next request just starts fresh.
The fix is not glamorous. Persist the agent's state at every meaningful step — usually after each tool call — to the same store you would use for any other workflow engine. Postgres works. DynamoDB works. Redis with a real persistence policy works. The store does not matter. The discipline does.
With state persisted, a crashed worker resumes the trace. A bad deploy can be rolled back. A long-running agent can be paused and continued. And — the underrated win — a support engineer can answer the question what did the agent actually do three days after the fact.
Observability is the feature that ships the rest
The single highest-leverage thing you can build for an agent is the trace view. Every tool call, every model call, every input and output, ordered by time, queryable by user. Without that, your team is debugging a black box. With it, every incident becomes a clear narrative you can hand to a junior engineer.
We default to Langfuse because it's self-hosted-friendly and the data model maps cleanly to how agents actually work. LangSmith and Helicone are also reasonable choices. The right tool is the one your team will actually open at 2am during the first incident; any of them is dramatically better than nothing.
Graceful degradation beats clever recovery
When something fails — and something will fail — the agent should fall back to a smaller, simpler answer rather than throw. A search agent that cannot reach its primary retriever should fall back to a keyword search. A summariser that cannot reach its preferred model should fall back to a smaller, faster, cheaper one and tell the user the answer may be less polished.
The user-visible bar is not my agent never fails. The bar is when my agent struggles, the user still gets something useful and knows what just happened.
The short list, if you only do five things
When we audit a team's agent stack, this is the checklist we walk. Anything below the line is a real risk to production:
- Hard step budget and per-request token budget on every agent.
- Timeouts and a circuit breaker on every external call, including the model itself.
- Persistent state at each tool-call boundary, with idempotent handlers so retries do not double-charge anything.
- A trace view that lets you replay any production request end to end.
- A documented degradation path for every external dependency the agent uses.
None of this is exciting. All of it is what separates an agent that earns its keep from one that quietly costs you money.