← all writing

Building reliable AI agents: evaluation, guardrails, and the loop that won't end

Agents fail in characteristic ways. Here are the failure modes worth designing for — and the patterns that turn an unreliable agent into something you can deploy.

4 min read agents / evaluation / production

Agents are seductive because they look like they’re doing what a human would do — until they aren’t. Then they hallucinate a tool call, retry the same broken action eleven times, or quietly succeed at the wrong task and report victory.

The path to a reliable agent isn’t a smarter model. It’s understanding the small list of ways agents fail, and designing around each one.

Failure mode 1: the loop that won’t end

An agent decides it needs more information. It calls a tool. The tool returns nothing useful. The agent decides it needs more information. It calls a tool again. Twenty minutes later, you have a $30 OpenAI bill and a confused trace.

What to design in:

  • A hard cap on total steps per run.
  • A cap on identical or near-identical tool calls (most agents will repeat themselves before they give up).
  • A “give up gracefully” branch — when the loop budget runs out, the agent must produce a structured “I couldn’t” response, not silently die.

The hard cap is non-negotiable. Without it, every other reliability feature is theoretical.

Failure mode 2: confident wrong answers

An agent finishes its run and produces a clean, articulate answer that happens to be wrong. The articulation makes humans trust it. This is the most expensive failure mode in production because nobody catches it.

What to design in:

  • A self-check step before final output: “given the tool outputs you actually saw, does your answer follow?”
  • Faithfulness evaluation in your offline eval set — not just “is the answer right,” but “is the answer supported by what the agent actually retrieved or computed?”
  • Refusals as a first-class output. An agent that can say “I couldn’t verify this” is more useful than one that always answers.

Failure mode 3: tool-call drift

The agent invents a tool that doesn’t exist. Or calls a real tool with arguments in the wrong shape. Or stringifies a number when the tool wants an integer.

What to design in:

  • Strict schema validation on every tool call. If validation fails, the agent gets the error back as a tool result and can self-correct — don’t crash the run.
  • A small list of well-named tools, with clear docstrings. Forty tools is too many; the model will get confused. If you’re past ten, think about composing or grouping.
  • A “no available tool” path. The agent should know when not to call anything.

Failure mode 4: context bloat

By step 15, the agent’s context window contains every tool call, every tool response, the original instructions, the system prompt, and now it can’t think clearly because it’s drowning in its own history.

What to design in:

  • Summarisation of older tool results once the context gets long. Keep the structured outcome, drop the verbose payload.
  • A scratchpad/notes mechanism: let the agent write a short running summary and refer to that instead of the full history.
  • A separate “planner” step that operates on the summary, not the raw context.

Failure mode 5: silent success at the wrong task

The agent achieved something that looks like the goal but isn’t the goal. It summarised the wrong document. It updated the wrong record. It answered an adjacent question, not the one asked.

What to design in:

  • An explicit goal-restatement step at the start of the run. The agent should write out “what I’m being asked to do” — and you can evaluate whether that restatement is correct, separately from whether the final answer is correct.
  • Acceptance criteria as part of the input, not just in the prompt. Make “did this satisfy criteria X, Y, Z?” a structured check before the agent can terminate.

Evaluation, not just one input

You can’t grade an agent by running one example and reading the output. Agent behaviour is high-variance: the same prompt run twice can produce different paths.

A real evaluation rig for an agent has, at minimum:

  • A scenario set: tasks the agent should handle, including ones it should refuse.
  • Multiple runs per scenario: catch variance. If 8/10 succeed and 2/10 hallucinate, that’s a different release decision than 10/10 succeed.
  • Stepwise grading: was the right tool called? Were arguments correct? Was the final answer faithful to the steps?
  • A regression baseline: yesterday’s success rate, so a change that drops you from 92% to 78% gets caught before it ships.

LangSmith, LangFuse, and LangWatch all have machinery for this. The choice of tool matters less than committing to the rig at all.

Guardrails are not a model

You cannot prompt an agent into being safe. Guardrails are systems-level:

  • Input filters for prompt injection, prohibited topics, or oversized payloads.
  • Tool-level permissions — most agents should not be able to call most tools. Scope aggressively.
  • Output filters that strip or refuse responses containing PII, secrets, or harmful content.
  • Human-in-the-loop checkpoints for any action with real-world side effects. An agent that can send email or write to a production database should ask before it does.

A safe agent is one where the system can’t do unsafe things, not one that’s been politely asked not to.

The shortest version

A reliable agent has:

  1. A step budget and a graceful-fail path.
  2. Strict tool schemas and a small, well-named tool set.
  3. A self-check before final answer.
  4. An evaluation rig you actually run.
  5. Guardrails outside the prompt, not inside it.

The model gets smarter every quarter. The system around the model is what makes it trustworthy.


Need help putting an LLM system into production?

Get in touch