You Can't Reproduce Your Agent's Bugs—That's Why You Can't Fix Them

#ai #agents #observability #testing

Here is a bug report I have received, in some form, at every company running agents in production:

"The agent gave a customer a wrong refund amount yesterday around 2pm. Can you look into it?"

Here is how that investigation goes when your stack isn't built for it: you find the timestamp, re-run the same prompt, and it works perfectly. Correct refund, every time. You change nothing; it keeps being right. Eventually you write "could not reproduce — will monitor," which is a professional way of saying you gave up.

This is the failure I think is most under-discussed in the whole agent space. Not hallucination, not drift, not cost. Irreproducibility. A bug you cannot reproduce is a bug you cannot fix, cannot test, and cannot prove you've fixed. Agents are, by nature, the most irreproducible software most of us have ever shipped.

The opinion I'll defend: reproducibility is the precondition for every other quality practice you claim to have. Your evals, regression tests, and CI gates all assume you can take a real failure and run it again on demand. If you can't, that apparatus is built on sand.

Why "re-run it" silently lies to you

For a normal backend bug, reproduction is mostly free: same request, same row, same code, same bug. We've built an instinct that says if I run it again with the same inputs, I'll see the same thing. For agents that instinct is wrong in four independent ways, each enough to break reproduction alone:

The model is nondeterministic. Temperature above zero means the same prompt yields different completions. The 2pm run took a reasoning path your replay never will, and the bug lived on that path.
You aren't replaying the same input. You re-ran the same template. But the prompt the model saw had interpolated context — a retrieved document, the user's account state, the date, a tool result — and that resolved input is gone.
The world moved. The agent called a tool that hit a database or an API. Today those return different values than yesterday. Even a deterministic model would behave differently, because its inputs changed.
Hidden state. The model version rolled silently behind a pinned name; a cache was warm then, cold now; a flag flipped. None of it is in your code; all of it changed the run.

Only the first is about the model's randomness. The other three are about inputs you failed to capture — which means the fix is mostly engineering discipline, not a model problem. You can't make production deterministic, but you can make a run replayable, and those are very different goals.

Replayable beats deterministic

The instinct is to chase determinism: temperature: 0, pin every version, freeze the world. That's a trap. A temperature-zero agent is often a worse agent, and you still haven't captured the tool outputs, so you still can't reproduce a past failure. Determinism is something you'd impose on all of production forever. Replayability is a property you attach to each run as it happens, and it's strictly more powerful: it reconstructs that specific failure no matter how nondeterministic production was.

To replay a run you must have captured, when it happened: the resolved input (the exact bytes the model saw after templating), every tool call's raw output (what the APIs returned then), and the execution parameters (model id and version, temperature, seed, system prompt).

This is the seam where the two tools I lean on operate as one unit, because reproduction needs both a record and a verdict. AgentLens captures the trace — every model and tool step, the resolved inputs, the raw outputs, the parameters — the raw material a replay is rebuilt from. agent-eval is the other half: it takes that captured run, re-executes it under pinned conditions, and scores whether the bug is present. AgentLens makes the failure replayable; agent-eval makes the replay a pass/fail test you can gate on. A trace with no scorer is an archive you read by hand; a scorer with no trace is grading a prompt you've already lost.

import { getTrace } from "agentlens";
import { evaluate, assert } from "agent-eval";

interface ReplayBundle {
  resolvedPrompt: string;                 // exact bytes the model saw at 2pm
  params: { model: string; temperature: number; seed?: number };
  toolReplays: Record<string, unknown>;   // call signature -> raw output THEN
}

// Pull a real failure out of AgentLens into a self-contained replay bundle.
async function bundleFromTrace(traceId: string): Promise<ReplayBundle> {
  const trace = await getTrace(traceId);
  const model = trace.steps.find((s) => s.kind === "model");
  if (!model) throw new Error(`no model step in ${traceId}`);

  // Freeze each tool's output by call signature, so a replay returns
  // yesterday's values instead of hitting today's moved-on world.
  const toolReplays: Record<string, unknown> = {};
  for (const s of trace.steps.filter((s) => s.kind === "tool")) {
    toolReplays[`${s.name}:${JSON.stringify(s.input)}`] = s.output;
  }
  return {
    resolvedPrompt: model.input,                                 // RESOLVED input
    params: { model: model.model, temperature: model.temperature, seed: model.seed },
    toolReplays,
  };
}

// Re-run the agent against the captured reality. Tools are stubbed to replay
// recorded outputs, so the ONLY thing that can vary is the agent itself.
async function replay(b: ReplayBundle): Promise<string> {
  return runAgent(b.resolvedPrompt, {
    ...b.params,
    toolResolver: (name: string, input: unknown) =>
      b.toolReplays[`${name}:${JSON.stringify(input)}`],
  });
}

// Turn the reproduced failure into a permanent regression eval.
async function lockAsRegression(traceId: string, mustNotContain: string[]) {
  const b = await bundleFromTrace(traceId);
  const output = await replay(b);
  return evaluate({
    input: b.resolvedPrompt,
    output,
    checks: [
      assert.notContains(mustNotContain),  // the bogus refund amount, forbidden forever
      assert.judge({ criterion: "refund amount matches the tool result", threshold: 0.8 }),
    ],
    metadata: { sourceTraceId: traceId },  // provenance back to the real incident
  });
}

Two decisions do all the work. Tool outputs are replayed, not re-fetched: the toolResolver hands the agent yesterday's recorded responses instead of calling the live API. If your replay re-queries the database, you're testing a world that has changed, and any "fix" you observe might just be the data moving again — pinning tool outputs isolates the one variable you want to study. And the resolved prompt is replayed, not the template: the subtly-wrong retrieved document or the unusual account state lived in the resolved input, and nowhere else.

Replaying once is debugging; replaying many times is the fix

A subtlety I won't skip: if the failure happened on a high-temperature reasoning path, replaying once might reproduce it or might not, because you'll roll a different path. For nondeterministic failures a single replay isn't a reproduction; it's one sample.

The honest technique is to replay the bundle N times and measure the failure rate. A bug in 8 of 50 replays is reproduced — you've proven it's real and quantified it — even though no single run is guaranteed to show it. Your fix isn't "the replay passed once"; it's "the rate across 50 replays went from 16% to 0%."

async function reproductionRate(traceId: string, isBug: (o: string) => boolean, n = 50) {
  const b = await bundleFromTrace(traceId);
  const runs = await Promise.all(Array.from({ length: n }, () => replay(b)));
  const hits = runs.filter(isBug).length;
  return { rate: hits / n, hits, n }; // e.g. { rate: 0.16, hits: 8, n: 50 }
}

This reframes the bug-fixing loop. You don't fix a failure and eyeball it once; you capture it, establish its rate, change the agent, prove the rate dropped, and keep the replay in your suite so it can never silently climb back. agent-eval runs the bundle; the AgentLens trace behind it tells you, when a replay does fail, which step diverged from the recorded path.

What to do Monday

You don't need production to be deterministic. You need every run to be reconstructable:

Capture the resolved input, not the template. Store only the template and you've already lost most of your future bug reports.
Record every tool call's raw output inline with the run. A reproduction that re-fetches from a moved-on world is not a reproduction.
Stamp the execution parameters — model version, temperature, seed, system prompt — onto the trace. "It was a different model version" is a real root cause you'll otherwise never see.
Measure failure rate, not a single replay. For nondeterministic bugs, reproduction is statistical: replay N times, and treat "rate went to zero" as the bar for fixed.

The agents will keep producing failures you can't explain from the symptom alone. The difference between a team that fixes them and one that closes tickets with "could not reproduce" isn't model quality or prompt skill — it's whether the run left behind enough of itself to be run again. Capture the trace with AgentLens, replay-and-score it with agent-eval, and "could not reproduce" stops being a sentence you're allowed to write.

Top comments (2)

Nazar Boyko • Jun 24

Replayable beating deterministic is the right frame, and the four ways reproduction breaks is the cleanest version of that argument I've read. The hard part in practice is that you have to capture everything before you know which run will fail, so you're storing resolved inputs and raw tool outputs for 100% of traffic, including whatever PII rode along in them. So the real operational question becomes retention. How do you decide what to keep and for how long, when the trace you age out is the one bug report you can never reconstruct? Curious whether you sample, or capture everything and lean on short retention plus redaction.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.