Your Evals Are Flaky Too: Stop Trusting a Pass Rate You Can't Reproduce

#ai #evaluation #testing #typescript

We spent two years teaching everyone that agents are non-deterministic. Same prompt, different output, every run. Fine. We internalized it. We stopped asserting equality, we built model-as-judge evals, we put them in CI.

And then we quietly assumed the evals were deterministic. They are not.

Your eval suite is a non-deterministic system grading another non-deterministic system. If you haven't measured how much your own grader wobbles, you don't have a quality gate. You have a coin flip wearing a lab coat.

The bug that taught me this

We had a model-as-judge check on a support agent: "Does the response correctly resolve the customer's stated issue? Return PASS or FAIL." Green for weeks. Then a release went out, the dashboard stayed green, and complaints spiked anyway.

I reran the exact same eval on the exact same 200 stored responses. 14 of them flipped verdict. Not because the agent changed — the responses were frozen on disk. The judge changed its mind. Temperature, sampling, a model-side update, who knows. My "97% pass rate" was 97% +/- something I had never measured, and that something was big enough to hide a regression.

The eval wasn't wrong. It was flaky. And a flaky gate is worse than no gate, because it manufactures confidence.

Flaky evals come from three places

1. The judge model. Any LLM-as-judge call inherits the same variance as the thing it's grading. Run it at temperature: 0 and you reduce it, but "reduce" is not "eliminate" — providers don't guarantee determinism even at zero, and a silent model version bump resets your baseline overnight.

2. The harness around the judge. Retrieved context, the order tools resolved, a truncated input, a rate-limit retry that changed what the judge actually saw. The judge gave a perfectly consistent answer — to a different question than last time, because its inputs drifted.

3. Your own rubric. "Is this answer good?" is not a spec. Vague rubrics push the variance into the judge's interpretation, where you can't see it. Tight, decomposed rubrics collapse it.

Notice that only one of these is "the model is random." The other two are infrastructure, and you can't tell them apart from a PASS/FAIL alone.

Treat your eval like flaky test code, because it is

Backend engineers already know how to handle non-deterministic checks: you don't ship a flaky test, you quarantine it, you measure its flake rate, and you fix or delete it. Same discipline here.

First, quantify the flake before you trust the score. Run each judge call N times and look at the agreement, not the average:

type Verdict = "PASS" | "FAIL";

interface JudgeResult {
  verdict: Verdict;
  // every model + tool step that produced this verdict
  traceId: string;
}

async function stabilityCheck(
  caseId: string,
  runJudge: () => Promise<JudgeResult>,
  samples = 5,
): Promise<{ verdict: Verdict | "UNSTABLE"; agreement: number; traceIds: string[] }> {
  const results = await Promise.all(
    Array.from({ length: samples }, () => runJudge()),
  );

  const passes = results.filter((r) => r.verdict === "PASS").length;
  const agreement = Math.max(passes, samples - passes) / samples;
  const traceIds = results.map((r) => r.traceId);

  // If the judge can't agree with itself, the verdict is not a signal.
  if (agreement < 0.8) {
    return { verdict: "UNSTABLE", agreement, traceIds };
  }
  return {
    verdict: passes > samples / 2 ? "PASS" : "FAIL",
    agreement,
    traceIds,
  };
}

The point isn't the magic 0.8. The point is that UNSTABLE is now a first-class outcome. A case where the judge flips 3-of-5 is not a 60% pass — it's a broken check, and it should fail loud and get quarantined, not silently average into a comforting number.

Second — and this is the half everyone skips — you have to be able to debug the disagreement. Knowing a case is UNSTABLE is useless if you can't see why the judge split. That requires the trace behind every one of those five runs: the resolved prompt the judge actually received, the retrieved context, the tool outputs, the raw judge completion. Not the summary. The bytes.

This is exactly where the two-layer split earns its keep

This is the workflow I keep coming back to, and it's two halves of one loop — not two products you bolt together at the end.

agent-eval is the layer that scores and gates the output. It runs the deterministic checks and the model-as-judge passes, it computes the stability/agreement above, and it's the thing that turns "the agent answered" into PASS / FAIL / UNSTABLE that CI can act on. It owns the verdict.

AgentLens is the layer that captures the trace of how that verdict happened — every model call and every tool step, with resolved inputs and raw outputs, for both the agent run and the judge run. It owns the explanation.

You need both because the eval score alone can't tell you which of the three flake sources you hit. When agent-eval flags a case UNSTABLE, you pull the five AgentLens traces side by side and the cause is immediately legible: if the resolved judge inputs are identical across runs and the verdict still flipped, it's the judge model — tighten the rubric or pin the version. If the inputs differ, it was never a judge problem — your harness is non-deterministic and the agent's context drifted between runs. Same UNSTABLE flag, opposite fix. The verdict tells you that it's unstable; the trace tells you why, and the why is the only thing that's actionable.

Without the trace, every flaky eval looks like "the model is random," so you reach for temperature: 0, watch the flake rate drop a little, and convince yourself it's solved. It isn't — you just made the infrastructure bug quieter.

What to actually do Monday

Stop reporting a single pass rate from a single run. Report agreement. A 95% pass rate at 0.6 judge agreement is noise.
Make UNSTABLE a failing state in CI. If your grader can't agree with itself across a handful of samples, that case does not get to vote on whether you ship.
Pin and alert on judge model versions the same way you pin the agent's. A silent provider bump is a silent baseline reset.
When a check goes unstable, read the traces, not the average. The fix for "the judge is random" and "my context drifted" are opposite, and the verdict alone can't distinguish them.

We earned our humility about agents being non-deterministic the hard way. The eval layer is built from the same stochastic parts and deserves the same suspicion. A green dashboard you can't reproduce isn't a quality signal — it's a story you're telling yourself, and the only way to check whether it's true is to grade the grader and keep the trace that proves the grade.