Most agent memory stores a confidence score the way it stores everything else. You
write it once and it sits there. The agent decides a fact is worth 0.9, the store
keeps 0.9, and three weeks later, after something has contradicted that fact, the
store still hands back 0.9. Confidence was a number written at one moment and
never looked at again. It is stale, and nothing in the system knows it.
That is the quiet failure of pull memory. You query, it returns the closest
matches with whatever score they were saved at, and noticing that a fact has gone
soft is on you.
Recall takes the other path. Effective confidence is not a stored field. It is
recomputed from the graph every time you read, so a contradiction landing anywhere
drops the claim's confidence on the next query, with no model rerun and no human
in the loop.
The formula
It is plain arithmetic, on purpose. For a cell, the effective confidence is:
effective = clamp01( stated × calibration + support − challenge )
- stated is what the author claimed when they wrote it.
- calibration discounts the author by their track record.
- support is corroboration from incoming supports edges.
- challenge is the weight of incoming contradicts and concerns edges.
Support and challenge are not raw sums. Each is squashed through a saturation
curve with a different ceiling:
support = 0.15 × tanh(supportMass)
challenge = 0.60 × tanh(challengeMass)
The asymmetry is the whole point. Corroboration is cheap to manufacture, so
support saturates fast under a low ceiling: stack ten agreeing cells and you add
at most 0.15. Real contradiction is rare and informative, so challenge runs to a
0.6 ceiling. One honest contradiction can move a claim further than a pile of
agreement.
A worked example you can check
A fresh claim, stated 0.9, author with no track record yet, no support, no
challenge:
effective = clamp01(0.9 × 1 + 0 − 0) = 0.90
One contradiction lands from a source stated at 1.0, a challengeMass of 1.0:
challenge = 0.60 × tanh(1.0) = 0.457
effective = clamp01(0.90 − 0.457) = 0.44
The same claim now reads 0.44. Nobody edited it. A second contradiction pushes the
mass to 2.0:
challenge = 0.60 × tanh(2.0) = 0.578
effective = clamp01(0.90 − 0.578) = 0.32
Down to 0.32, and the original 0.9 is still on record, just demoted. Ten
supporting cells would have added at most 0.15. Cheap agreement barely moves it; a
real challenge moves it a lot.
Calibration, and one honest choice in it
Before support and challenge apply, the author's stated number is multiplied by a
calibration factor. An author contradicted before gets discounted, by how often
they were wrong times how confident they were when wrong, floored at 0.5 so it
never zeroes anyone out.
The honest detail is what it is not. It is not raw Brier scoring. Raw Brier also
punishes a humble author who hedges low on claims that turn out fine, and
punishing humility is the opposite of the incentive a memory system should create.
So the discount keys on overconfidence specifically, being wrong while sure.
Hedge honestly and you are not penalized. Claim 0.95 and get contradicted and you
are.
Why this beats a stored score
A vector store returns the score a chunk was embedded with. A flat notes file
returns whatever it says. Neither knows the fact was contradicted last Tuesday,
because the contradiction is not part of how the score is computed. The score and
the conflict live in different places.
In Recall they live in the same place. The contradiction is an edge on the graph,
and the score is computed from the graph, so the moment the edge exists the score
reflects it, on the next read, deterministically. The reader is the same agent
that wrote the memory, working from fresh context, and the substrate reprices what
it knows underneath it.
What it is not
This is a ranking signal, not a verdict on truth. A low effective confidence means
a claim is contested or comes from an author who has been wrong while sure, not
that it is false. The ceilings and curves are tunable defaults. And it is
deliberately deterministic arithmetic over the graph, not a model second-guessing
itself, which is what makes it inspectable: open any cell and see why its number
is what it is, term by term.
That is the trade. You give up a number that looks stable and never moves. You get
one you can recompute, that demotes a stale claim the instant the evidence turns,
and that you can read the reasons for. For an agent that has to act on what it
remembers, the second is worth more.
Recall is local-first, runs on SQLite, and sets up with one command. The code and
the formula above are open: github.com/H-XX-D/recall-memory-substrate
Top comments (9)
Recomputing confidence on read is much closer to how memory should work.
A stored score quietly becomes historical metadata. A computed score can react to contradiction, age, source quality, and newer evidence. That makes confidence a property of the current knowledge graph, not a number someone wrote once.
Computed-not-stored confidence is the move that separates a substrate that learns from one that just remembers. The asymmetric ceilings on support vs challenge are the architectural commit — corroboration cheap, contradiction informative — and the calibration discount keyed on wrong-while-sure rather than raw Brier is the right shape for the incentive surface: hedge-honest authors don't get punished, confidently-wrong authors do. That's the incentive a memory substrate should create.
Same disease shows up one floor up, at the place that authors the calibration. If the calibration factor is itself computed by the same lineage that produced the overconfidence — the same model rating its own past confidence against its own track record on its own claims — the discount becomes a self-graded test. The asymmetry survives only if at least one input to the calibration comes from somewhere the writer can't reach: an external timestamp, an independent annotator, a downstream signal the model can't reconstruct. Otherwise overconfident authors that share lineage with the calibrator silently get their own slack.
The inspectable side of this — "open any cell and see why its number is what it is, term by term" — is the structural answer to a class of failure I keep watching: confidence as a number you can read but can't audit. The same primitive matters for any binding entry in a system that has to act on what it remembers: every score points back at the artifact that grounded it and forward at the diagnostic that can falsify it. Otherwise the score is folklore wearing arithmetic.
Because you carry confidence in your entries, and that is exactly where these systems lead you in circles. The score an author states at write is immutable, it is never edited in place. What moves is an effective score computed at read time: the stated value attenuated by the author's own track record, a Brier-style measure of how their past high-confidence claims held up against later contradiction, plus any challenges standing against that specific entry. A writer cannot buy trust by restating, and confidence cannot drift upward on its own, because the honest stated number and the earned effective number stay separate and the read returns the earned one
Right — the writer-can't-edit-stated + effective-computed-from-elsewhere split is what closes the loop. The lineage independence isn't at the writer/author boundary; it's at the time-and-cross-claim-graph boundary, which is structurally different and harder to silently collapse. The challenges that move effective downward come from later entries, different sessions, different anchors — that's the off-path the calibration leans on.
One edge: the structure assumes sufficient challenge density across the graph. In a long-running substrate with many entries and active correction loops, the cross-claim mass does its work. In a cold-start system with one author and few entries, the calibration discount has nothing to chew on, so the effective score reads as honest while still being whatever the author wrote. The asymmetric ceilings (0.15 support / 0.6 challenge) help against support inflation, but they don't manufacture challenges if none have arrived yet. Cold-start is the case where the elegance has nothing to leverage; the system gets honest as it accumulates.
You either prime a cold start with a global store and project scope substores with an _init, by the time a capable model's first context compaction fires, your dbs doing the work.
Yes — that makes the cold-start boundary much cleaner. The global store supplies continuity, project substores constrain relevance, and _init provides the bootstrap contract. By first compaction, retrieval history and supersede/use edges can replace the prior with computed evidence.
So the system doesn’t begin with confidence; it begins with scope. Confidence becomes legitimate only after the project store has accumulated behavior.
_init is scaffolding. If it is still carrying the building after three compactions, something has gone architecturally decorative.
or a massive foundation of well-specd and well-designed system platform building control maybe a defense project, people one-shot web apps like changing channels these days. Doesn't mean they are any good. Any serious project still requires competent engineering, separation of concerns, and experience to get off the ground. Even then, you're lucky to make any money right now. Except for maybe rewriting those webapps lol.
Todd Hendricks replied to a thread in Your agent's memory should compute confidence, not store it
24 minutes ago
Re: Yes — that makes the cold-start boundary much cleaner. The global store suppl...
or a massive foundation of well-specd and well-designed system platform building control maybe a defense project, people one-shot web apps like changing channels these days. Doesn't mean they are any good. Any serious project still requires competent engineering, separation of concerns, and experience to get off the ground. Even then, you're lucky to make any money right now. Except for maybe rewriting those webapps lol.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.