DEV Community

Confidently wrong is worse than "I don't know"

Todd Hendricks on June 22, 2026

Someone left a comment on my last post and then deleted it before I could reply. I am going to answer it anyway, because it said the thing better t...

Read full post

TxDesk • Jun 22

the second failure you name, the silent one, is the one i think most people never even classify as a failure, and that's exactly why it's the dangerous one. a wrong answer is at least an event, it hands you something to check. a fact you can't surface is a non-event, it leaves no trace, so you proceed without the thing you already knew and nothing in the system registers that anything happened. you called it "deleted with extra steps," which is right, but i'd push it one further: it's worse than deletion, because deletion you'd eventually notice and re-acquire. this you never look for, because as far as you and the model can tell, it was never there.
the part i'd build on is your point that a smarter model makes it worse. that generalizes past memory: capability multiplies whatever the substrate hands it, so the better the reasoner, the more persuasively it argues the stale fact and the more smoothly it papers over the missing one. confidence is a function of fluency, not correctness, and a great reasoner on a substrate that can't represent doubt is just a more convincing version of the same three errors. which is why your fix being in the memory and not the model is the right place to put it: doubt has to be a property of the stored fact, not something you hope the reasoner reconstructs at read time.
the open question i'd hand back: who computes the confidence, and what stops it from being gamed? you said the runtime attenuates it on contradiction history rather than a number you typed, which is the right instinct. but if an attacker or just a noisy source can manufacture supporting edges, confidence becomes another thing that can be inflated. the same problem as everywhere else, the score is only as trustworthy as the independence of the inputs feeding it.

Todd Hendricks • Jun 22 • Edited

effective = clamp01( stated × calibration + support − challenge ) two scores immutable confidence for calibration(models history of being contradicted a lambda ill explain in another post)the model gives at write time(stated)and a effective confidence that is computed at read time, so your model can be confidentas it wants, but it needs supporting evidence; then theres the end-turn writeback hook that won't let them end the turn, unless they write what changed, what it relates_to, depends_on, contradicts, etc., through a strict schema firewall. It sounds heavy, but it's not, even before I deisgned hooks, these newer frontier models started reaching for it there are two other important hooksw that happen a compile and a verify so all in all the single exchange becomes five turns between the model on your computer, you only notice one that would seem like a-lot of tokens but greping 1000 md files is way more on serious project. The compile packet is bounded. I'm using IDs, tags, and addressable cells to organize the memories/writes and pushing a deliberately incomplete index into context at the start of the prompt, then a verify hook that stops the model from continuing unless it opens the cell address and reads what they contain, then does its work, and the end hook won't end its turn until it does the write back....

Comment deleted

TxDesk • Jun 24

the two-score split is the right shape, and computing effective at read time from support minus challenge is what makes it ungameable from the write side. the model can claim 0.99 all it wants, calibration plus the supporting-edge requirement is what actually has to be earned. that closes the "model inflates its own confidence" hole cleanly.

the gap i'd poke at is the support term itself. calibration scores the writer, but support counts the edges, and edges don't have a calibration score. so the failure mode isn't a confident model anymore, it's a fact propped up by three supporting edges that all trace back to the same origin. correlated support reads as strong support. a stale fact that got cited into four notes early on looks better-supported than a true correction that only just arrived with one edge. the score rewards how well-connected a claim is, which is usually a proxy for true but comes apart exactly when a wrong thing spread before the right thing showed up.

so the question back: does support weight independence, or just count? because if two supporting edges share a source they aren't two confirmations, they're one fact wearing a coat. the thing i keep landing on across all of this is that every confidence score is only as good as the independence of whatever feeds it, and independence is the hardest property to verify cheaply at write time.

Tae Kim • Jun 22

The calibration gap is what makes this expensive in production. A model that says it does not know hands the cost back immediately. A model that confidently misremembers distributes that cost invisibly to everyone downstream who acts on the output. In a RAG pipeline I worked on, we added a coverage check before the response goes out: if a generated claim references a fact not grounded in any retrieved chunk, flag it. It does not solve all hallucination but catches the pure confabulation cases where the model fills in details the context never gave it.

Todd Hendricks • Jun 22 • Edited

The coverage check is slick, but moves the problem up a layer when it's a huge store. Similarity doesnt nesscarily me revelavance, multi-hop, or aggregation questions still trip it up, and my arch nemesis stale outdated chunks.

UnitBuilds • Jun 22

I feel that... I've been working on my Autonomous Accounting Suite (Doccit), the real issue I've been hitting lately, is that LLMs tend to trust their guts too much... It read 1943.20 as 943.20, at 99% confidence, because the dot-matrix print was overlapping with form text. Instead of saying 'hey there's an anomaly here, maybe I'm wrong?', it cleared it as a high confidence match. That happens wayyyyy too often to be usable. And that's just PIT checks, continuous evaluation is even worse for LLMs, when you're dealing with long-context work, it seems that it just doesnt keep track of shifts. For the foundry, I had to write from scratch a branching decision making system, which was heavily inspired by git, allowing it to recognize when changes were made to the core design and how it affects everything. That, tied in with a dependency graph DB and a discourse thread, where all agents can voice their change requests on shared components, with 3rd-party evaluation by another model cross-referencing the intersecting works with the proposals, to verify that the changes wont break it... Seems to me like way too much work to have to redo every single project. There really has to be a better way. I'll have a look when I get a chance and give feedback on recall-memory-substrate. Thank you for looking at one of the biggest problems in AI, that people simply brush off as 'too much context' and start a new chat.

Todd Hendricks • Jun 22 • Edited

That would be so very much appreciated. If you can do that, I also built an optimizing suite with a bunch of solvers with a really fast sparse QUBO/ising algorithm you can run on on a modern CUDA GPU offhand million variables 265billion updates/sec., anyways the product is the same idea, instead of the memory this moving computation away the model and into dedicated algorithms on a graph that they construct instead of hand rolling numpy themselves. If you can give me some feedback on Recall, I'll need some beta testers on that too.

Vasyl • Jun 24

Really like this. For me the same thing happens in retrieval: the model sounds sure even when the answer was never in the chunks we pulled. So now I check first — can this be answered from what we have? If not, it stays quiet. How does your score handle old facts that nobody challenged yet?

Todd Hendricks • Jun 24 • Edited

I run a hook at the beginning of the exchange that does a simple best match keyword search that then pushes an incomplete "primer" of addressable cell IDs to orient it to the concept, then another hook that instructs a compile of the relavitve sub graph expands those cells its relatons and dependens_on and others associated those cells, it does the work and before the turn ends another hook forces a strict write schema that doesnt let the turn end till it gets stated confidence and the cell edges are wired in. so ever entry is structured the same. To answer the question, there's a JSON key called supercede, which gets updated at write time. If a fact is pulled, the new cell gets appended. with old cells ID, but until that happens, nothing a fact is a fact, even old ones. There are a few on these keys that represent things like concern, contradiction, health currency, and salience. So a delta is happening every prompt and being recorded, or it's a new ne entry. The secret sauce is that everything happens in a single forward pass why the model is actually processing information. I did a 6-part series this week, you can check it out if you have the time its OSS with there's a live repo at the end of the post if you want to inspect. Your feedback would be appreciated