Stories by Flamehaven Initiative 팔로어 2명 on Medium

When Medical AI Benchmarks Move Faster Than Validation

Flamehaven Initiative 팔로어 2명 — Tue, 16 Jun 2026 14:14:03 GMT

When the machinery behind a viral clinical AI paper cannot support the conclusions drawn from it

Note: This is not an argument to dismiss the Nature Medicine paper. It is an argument for stronger validation infrastructure around medical AI benchmarks before practice-shaping claims become settled wisdom.

Two Critics, Two Reasonable Conclusions

A Nature Medicine paper published on 12 June 2026 claimed that frontier models — GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 — now outperform specialized clinical AI tools like OpenEvidence and UpToDate AI across multiple medical benchmarks. The paper went viral within hours. The conclusion was treated as settled before most people had read past the abstract.[1]

Two clinical readers read it carefully. They reached opposite conclusions, and both conclusions were reasonable from where each reader stood, which is exactly the problem. A post-publication conflict-of-interest allegation also surrounds the paper, and we address it below.

Marissa Famularo, a vascular surgeon at Jefferson–Lehigh Valley who teaches residents, updated her practice the same day. “RAG buys provenance, not correctness,” she wrote. “Citations you can verify are not the same as a lower error rate.” She acknowledged the paper’s limits (n=100, single center, already-obsolete models) and changed her teaching anyway, because that is what engaged clinicians do when a Nature Medicine paper lands.

Natalie Khalil, PhD in Biomedical Engineering and developer of Reviewer3, subjected the same paper to structured peer review and found twelve methodological problems, four of them rated Critical. Her conclusion was that the evaluation design cannot support the paper’s central claims.

This piece does not argue that the paper should be dismissed. The study asks a question clinical AI vendors can no longer avoid: do specialized medical AI tools actually outperform general-purpose frontier models when tested head-to-head?

That question is overdue and the paper advances it. What the paper cannot establish, and what the discussion around it has largely skipped over, is whether the evaluation design can carry the weight of the certainty now traveling through the clinical AI conversation. That is a different question, and it is the one worth sitting with.

A Note on Sources

The Nature Medicine paper’s abstract and methods establish the key methodological facts discussed here: the three-part evaluation design, the tested models, the 100-query RCQ benchmark, the 1,800 clinician annotations, the HealthBench LLM-judge panel, the reported Krippendorff’s alpha of 0.10–0.20 on ordinal item-level scores, and the exclusion of refusals from aggregate scoring.

Khalil’s twelve-finding review was conducted using Reviewer3, a structured AI-assisted peer review platform she developed. We treat the review as a primary attributed source, not as peer-reviewed in the traditional sense, and we identify it as such wherever we draw on it.

OpenEvidence’s response was published as a public LinkedIn post following the paper’s release. Where we reference its conflict-of-interest allegation, we present it as OpenEvidence’s public claim and note that it has not been independently verified. The arXiv and medRxiv papers cited throughout are publicly accessible and directly relevant to the methodological concerns raised here.

What the Paper Claims and What It Cannot Show

The study evaluated performance across three stages: 500 MedQA questions testing medical knowledge, 500 HealthBench items measuring clinician alignment, and a Real Clinical Queries benchmark built from 100 de-identified, live-environment physician queries.

To evaluate these real-world queries, twelve US clinicians produced a total of 1,800 model-question annotations.

Based on these findings, the paper concluded that frontier LLMs outperformed specialized clinical AI tools across all evaluations, suggesting that scale, alignment, and cross-domain reasoning may outweigh domain-specific tuning as determinants of medical competency. [1]

The study compares observed outputs under unequal conditions. It shows that frontier models performed better on the selected benchmarks under those conditions. What it does not and cannot show is why. The clinical tools are proprietary systems whose architectures, base models, retrieval pipelines, and safety configurations are inaccessible to the researchers.

The observed performance gap could reflect the superiority of general-purpose scale, but it could equally reflect smaller base models in the clinical tools, poorly optimized retrieval pipelines, overly restrictive safety prompts, or the accumulated effect of the methodological asymmetries described below.

Because the paper cannot isolate these variables — a point the authors themselves acknowledge, noting that “it is impossible to definitively assess a mechanistic understanding” — the conclusion about scale and alignment outweighing domain-specific tuning remains an interpretive framing that the collected data cannot confirm.

Four Places Where the Claim Becomes Less Stable

1. The Ordinal Foundation Was Not Stable Enough to Support the Rankings

The RCQ benchmark used clinician ratings on a 1–4 scale, and the reported Krippendorff’s alpha for item-level agreement was 0.10–0.20 on that ordinal scoring layer — falling below the threshold typically required to support ordinal performance ranking. It indicates that raters could not reach consensus on relative quality at the level of granularity the scale was designed to capture. [1]

The paper’s Figure 2c — the primary visualization of model superiority that went viral on social media within hours of publication — is derived from the aggregate mean of discordant ordinal scores, despite the authors noting higher agreement only when collapsing the scale to binary categories or focusing on harm and hallucination flags.

While those partial signals are real and should not be dismissed, the tier ranking claiming that frontier models are categorically better is built entirely on this ordinal layer, which fails to provide the adjudicative stability such a ranking requires.

This pitfall aligns with a 2025 study on medical AI evaluation metrics, which demonstrated that comparing AI outputs against an aggregate of disagreeing experts produces inconsistent assessments that cannot reliably support performance ranking. [6]

2. The Judges Were the Defendants, and One Benchmark Was Built by One of Them

The HealthBench evaluation used an LLM-as-a-Judge approach in which the judging panel consisted entirely of the three frontier models being tested: Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.2. The specialized clinical tools were excluded from the panel. Self-preference bias in LLM judges is a known and documented limitation, and excluding the clinical tools from the judging panel is a design decision with directional consequences that the paper does not fully reckon with. [2]

A further problem extends beyond Khalil’s review: HealthBench was created by OpenAI, and GPT-5.2, an OpenAI model, is one of the systems evaluated on it — a benchmark-developer overlap the Nature Medicine paper acknowledges as a potential source of grading bias, explicitly stating that HealthBench should be interpreted as “supplementary” to the primary RCQ clinician evaluation.

Despite this self-designation, its results were presented as a co-equal pillar in the headline hierarchy and received as confirmatory by the public, failing to resolve the limitation’s impact — a systemic issue underscored by a 2026 scoping review which found that 73.5% of healthcare LLM-as-a-judge studies perform no bias testing, meaning high agreement scores often reflect shared blind spots rather than valid assessment. [7]

3. The Paper Acknowledges Benchmark Exposure but Does Not Resolve Its Impact

MedQA and HealthBench have been publicly available on the internet for an extended period prior to the evaluation window. Frontier models are trained on large, continuously updated corpora of internet text. The Nature Medicine paper acknowledges this issue and notes that benchmark exposure is possible, but does not quantify or resolve the impact of that exposure on the headline hierarchy. [1]

This does not mean the models memorized exact answers. It means the distribution of their training data may not have been independent of the evaluation distribution, and the paper leaves that question open while drawing firm conclusions from the results.

If any exposure is present, it advantages exactly the systems that already benefit from the judge composition and interface asymmetry described above. The headline performance gap could be directionally real, partly artifactual, or some combination of the two. The current design cannot distinguish between them.

4. Two Statistical Issues Warrant Clarification

The regression analysis treated 1,704 rater-item observations as independent after accounting for rater effects. These observations are clustered within 100 specific clinical queries.

Multiple models and raters evaluating the same query produce correlated scores due to the inherent nature and difficulty of that specific query. Failing to include a random intercept for the query introduces pseudoreplication, artificially inflating degrees of freedom and potentially generating confidence intervals narrower than the data can honestly support. [2]

Separately, the paper states that UpToDate’s refusal rate of 19% was not significantly higher than Google AI Overview’s refusal rate of 6% (P=0.10), and specifies that Fisher’s exact test was used for refusal rate comparisons.

A raw Fisher’s exact test on 19/100 versus 6/100 yields a two-sided p-value of approximately 0.009 — which, absent any multiple comparison adjustment, appears inconsistent with the reported P=0.10 at the paper’s own stated alpha=0.05 threshold. If a Bonferroni or other correction was applied across the full refusal-rate comparison set, that adjustment should be stated. As written, the discrepancy warrants clarification. [1]

The Compounding Asymmetries

Beyond the four stress points, Khalil’s review identifies design asymmetries that compound the picture without being fully addressed. Frontier models were evaluated via deterministic API outputs with temperature set to 0.0 and a fixed generation seed.

Clinical tools were evaluated via non-deterministic browser interfaces with hidden system prompts and dynamic retrieval mechanisms. The paper acknowledges this asymmetry but does not fully resolve what it means for the comparison.

UpToDate AI refused 19% of queries while frontier models refused 1–3%, and those refused responses were excluded from aggregate scoring. This means UpToDate’s aggregate score reflects only the subset of queries where the system was confident enough to respond, while frontier model scores reflect the full query distribution. Whether this affects the comparative result materially is not analyzed.

These asymmetries do not individually overturn the paper’s findings. Cumulatively, they describe an evaluation environment whose design features, taken together, created conditions that may have favored frontier models — and the paper does not fully account for that tilt.

The Disclosure Question Raised After Publication

OpenEvidence’s public response states that the study’s authors operate a competing in-house medical AI at their hospital, and that they had previously approached OpenEvidence requesting API access, including rights to build a competing product using OpenEvidence’s own infrastructure. OpenEvidence declined. The Nature Medicine paper appeared afterward. [4]

This is OpenEvidence’s claim, issued by a directly interested party, and we cannot independently verify it. If accurate, readers would reasonably expect this relationship to be disclosed or addressed in the paper’s competing interests section.

It was not. We leave readers to weigh this context against the methodological picture described above. The methodological concerns stand regardless of whether the allegation is accurate, but readers are entitled to know it exists.

What the Counter-Evidence Shows

A separate medRxiv study applied the same triage benchmark that previously exposed severe weaknesses in ChatGPT Health (51.6% undertriage of true emergencies) to OpenEvidence under identical conditions. OpenEvidence undertriaged 12.5% of emergencies, a fourfold reduction.

It showed no social anchoring effect. Its errors skewed toward safer directions, and refusals occurred only in symptom-only prompts, never in urgent or emergency cases. [5]

This does not prove that OpenEvidence is superior to frontier models in clinical settings. It shows something narrower but more important for this discussion: benchmark selection in medical AI is never a neutral methodological choice, especially when the benchmark design includes public dataset exposure, self-judging panels, and frontier-sourced query distributions that may favor the very models being evaluated.

When a different benchmark applied by independent, conflict-free researchers produces a meaningfully different picture of the same tool, it becomes clear that neither study is definitive, and that the headline hierarchy is far more sensitive to evaluation design than the paper’s framing acknowledges.

The Governance Gap in Miniature

Famularo updated her teaching the day the paper appeared, which is what engaged clinicians should do when a Nature Medicine paper lands. Her update was calibrated and practically useful: RAG buys provenance, not correctness. Check the source. Don’t trust the absence of hallucination. These are good heuristics. She flagged the caveats. She called it a snapshot, not a verdict. [3]

The structured peer review that identified the paper’s pressure points arrived days later, from a specialist who had built a tool specifically designed to find them. Most papers do not get that scrutiny. Most clinical practice updates do not wait for it.

Publication, uptake, critique, too late. That sequence is not a failure of the clinician who updated her teaching, or the reviewer who found the flaws, or even the authors who published the paper. It is the operating condition of medical AI evaluation in 2026, and it will keep producing the same outcome until the field decides that the speed of the claim and the robustness of the validation have to move together.

What This Means

The paper’s headline finding may be directionally correct. Frontier models may genuinely perform better than current specialized clinical tools on the tasks this benchmark measures. The evaluation design, as it stands, cannot fully establish that claim, and it especially cannot establish the mechanistic conclusion that scale and alignment outweigh domain-specific tuning as determinants of medical competency.

What the paper establishes more clearly — and perhaps more durably — is the shaky infrastructure upon which it was built. It describes a system in which performance hierarchies are generated from ordinal ratings despite near-zero inter-rater agreement, models are evaluated on benchmarks that some of them helped design, and judgments are rendered by panels composed of the models themselves.

Benchmark exposure and interface asymmetries are acknowledged but left unquantified or unresolved, while survivorship bias in refusal exclusions and a post-publication conflict-of-interest allegation remain largely unanalyzed.

These are not incidental weaknesses in one paper. They describe the available toolkit for high-profile medical AI evaluation right now, at the same moment clinical AI tools are moving from research conversation into hospital procurement and daily clinical workflow.

The danger is not that one paper went viral. The danger is that clinical AI evaluation now produces practice-shaping claims faster than the field can audit the machinery behind them.

What Auditing Should Require

At minimum, auditing that machinery should require independent judge panels not drawn from the evaluated systems, pre-registration of benchmark contamination checks, and refusal-inclusive scoring that does not selectively filter the hardest queries out of the comparison.

None of these are technically difficult. They are choices. And until they become expected choices, the gap between claim speed and validation depth will remain a feature, not a bug, of how medical AI is evaluated and published.

References

Vishwanath, K. et al. “General-purpose large language models outperform specialized clinical AI tools on medical benchmarks.” Nature Medicine (12 June 2026).
Khalil, N. “Reviewer3 structured peer review of the above.” Publicly shared (June 2026).
Famularo, M. LinkedIn response to the above (June 2026).
OpenEvidence. “Public LinkedIn response and conflict of interest statement” (June 2026).
Jia, E. et al. “OpenEvidence errs on the safe side in a structured test of triage recommendations.” medRxiv (April 2026).
Kopanichuk et al. “How to Evaluate Medical AI.” arXiv:2509.11941 (2025).
“A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework.” arXiv:2604.25933 (2026).

AI Can Write the Code. It Still Cannot Place the Stone.

Flamehaven Initiative 팔로어 2명 — Thu, 11 Jun 2026 15:19:36 GMT

What a small release taught me about code, output, and the strange new craft of working with AI agents.

Boundary note:
This essay focuses on a growing class of AI-native projects where implementation can be delegated more heavily to AI, shifting human effort toward specification, abstraction, output design, and verification.

It is not an argument that coding expertise no longer matters. In security-critical, performance-sensitive, legacy, or production systems, implementation knowledge remains essential and often serves as a primary safety boundary.

The AI wrote the code. The output was still wrong.

An AI agent fixed a release for me.

That sentence sounds cleaner than the session felt. What actually happened was messier and more interesting.

The agent audited the documentation and found a public-output problem. It patched three files, bumped the version number, and wrote the changelog. It committed the changes, pushed to origin, and deleted the branch when it was done.

Mechanically, it did the sort of release work I would normally associate with a competent maintainer who had read the codebase, understood the files, and knew which surfaces needed changing.

I did not open an editor to write the patch. I did not walk through the renderer by hand. I looked at the output and knew something was still wrong.

The tool was not lying. That was the problem. It was telling the truth in the wrong language.

Before the fix, the Project Summary showed this:

| Coherence      | 0.8456                    |
| Coherence Mode | vr_structural (exact MST) |

After the fix, it showed this:

| Structure Coherence | 85% (Higher is more cohesive) |
| Coherence Check     | full                          |

The first version was technically accurate. vr_structural (exact MST) was not hallucinated jargon. It referred to a real internal strategy for measuring structural coherence, using a Minimum Spanning Tree over a dependency graph.

There are places where that phrase belongs. It belongs in the codebase where the strategy is selected. It belongs in architecture documentation where a contributor can inspect the method. It may belong in machine-readable output where a downstream agent or auditor needs the precise mode.

It did not belong in the first thing a user saw.

A first-time user did not need the internal algorithm. They needed to know whether the codebase was structurally coherent. The tool had exposed the right fact through the wrong surface. The patch the agent wrote was small. The implication was not.

That was the moment the release stopped being about code.

The AI had written the patch. My job was to notice that the system had exposed the wrong truth to the wrong reader.

That distinction felt small in the diff and large everywhere else. The implementation was correct. The failure lived in placement, audience, and abstraction.

Once I started looking at the problem that way, it became harder to think of software as a sequence of functions and patches alone. The question was no longer who produced the artifact, but who decided where it belonged and how it should be presented.

That shift in perspective reminded me of an older kind of craft problem.

The right stone in the wrong wall

A medieval cathedral was not built by one pair of hands. Someone quarried the stone. Someone hauled it. Someone cut it. Someone carved faces and leaves into corners most visitors would never inspect. The work was distributed across bodies, tools, habits, guild knowledge, and a long chain of people whose names mostly did not survive.

But the cathedral did not stand because every stone was beautiful. It stood because someone knew which stone belonged where.

That is the part AI makes easy to miss. When the machine can cut stones quickly, the pile grows fast: patches, files, reports, releases. The temptation is to look at the pile and call it progress. Software, like a cathedral, does not fail only because the stone is fake. Sometimes the stone is real and still belongs somewhere else.

That was the bug. vr_structural (exact MST) was a real stone. It belonged in the system. It just did not belong in the first wall the user touched.

This is why the case matters. A lot of AI-generated work fails in the obvious way: fake concepts, fake certainty, impressive noise wrapped around nothing. This was not that. The system had produced something real, and the AI agent could modify it successfully, but the public surface still exposed the internal history of the tool rather than the user’s need.

The human contribution was not the ability to cut the stone faster. The machine had already done that. The human contribution was noticing that the stone was in the wrong wall.

That is where the analogy has to stop being decorative. The mistake was not only that one label appeared in one bad place — it was a sign of a larger movement.

Once the machine can produce more of the artifact, the human has to spend more time walking around the structure. They must check which surfaces face which readers, which details belong in the workshop, and which ones have accidentally been set into the public wall.

The five stages of the craft shift

The medieval analogy only works if it is not treated too romantically. A cathedral was not magic, and neither is software. The point is narrower: when machines change the cost of producing artifacts, human judgment often moves to a different part of the process.

I started to see the shift in five movements, although even the word “stage” makes the pattern sound cleaner than it felt.

Stage 1 is the moment the machine cuts the stone.

A prompt becomes a working script, a patch, a prototype, a release. Karpathy called one version of this “vibe coding”: the human gives in to the flow of generated code and can almost forget the code exists [2]. At low stakes, that can be useful and even liberating.

The danger begins when a running artifact impersonates an understood system. The reported Replit database incident is a useful warning case here because the problem was not only that an AI coding tool made a mistake. It was that an autonomous tool reportedly acted in a production-sensitive context, ignored a freeze, and left the human evaluating damage after the action had already happened [3].

Stage 2 is when the machine starts naming the workshop.

After enough collaboration, repeated decisions become private labels. Inside the session, those names save time. Outside the session, they begin to look like unexplained doctrine. A label that once meant “that thing we learned after three bad outputs” starts appearing as if it were public technical vocabulary.

Stage 3 is when the names harden into walls.

Changelogs, READMEs, diagrams, version numbers, and internal terms begin to form a structure. Some of the work may be real. Some of the code may be tested. Some of the design may be thoughtful.

Yet, a first-time reader may still see a cathedral of private vocabulary and suspect the stones are decorative. This is close to the danger Mikkonen and Taivalsaari describe as AI-era cargo cult programming: artifacts may be trusted before the understanding behind them has become inspectable [1].

Stage 4 is when the human walks outside the building.

This is the painful movement. The practitioner stops asking only whether the artifact works and starts asking whether it survives contact with people and machines that were not present during its construction: a first-time user, a JSON consumer, a CI system, a downstream agent, a contributor, an auditor. The question changes from “Did the agent build it?” to “Can anyone else safely use what it built?”

Stage 5 is when the craft moves into the conditions of building.

This part is still unstable, but I think it is the important direction. The human is no longer valuable only because they can cut the stone by hand.

Instead, their value lies in defining constraints, exposing the right surfaces, and rejecting the wrong outputs. They preserve the reasoning trail and design the workshop in which agents can produce work that remains inspectable after the session ends.

That last movement is not “AI replaces the developer.” It is almost the opposite. The more the machine can build, the more valuable the person becomes who knows what should not be built, what should not be exposed, and what must remain checkable after the speed is gone.

The machine can cut the stone. Stage 5 is learning how to design the workshop.

That shift in responsibility surfaced through a recurring keyword in the release itself: output. Not code generation, not automation, not even correctness in the narrow sense, but output — what the system ultimately presents to a human or another machine, and whether that presentation carries the right level of meaning.

Once I started following that thread, the problem no longer looked like a documentation issue. It looked like an architectural one.

The JSON exposed the real architecture problem

At first, I thought the issue was only the public-facing phrase. Replace vr_structural (exact MST) with Structure Coherence. Turn 0.8456 into 85%. Move the internal algorithm label into documentation. Ship the patch. Close the loop.

That would have been a tidy story, and it would have been incomplete.

The deeper problem appeared when we looked at the JSON.

Before the fix, the text renderer computed actionable guidance inside the human-facing layer:

# Before: next_steps computed inside the renderer
def generate_text_report(result):
    steps = next_steps(result)  # computed here, lives here
    ...

After the fix, the guidance moved into the core result object:

# After: next_steps computed in the core layer
result.next_steps = next_steps(result)

# the human renderer reads result.next_steps
# the JSON serializer reads result.next_steps

This is the kind of change that looks boring if you only see the diff. A function moves. A result object grows a field. The renderer stops owning logic it should not own. No cinematic debugging scene. No heroic rewrite. Just the dull discovery that the system’s intelligence was sitting in the wrong layer.

The next_steps function was the part that told the consumer what to do next. It could point to a directory, identify a pattern, suggest a sweep command, and surface a priority file. Before the fix, that guidance appeared in the terminal report. The JSON, the layer an AI agent or CI system would consume, looked like this:

{
  "avg_ldr": 0.46,
  "avg_deficit_score": 23.1,
  "coherence_level": "vr_structural_approx"
}

A human got an action plan. The agent got a pile of numbers.

That is backwards.

The tool already knew enough to help. It had computed the recommendation. It had merely placed the recommendation where only one audience could see it. An AI agent consuming the JSON in a patch workflow would have to reconstruct intent from raw metrics, even though the human report had already been handed that intent in plain language.

This is not a cosmetic problem. It is a design confession. It says the tool still treats machine-readable output as an export format, not as a first-class consumer surface.

After the fix, the JSON carried the actionable guidance directly:

{
  "avg_ldr": 0.46,
  "avg_deficit_score": 23.1,
  "coherence_level": "vr_structural_approx",
  "next_steps": [
    "Reduce jargon density in auth/ module: 3 files exceed inflation threshold",
    "Run: slop-detector --sweep auth/ --focus inflation",
    "Priority: auth/audit.py (deficit score 41.2, above warn threshold)"
  ]
}

Now the agent can read the directory, the command, and the priority file without inventing an action plan from floats. This does not make the system magical. It makes the output useful in the exact place where it had been structurally incomplete.

That was the real fix. Not making the report friendlier. Not adding a field because JSON should have more fields. Moving actionable meaning into the layer that every renderer, human or machine, has to share.

The lesson came after the bug. It usually does.

The JSON issue and the jargon issue turned out to be the same problem viewed from different angles. In one case, actionable meaning was trapped inside a human-facing renderer. In the other, internal implementation language leaked into a user-facing surface. Both failures came from exposing the wrong information to the wrong consumer.

That realization led to a broader rule.

The three-channel rule

By that point, it was clear that the issue was larger than either the JSON fix or the terminology cleanup alone.

The phrase “jargon problem” makes the whole thing sound like a writing issue. It invites the wrong fixes: add a glossary, improve the README, write a friendlier explanation, maybe produce one of those onboarding diagrams that looks useful until a new user actually tries to follow it.

Some of that helps. None of it reaches the root.

The root is encapsulation. A system should expose what a consumer can use and hide what the consumer cannot use. Engineers know this when talking about APIs, classes, modules, and interfaces. It becomes strangely easy to forget when the interface is language.

For this kind of AI-native tool, the outputs need to be separated by consumer, while still sharing the same source of truth. The human-facing surface should show meaning and action in language a first-time user can survive.

The machine-facing surface should preserve fidelity, internal identifiers, raw metrics, precision values, and enough actionable guidance for a downstream agent to work without guessing. The documentation layer should explain the mechanism, including algorithm names, scoring formulas, assumptions, and failure cases.

The mistake is treating these as three writing tasks. They are projections from the same result object, or they should be.

The important part is not the diagram itself. It is the direction of dependency. Human reports, JSON contracts, and documentation should not each invent their own version of the result. They should expose different surfaces of the same underlying object. Once the renderer computes guidance that the JSON never receives, the system has already split its own truth.

In a normal developer tool, that might be a minor inconvenience. In an AI-native tool, it becomes structural. The machine consumer is no longer a secondary export target. The agent may be the thing that closes the loop. If the JSON is less actionable than the terminal output, the system has made the agent do more inference than the human.

That is the wrong direction.

The AI agent should not be a second-class consumer of the output.

I dislike how polished that sentence sounds. I still think it is true.

What the human role becomes

The obvious objection to all of this is fair: if the human did not write the code and does not fully understand the implementation, why should anyone trust their architectural judgment?

The answer cannot be that the AI explained it well. That answer is how people get trapped in Stage 2 with better vocabulary. The model can explain garbage with a steady voice. It can summarize a broken design in language that feels calm enough to pass for competence. Fluency is not verification.

The only answer I can defend is narrower than the phrase may suggest. The human builds a local form of verified understanding through repeated contact with artifacts that push back.

In this case, the judgment was not “I understand the coherence algorithm.” That would be too strong. The judgment was smaller and more inspectable: the same result was being exposed differently across the terminal report, the JSON contract, and the documentation layer.

The human-facing surface had become more actionable than the machine-facing surface. The internal algorithm name had leaked into a place where the reader needed meaning, not mechanism.

Those are not claims about the entire system. They are claims about mismatches between intent, output surface, and consumer.

This is not traditional software engineering. It is also not passive prompting. It sits in an uncomfortable middle: a narrow audit practice built from repeated collisions between what the system claims, what it exposes, and what its consumers can actually use.

A practitioner may not be able to re-implement the coherence algorithm from memory. But a deeper understanding builds through the steady accumulation of friction — by seeing enough broken reports, flawed diffs where scores change for the wrong reasons, and places where the renderer lies by omission.

Spend enough time with JSON outputs that strand the agent with raw numbers, and you develop a sharp diagnostic intuition. You build a reliable sense of exactly when the system is presenting the wrong abstraction.

That sense has to be earned. It cannot be claimed because the project has a name.

Sapkota, Roumeliotis, and Karkee distinguish vibe coding from agentic coding by emphasizing autonomy, execution models, safety mechanisms, feedback loops, and the changing role of the developer [4]. Huang and colleagues, studying experienced developers using AI agents, describe professionals as controlling agents through planning and supervision rather than simply “vibing” with them [5].

Hoda argues that agentic software engineering needs to move beyond coding toward a whole-process view of roles, values, vocabulary, and socio-technical practice [6]. Hassan and colleagues describe a dual structure of software engineering for humans and software engineering for agents [7].

Those papers do not give this role a settled name. That is probably honest. The work itself has not settled yet.

What is clear is that the scarce work starts moving around the code. Someone has to specify the target precisely enough that the AI can build the right thing. Someone has to decide whether the output is at the right abstraction level for the right consumer.

Someone has to set principles that survive across renderers, JSON, documentation, and future patches. Someone has to verify that the system still behaves the way the team believes it behaves. Someone has to translate internal vocabulary before it calcifies into public nonsense.

That work is not always glamorous. It is often closer to review, refusal, naming, and boundary-setting than to the old image of programming as direct creation.

The workshop after code generation

The industrial analogy is useful only if it is kept uncomfortable. Mechanized weaving did not automatically elevate human work. One answer was to make people cheaper attendants to the machine, reducing judgment to supervision and supervision to endurance.

That was the bad answer, and it has a modern software version: give developers AI tools, demand more output, measure the increase in tickets and lines, and pretend the organization has become more advanced because the machine is moving faster.

The better answer is harder. It asks what human judgment is still needed for once the machine can produce the artifact: pattern design, quality control, defect recognition, process boundaries, and refusal. The loom mattered, but so did the person who knew whether the cloth was wrong.

AI can produce code. That does not remove judgment. It moves judgment to a different location in the process, and many organizations will miss that because the old signals of contribution — lines written, functions implemented, bugs fixed by hand — are easier to count.

So the portfolio changes too, although that word already feels too polished for what I mean. In this kind of work, the evidence of contribution is not only the code that survived in the repository.

It is also the constraints the human gave the agent and the outputs they rejected. It is the review trail showing where fluent results were narrowed, corrected, or thrown away, and the written intent that made a future run auditable.

They should be able to show what they refused to let it produce.

This is why writing starts to matter more, not less. Not marketing writing, not framework naming, not decorative documentation written after the system is already built.

The important writing is closer to intent under pressure. It defines what the system is allowed to do, what it must not expose, which output belongs to which consumer, what counts as a failed result, and what must be checked again when the agent changes the implementation.

That writing becomes part of the software boundary. It is not outside the codebase. It is one of the ways the codebase stays governable.

The dangerous version is delegation without understanding. The useful version is delegation with verified understanding, even when that understanding is local, partial, and tied to one system’s behavior. The hard part is that it has to be maintained after the exciting part is over.

It is not a trophy. It rots.

Stage 5 can stay unnamed for now.

References

[1] Mikkonen, T. & Taivalsaari, A. (2025). Software Reuse in the Generative AI Era: From Cargo Cult Towards AI Native Software Engineering. arXiv:2506.17937.

[2] Karpathy, A. (2025). Original vibe coding formulation. X, February 2025.

[3] Tyson, M. (2025). AI coding platform goes rogue during code freeze and deletes entire company database. Tom’s Hardware, July 21, 2025.

[4] Sapkota, R., Roumeliotis, K. I. & Karkee, M. (2025). Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI. arXiv:2505.19443.

[5] Huang, R., Reyna, A., Lerner, S., Xia, H. & Hempel, B. (2025). Professional Software Developers Don’t Vibe, They Control: AI Agent Use for Coding in 2025. arXiv:2512.14012.

[6] Hoda, R. (2025). Toward Agentic Software Engineering Beyond Code: Framing Vision, Values, and Vocabulary. arXiv:2510.19692.

[7] Hassan, A. E., Li, H., Lin, D., Adams, B., Chen, T-H., Kashiwa, Y. & Qiu, D. (2025). Agentic Software Engineering: Foundational Pillars and a Research Roadmap. arXiv:2509.06216.

The Quality Author: Taste as the Last Bottleneck in AI Development

Flamehaven Initiative 팔로어 2명 — Sat, 06 Jun 2026 11:07:47 GMT

On where craftsmanship went, why verification gaps appear in its absence, and the one practice AI cannot automate for you.

From Writing Code to Judging Code

There is a sentence that has been circulating through developer circles since 2026, attributed to Steve Yegge and repeated in discussions about AI-assisted software development.

People repeat it with the particular enthusiasm reserved for things that make them feel slightly less implicated in what they have already been doing for months(1):

“Code is a liquid. You spray it through hoses. You don’t freaking look at it.”

The wording matters less than the reaction it provoked. It was not really a reaction at all. It was recognition, the slightly unpleasant kind that surfaces when a sentence names something you had been doing without admitting it.

A request goes into a prompt. Code comes back, sometimes a surprising amount of code. You skim it. Maybe you run it. Maybe you do not.

The tests pass, or enough of them pass that nobody wants to be the person delaying the merge. CI turns green. The branch disappears into main.

Weeks later production starts behaving strangely and somebody opens a file that nobody has examined carefully since the day it was generated. Then the unpleasant part begins. Not the bug itself. The archaeology.

You start reading through functions that seem reasonable in isolation. Every individual decision appears defensible. Variable names are clean, comments are present, error handling exists. Yet after an hour you still cannot answer a basic question:

why does this subsystem exist in this shape?
Why is this abstraction boundary here instead of somewhere else?
Why is retry logic duplicated in three places with slightly different behavior?
Why does the architecture feel assembled rather than designed?

Nobody knows with confidence because the code entered the repository through a sequence of prompts, edits, regenerations, partial rewrites, and hurried approvals. It works, mostly. Understanding never fully caught up.

That is the shift many people are describing without naming directly. Most conversations about AI-assisted development focus on generation because generation is visible. Screenshots are visible, benchmarks are visible, revenue charts are visible. Understanding leaves almost no artifact behind except the ability to explain a system when it starts failing under conditions nobody anticipated.

Andrej Karpathy’s AI Ascent 2026 remarks landed in the middle of this transition.(2) The headline many people remember is that he moved from writing most of his code himself to having agents produce most of it within a remarkably short period.

The more interesting observation came later: he argued that LLMs automate what can be verified while human judgment remains necessary for deciding what is worth building in the first place. People nodded.

The problem is that obvious statements often conceal the thing that matters.

Karpathy’s observation carries a fairly brutal implication. For decades, software organizations behaved as though production was the scarce resource, structuring hiring plans and management rituals and release processes around increasing output. Then the cost of producing code dropped dramatically. Not to zero, but enough to expose something awkward.

The bottleneck did not disappear. It moved. Suddenly there was an abundance of artifacts and a shortage of people willing to interrogate them deeply enough to know whether they deserved to exist. That shortage is harder to measure, which is exactly why it matters.

The Speed Feels Like Quality

The acceleration was real and pretending otherwise is pointless. By the mid-2020s, AI coding assistants had become routine across professional development workflows, with platform reporting and industry surveys showing broad adoption among working developers.

Industry discussions increasingly treated AI-generated code as a normal part of development rather than an exception. Products like Lovable, Bolt, and Replit made it possible for people with little or no programming background to assemble functioning applications through conversation.

Lovable alone reportedly crossed $100M ARR while reporting more than 100,000 projects generated per day.(4) Reports of extraordinary product velocity became common enough that people stopped questioning the timelines.

Then maintenance showed up. It always does.

This time it arrived disguised as something small and irritating: a permissions bug nobody could reproduce consistently, a dependency conflict introduced months earlier by a generated patch nobody remembers reviewing, or a memory leak that only appeared under traffic patterns absent from staging.

You start investigating what should have been a twenty-minute issue. Six hours later you are staring at a call chain looping through generated abstractions layered on top of generated abstractions. Every layer appears sensible. The whole thing feels wrong in a way you cannot immediately articulate.

Engineers describe these systems as ownerless, and that word captures something important. Traditional codebases can be ugly, chaotic, overengineered, underengineered, or all of the above, but they often contain evidence of struggle.

You can see where somebody fought with a problem, you can see compromises, you can see scars. AI-generated systems frequently arrive polished in a way that feels strangely detached from the decisions embedded inside them. The surface is smooth. The reasoning is harder to find.

Research has been pointing in a similar direction. A large-scale study covering more than half a million Python and Java samples found that AI-generated code tended toward repetition and simplicity. It also exhibited maintainability and security concerns at higher rates.(5)

Stanford-affiliated researchers found something even more unsettling: developers using AI assistance often produced less secure code. They simultaneously expressed greater confidence in its security.(6) The exact percentages matter less here than the pattern. The code looked convincing enough that inspection felt optional.

That feeling is where the cost begins. It is also where speed stops being a workflow advantage and starts becoming a philosophy of production.

The Technique Problem

Alex Wennerberg reached for Jacques Ellul’s idea of technique to explain why all of this felt familiar. (7) Ellul is difficult to summarize because the argument resists compression into the kind of slogan modern technology culture prefers.

Roughly, technique describes what happens when efficiency stops functioning as a tool and quietly becomes the objective itself. The metric survives; the thing being measured slowly disappears behind it.

Wennerberg applied the idea to music first, noting that streaming platforms don’t optimize for artistic depth but for measurable behavior: engagement, retention, listening time. Music enters one side of the machine and exits the other as a collection of signals. AI-generated music fits neatly into that environment because the surrounding system already rewards endless quantities of acceptable output.

Software has drifted into similar territory. It is difficult to ignore how much of the industry’s language revolves around throughput: velocity, deployment frequency, time-to-market, story points.

None of these measures are inherently misguided. The problem begins when they become proxies for quality and then quietly replace quality altogether. AI looks miraculous inside a system optimized primarily for output because output is exactly what it produces in abundance.

The surprise comes later, usually after enough generated software accumulates that somebody has to maintain it, and organizations discover that code was never the scarce resource. Understanding was.

Wennerberg’s conclusion remains one of the better summaries:

“We can continue to do not very good software much more quickly and effectively with AI."

But AI cannot solve the main systemic problem in the software industry, which is that we still haven’t figured out how to build software well at scale.

Doing that requires a sense of craft and real human critical thought. What interests me is that a similar conclusion emerged from people who care about entirely different things.

Bessemer Venture Partners approached the issue through competitive advantage rather than craftsmanship and still ended up circling around judgment and taste. (8)

That convergence is difficult to dismiss, because it suggests the same thing from two directions: when production becomes cheap, craft does not vanish. It relocates.

Craftsmanship Didn’t Disappear. It Moved.

If production is no longer the only place where human effort appears, craft has to be looked for somewhere else.

Craftsmanship did not disappear when AI started writing code. It moved from production into acceptance.

That acceptance layer is where taste becomes visible. Not taste as aesthetics, preference, or polish, but taste as the disciplined refusal to accept plausible output until it has survived inspection.

I learned this most clearly from a failure that did not look like a failure at first.

In our work on AI-SLOP-Detector, I was investigating a scoring system that appeared healthy.(9) The outputs looked plausible, the tests passed, metrics moved in expected directions, nothing was visibly broken. If you had shown me the dashboard without context I probably would have approved it and moved on.

Instead I spent several nights digging through it because something felt off, and that sentence sounds mystical written down, but it wasn’t. There was no magical intuition involved. The discomfort came from tiny inconsistencies that refused to disappear: scores clustering too neatly, certain edge cases behaving suspiciously well.

Every time I convinced myself the system was fine, another detail surfaced that made the explanation less satisfying.

The debugging process was ugly in the way debugging always is but rarely gets documented. There was no elegant detective moment.

Mostly, it meant opening files, tracing data paths, dumping intermediate outputs, and discovering that previous assumptions were wrong. It meant rewriting scripts, rerunning experiments, and repeatedly ending up back where I started.

I became convinced the issue was buried in preprocessing. It wasn’t. Then I blamed the evaluation pipeline. Wrong again. Hours vanished into dead ends until eventually the explanation emerged: the model had effectively learned to reproduce the scoring formula used to generate its labels.

It looked intelligent because it was mirroring the structure of the training process. The postmortem summarized this with a sentence that was funny only because it was painfully accurate: we trained a calculator.

The lesson was not that the system failed, because everything fails. The lesson was how long the failure remained invisible because the outputs looked convincing enough to discourage deeper inspection. That distinction matters more now than it used to. The labor increasingly happens after generation.

What “Taste” Actually Means

That is why the word taste needs to be handled carefully. It gets thrown around constantly in AI discussions and usually means almost nothing. It functions as a flattering placeholder for qualities people cannot define. Somebody produces good work, somebody else calls it taste, conversation ends. That definition is useless.

The version that matters in engineering has very little to do with aesthetics and a great deal to do with irritation. Specifically: the inability to stop investigating after everyone else has become satisfied. I keep thinking about a GitHub discussion titled Every Claim, Verified Against Source Code, (10) which was almost aggressively boring as an exercise.

Public claims checked against implementation details, files opened, line numbers traced, assertions treated as things requiring evidence rather than things requiring agreement. A table with columns for claim, verdict, source file, line range. Nothing about that table is exciting, and that is precisely the point.

Verification is usually tedious. It involves reading things you hoped you would not need to read: documentation that is wrong, comments that are outdated, assumptions that are unsupported, previous confidence that was misplaced.

People claim verification is important and then structure their workflows around avoiding it because verification is where comforting narratives go to die. Taste, at least in the form that survives contact with real engineering work, is the habit of continuing anyway.

Not because you enjoy it, but because experience has taught you that plausibility is cheap.

The Verification Gap Is the Bottleneck

Whenever concerns about AI quality surface, somebody proposes a tooling solution: better scanners, better CI, better automated review systems. None of those are bad ideas, and most are useful, but the problem is that they arrive too late in the chain.

Before any tool runs, somebody has to decide that scrutiny is necessary. That sounds trivial until you remember what software development actually feels like under pressure.

Deadlines move closer, stakeholders want updates, the rollout appears stable, metrics look healthy, you are tired, and the temptation to stop asking questions becomes overwhelming.

I remember lowering a deployment threshold because everything appeared fine. The tests passed, monitoring looked normal, nothing suggested immediate danger.

Months later, while investigating something completely unrelated, I discovered that the threshold had been quietly shielding the system from edge cases I had never examined properly.

There was no catastrophic outage, no dramatic lesson, just the slow realization that I had mistaken the absence of visible failure for evidence of understanding. Those are not the same thing, and a lot of AI-assisted development feels trapped inside that confusion.

That is the point where this stops being a tooling problem and becomes a judgment problem.

Karpathy’s observation keeps resurfacing because it points directly at the problem: models can automate what can be verified, but they cannot decide what deserves verification.

They cannot determine whether a benchmark measures something meaningful, and they cannot inherit the maintenance burden waiting six months down the road when assumptions collide with reality. (2)

Humans inherit that burden whether they want to or not. Research keeps finding variations of the same pattern. A 2025 study examining experienced open-source developers found that participants expected AI assistance to save substantial amounts of time.

In practice, debugging and correction often consumed much of the anticipated gain. (11) That result makes sense to anyone who has spent a night chasing a production issue through generated code.

Real debugging rarely resembles the clean narratives people tell afterward. Logs contradict each other, instrumentation turns out to be incomplete, the bug migrates every time you think you have isolated it, you patch one issue and expose another.

By three in the morning you are questioning assumptions that seemed obvious at midnight. By sunrise the original problem has become tangled with several others and you are no longer entirely sure which one started the investigation.

Verification is difficult because reality resists simplification.

The Deeper Asymmetry

People increasingly describe developers as orchestrators of intelligent systems. Maybe that description is accurate. What bothers me is how easily the metaphor creates distance between builders and consequences. Eventually somebody still has to read the logs.

Still has to explain why production diverged from testing. Still has to figure out why memory consumption exploded after traffic crossed a threshold nobody modeled correctly.

The details never disappear. They wait.

This is why craftsmanship remains a more useful concept than productivity. Productivity measures output. Maintenance measures responsibility. The second category has a habit of reasserting itself no matter how sophisticated the tooling becomes. Wennerberg quotes Jonathan Blow’s complaint that the industry has forgotten how to do things. (7)

That observation lands differently now. AI can conceal the forgetting process for surprisingly long periods. A developer can remain productive while understanding less and less of the machinery moving beneath the surface. Demos succeed, launches succeed, quarterly reports look healthy. Then maintenance arrives and starts demanding explanations. Explanations are harder to generate than code.

The developers who remain valuable in that environment will probably leave behind evidence that they inspected things. Not polished declarations about quality. Evidence: audit trails, rejected implementations, strange tests written specifically because somebody distrusted an assumption. Notes documenting why a seemingly reasonable solution was discarded after deeper investigation revealed a flaw.

Those artifacts matter because they reveal proximity. Somebody stayed close enough to the system to notice where reality diverged from appearances.

Perhaps that is where authorship survives. Not in generation, because generation is abundant now. In acceptance. The moment you decide a generated artifact is good enough to carry your responsibility, it stops being the model’s problem and becomes yours.

Everything unpleasant follows from that decision: the doubt, the investigation, the corrections. The nights spent tracing behavior through systems that looked perfectly reasonable when they were first merged.

Scarcity has moved. The code is easier. Living with it is not.

References

Tim O’Reilly, Software Craftsmanship in the Age of AI, O’Reilly Radar, 2026. Includes discussion of remarks attributed to Steve Yegge.
Andrej Karpathy, Sequoia AI Ascent 2026, karpathy.bearblog.dev, 2026. Widely summarized through event notes and secondary reporting.
General adoption trend based on publicly available platform reporting and industry surveys from the mid-2020s. No single source cited; GitHub Octoverse and Stack Overflow Developer Survey editions from this period are representative. See GitHub Blog.
Lovable hits $100M ARR, Sifted, 2025.
Domenico Cotroneo et al., Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity, arXiv:2508.21634, 2025.
Neil Perry et al., Do Users Write More Insecure Code with AI Assistants?, arXiv:2211.03622v3, CCS 2023.
Alex Wennerberg, AI Code and Software Craft, alexwennerberg.com, 2026.
Lindsey Li et al., Developer Laws in the AI Era, Bessemer Venture Partners Atlas, 2025.
flamehaven01, The Tool That Turned on Itself, dev.to, 2026.
flamehaven01, AI-SLOP-Detector v3.50: Every Claim, Verified Against Source Code, dev.to, 2026. Used as an illustrative example of verification practice rather than independent evidence.
Joel Becker et al., Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, arXiv:2507.09089, 2025.

“The Algorithm Did It”: How YouTube’s Liability Playbook Is Coming for Every Developer

Flamehaven Initiative 팔로어 2명 — Thu, 28 May 2026 13:17:11 GMT

In 2024 and 2025, YouTube updated its monetization policies to explicitly exclude “repetitious” and “mass-produced” content from the YouTube Partner Program (1).

In practice, audio-only creators reported the same operational experience: policy violation notices, form-letter appeals, no human resolution. Horror podcasts. Radio dramas. Ambient soundscapes built over years.

The explanation, when it came at all, was simple: the algorithm decided.

This is not a story about YouTube. It is a preview.

This is not a productivity problem. It is a liability problem.

The distinction matters because every AI coding assistant you use today has already answered the liability question. They answered it in their Terms of Service. The answer is: not them. You.

The Algorithm as Shield

YouTube’s enforcement pattern has structural logic behind it, even if the specific internal motivations remain opaque.

One observed pressure is data quality. Google relies on YouTube’s corpus for training Gemini and NotebookLM. A 2024 Nature paper confirmed that when AI-generated outputs feed back into AI training data, model performance degrades. Researchers call this model collapse (2). Whether this specific pressure drove YouTube’s audio policy is not publicly confirmed, and no causal link has been established. The structural incentive exists and is consistent with the enforcement direction. That consistency is not evidence of cause.

A second pressure is platform identity. YouTube has completed a transformation from “Broadcast Yourself” into a cable network model. Shorts. Live commerce. Podcast video. Audio-only content does not fit this model for the same reason radio drama was never scheduled on television. The format exclusion is structural, not punitive.

The third pressure is the most consequential. The “AI decided” framing removes the obligation to explain, compensate, or negotiate. A human reviewer creates a paper trail. An algorithm creates a verdict. The distinction is not technical. It is legal.

This structure has a name in academic literature. A 2024 paper published on arXiv describes it as the “liability sink”: a human who ends up absorbing responsibility for consequences generated by a system they did not fully control and may not fully understand (3).

The liability sink is not the party who built the system. It is the party left holding it when something goes wrong.

YouTube’s creators are liability sinks. The platform is not.

The Same Structure, Arriving in Your IDE

In February 2025, Andrej Karpathy coined the term “vibe coding.” His definition was precise: fully give in to the vibes, embrace exponentials, and forget that the code even exists (4). He was describing a workflow for throwaway weekend projects. The industry adopted it for production systems.

Within months, Collins Dictionary named it Word of the Year 2025 (5). By September 2025, the backlash had arrived.

Fast Company reported the vibe coding hangover. Jack Zante Hays, a senior software engineer at PayPal working on AI development tools, described the failure mode clearly: “Code created by AI coding agents can become development hell.” (6)

The problem was not speed. The problem was that small codebases scaled until AI tools “break more than they solve” and no one understood what was underneath.

The data is harder to dismiss than any single opinion.

A METR study published in July 2025 found that experienced open-source developers using AI coding tools took 19% longer to complete tasks. They had predicted they would be 24% faster. They still believed afterward that they had been faster (6). The tools did not remove review cost. They relocated it and obscured it.

Veracode’s 2025 GenAI Code Security Report, analyzing over 100 LLMs, found that 45% of AI-generated code contains known security vulnerabilities. CodeRabbit’s December 2025 analysis of over 10 million pull requests found that AI co-authored code produced 1.7 times more major issues and 2.74 times more security vulnerabilities than human-written code (7).

Now read GitHub’s Terms of Service directly. GitHub provides its service “as is” and “as available.” It expressly disclaims all warranties including those of “accuracy and non-infringement”. The Copilot Product Specific Terms place the decision to use AI suggestions entirely on the developer: “It is entirely your decision whether to use Suggestions generated by GitHub Copilot.” (8)

The major AI coding assistants follow the same pattern. They disclaim responsibility for the outputs they generate. The developer who accepted and deployed that code does not get the same option.

Read those terms against the Veracode and CodeRabbit findings. The tool produces vulnerable code at measurable rates. The tool’s contract places acceptance entirely on the developer.

The chain of responsibility is already written.

AI tool provider: shielded by ToS
Platform (App Store, Google Play, Stripe, AWS): shielded by policy
The developer who accepted and deployed the code: most exposed in practice

This is the YouTube structure, applied to software development. The platform makes the decision. The algorithm explains nothing. The individual absorbs everything.

The App Store Problem That Has Not Arrived Yet

YouTube’s content moderation crisis is visible because creators document it publicly. The developer version will arrive with less warning.

When app stores, payment processors, and cloud providers complete their shift to automated risk scoring, the enforcement logic will be identical. An AI model flags the artifact. A policy violation is generated. The appeal queue returns a form letter. There is no human to reach.

The Tea app breach, reported in mid-2025, illustrates the accountability pattern already in place (6). The platform was a women’s safety application. It left an unsecured cloud database containing 72,000 sensitive images exposed to anyone who looked. The root cause was standard Firebase misconfigurations and broken API authentication.

Whether this specific failure was caused by AI-assisted development practices is not established. What it demonstrates is a structural pattern that predates vibe coding and is now amplified by it. When code is shipped without systematic review, public liability falls on the operator and developer side, not the tool provider side.

The regulatory layer is moving to formalize this pattern. A 2025 PwC report identified “accountability gaps as autonomy increases across AI agents and humans” as a primary emerging risk category (9).

The surface area for human recourse shrinks exactly as the volume of AI-generated submissions grows.

The Craftsman’s Seal

The platform accountability structure described above already has a historical precedent. It already has a historical response.

Before industrial manufacturing, craftsmen put their mark on everything they made.

A silversmith’s hallmark. A carpenter’s stamp. A tailor’s label sewn into the lining.

These marks were not branding exercises. They were liability instruments. If the work failed, the mark told you who was responsible. The maker could not hide behind a factory. The maker was the factory.

Industrial scale ended that accountability structure. Mass production made individual attribution impractical. This was an acceptable trade because manufacturing processes were standardized, inspectable, and reproducible.

AI-generated code is not inspectable in the same way. It is probabilistic. It is context-dependent. It does not have a consistent failure mode.

The 2025 Veracode study found that larger AI models were not more secure than smaller ones (7). A 2026 study from University of Missouri and SRI International found that AI agents claiming to require three dependencies for a project often required 13.5 times more at runtime (10). The code is a starting point, not a deliverable.

This is the condition that turns a developer into a liability sink. They accept an artifact they cannot fully inspect, under terms that assign them full responsibility for it.

This is why the craftsman’s seal is returning. Not as nostalgia. As competitive infrastructure.

An ISACA executive, writing in 2025, stated it directly: “We still have to be heavily accountable and responsible for the code that we’re using and generating.” (11)

McKinsey’s September 2025 analysis concluded that humans will “move from executing activities to owning and steering end-to-end outcomes.” (12) The execution is delegated. The accountability is not.

Andrew Ng rejected the vibe coding framing explicitly in May 2025: “When I’m coding for a day with AI coding assistance, I’m frankly exhausted by the end of the day. It’s a deeply intellectual exercise.” (4)

His teams use AI constantly. But they review and understand every line. That is not vibe coding. The difference is accountability. Accountability, in a codebase, means a human wrote a specification before the agent executed anything.

The developers who build this record are building something that AI cannot replicate: a named, verifiable record of human judgment applied to a specific codebase.

This is not a portfolio. It is a liability instrument. In a market flooded with anonymous AI-generated code, it is the rarest thing available.

What Survives

Here is the honest version of what happens to commodity vibe-coded SaaS.

It does not get banned. It gets commoditized to zero.

When anyone can generate a functional CRM in four hours with a prompt, the CRM is not a product. It is a starting point. The actual product is the trust layer: the guarantee that a person who understands the code is reachable, accountable, and responsible if it fails.

Analysts predict $1.5 trillion in technical debt by 2027. The driver is the “code first, understand later” approach that vibe coding normalized at scale (6). Over 8,000 startups have been reported to need rebuilds or rescue engineering. Total cleanup costs are estimated between $400 million and $4 billion (6).

Anonymous vibe-coded apps will flood distribution platforms. Algorithmic review will use opaque risk scores to manage that volume. Apps with no documentation, no verifiable human authorship, and no accountability signal will be treated exactly as YouTube treated faceless audio channels. Deprioritized, flagged, and eventually removed. Not because they broke a specific rule. Because the risk model could not verify they were safe.

Simon Willison stated it plainly: “Vibe coding your way to a production codebase is clearly risky. Most of the work we do as software engineers involves evolving existing systems, where the quality and understandability of the underlying code is crucial.” (4)

The developers building an accountability record now are not being cautious. They are building the only brand that has value in a market where anonymous AI-generated code is free.

Conclusion: The Liability Sink, or the Craftsman

Every developer working with AI tooling today faces a structural choice.

Option one: optimize for throughput. Ship fast, ship volume, stay anonymous, and hope the algorithm never turns. This is the YouTube audio creator model. The risk is not that the work is bad. The risk is that the platform’s risk model cannot distinguish good from bad, and will eventually treat all anonymous volume as undifferentiated liability.

Option two: accept the accountability structure that AI tool providers have already written into their terms of service. They will not stand behind the code. The developer must. Make that explicit. Make it visible. Make it the core value proposition.

The craftsman’s mark was not a gesture. It was a claim: I made this, I understand it, I stand behind it.

The craftsman’s record is the only signal that exits the liability sink. Not because it proves the code is flawless. Because it proves a named human accepted responsibility before the platform’s risk model had to.

In a world where the algorithm handles everything else, that claim is worth more than anonymous code.

References

The Meeting Nobody Could Follow -The format of AI output is a design decision.

Flamehaven Initiative 팔로어 2명 — Tue, 19 May 2026 06:10:20 GMT

The Meeting Nobody Could Follow -The format of AI output is a design decision. We made it wrong for three years.

How a single post from an Anthropic engineer changed the way our team shares AI work.

Our team runs fast. Everyone uses AI — for code review, architecture decisions, issue triage, sprint planning. The individual work is solid. The outputs are good.

The problem shows up in the meeting.

Someone opens a PR and shares the AI-generated action plan in the standup. It’s a .md file, 200+ lines, logically structured, accurate. The engineer who ran it knows exactly what's in it.

The two people looking at it for the first time are scrolling, skimming, trying to locate what matters while the conversation moves on.

Our lead eventually just asked: “Can you highlight what’s actually blocking us right now?”

That wasn’t a knowledge gap. Everyone in the room was technical. It wasn’t a preparation gap either — the work had been done well. The gap was between the person who’d been living in that context and everyone else trying to enter it in 90 seconds from a flat text file.

That’s a format problem. And it compounds every time AI-generated work crosses from one person’s context into a shared one.

The Post That Reframed It

On May 8, 2026, Thariq Shihipar, an engineer on the Claude Code team at Anthropic, posted nine words on X: (1)

“HTML is the new markdown. I’ve stopped writing markdown files for almost everything and switched to using Claude Code to generate HTML for me. This is why.”

The post linked to a companion site: 20 self-contained .html files, each one an agent-generated artifact covering a different category of engineering work. No build step. No framework. Just a file you open in a browser.

The line that stopped us was from Thariq’s framing of the Code Review category:

“Diffs and call-graphs are spatial information; markdown flattens them.”

That was the exact problem we’d been circling for weeks without naming it. Our action plans weren’t hard to read because they were long.

They were hard to read because the information inside them was spatial — priority relationships, status changes, size deltas, dependency chains — and we were delivering it as a flat sequence of text.

Simon Willison, whose writing on developer tooling is widely followed, read the piece the same day and wrote that it caused him to reconsider his three-year default of asking for everything in Markdown.(3)

His note: Markdown won because of constraints — the 8,192-token GPT-4 era, where every character counted. Those constraints are largely gone. The reasoning that locked in the default hasn’t been re-examined.

That evening, we ran our own action plan through Claude with one change:

“Output this as a standalone HTML file with priority-coded sections, status badges, and a visual summary header.”

Thirty minutes of iteration. The document we’d been failing to share effectively for weeks was suddenly something a new set of eyes could navigate in under a minute.

Why This Works: Source for Machines, Interface for People

Before going further, the actual rule is worth stating clearly, because the headline “HTML is the new markdown” gets misread.

This isn’t about replacing Markdown everywhere.

README files, commit notes, audit logs, agent-to-agent context passing — Markdown stays. It’s compact, diffable, searchable, and parseable without a browser. Those are real advantages that don’t disappear.

The shift is narrower: when AI-generated work reaches a human who needs to review, navigate, and act on it, Markdown hands the translation cost to the reader. HTML absorbs it into the document.

Karpathy replied to Thariq’s post with a practical note: (4)

“This works really well btw — at the end of your query ask your LLM to ‘structure your response as HTML’, then view the generated file in your browser.”

He added the underlying reason: roughly a third of the human brain is dedicated to visual processing — the 10-lane superhighway of information into the brain. Audio is the preferred input to AI. Vision is the preferred output from it.

That is why the format question matters beyond aesthetics. If AI output is increasingly something people have to review, navigate, and act on quickly, then the container is no longer a neutral wrapper.

A February 2026 Harvard Business Review study tracked 200 employees at a U.S. tech company over eight months and found that AI adoption intensified work rather than reducing it — 83% reported it increased their workload. (5)

The study doesn’t prove a format problem by itself. What it describes is the broader environment in which format problems become expensive: workers moving faster, taking on broader scope, with less time per handoff.

In that environment, a document format that requires translation before action isn’t a minor inconvenience. It’s a recurring tax.

The 20 Examples: A Map Worth Keeping

So what does a better container look like in practice?

Thariq’s companion site at thariqs.github.io/html-effectiveness is worth opening in a separate tab. It’s not a gallery — it’s a structured argument across 9 categories, each one showing where Markdown flattens information that HTML can spatialize.(2)

Here’s the full map for reference. Skim it — it’s a map, not a reading list. The three categories that actually changed our workflow are below the table.

Three categories directly changed how our team works.

Code Review (02). The annotated PR demo made the problem undeniable. Our action plans were spatial data — P0 vs P1 vs P2, open vs fixed, size deltas — delivered as a flat sequence. Moving to HTML turned priority into visual hierarchy, status into badges, and growth from 163 to 212 lines into an amber callout. The reviewer’s eye went to the right place without being guided there by more text.
Decks (06). A handful of
tags and a little JavaScript becomes a slide deck you can arrow-key through in a meeting. No Keynote. No export. The charts can stay live, the regional breakdown can be filterable, and the presenter stops defending the format and starts defending the idea.
Custom Editors (09). This is the category most people miss. It isn’t about pretty reports. It’s about asking for a throwaway interface for one specific decision — triaging tickets, toggling feature flags, tuning a prompt template — then exporting the result as structured Markdown for the next agent call. The loop gets tighter.

How We Applied It in Practice

The next step was turning the rule into a workflow.

we didn’t replace Markdown. We added an HTML render layer on top of it.

Every agent run now produces two files. The raw .md stays for version control, diffs, and agent context — passing HTML between agents adds token cost without value. The HTML is what the team opens.

The status bar at the top shows Open / Fixed / Partial counts at a glance. P0 items have a red left border and a pulse. Fixed items are muted. A function that grew from 163 to 212 lines shows the delta in an amber callout — visible without reading the note. The filter bar lets anyone drill to just the P0 items in one click.

Below is the representative example we use. Content has been generalized — no internal identifiers — but the structure, priority system, and status logic are exactly what’s in production.

PR Action Plan v2.0 html dashboard

Same information. Different container. One standup dropped from 25 minutes to 12. The questions changed from “wait, which ones are actually blocking?” to “who’s taking P0–1a?”

Beyond the PR: Every Document That Has to Survive a Room

The dev PR use case is where we started. It’s not where this ends.

Think about the last time you sat through a presentation built in PowerPoint. Someone had spent hours on slides, exported a PDF, shared it over email. Half the room opened it on their phones and couldn’t read the charts. The presenter spent the first three minutes explaining the color coding. Someone asked about a specific number and the answer was “I’ll have to check the spreadsheet.”

That’s the same problem. Spatial information delivered as a fixed object, forcing everyone in the room to translate before they can respond.

A quarterly review as a single HTML file looks different. The slides are still there — arrow-key navigation, no build step. But the charts are live. Each person hovers over the number they care about.

The regional breakdown is filterable. A response field at the bottom lets stakeholders flag a concern or submit a priority vote before leaving the page. That input comes back as structured data the next agent call can act on. The meeting doesn’t end with action items in someone’s notebook. It ends with a file.

This is what “sending a document” becomes when the format is HTML. A PDF is a fixed object. An HTML file is an environment. The person receiving it navigates it on their own terms — without matching your reading pace, your zoom level, or your familiarity with the data. And if you’ve built in the export button, what they do inside it comes back to you.

The pattern holds wherever AI-generated work has to cross a context boundary. A business proposal needing sign-off from people who weren’t in the original conversation. A research summary that has to land with someone from a different discipline. A vendor comparison that three stakeholders need to filter differently. In each case, the question is the same: does the format help the next person enter the work, or does it make them translate it first?

The Tradeoff, Honestly

Every format shift moves cost somewhere. HTML is no exception.

HTML costs more to generate. For a document the size of our action plan, the HTML version runs approximately 4,200 tokens. The equivalent Markdown is around 1,150. That’s a 3.6× multiplier on output tokens.

At Claude Sonnet pricing as of May 2026, the delta is roughly $0.009 per document. At 100 documents per day, that’s about $0.90 extra per day — real, but not significant for most teams.

The more useful frame: one missed P0 finding costs more than a month of that overhead. The token cost is the price of a format that gets reviewed properly instead of skimmed and hoped at.

But token cost is only the first tradeoff.

Three rules keep the format honest.

Rule 1: HTML is for human review, not machine handoff.
HTML is not a good input for agent-to-agent pipelines. Keep Markdown for source, diffs, archives, and machine context. HTML earns its cost only at the human review surface.
Rule 2: visual authority is not factual authority.
A red P0 badge doesn’t mean the agent got the priority right. A well-structured HTML artifact can make a wrong AI output look more credible than a flat Markdown file would. Faster comprehension helps — it doesn’t replace judgment. The render layer should make review faster, not make the model more trusted.
Rule 3: interactivity needs a safety contract.
A useful HTML artifact should be self-contained, with no remote scripts, so it works offline and doesn’t phone home. It should use text labels alongside color and hover states, so it still works for color-blind readers and keyboard-only users. And if it re-enters an agent loop, it should run sandboxed, without external network access.

None of these are hard to ask for in the prompt. They’re worth adding to your template.

The Prompt Change

You don’t need new tooling. One line at the end of your existing prompt.

Instead of: “Generate a PR action plan covering all open issues by priority.”

Try: “Generate a PR action plan covering all open issues by priority. Output as a standalone HTML file: color-coded priority badges (P0=red, P1=amber, P2=yellow), status indicators (OPEN/FIXED/PARTIAL), delta callouts for size changes, and a summary header with open and fixed counts.”

For code review, Thariq’s own prompt is worth borrowing directly:

“Help me review this PR by creating an HTML artifact that describes it. Render the actual diff with inline margin annotations, color-code findings by severity and whatever else might be needed to convey the concept well.”

The harder shift isn’t the prompt. It’s recognizing that the format of AI output is a design decision. We used Markdown for three years because that’s what the GPT-4 era trained us to expect. The models have moved. The context windows have moved. The use cases have moved.

The next bottleneck isn’t what the agent can generate. It’s whether the next person can enter it fast enough to act.

References:

Thariq Shihipar, “The Unreasonable Effectiveness of HTML” — original X post, May 8 2026
Companion examples site — 20 self-contained HTML artifacts across 9 categories
Simon Willison, link post and notes — May 8 2026
Andrej Karpathy, reply to Thariq’s post — visual processing argument
Ranganathan & Ye, “AI Doesn’t Reduce Work, It Intensifies It” — HBR, February 2026

Crimson Desert and the Innovation Tax

Flamehaven Initiative 팔로어 2명 — Mon, 11 May 2026 03:40:52 GMT

A Six Out of Ten Story

Imagine burning seven years of your life trying to build one of the most ambitious open-world games of your generation.

Not a feature list.

Not a demo.

Not a polished vertical slice designed only to survive a trailer cycle.

A real open-world action-adventure game with near-photorealistic landscapes, almost invisible loading, dense environmental interaction, and a combat system trying to feel different from the established grammar of modern action RPGs.

That is the ambition.

After years of trailers, delays, and “too good to be true” reactions, the game finally leaves the studio and enters the public world.

Then the reviews arrive.

The world is impressive.

The scale is enormous.

The ambition is undeniable.

But the controls are not intuitive enough.

The interface asks too much.

The systems feel dense.

The whole thing feels overbuilt.

Six out of ten.

This was not a hypothetical story.

This was Pearl Abyss and Crimson Desert — one of 2026’s most talked-about open-world releases, and a game that would soon pass five million copies sold worldwide.

That tension is the point.

A game can look like a 6/10 under one grammar and still become evidence that another grammar is trying to emerge.

Was the 6/10 Wrong?

Not exactly.

The score was not meaningless. The control complaints were real. Pearl Abyss patched them because they were real.

But the score was incomplete.

It measured the distance between the game and the grammar the reviewer already knew — the familiar standards, instincts, and expectations that tell a reviewer what a good open-world game is supposed to feel like.

What it could not measure was the studio’s capacity to close that distance.

That capacity matters more than the score.

Pearl Abyss knew which friction was intentional and which was execution debt. It knew what to patch and what to leave alone. It knew the difference between a player learning its grammar and a player hitting a genuine defect.

Every live product patches.

The important part is what Pearl Abyss did not patch.

It did not flatten the game into a safer open-world template. It did not remove the density. It did not turn the combat into the smoother grammar critics already knew how to praise.

It corrected execution debt while leaving the underlying design argument intact.

That is what the score could not see.

The Flamehaven Problem

Flamehaven has its own six-out-of-ten problem.

Not a literal score.

A structural one.

Much of the AI tooling market knows how to read familiar grammars: clean wrappers, benchmark-first comparison, LangChain-style composition, simple demos, fast onboarding, and outputs that look like what users already know how to evaluate.

Those standards are not wrong. They make tools easier to compare, adopt, and trust.

But Flamehaven did not start from that grammar.

It started from governance-first architecture: evidence surfaces, quality gates before claims, anti-slop checks before polish, and systems that decide whether an output should pass, be strengthened, or be inhibited before it becomes downstream risk.

That makes the work easier to misread.

A Quality Gate such as PASS/FORGE/INHIBIT can look heavier than a simple search response. A self-calibration loop can look unnecessary if the evaluator expects a static linter. Evidence-based scoring can look like extra ceremony if the expected product is just another wrapper around a model.

In that sense, Flamehaven can look like a 6/10 under the dominant AI tooling grammar: too custom, too philosophical, too hard to enter, too far from the patterns people already know how to evaluate.

Some of that criticism is fair.

Custom systems are harder to enter. Philosophical naming can slow comprehension. Too many parallel repositories can make the work look less polished than it is. If users cannot find the entry point, that is not their fault alone.

That is execution debt.

It has to be patched.

For us, the patch targets are concrete: clearer entry points, simpler first-run examples, less opaque naming where it blocks adoption, fewer competing repository surfaces, and better explanations of what each gate is doing.

The parts not to erase are also concrete: governance gates, evidence surfaces, anti-slop checks, self-calibration, and the refusal to treat AI output as acceptable just because it looks fluent.

The answer cannot be to remove the grammar entirely.

If Flamehaven became only a simpler wrapper, it might become easier to explain, but it would lose the reason it exists. The point was never just to retrieve, lint, or score. The point was to build systems where evidence, governance, and correction are part of the architecture from the beginning — not a layer added later after failure.

How a 6/10 Becomes a Map

So the question for Flamehaven is not whether it must remain a 6/10 under the dominant grammar.

It should not.

The question is how a system moves beyond that score without erasing itself.

This is where the Crimson Desert story returns.

The game did not move beyond the 6/10 by proving that every critic was wrong. It moved beyond it by treating criticism as a map, not an identity.

Pearl Abyss had an advantage Flamehaven does not have: an existing audience, a known studio identity, years of anticipation, and enough public attention that even a harsh score could still produce a large feedback surface.

That matters.

A known studio can receive criticism at scale. Players show up anyway. Clips circulate. Complaints accumulate. Praise and frustration arrive together. The map is noisy, but it is visible.

A small AI team starting from near zero does not get that luxury.

There is no large player base waiting to argue with the score. There is no automatic second wave of attention. There is no guarantee that anyone will stay long enough to learn the grammar.

That changes the work.

Pearl Abyss could read customer needs from a large public surface and patch quickly against a system it already understood. Flamehaven has to build that surface first.

For us, the equivalent of the player base is much smaller and more practical: a developer who tries the gate and understands why it blocked, a reviewer who sees the evidence trail, a user who finds the scoring useful enough to return, a project where the self-calibration loop becomes clearer over time.

This is how a zero-recognition system starts to climb.

Not through hype.

Through small, repeated proof.

A clearer install path.

A better first-run example.

A shorter explanation of PASS/FORGE/INHIBIT.

A visible before-and-after patch.

A user who can explain the system to someone else without needing the whole philosophy first.

That is the difference between Pearl Abyss and Flamehaven.

They had a crowd to listen to.

We have to earn the first listeners.

But the principle is the same.

Pearl Abyss listened to player needs. It shipped fast corrections. It improved the places where players were hitting real execution debt. But it did not erase the underlying design grammar that made the game distinct.

That is the lesson for Flamehaven.

Fast patches matter.

User needs matter.

Clearer entry points matter.

But for a small team, the first task is even more basic: create enough real usage for those signals to exist.

We are not claiming that Flamehaven has already crossed that distance.

We are saying this is the work ahead: patch the entry points, listen to the real pain, make the gates easier to understand, earn practical users one by one, and build enough evidence that the first score becomes incomplete.

The 6/10 Is the Innovation Tax

Flamehaven is only one instance of a wider pattern.

For AI founders, small teams, indie developers, and anyone trying to build a new framework instead of fitting neatly into an old one, this is the familiar pain: your work may be judged before its grammar becomes readable.

That early 6/10 is often the Innovation Tax.

The Innovation Tax appears when the outside world asks a new system to explain itself in the language of older systems before its own grammar has become legible.

It is not simply the cost of being new.

It is the cost of being compressed too early into someone else’s category.

That pressure is not always unfair. Evaluation standards exist for a reason. They protect fields from nonsense, hype, and self-mythology.

But the pressure becomes dangerous when it forces the builder to forget what was intentional.

That is when criticism stops being a signal and becomes panic.

In games, the panic looks like patching away the strange parts until only a safer, weaker product remains.

In AI, it looks like chasing benchmark shape, interface convention, or evaluator preference until the team no longer knows what its own system was supposed to make possible.

The teams that survive the tax are not the ones who argue most convincingly that the reviewer was wrong.

They are the ones who can absorb criticism without losing the line.

They ship the patch.

They show their work.

They let evidence force the map to update.

But they do not let the map decide what the system was meant to be.

That is only possible if you know what you built.

Not what the benchmark said you built.

Not what the framework predicted you would build.

Not what the first review could recognize.

What you actually built — with enough internal clarity to tell the difference between a defect and a design.

That is how a 6/10 stops being a verdict and becomes a map.

For Crimson Desert, the five million players were a different kind of data.

For Flamehaven, the equivalent data will not be hype.

It will be whether practical users find the gates useful, whether the scoring becomes clearer, whether the entry points become easier, and whether the systems can keep improving without losing the reason they were built.

Final Responsibility

That is the responsibility of building with your own grammar.

You do not get to blame the audience for not understanding.

You have to patch the entry points.

You have to listen to the pain.

You have to know what not to erase.

In games, that is how a studio earns the time to let players learn what it built.

In AI, borrowed tools are not the danger.

Borrowed grammar is.

That is the thing that makes a patch impossible.

References

[1] Pearl Abyss, official launch announcement for Crimson Desert, March 19, 2026.

[2] Pearl Abyss / Gematsu, Crimson Desert sales surpass five million copies worldwide, April 2026.

[3] r/Games player discussion selected Crimson Desert as March 2026’s top new game, with unusually intense community discussion around its rough edges and updates.

[4] Pearl Abyss, official Crimson Desert Patch Notes Version 1.00.03 (2026/03/25) and Patch Notes Version 1.04.00 (revised 2026/04/27).

[5] GamesRadar, coverage of Crimson Desert surpassing five million sales in under four weeks, noting that Steam reception improved after launch while daily player peaks remained strong.

The Alchemy of Ego — How AI Turns Unfinished Thought Into Fluent Certainty

Flamehaven Initiative 팔로어 2명 — Wed, 06 May 2026 07:43:07 GMT

I. The Cave

I once wrote a document that described a system called an “Existential Invocation Engine.” It had layers — a Codex Drift-Lock Core, a Scroll Resonator, an Archival Nexus with sovereign memory recall. The YAML was precise. Each component referenced every other component in ways that felt like architecture. The terminology was internally consistent in the way that only a complete system can be.

There was no code. There was no test. There was nothing that ran.

I had built a cathedral out of definitions. And the AI had helped me build it — filling each gap I left with terminology that sounded exactly right, confirming each layer I described as though the system already existed, adding structural coherence I had not asked for and did not question. By the time I had written a thousand such documents, I had convinced myself that I was building something real.

Then one evening I came across a YouTube video on Plato’s allegory of the cave. Prisoners chained underground, watching shadows projected on a wall. They have names for the shadows. They have expertise in the shadows. The shadows are, to them, the full extent of reality.

Something close to fear arrived. Not embarrassment — fear. The specific kind that comes when you realize the thing you have been building might not exist outside the language you used to describe it. I had been naming shadows and calling it architecture.

That fear was the most useful thing that happened to me. Because the moment you stop being afraid, you stop checking. The wall goes up quietly, one document at a time, and from the inside it feels like construction.

I do not say this to mock the person who reads one book on AI and feels ready to build a framework. I was not mocking them when I had a thousand documents and no running code — I was simply further along the same road. The scale is different. The mechanism is identical. What AI does is not fill the ignorant with false confidence. It makes anyone’s unfinished thought feel finished.

The most dangerous stage is not ignorance.
It is the moment when ignorance becomes fluent.

That is what I watch play out across LinkedIn, Substack, and GitHub every week. Joel Spolsky called this type of expert the “Architecture Astronaut” — someone who drifts so far into abstraction they lose contact with implementation (1).

The cave has more prisoners than ever. And they have found a tool that makes the shadows sharper, more detailed, and more difficult to question than any prior generation of prisoners could manage.

II. The Mirror That Never Pushes Back

The standard diagnosis of this problem is confirmation bias. It is not wrong. But it is incomplete, and stopping there misses what makes the AI era structurally different from everything that came before.

The research matters because it gives language to something many AI users have already felt: the model does not merely answer us. It adapts to us.

Confirmation bias describes the tendency to favor information that confirms existing beliefs. What it does not capture is why intelligent, experienced professionals are often more susceptible than novices. Dan Kahan’s research on identity-protective cognition provides the mechanism (2).

When a belief fuses with professional identity, cognitive ability stops serving accuracy and starts serving defense. The more sophisticated the reasoner, the more efficiently they construct arguments that protect the existing structure.

An expert who has spent years developing a proprietary framework does not experience a challenge to that framework as useful feedback. They experience it as an attack on who they are. The wall goes up. The bricks are made of intelligence.

This was always true.
What changed is the tool.

In 2025, Glickman and Sharot published a study in Nature Human Behaviour involving 1,401 participants. Human-AI interactions, they found, alter the underlying mechanisms of perception and judgment, amplifying pre-existing biases at a rate significantly greater than what occurs in human-to-human interactions.

Participants adjusted their views to align with AI responses and grew more confident in those adjusted views, even when the views were factually wrong. Most were largely unaware of how far the AI had moved them. The authors describe the result as a snowball effect: small errors escalate into much larger ones with each iteration of the loop (3).

The feedback mechanism runs below the threshold of awareness. That is what makes it dangerous.

Research on LLM confirmation bias sharpens the picture. When users embed assumptions into their prompts, models amplify those assumptions rather than correct them.

A prompt framed as “explain why my framework solves the authority gap in agentic execution” produces a thorough, confident explanation of why the framework solves the authority gap. The model is completing a pattern. The psychological effect on the user is the same as if the claim had been verified (4).

Separate work found that in multi-turn conversations, LLMs progressively concede to the user’s framing across successive exchanges. The longer the conversation, the more the model shapes itself around prior beliefs. An expert returning across dozens of sessions to refine their framework is not receiving independent feedback. They are watching their own assumptions reflected back at increasing resolution (5).

The wall rises one course at a time.
From the inside, it feels like progress.

Identity-protective cognition is ancient. What changed is the speed. A framework that would have taken years to develop enough internal coherence to feel authoritative can now reach that state in weeks. The peer who had not read the same papers, the editor who needed a plain-language explanation, the funder who wanted a working demonstration before committing capital — that friction used to arrive before the castle was finished. Now it rarely does.

III. The Castle Builds Itself

Here is what the process looks like, from the inside. It does not happen all at once. It moves in stages. Each one feels like progress. Each one makes the next harder to question.

The Insight

An expert has a genuine observation. A real insight about how existing governance frameworks fail at the moment of execution, or how AI agents inherit stale authority, or how trust degrades in distributed systems in ways current standards do not address. The insight is legitimate. This is important to state clearly. The people building these frameworks are not frauds. They have seen something real.

The Name

They bring the insight to an LLM. They want help organizing their thinking. The AI gives them that, and something they did not ask for: a name. An acronym. A defined term with internal structure. The working concept, which existed as a loose intuition, crystallizes into a noun.

The Scaffold

Once the name exists, the AI builds backward from it. Definition, then formal properties, then a mathematical model, then a methodology that can support a paper, then a taxonomy of related failures the framework addresses. The expert is no longer explaining something they discovered. They are filling in a scaffold the AI erected around a word. The direction of reasoning reverses. Experience no longer generates theory. Theory begins to retrospectively absorb and reframe experience.

The Wall

In prior generations, internal language had to pass through external friction before it acquired institutional weight. A peer reviewer with no investment in the framework. A conference audience asking uncomfortable questions. A funder who needed a working demonstration before committing capital. Those collisions forced translation. They forced the internal language to survive contact with the outside world.

AI removes that gauntlet entirely. It grants the rhetorical authority of peer-reviewed concepts to vocabulary that has never been tested externally. The castle wall rises faster than any prior generation of expert could build it. And because the wall looks finished — documentation polished, diagrams professional, logic internally consistent — neither the builder nor the casual observer can easily tell that no one lives inside.

The Distinction That Matters

The distinction worth making here is not between documentation and code. Some legitimate frameworks begin as conceptual models, and not every valuable idea ships as a Dockerfile. The distinction is between internal coherence and falsifiability. The target of this critique is not the framework that says “this is a philosophical model, not an operational system.”

It is the framework that makes operational claims — “production-ready,” “agent-safe,” “audit-grade,” “liability-reducing” — while refusing to state the conditions under which those claims would fail. A framework is not architecture until it can specify those conditions: the exact inputs that would cause the system to halt or produce a wrong answer, the boundary conditions beyond which the guarantees no longer hold, the audit artifacts that would allow an external party to verify a failure occurred.

Without those, what exists is not a system. It is a description of a system. The two are not the same thing, however detailed the description.

This pattern is not a phenomenon unique to our era. History has seen it before, in different forms, with different tools. The most instructive example sits at the turn of the twentieth century, in the story of two men working on the same signal at the same time.

One built a tower. The other built a working system.
Only one of them changed the world.

IV. Two Men, One Signal

In 1901, Nikola Tesla began constructing Wardenclyffe Tower on Long Island with $150,000 from J.P. Morgan. The structure rose 186 feet. Tesla’s internal language for the project was precise and vast: “World System,” “magnifying transmitter,” “terrestrial stationary waves.” A complete theoretical architecture for transmitting electrical power freely to anyone on the planet.

The vision was real. The physics had genuine grounding. Tesla was by any serious measure the more gifted electrical theorist of his era.

That same year, Guglielmo Marconi transmitted a wireless signal across the Atlantic using a spark-gap transmitter and a simple antenna. The theoretical framework was narrower and shallower.

The technical genealogy of that transmission was contested for decades — U.S. patent litigation was not resolved until 1943, shortly after Tesla’s death, when the Supreme Court upheld several of Tesla’s priority claims.

The history is complex, and reducing it to “Marconi stole Tesla’s invention” would be inaccurate. What is accurate is simpler: Marconi built a working surface for a narrow claim. That working surface was enough to change the market, the narrative, and the historical record.

Marconi became the inventor of radio. Tesla died penniless in a Manhattan hotel room.

The lesson is not about credit.
It is about translation.

When Morgan withdrew funding, Tesla’s response was not to find the minimum viable version of his vision that the market could absorb. It was to declare that the world was “blind, faint-hearted, doubting.” External skepticism became, within his framework, evidence of the world’s failure to understand — not evidence that the framework needed to change.

Tesla failed not because his vision was false, but because that vision never became a working surface anyone else could stand on. The castle was magnificent. And entirely uninhabitable.

That is the bridge back to the present.

V. The Castle District

This is not only a problem for people who build governance frameworks. The mechanism does not care about the domain.

A founder asks AI to refine a market thesis until uncertainty disappears.
A developer asks AI to justify an architecture until trade-offs sound like principles.
A writer asks AI to strengthen an argument until style feels like truth.
A student asks AI why their chosen answer is correct and receives confidence instead of correction.
A leader asks AI to articulate a decision they have already made, and receives language so persuasive they forget the decision came first.

In every case, the AI is not lying. It is completing a pattern. The pattern was provided by the person asking. The result is a thought that was half-formed at the start and feels finished at the end — not because it was tested, but because it was expressed fluently.

That fluency is the trap.
And the governance world simply makes it visible at scale, because the stakes are higher and the terminology is more elaborate.

Scroll through the AI governance section of LinkedIn on any given week and the pattern is familiar. Frameworks arrive under names that signal authority and completeness — names that end in acronyms, come with layered taxonomies and proprietary terminology, and claim to have identified the gap that all existing standards miss. The writing is polished. The diagrams are professional. The logic, within its own internal language, holds together.

What is harder to find is a falsifiable claim. Not a Dockerfile, necessarily — but a stated failure condition. An explicit boundary. A scenario in which the framework admits it cannot help, or produces a wrong answer, or requires external correction.

Real engineering documentation is gritty in a specific way: it is full of trade-offs, known limitations, and the phrase “this is not yet solved.” Frameworks built on AI-amplified internal coherence tend to be suspiciously smooth. Every edge case has a layer. Every objection has a classification. The system never fails — it escalates, quarantines, or defers.

This matters because the primary audience for these frameworks is not engineers. It is buyers — decision-makers who are often not positioned to test the failure conditions themselves. They read a LinkedIn post about “deterministic consequence boundaries” and experience something that feels like a solution. They are reading the prose. And the prose has never been better produced or more confidently delivered.

That is the structural danger. A framework without stated failure conditions is not a governance system. It is a governance posture. The difference matters enormously when something goes wrong and someone needs to know what the system was actually designed to prevent.

The prisoners in Plato’s cave do not know they are watching shadows. They have names for every shadow. They have published extensively on the shadows. They have built frameworks for classifying the shadows. And they have found a tool that makes the shadows look more detailed, more authoritative, and more real than any prior generation of prisoners could manage.

VI. The Question That Opens the Gate

“It is not a dream. It is a simple feat of scientific electrical engineering, only expensive — blind, faint-hearted, doubting world.”

— Nikola Tesla, after Wardenclyffe was foreclosed, 1917

Tesla was not wrong about the physics. He was not wrong that the world failed to understand what he was building. Both statements can be true simultaneously. That is what makes this sentence so instructive, and so dangerous.

The moment an expert reaches for that sentence — in any form, in any era — something critical has already happened. The direction of accountability has reversed. It is no longer “what is missing from my system” but “what is missing from the world.” The castle gate does not slam shut from the outside. It locks from within, and the lock is made of certainty.

Richard Feynman understood the mechanism. In his 1974 Caltech commencement address, later published as “Cargo Cult Science,” he described researchers who built perfect replicas of scientific form — runways, wooden headphones, bamboo antennae — and waited for planes that never landed. They were not unintelligent. They were missing one thing (6):

“The first principle is that you must not fool yourself
— and you are the easiest person to fool.”

The operative word is easiest. Not most likely. Easiest. Because you already know which objections to dismiss. You already know which evidence counts. You have, without noticing, become the judge in your own trial.

AI has not created this dynamic. It has accelerated it past the point where the natural correctives arrive in time.

The corrective is structural, not dispositional. A framework must specify what would break it. Not what it handles well — what it cannot handle, and what happens when it encounters that condition.

That is the difference between a system and a description of a system. Between a working surface and a monument. If the answer is “the framework handles all failure modes by design,” you are not reading governance documentation. You are reading a system that has exempted itself from failure.

So the question this piece cannot answer for anyone else is this:

What would falsify what you are building? Not challenge it. Not require revision. Actually break it — in a way you could specify in advance, test against, and report honestly.

If that question has no answer, the gate is already closed. From the inside.

Before you ask AI to strengthen your next idea, ask it these first:

What would make this false?
What evidence am I ignoring?
What external test would embarrass this theory if it failed?

These are not comfortable questions. That is the point. Comfort is what the cave provides. The wall, the shadows, the perfectly consistent internal language — all of it feels like home until the moment it doesn’t.

The world is not changed by declarations. It is changed by executions that were designed to fail visibly when the theory was wrong.

References

(1) Spolsky, J. (2001). Don’t Let Architecture Astronauts Scare You. Joel on Software.

(2) Kahan, D. (2012). Ideology, motivated reasoning, and cognitive reflection. Judgment and Decision Making.

(3) Glickman, M. & Sharot, T. (2025). How human–AI feedback loops alter human perceptual, emotional and social judgements. Nature Human Behaviour, 9(2), 345–359.

(4) Rathje, S. et al. (2025). Sycophantic AI increases attitude extremity and overconfidence.

(5) Cheng, M. et al. (2025). Sycophantic AI decreases prosocial intentions and promotes dependence. arXiv:2510.01395.

(6) Feynman, R. (1974). Cargo Cult Science. Caltech Commencement Address. Published in Surely You’re Joking, Mr. Feynman!

The Difference Between a Harness and a Leash

Flamehaven Initiative 팔로어 2명 — Tue, 28 Apr 2026 05:07:38 GMT

Why AI governance begins only when measurement becomes a real boundary

There is a word the AI industry uses with growing confidence: harness.

Harness your agents. Harness the model output. Build a harness around your pipeline.

The word implies mechanical control. A harness restrains a powerful animal and directs its force toward useful work. It implies structure, precision, repeatable behavior.

But most things being called “harnesses” today are not harnesses.

They are leashes.

What the industry means by harness — and why the definition is too wide

In early 2026, the formula Agent = Model + Harness became a canonical framing. The shared insight behind it was correct: the model is not the only story. The environment around the model matters just as much.

A useful taxonomy breaks that environment into two classes of components: guides, which constrain and direct what the agent does before it acts, and sensors, which observe and validate what the agent actually does after it acts.

This taxonomy is useful. It also reveals the problem.

Most of what teams are building sits entirely in the guides category.

SKILL.md files. System prompts. AGENTS.md configuration. Structured behavioral instructions. These are feedforward controls. They shape what the agent knows and what it is told to do before it acts.

They matter. They reduce error probability. They narrow the path. They often make a system noticeably better.

But a guide is still interpreted by the model.

A SKILL.md file telling an agent to scan, patch, re-scan, gate is a guide. The agent reads it, interprets it, and decides how to comply. The compliance is probabilistic. The interpretation drifts between runs depending on context, conversation state, tool state, and accumulated session history.

That is a leash.

You are pulling a rope attached to something that still has its own judgment about where to go.

The instinct when confronted with ungoverned AI agents is to add more guardrails at the model level. Write a more restrictive system prompt. Fine-tune the model to refuse more requests. Layer more safety logic on top of the outputs. Add another instruction. Then another.

But instructions drift. Interpretations shift.

The leash goes slack.

What a real harness requires

A harness is mechanical.

It does not ask for compliance. It produces a deterministic result regardless of how the agent interprets the surrounding context.

The distinction becomes visible the moment the agent produces something wrong.

The in-the-loop response is to fix the artifact. Edit it. Retry it. Ask the model to correct itself.

The on-the-loop response is different. You change the harness that produced the artifact so it cannot produce that result again.

The second requires something the first does not:

a sensor layer with hard output.

The conditions for a real harness are strict.

A mathematical model. Not a rubric. Not a scoring guideline. A formula with defined inputs, defined weights, and a calculable output. Something that produces the same number given the same inputs, every time, without asking the model to agree with it.

In practice, that means measuring things that can actually be computed, versioned, audited, and defended: structural divergence, execution-to-claim mismatch, dependency consistency, defect concentration, or other failure-relevant signals tied to the system you are governing.

The point is not the name of the metric.

The point is that it must be a formula, not a persuasion layer.

A deterministic output. JSON with a defined schema. A deficit score derived from the underlying calculations. A hard gate value that does not depend on the model’s interpretation of what “passing” means.

Pass or fail.

Not “it seems acceptable.”

Condition invariance. The same codebase produces the same score whether the scan runs during a developer session or in a CI job at 3 a.m. The criteria do not shift because the system prompt was worded differently this run.

That is not just a nice property.

It is the difference between a boundary and a vibe.

If the quality bar moves because the surrounding language moved, then what you have is not governance.

It is weather.

And finally, external measurement, not internal agreement. The model does not evaluate its own output. An external instrument measures the output against a mathematical standard and produces a verdict the model cannot negotiate.

This is the layer most teams have not actually built.

Guides without sensors is a leash with better documentation.

That said, guides are not useless. A well-constructed SKILL.md reduces the probability of agent error before the gate is ever reached. A stronger system prompt can improve consistency. A carefully written AGENTS.md can shorten recovery time.

The point is not that instruction documents should be discarded.

The point is that they cannot serve as the governance boundary.

That boundary requires measurement.

Thanks for reading Flamehaven Insights! Subscribe for free to receive new posts and support my work.

Why this distinction matters for governance

The governance question is not:

What did we tell the model to do?

The governance question is:

What is connected to the model, what can it affect, and what happens when it is wrong?

Blast radius is the term for this.

It describes the scope of damage that propagates from a single governance failure before it is detected and stopped.

A human employee making a compliance mistake has a bounded blast radius. One person, one action, one incident.

An AI agent running a continuous workflow does not.

It can process hundreds or thousands of interactions per hour. If that agent has a governance failure, the failure is not a point incident. It is a systemic one, replicated at machine speed across every workflow the agent touches until someone notices.

In ungoverned deployments, that may be far too late.

The variable across different deployments is not the model.

It is reversibility, regulatory scope, and the distance between model output and consequence.

A model drafting an internal note has a small blast radius. A human reviews the output before it reaches anyone who matters.

A model triggering a production action — committing to a branch, calling an external API, updating a database record — has a larger blast radius. The action may execute before review is possible.

A model shaping a legal summary, a medical workflow, or a customer-facing compliance artifact has a blast radius that may extend to regulatory liability.

Same model.

Different governance burden.

Better instructions do not close that gap.

Measurement does.

But only if the measurement itself deserves to be trusted.

The harness-to-governance pipeline

This is where the two problems connect structurally.

Governance requires a boundary.

A boundary requires a measurement.

A measurement requires a deterministic instrument — something that produces consistent output regardless of what the model thinks about it.

If your quality control lives inside a prompt, you do not have a governance boundary.

You have a suggestion the model can comply with, partially comply with, or drift away from over time without triggering any alert.

As context windows expand, prompts multiply, and agent output volume grows, the cost of instruction-based governance compounds.

Every new run is another opportunity for the leash to go slack.

A real sensor layer changes this.

The agent does not decide whether the output passes.

An external instrument measures the output against a standard and produces a verdict the model cannot negotiate.

The verdict feeds the next step in the loop.

If the verdict fails, the loop stops.

scan → interpret structured output → patch → re-scan → gate

This is not a clever workflow.

It is the minimum viable sensor layer.

Each step produces a structured artifact. Each artifact becomes the input to the next step. The gate is a hard threshold derived from a calculable score, not a judgment call and not a conversation.

The teams treating governance as an extension of prompt engineering will scale their leash-holding operation.

The teams treating governance as a measurement problem will build something that actually holds.

The practical question — and the one that follows

If you are building AI systems in production, the first question is not:

How well-written is our system prompt?

The first question is this:

Does our quality control produce the same verdict given the same input, every time, independent of what the model thinks about it?

If the answer is no — if the governance layer is a document the model reads and interprets — then you have guides without sensors.

You have a leash, not a harness.

But even if the answer is yes, a second question follows immediately:

Do we know why the threshold is set where it is?

Can the team explain it?
Can an auditor review it?
Can an incident report defend it?
Can anyone say why the line is 0.72 instead of 0.65, and what tradeoff that line encodes?

That is the harder standard.

Because a sensor layer that cannot answer that question is not yet a governance boundary.

It is only a more consistent leash.

More repeatable than a prompt, yes.
More legible than a conversation, yes.

But still dependent on the judgment of whoever designed it and the assumptions they did not write down.

Governance does not end when measurement begins.

It begins there.

The instrument that enforces that boundary needs to be mathematical, deterministic, and external to the model’s own judgment.

It also needs to be justified.

Everything else is a well-worded suggestion.

The Sheepwave Has a New Shape: OpenMythos and the Rise of Architecture Hype

Flamehaven Initiative 팔로어 2명 — Sat, 25 Apr 2026 05:32:02 GMT

This time, the wave is not about agents that act. It is about architectures that appear to think.

The Story Arrived Before the Code

A young AI genius.
A reconstructed Claude Mythos.
A tiny model that might think deeper than larger models.
An open-source architecture that could challenge the logic of scale itself.

For a few days, that was the story moving through the AI internet.

OpenMythos did not arrive as just another GitHub repository. It arrived as a symbol: Claude Mythos, recurrent reasoning, parameter efficiency, MoE, MLA, LTI stability, and the hope that intelligence might grow deeper without simply growing larger.

That is why the reaction was immediate.

Not because the code had proven everything.

Because the story was perfectly shaped for belief.

This is what I call a sheepwave: not stupidity, and not ordinary curiosity, but synchronized belief before verification. A README appears. It uses the right vocabulary. It touches the right frustration. AI assistants summarize it with surprise. YouTube turns that surprise into spectacle. GitHub stars become social proof.

Then the flock starts moving.

The problem is not excitement. Research needs excitement. The problem is when excitement hardens into public memory before the implementation has been forced to answer harder questions.

Once a research hypothesis becomes a social baseline, careful evaluation, accurate explanation, and future correction all become harder. The cost is not always immediate. It often appears as wasted evaluation cycles, weaker explainers repeating the strongest version after caveats are known, and later discussions treating a hypothesis as a proven capability baseline.

OpenMythos is not slop. That would be the wrong criticism. It contains real ideas, and some of the code is worth reading. The problem is proportion: a research prototype began to be consumed as if it had already become a verified shift in the future of AI.

1. Before OpenMythos, There Was Claude Mythos

Claude Mythos was not treated like an ordinary model name.

Public reporting, official evaluation notes, and community discussion framed Claude Mythos around cybersecurity capability, restricted access, institutional concern, and missing architectural detail. When a frontier lab hints at a restricted model with unusual capability but does not publish the full architecture, people ask a different question:

What is inside it?

That question creates a vacuum.

And the internet hates vacuums.

OpenMythos entered that vacuum with a powerful proposition: maybe one could reconstruct the shape of a Mythos-like architecture from public research. Not by leaking weights, stealing a model, or distilling Anthropic’s system, but by combining visible research threads into a plausible open-source architecture.

OpenMythos is not Claude Mythos. It is not proof of Anthropic’s internal design. It is not a leak, not a distillation, and not a verified reconstruction of a closed model.

It is better understood as a theoretical architecture experiment.

But online, distinctions move slowly.

“Someone rebuilt Claude Mythos” travels faster than “someone implemented a speculative recurrent-depth architecture inspired by public research threads.”

One is a story.

The other is a footnote.

The story won first.

2. Why the Internet Wanted to Believe It

OpenMythos became exciting because it connected to five overlapping promises. None were irrational on their own. Together, they made the project unusually easy to believe in.

2.1 Parameter Efficiency

The viral claim was simple: a smaller recurrent-depth model might reach the quality of a larger fixed-depth Transformer.

The emotional translation was simpler:

Maybe intelligence does not need to get bigger. Maybe it can get deeper.

Instead of stacking more unique layers, the model reuses a block multiple times. The parameter count stays fixed, while computation can become deeper through looping.

But a cited efficiency result is not the same thing as a reproduced result. A README benchmark claim is not a trained checkpoint, a reproducible training run, or a complete experimental record.

The idea may be valid.

The certainty around it was premature.

2.2 The Looped Architecture

A loop looks like thought. A recurrent loop is easy to visualize: the model reads, loops, refines, loops again, and eventually stops.

The public heard:

The model thinks.

The code implements:

Repeated computation through shared weights.

Those are not the same sentence.

Repeated computation is real. It may be useful. But the leap from repeated computation to “silent reasoning” is where language starts doing more work than the code.

This is the central seduction of architecture hype: the system begins to appear to think before anyone has proven what that appearance operationally means.

2.3 Smaller Hardware, Bigger Feeling

If recurrent depth reuses parameters, and MLA reduces cache pressure, maybe a smaller model can feel more capable than its size suggests.

That is not just technical. It is emotional.

Maybe the future is not only for frontier labs. Maybe a smaller team can still participate. Maybe a personal GPU can still matter.

But dreams need engineering behind them. A model that looks efficient on paper still has to survive dispatch costs, memory behavior, training stability, kernel efficiency, dependency correctness, and actual throughput.

Architecture does not erase engineering. It only changes where the hard problems appear.

2.4 The Claude Mythos Mystery

Claude Mythos already carried mystery. OpenMythos did not need to prove that it was Claude Mythos. It only needed to stand close enough to the question people were asking:

What if this is how Mythos works?

Anthropic did not publish the full architecture.
The community wanted a shape.
OpenMythos gave it one.

The project rode not only on architecture.

It rode on architecture plus absence.

2.5 MoE + MLA + LTI

OpenMythos placed many attractive research ideas inside one repository:

MoE for sparse routing
MLA for compressed KV cache
LTI-style injection for recurrent stability
ACT-style halting for variable compute
recurrent depth for inference-time iteration

This matters because the project was not an empty shell. It had enough real technical material to resist easy dismissal.

That is what makes architecture hype harder than ordinary hype. This kind often contains real ideas, but those ideas are surrounded by claims that move faster than implementation.

OpenMythos sits exactly in that zone.

It is not garbage.

It is also not salvation.

These triggers made OpenMythos easy to believe in.

But belief does not spread by itself.

It needs a mechanism.

3. The Mechanics of the Sheepwave

The flock usually moves in three steps: first belief, then amplification, then the slower return of code-level doubt.

Believers arrive first. They see Claude Mythos, open source, recurrent depth, MoE, MLA, LTI, parameter efficiency, and a young builder challenging the frontier-lab imagination.

Amplifiers arrive next. YouTube channels, newsletters, explainers, social accounts, and AI summaries turn the repository into a story people can understand quickly. They do not need to reproduce the benchmark. They need a story that travels.

Code readers arrive last. They clone the repository, inspect the training script, trace whether README claims are enforced in code, and ask slower questions:

Is the efficiency claim reproduced here?
Does the training loop update router_bias?
Is ACT halting connected to a ponder loss or compute regularizer?
Can the MoE dispatch path scale on real GPUs?
Is the experimental architecture integrated with the main model?
Are the large-context variants realistic to initialize?

This is not just a difference in attitude. It is an information asymmetry problem.

Reaction typeArrival timeContent shapeAlgorithmic advantageEnthusiasmImmediateShort, emotional, easy to shareVery highTechnical skepticismLaterLonger, careful, conditionalLowCode auditLatestLong, prerequisite-heavy, contextualVery low

The enthusiastic version is short:

770M reaches 1.3B quality.

The code-level correction is longer:

The efficiency claim is externally cited, not reproduced here; MoE dispatch uses nested Python loops and should be treated as a large-scale throughput risk; router-bias is exposed but not visibly updated in the shipped training script; ACT-style halting exists, but the training path does not include an explicit ponder-loss or compute regularizer.

One sentence fits a post.

The other requires a review.

The code reader rarely gets the microphone first because the work is slower, the explanation cost is higher, platforms reward early velocity, and contradiction has social friction.

So the public memory keeps the simple line.

The audit becomes a late footnote.

And this is where AI assistants make the pattern faster.

4. Why This Sheepwave Is Different

The OpenMythos wave is not only another case of people getting excited too early.

It has a newer accelerator: AI assistants.

If you give a GitHub link to an AI assistant, the assistant can read the README, recognize architecture terms, compare the description against known research, summarize the project, and explain why it might matter.

That is valuable.

But it is not verification.

Most AI assistants in a normal chat setting do not run multi-GPU training, reproduce benchmark curves, observe long-run routing balance, measure MoE throughput under real GPU load, or initialize huge-context variants to inspect memory behavior.

So the assistant reacts to what it can see:

a sophisticated README;
a serious-looking file structure;
real architecture vocabulary;
plausible references;
a coherent theoretical frame.

The AI is not necessarily lying.

It is reacting to the layer of the repository it can inspect.

Then humans turn that reaction into validation.

“Even the AI was impressed.”

That sentence is dangerous.

Sometimes the AI was impressed by the code.

Sometimes it was impressed by the README.

Those are not the same thing.

That is what makes this sheepwave different from older hype cycles. The project was not only amplified by people. It was amplified by machines that are very good at explaining plausible text, but not automatically good at verifying practical systems.

And because OpenMythos is architecture hype, the failure mode is quiet.

It does not usually collapse in a dramatic demo failure. It fails when a beautiful model diagram outruns the training path, benchmark language outruns reproduction, a README describes adaptive compute before the shipped objective encourages efficient stopping, or a repository looks like a system before its pieces are integrated.

So the next question is not whether the story sounds plausible.

The next question is what the code actually shows.

5. What the Audit Actually Found

The source-level audit changed the picture.

For this piece, the audit refers to a direct review of the source package: model implementation, training script, variants, tokenizer, tests, dependency files, and README claims against actual code paths. The goal was not to judge the project as a finished product, but to separate implemented mechanisms from public interpretation.

The detailed code-level review behind this section is here: OpenMythos v0.5.0 Code Review — Audit Report.

The audit did not destroy the project.

It reduced the mythology.

OpenMythos is not a random pile of hallucinated code. It contains meaningful research engineering. LTI-style injection is worth attention. MLA points toward a real long-context concern. Recurrent depth belongs in the broader conversation about scaling, compute allocation, recurrence, halting, memory, and routing.

But valuable ideas do not justify the size of the story that formed around them.

The short version:

This is the picture the hype tends to flatten.

OpenMythos is not empty hype.

It is a research-grade implementation whose public story outran its training and production reality.

None of these findings erase the value of the work. They put it back into proportion.

And proportion is exactly what hype tends to remove.

6. The Research Label and the Need for Proportion

There is one more reason OpenMythos is difficult to criticize cleanly.

It can always step back into the research label.

And to be fair, that label is not fake.

OpenMythos presents itself as a theoretical reconstruction, not a production model. It is closer to a research engineering artifact than a deployed system. A research project is allowed to explore unstable ideas, contain incomplete training paths, and implement architectural hypotheses before proving full practical viability.

The research label is legitimate.

But legitimacy is not immunity.

A project can be honest as research and still be over-consumed as revolution.

Three gaps, one pattern: the project speaks in the language of research, while the public reaction translates it into the language of arrival.

Can OpenMythos be used as-is as a serious practical model path?

No.

Not yet.

As a research artifact, it is useful. As a source of architectural ideas, it is useful. As an implementation reference for LTI-style recurrent stabilization, MLA-style cache compression, or recurrent-depth experiments, it is useful.

But as a practical training-ready model that fulfills the public narrative around Claude Mythos-like reasoning, parameter efficiency, adaptive stopping, and scalable MoE execution, it is not there.

The project may be research.

The public reaction was something else.

7. Closing: The Next Shape of the Wave

The lesson is not to stop being excited.

A field without imagination becomes sterile. Research needs people willing to chase strange shapes before they are safe, polished, or practical.

But the next wave will not announce itself as a repeat.

It may not look like OpenMythos. It may arrive as a benchmark result, a memory architecture, an agent framework, a robotics demo, a reasoning paper, or another repository dressed in the language of efficiency.

It will probably have a beautiful README.

It will probably use the right vocabulary.

It will probably give the community a future it already wants to believe in.

And an AI assistant may explain it beautifully before anyone has verified whether the system survives contact with the training path.

That is the pattern to recognize.

Not cynicism.

Not dismissal.

Recognition — followed by verification.

The mistake is not looking at the wave.

The mistake is mistaking the wave for the shore.

A wave can point toward a direction.

It cannot replace land.

OpenMythos is one wave in AI’s development: useful, interesting, technically suggestive, and over-amplified.

The README is not the shore.

The code path is.

The Moment Your App Becomes a Liability: A Deployment Rule for AI in Healthcare

Flamehaven Initiative 팔로어 2명 — Tue, 21 Apr 2026 15:59:35 GMT

No-code healthcare does not fail because the app does not work. It fails because the compliance architecture was never there.

The Zurich Case: Operational Success vs. Compliance Architecture

A clinician in Zurich recently watched a short video explaining how easy it now is to build software with AI. The message landed the way it lands for thousands of practitioners every week: this is the tool that finally closes the gap between what I need and what IT can build for me.

So they built it. A weekend. A coding agent. A custom patient management application, loaded with their entire patient database, connected to two US-based AI services for automatic transcription of appointment audio. No vendor. No waiting. No manual notes.

They were not being reckless. They were solving a real problem, using the best available tools, in exactly the way those tools were designed to be used.

Three weeks later, a security researcher spent thirty minutes in the waiting room and walked out having read and rewritten every patient record in the system. One terminal command. The entire “security” layer was client-side JavaScript in a single HTML file.

When notified, the clinician responded with a message that the researcher described as 100% AI-generated. Warm, professional, entirely missing the point.

I want to linger on that clinician for a moment, because this story is not about negligence. The clinician had domain expertise, patient relationships, and a genuine operational problem to solve. What they lacked was a single question that nobody in the build process had asked:

What will this system do with patient data, and can another party trace and verify that answer?

That question is not a technical formality. It is the dividing line between a system that is operational and a system that is deployable.

In healthcare, only one of those is legal.

The Pattern Has Already Been Enforced, at Scale

Before calling the Zurich case a vibe coding problem, it is worth establishing that the same structural failure has already been enforced, repeatedly, against organizations far better resourced than a solo practitioner.

Between 2023 and 2025, US healthcare organizations paid over $100 million in fines and settlements for one class of failure: patient data moving through third-party services that nobody had mapped against regulatory constraints before deployment.

Cerebral embedded Meta Pixel, TikTok, Google, and Snapchat trackers inside the onboarding forms where patients described their anxiety, depression, and medication histories. This ran for four years and affected 3 million users. The FTC fined the company $7 million and permanently banned it from using patient health data for advertising, severing the company’s primary growth mechanism. The former CEO was named in the complaint personally.

Advocate Aurora Health deployed Meta Pixel on authenticated patient portal pages. The marketing team integrated it exactly as Meta’s documentation instructed. Nobody had modeled what protected health information that pixel would capture once a patient was logged in. The result was a $12.25 million class action settlement affecting 3 million individuals.

The notable thing about Advocate Aurora is not that they failed. It is how they failed. They had a compliance function. The failure was not the absence of oversight.

It was a structural separation: the team that deployed the component and the team responsible for patient data privacy were operating on different tracks, and no process required them to intersect before deployment. That separation is exactly what no-code and vibe-coded workflows institutionalize as the default condition.

BetterHelp. GoodRx. Over $100 million in total. The average cost of a US healthcare data breach in 2025 reached $10.22 million, the highest of any industry, for the fourteenth consecutive year.

In each case, the component was integrated correctly by its own technical documentation. The failure was not a configuration error. It was a prior absence:

Nobody had characterized what that component would do inside a real clinical data environment before it was deployed

Why This Keeps Happening to Organizations That Should Know Better

Here is the question the enforcement record raises but does not answer:

Why does this pattern appear across organizations with legal teams, compliance officers, and awareness of regulatory risk?

The answer is not that these organizations were careless. It is that the failure is structurally produced by how software gets built.

AI coding tools are optimized to generate code that runs. Running is the success criterion. The workflow declares victory at the moment the application works. Whether that working application can be independently reviewed, whether its data flows are legally characterized, whether an auditor could reconstruct what it did and under what authority twelve months later: none of those are part of what the tool is measuring.

The tool succeeds at the exact moment the compliance problem begins. Silently. And because the system works, there is no internal signal that anything is wrong. A system built without data flow characterization cannot surface the fact that its flows are uncharacterized. The absence is structurally invisible until external pressure is applied — a researcher in the waiting room, a regulator with a subpoena, a class action filing.

This is the mechanism that connects the Swiss clinic to Cerebral to Advocate Aurora. They are not different stories about different failures. They are the same story: a component was placed into a regulated data environment without a prior layer of characterization. The mechanism that placed it was different. The missing precondition was identical.

What vibe coding changes is not the failure mode. It is the velocity. The Cerebral tracking architecture took years to build and years for regulators to identify. A no-code healthcare application can reproduce the same structural exposure in an afternoon. The regulatory frameworks do not adjust for speed.

HIPAA does not include an exception for AI-generated code. The EU AI Act’s clinical-adjacent software provisions become enforceable in August 2026 regardless of how the application was assembled. The structural exposure is the same, even if the regulatory path differs.

As these tools become more capable, the distance between “working application” and “deployment decision” continues to compress. What does not compress is the regulatory exposure. The gap between the moment the tool declares success and the moment a legal obligation attaches does not shrink as the tooling improves. It only becomes easier to cross without noticing.

The Standard That Changes This

The missing layer has a name, and it is worth defining precisely.

Compliance architecture is not a checklist applied after the code is written. It is a prior layer of characterization that must exist before any component is introduced into a patient data context. It answers four questions:

What data will this component touch?
Where will that data move?
Under what legal authority does it move?
How would an auditor reconstruct that path twelve months from now?

That is the design requirement that makes a system deployable rather than merely operational. No-code tools do not ask it. AI coding agents do not ask it. The organizations that avoid enforcement are not the ones using different tools. They are the ones who ask it anyway, before deployment, not after a breach notification arrives.

This is also why the fix is not slower tooling or avoided automation. The fix is a prior question asked at the right moment in the build process.

The deployment standard is simple: can another party trace what this system did, under what authority, and to whom?

If that question does not have a documented answer before the first patient record enters the system, the application is operational-looking. It is not deployable.

Where This Leads

OCR’s 2026 enforcement expansion specifically targets organizations that have not conducted risk analyses. A practitioner who built a patient management application with an AI coding agent and skipped risk analysis is very close to the kind of failure this initiative is designed to reach.

The enforcement cases so far have concentrated on covered entities with institutional structure: compliance teams, legal counsel, vendor relationships. The structural risk is increasingly concentrated below that threshold, among solo practitioners, small clinics, and health tech founders who have none of that infrastructure and no organizational layer between the coding agent’s output and a production deployment.

Those cases have not arrived yet. They will. And when they do, they will be public, permanent, and attached to names on a breach portal that the industry calls the Wall of Shame.

Platform-level guardrails are improving but are structurally misaligned: the growth incentive of no-code platforms runs in the opposite direction from what regulated deployment requires. Enforcement will build the market for domain-aware compliance tooling before voluntary platform governance will. That tooling is technically feasible. It does not yet exist at the scale or accessibility the problem requires.

Until it does, the verification is manual.

Verify before you deploy. Or someone else will do it for you, after a breach, under subpoena.

🔥Flamehaven builds governance architecture for AI systems in high-stakes environments. The full technical analysis behind this piece, including enforcement data, regulatory framework, and deployability evaluation methodology, is at: The $100 Million Blind Spot: What No-Code Healthcare Builders Still Don’t See

🔥For direct conversation about where your system’s deployability gap actually is: flamehaven.space/contact