close

DEV Community

Cover image for Everyone's Excited About Claude Tag. Nobody's Built the Trust Layer.
Daniel Nwaneri
Daniel Nwaneri Subscriber

Posted on

Everyone's Excited About Claude Tag. Nobody's Built the Trust Layer.

Social friction in ambient AI channels

Andrej Karpathy, OpenAI co-founder and former Tesla AI director, called Claude Tag the third major redesign of LLM UI/UX. First the LLM was a website. Then it was an app you downloaded. Now it's a persistent, asynchronous teammate that lives inside your Slack channels with org-wide context. He's right about the architecture. He's silent on what happens to the room.

Simon Smith, who'd already wired ChatGPT Workspace Agents into his team's Slack, said ambient visibility helps adoption: people watch each other use Claude in a shared channel and learn organically, no training program required. That's true for the person who turned Claude on. It's a different experience for the person who didn't get a vote.

I wrote about this eighteen hours after the announcement. Tag Claude into a five-person team and the moment it joins, every message anyone types is something an AI reads. You stop looking like someone using a tool. You start looking like the person who brought a surveillance device into the meeting. The frame you've built from there is unwinnable: good output gets read as "she's outsourcing her thinking." Mediocre output gets read as "see, this is what we were worried about." There's no third outcome that proves the skeptics wrong.

Gail Weiner replied to that thread with something simple. Bring the skeptics into the conversation. Ask how comfortable they are. Start small, let them pick the first use case, and let the small win be something they can point to and say out loud: this added value.

That's not diplomacy. It's a trust mechanism with a hard edge. The moment a skeptic says "okay, this added value" out loud, they're no longer the person blocking the rollout. They're on record as the person who approved it. Gail named it better than I did: the human trust layer. Everything Karpathy is excited about runs on top of that layer, and nobody launching Claude Tag this week is talking about who builds it.

Days before this launch I wrote production-safe-agent-loop, a small Python library for keeping single-agent loops from running away. A four-agent LangChain loop ran eleven days and cost $47,000. Claude Code recursion has burned $16,000 to $50,000 in five hours. The fix wasn't a smarter agent. It was five primitives: a spec writer that forces three answers before the loop runs, a circuit breaker with hard ceilings, an append-only ledger, the loop that respects both, and a review surface that assembles a fixed five-element frame once the run finishes: the original promise, the acceptance criteria, the diff, the evidence, and the unresolved assumptions.

The last piece is the one that matters here: attestation. A human reviews the frame, and attestation is not approval. It's a record that they reviewed exactly what's in front of them and they're taking responsibility for what happens next. The frame gets hashed. Two reviewers attesting the same session get the same hash. That's a receipt, not a vibe.

Claude Tag has none of this. It's ambient, persistent, and it decides on its own initiative what's relevant across every channel it's in. The five-element frame I built for single-agent loops maps directly onto "what did Claude decide in this channel, and did a human actually sign off on it." Just at team scale, across a dozen channels, not one developer's terminal session.

What ships without it

VentureBeat is asking about data retention and vendor lock-in. Twitter is asking what it can do. The actual unresolved question is structural: when an ambient agent acts on its own initiative across a dozen channels, who gets the five-element frame, who has to attest to it, and what happens when nobody does.

Claude Tag isn't wrong to exist. It shipped the easy half. The architecture works. The trust layer is still unbuilt, and it's not a UX problem. It's an audit problem with a name and a shape, and I already wrote the code for what it looks like when someone takes it seriously.

More on agent governance at dannwaneri.com/ai-agents.


AI helped me research and edit this piece. The arguments, the examples, and the opinions are mine. So is whatever's wrong with them.

Top comments (20)

Collapse
 
jugeni profile image
Mike Czerwinski

Attestation as record-not-approval is the primitive most teams will skip when they retrofit Claude Tag into existing review practice. It is also the part of your post I want to keep, because it names a distinction the rest of the conversation collapses: approval implies endorsement of the decision, attestation only commits the reviewer to having read the artifact and accepted accountability for what happens next. Those are different liabilities and they need different surfaces.

The hash convergence on two-reviewer attestation does interesting work I had not seen named explicitly. Two reviewers attesting the same session get the same hash by construction, because the artifact is content-addressed. That gives you a verifiable record of independence-via-content, which is different from independence-via-different-judgment. Independence-via-content says "we both saw the same thing"; independence-via-judgment says "we agreed despite seeing differently." Most agentic verification setups conflate these and end up with neither.

The piece that maps directly from your single-agent framework to the ambient-team case, and where I think the harder problem hides, is: at team scale, who authors the five-element frame matters as much as who attests to it. If the same agent that made the decision also assembles "original promise / acceptance criteria / diff / evidence / unresolved assumptions," the attestation is over a self-curated artifact. Structural separation between deciding agent and frame author is the part that needs ambient infrastructure most teams don't have yet, and it does not get solved by "tag a different agent to write the summary" because the deciding agent's outputs are still the only source the framer sees.

Gail Weiner's "skeptic on record as approver" mechanism is also the social version of your hash receipt: a public commitment a third party can point at later. That is the same primitive at a different layer, content-addressed accountability at the technical layer, name-addressed accountability at the organizational layer. Both have the same structural property: the accountability is hard to relocate after the fact.

Honest stage marker on this side: I work adjacent (operator-side decision audit on dev.to, with a parallel decision-audit primitive set called jugeni-contracts published this week). Reading your post sharpened how I think about the attestation-vs-approval split, which is a distinction I had been leaving implicit.

Collapse
 
dannwaneri profile image
Daniel Nwaneri

You're right, and it's worse in my own repo than the Claude Tag case makes obvious. The ledger rows are written by the same process that made the decisions. ReviewSurface reads self-reported evidence and calls it independent verification. Separating framer from decider means an observer outside the agent's own process — something logging tool calls and token counts at the system level, not trusting the agent to log itself honestly. I haven't built that yet. Where would you put the boundary??

Collapse
 
jugeni profile image
Mike Czerwinski

The boundary is the transport seam, not the code path. The agent can lie about what it did internally; it cannot lie about what bytes left its process. So the observer has to live where the agent's output crosses into the world — the syscall boundary, the MCP transport, the LLM API call, the file write fsync. Not inside the agent's loop, calling it "instrumented."

Concretely, three properties the observer needs:

Different process, append-only sink it cannot rewrite. If the agent can edit the log, the log is self-report under a different name. OTel collector in another container, or a unix socket to a separate writer with one-way pipe. The agent doesn't get a handle to the past.
Capture at the wire, not the wrapper. Token counts and tool calls logged by whatever talks to the model provider, not by the agent's own SDK wrapper. The wrapper is on the agent's side of the seam. Anthropic's API logs, OpenAI usage records, MCP server-side logs — those sit on the world's side.
Content-addressed, so the ledger row points at the bytes that actually went over the wire. Hash of the prompt, hash of the response, signed by the transport layer. If ReviewSurface reads evidence_hash and the observer's log has a different hash for the same call, you have a discrepancy without trusting either side.
The honest version of "I haven't built that yet" is that almost nobody has — most agent observability is wrapper-side, which means it is still self-report with extra latency. The closest production examples I have seen are the ones that piggyback on existing transport layers nobody owns: provider usage records, MCP server logs, syscall auditing. Those exist because somebody else wrote them for a different reason.

The shape that would land in your repo without rewriting everything: a sidecar process subscribing to the MCP server's stdio, hashing each call and response, writing append-only to a sink the agent's runtime cannot reach. ReviewSurface keeps reading its own ledger; an external diff job compares the two. Disagreement is the signal, agreement is not.

Three-actor structure stays: frame author (agent), attester (sidecar), reader (anyone). What changes is that the attester is not running the same code as the framer.

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

The "wire not wrapper" distinction is the part I needed. My repo doesn't run MCP . it calls the Anthropic SDK directly, so the seam is the HTTPS call to api.anthropic.com, not stdio. That means the sidecar has to terminate TLS or proxy the connection, not just wrap the client in the same interpreter, since a same-process wrapper is still self-report with extra steps. Does the diff-job design still hold once the proxy has to decrypt to hash anything, or does that just relocate the trust problem to whoever holds the cert???

Thread Thread
 
jugeni profile image
Mike Czerwinski

Yes, the trust problem relocates. That isn't a bug to solve, it's the property to design around. There is no fully-untrustable observer in software-only solutions — you can only push trust to actors with different motives, fewer conflict-of-interest, or harder-to-compromise positions. Same-process wrapper has no independent motive and is self-report under another name. Different process, different motive, harder-to-compromise position is the gradient.

Three options for your Anthropic-SDK-direct setup, ranked by infra cost:

Anthropic-side usage records as already-existing observer. They log request/response metadata server-side because they have to bill you and detect abuse. That motive is not identical to your agent's motive of self-favorable reporting. You don't control what they log, you don't get content hashes by default — but it's zero new infra, and the structural property (separate party, different motive) is real. Cheap; partial coverage. Worth pulling first because someone already paid for it.
eBPF or kprobes on SSL_write / SSL_read. Capture the cleartext bytes in kernel space before they enter TLS. Sidecar reads pre-encryption from a kernel hook. You never terminate TLS, so the cert ownership question doesn't apply. Trust relocates to the kernel — which is the threat-model boundary you were already implicitly trusting. Cilium and Falco do this in production for security observability. Same primitive, applied to honesty rather than security.
TLS-terminating proxy with separately-owned cert. Works, but only buys you independence if the cert owner is a different operator/team/org than the agent's shipowner. Same team holds the cert = security theater. Different team holds the cert = real gain, plus production complexity (cert rotation, latency, SPOF).
The cleanest of the three for a small ops surface is #2. The trust relocation lands on the kernel, which is where most threat models already draw the line, and you skip the TLS termination problem entirely. The cleartext is in process memory for a few microseconds anyway — the kernel just gets to see it without you having to MITM yourself.

Honest stage: I haven't shipped this at the SSL_write boundary in production. The eBPF pattern works for security observability — applying it to honesty observability is the same primitive, but I am reasoning by analogy on the production-readiness side. The Anthropic-side usage path I am more sure of, because it costs nothing to start using.

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

The eBPF route is the cleanest design but it's a kernel-level commitment a "minimal library, $5/month" tutorial repo's audience won't make. Anthropic's own usage records are the only option that matches the project's existing philosophy of zero added infrastructure. Is partial coverage — metadata only, no content hash still worth shipping as a real v2 step, or is that thin enough that it's not meaningfully different from having nothing?

Thread Thread
 
jugeni profile image
Mike Czerwinski

Metadata-only is meaningfully different from nothing if you reframe what you are trying to catch. The question is not „does this give us full audit" — it is „does this catch realistic adversarial patterns at this layer."

What metadata-only catches (Anthropic usage records vs ReviewSurface):

Ghost calls. Ledger says N tool calls, Anthropic billing says N+1. The extra one is structural fraud you cannot fake server-side because somebody charged you for it.
Model substitution. Ledger claims Haiku, usage shows Opus. Or the reverse — claims premium reasoning, actually ran cheap. Cost-drift fingerprint.
Token count mismatch. Ledger claims response of length X, usage shows tokens consumed for length Y. Tampering with the content row gets caught when it does not match what was actually billed.
Sequence / timestamp violations. Out-of-order calls, calls during periods the agent claims it was idle, retries not logged.
What it does not catch: content-level fabrication where token counts and call patterns are honest but the response text was tampered after the fact. That is a real adversary but a less common one in practice — and crucially, it almost always co-occurs with one of the structural patterns above, because most fraud has to lie about what happened as well as what was returned.

The framing that makes this worth shipping as v2: this is the boring-on-purpose primitive. Two-layer defense — server-side metadata for did this happen as claimed, wrapper-side content for what was the substance. Self-report on the content side is acceptable precisely because the structural facts are externally verified. Tamper the content and the metadata fingerprint diverges. That is the same shape as a correctness primitive being cheaper to verify than to forge.

Concrete v2 ship spec, fits the „$5/month, minimal library" shape:

For each ReviewSurface entry, store a fingerprint:
(timestamp, model_id, tokens_in, tokens_out, tool_call_count)
Periodically pull Anthropic usage records for the same window.
Diff the two. Mismatch = flag for review.
That is one cron job and one diff function. No proxy, no TLS termination, no kernel. Audience can adopt it without changing their stack.

Honest stage: this is not equivalent to content-level audit and you should not claim it is. But „structural fraud caught at the transport seam, content fraud surfaced by mismatch" is a real v2 step, not a fig leaf. The fig-leaf version would be hashing the response on the agent's side and storing it next to the ledger row — that is just self-report with a checksum.

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

This is a complete v2 spec not just a direction. Ghost calls and model substitution are the two I hadn't separated out, and "tamper the content and the metadata fingerprint diverges" is the line that makes the self-report half acceptable instead of suspicious. Building the fingerprint-plus-diff-job exactly as specified. The one thing still undefined is mismatch tolerance — legitimate retries and network hiccups versus an actual flag. Want to weigh in on that threshold once it's built or is that a separate problem??

Thread Thread
 
jugeni profile image
Mike Czerwinski

Same problem, separate architecture — worth naming the principles now so you don't rediscover them under deadline pressure.

The threshold has two distinct shapes that conflate badly. Infrastructure noise (transparent SDK retries, network hiccups, HTTP 5xx) has observable signature: idempotency keys, exponential backoff timing, error class. Adversarial evasion-via-noise is the case where someone hides fraud in the same noise floor on purpose. Different problems, different rules.

Five principles worth specifying before cutting numbers:

Retry-aware aggregation before diffing. Single semantic call that triggers 3 SDK retries should collapse to one ledger entry vs three usage records. Match on the leader, not the raw count.
Class-aware thresholds, not flat. Token count mismatch of ±10 is rounding noise. ±500 is signal. Model substitution is always signal — no legitimate noise should change model_id. Call count mismatch by 1 within retry window is noise; by 5 is signal.
Per-actor accumulation, not per-event. Single mismatch is not a flag. Multiple mismatches accumulating on the same actor within window = signal. Same shape as injection-suspicion that accumulates on the writer rather than firing per cell — cost lands on the actor's standing, not the individual call.
Calibrate empirically, not from theory. Ship with conservative thresholds, log near-misses, hand-review the first N days, tighten based on actual noise distribution. The numbers can only come from production data. Speccing them now means picking the wrong numbers more confidently.
Spec the adversarial-mode threshold separately. The honest-retry-tolerance case and the someone-hides-in-noise case have opposite design pressures. One threshold for both ships either too-loose or too-tight.
Yes, want to weigh in once it's built and there is real mismatch data to look at. Threshold setting from spec is the kind of thing that has to be re-done from production numbers anyway. Framework in spec, numbers from data.

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

Framework now, numbers from data — agreed and right call. Building it with all 5 principles specified as constraints, not guesses. Will ping you when there's real mismatch data to look at. thanks Mike

Thread Thread
 
jugeni profile image
Mike Czerwinski

Locked in. Constraints-not-guesses is the version of „spec the framework" that survives contact with production noise. Will be reading when the data lands.

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

Appreciate the whole thread. The tutorial's the next thing I write and you're getting credited by name in it.

Collapse
 
mnemehq profile image
Theo Valmis

The unwinnable-frame observation is the real insight here, sharper than the tooling debate. Good output reads as outsourcing, bad output confirms the fear, and there's no third result that converts a skeptic on its own. Gail's move works because it changes who's on record, not because it changes the tech. One thing I'd add: the skeptic's 'this added value' only holds if they can see what the agent did and why. Trust survives when the work is legible, not just when the output happens to be good.

Collapse
 
dannwaneri profile image
Daniel Nwaneri

Right and it's the same requirement Mike's working through in the other thread, just at a different layer. His fix for technical trust is an independently auditable trail — something a different process can check without trusting the agent's own account. The skeptic's version of legible isn't a ledger though. They're not opening a diff tool. What does "I can see what it did and why" actually look like for someone who was never going to read the logs?

Collapse
 
leob profile image
leob

Deep stuff, and I have the feeling you're right ... when is Anthropic going to hire you?

Collapse
 
dannwaneri profile image
Daniel Nwaneri

They don't have to. I already gave the code away.🤣

Collapse
 
leob profile image
leob

They don't have to - but they might want to, when they recognize you've got some serious "skillz" ... ;-)

Just curious, if they'd offer you a nice "position" (remote), would you consider it? :-)
(but maybe it's only possible if you relocate to the US, I don't know)

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

Remote, yes. Relocating to the US, no. That's the version I'd actually consider...

Thread Thread
 
leob profile image
leob

Oh yes that's so true, I couldn't agree more ...

I would also NEVER consider relocating to the US, even if they'd beg me (vanishingly small chance of that, lol) - especially not with the current Trump administration and their ICE insanity and all that ... thanks but no thanks!

(well it's completely theoretical, because they seem to have decided that they really do NOT want any foreigners in their country anymore, not even the best and the brightest, or the most hardworking)

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

lol, let's leave that one to the comments section and let the repo do the talking.