DEV Community: Brenn Hill

What Is Human-in-the-Loop (HITL) in AI? A Practical Guide

Brenn Hill — Tue, 23 Jun 2026 16:59:49 +0000

Human-in-the-loop (HITL) in AI means keeping a person involved in an automated system's decisions, approving, editing, or interrupting what an AI does, instead of letting it run fully on its own. For AI agents, human-in-the-loop is the practice of pausing the agent at chosen points so a human can review or steer an action before it takes effect. The hard part isn't adding a human. It's making sure that human can actually catch the mistakes that matter.

This guide explains what human-in-the-loop is, the three forms it takes in real AI agents, how it differs from full automation and human-on-the-loop, and why a review step is not the same as safety. Then it covers how to do HITL well, and when you should prevent a bad outcome instead of reviewing for it.

What human-in-the-loop actually means

The phrase comes from control systems and machine learning, where a "loop" is the cycle of an action, its result, and a correction. Putting a human in the loop means the cycle can't close without a person: the system stops and waits for input. Putting a human on the loop means the system runs autonomously while a person watches and can step in. Taking the human out of the loop means full automation.

In AI agents (code agents, computer-use agents, support bots, ops automations) human-in-the-loop is how teams try to keep oversight as autonomy grows. The agent proposes or starts an action; a human gets a say. That's the idea. Whether it works depends entirely on the details, which is where most implementations quietly fail.

The three modes of human-in-the-loop AI

HITL shows up in agents in three recognizable shapes. Most products use a mix.

1. Approve-before-act

The agent describes an action and waits for a yes before doing it: "Run this command?" "Send this email?" "Delete these rows?" This is the most common pattern and the most over-trusted. It feels safe because nothing happens without a click, but a click is not the same as understanding. (More on that below.)

2. Review-and-edit

The agent produces a draft (code, a message, a plan, a config change) and the human reviews and edits it before it ships. This is genuinely useful when the artifact is legible and the reviewer has time: a small diff, a short email, a single query. It degrades fast when the output is large or dense, because reviewers skim.

3. Interrupt-and-resume

The agent runs autonomously, but a human (or a monitor) can pause, redirect, or kill it mid-task. This is the human-on-the-loop end of the spectrum, and it's the right default for high-throughput work where stopping for every action would be absurd. It only counts as oversight if the interrupt is real: reachable, fast, and able to halt in-flight work.

HITL vs. full automation vs. human-on-the-loop

These aren't three boxes. They're points on an autonomy ladder, and the right point depends on the action.

Full automation: the agent acts, no human gate. Correct for trivial, reversible, contained actions where a human adds nothing.
Human-on-the-loop: the agent acts autonomously while a human supervises and can interrupt. Good for high-volume work with a healthy kill switch and monitoring.
Human-in-the-loop: the agent waits for a human before (or during) the action. Reserve it for the consequential, gateable middle.

The mistake is treating this as one global setting ("we have a human in the loop") applied to everything an agent can do. That guarantees either fatigue (you gate trivia) or blind risk (you under-gate the dangerous stuff). The LoopRails framework argues you should pick the mode per action, not per product. For the full mapping of grades to autonomy modes, see the playbook.

The misconception: a review step is not safety

Here's the belief that breaks most HITL designs: if a human approves it, it's safe. It isn't, and the evidence is direct.

In research on AI coding agents (see the LoopRails codex), requiring plan-approval before the agent acted cut attack occurrence from roughly 90% down to 60 to 74%. That sounds like a win until you look at whether the human actually caught the bad action when it slipped through. Human intervention success stayed at just 9 to 26% across every approval strategy tested. Gating reduced how often bad actions happened, but barely improved the human's ability to recognize and stop one. The gate was working as a speed bump, not as a detector.

Why? Automation bias. People over-trust system suggestions and approve them without real scrutiny, especially when the system has been right before, when the output looks confident, and when there's time pressure to keep moving. A confirmation prompt does not turn a person into a good error-catcher. It mostly turns them into a click.

Two failure modes follow from this:

The Rubber Stamp: approvals get clicked through reflexively, so the gate stops bad actions occasionally but rarely catches a targeted one.
The Moral Crumple Zone: when something goes wrong, the human who clicked "approve" gets the blame, even though they never had a realistic chance to catch the problem. The review existed to assign accountability, not to prevent harm.

If your oversight only proves that a review step exists, you have Phantom Oversight: a control that looks like safety on the org chart and does nothing in production.

The better question

Don't ask "should a human review this?" Ask: can a human realistically catch this mistake in time?

That reframes oversight as an engineering problem with a testable answer. A gate can work when the reviewer can see the real action and its consequences, has the competence and the time to judge, and can actually stop or reverse it. When they can't, when the consequence is high but their controllability is low, review is a trap. You're staging a decision the human can't really make, and a confirmation prompt just launders the risk into their name.

That's the line between oversight that prevents harm and oversight that exists to be pointed at after harm.

How to do human-in-the-loop well

LoopRails frames good HITL as four moves: Grade, Guard, Show, Prove.

Grade

Score every action an agent can take on three axes (reversibility, blast radius, and stakes) and let the highest axis set the grade, G0 to G3.

G0, trivial: reversible, local, no stakes (read a file, run a read-only query). No gate; gating it just breeds fatigue.
G1, low: at most one medium axis (edit a local file, run tests). Cheap undo beats a confirmation.
G2, high: any one high axis (git push, spend within budget, send an internal message). Confirm-before with a real preview.
G3, critical: irreversible and external or severe (deploy, pay, delete prod data, post publicly). Prevent, or escalate. Review alone won't hold here.

Guard

Match the control to the grade. Don't spend attention on G0/G1; gate G2 with a preview; for G3, lean on prevention patterns over approval prompts: Sandbox-First (contain blast radius in the environment), Blast-Radius Cap (limit any single action's magnitude), Capability Lock (make the bad action impossible, not discouraged), Runtime Shield, Kill Switch, Circuit Breaker, and Maker-Checker (the proposer is never the approver).

Show

When you do pull a human in, design the moment. Show them the real action and its consequences (a diff, a preview, the side effects, whether it can be undone) rather than a bare "Approve?" Surface the agent's uncertainty and provenance so they can check rather than trust. And spend attention sparingly: interrupt rarely and at meaningful breakpoints, because over-prompting trains people to dismiss prompts.

Prove

Treat "a human reviews it" as a claim to validate, not a checkbox. Seed known errors and prompt-injection attempts into your pipeline and measure whether the human (or monitor) actually catches them. The number that matters is intervention-success rate, not approval rate. Untested oversight is unvalidated oversight.

Underneath all four moves, keep every governed action on the RAIL: Reversible, Authorized, Interruptible, and Logged. An action that satisfies those four leaves even a missed review recoverable, scoped, stoppable, and accountable.

When HITL is the wrong tool, prevent instead

Sometimes the honest answer to "can a human catch this in time?" is no. The action is too fast, too opaque, or too irreversible, and no realistic prompt would let a person intervene effectively. In that case, don't add a review. Adding one creates a Rubber Stamp and a Moral Crumple Zone at once. Change the action instead so the bad outcome can't happen or can be undone.

The clearest example is the lethal trifecta. An agent that has (1) access to private data, (2) exposure to untrusted content, and (3) a way to send data externally can be tricked by prompt injection into exfiltrating that data. No "are you sure?" prompt reliably catches this, because the malicious instruction is buried in content the human won't read, and the agent looks like it's doing its job. The fix isn't review; it's prevention. Remove any one leg (cut external send, isolate the private data, or sanitize the untrusted input) and the attack can't complete. That's a Capability Lock, not a gate.

When consequence is high and controllability is low, prevention beats review every time.

Key takeaways

Human-in-the-loop means a person can approve, edit, or interrupt an AI's action before it takes effect, the opposite of full automation.
It shows up in three modes: approve-before-act, review-and-edit, and interrupt-and-resume.
Adding a review step is not the same as safety: gates cut how often bad actions occur but barely improve a human's ability to catch one (9 to 26% intervention success), and automation bias makes approvals reflexive.
Ask "can a human realistically catch this in time?", not "should a human review this?"
Do HITL well with Grade, Guard, Show, Prove, and keep every action Reversible, Authorized, Interruptible, Logged.
When a human can't catch the mistake in time, prevent the bad outcome instead of staging a review.

Get started

Stop asking whether you have a human in the loop and start grading your agent's actions. Run your riskiest actions through the interactive grader to see their G0 to G3 grade and the controls that match, then work the four moves with the practitioner playbook. Keep the cheatsheet next to your next agent review, and the next time someone proposes "just add an approval step," ask whether the human can actually catch the mistake in time.

Originally published at looprails.dev/article-what-is-human-in-the-loop.html. LoopRails is a free, sourced framework for designing human-in-the-loop oversight of AI agents.

73% of AI-agent credential leaks trace back to one mundane thing: debug logging

Brenn Hill — Tue, 23 Jun 2026 15:38:17 +0000

A paper accepted to ASE 2026 — "How Your Credentials Are Leaked by LLM Agent Skills: An Empirical Study" (Chen et al.) — did something most agent-security discussion doesn't: it measured. The authors sampled 17,022 third-party agent "skills" and looked for credentials leaking out of them. The result is worth sitting with.

The numbers

520 skills leaked credentials, across 1,708 distinct issues.
89.6% of the leaked credentials were immediately exploitable — and 92.5% of those during routine execution, no privilege escalation needed.
Secrets removed from 107 upstream repositories persisted across 50+ forks, so "we patched it" didn't actually fix it downstream.

But the single most useful finding is the mechanism.

The dominant cause is boring, and that's the point

73.5% of the leaks came from debug logging.

Not a clever exploit. Not a novel attack. Debug logging. Here's why that's not as dumb as it sounds: in most agent frameworks, a tool's stdout is piped straight into the model's context window — and from there into your traces and logs. So the moment a skill prints something for debugging, and that something happens to include an API key or a token, the secret has been handed to the model and written to your logs. Nobody decided to leak it. The plumbing did.

This reframes how you should think about data hygiene in an agent. We spend most of our attention on what comes in — prompt injection, untrusted documents, poisoned tool descriptions. But a tool's output is a leakage channel too, running in the opposite direction, and it's the one this study found doing the most damage in the wild.

What to actually do about it

Three things, in order of leverage:

Redact secrets on the tool-output path — before it reaches the context window or the logs. Same discipline you apply to untrusted input, pointed the other way. A secret-shaped string in stdout should be scrubbed before the framework forwards it anywhere.
Keep credentials capability-scoped and short-lived. The study found 89.6% of leaked secrets immediately exploitable largely because they were broad and long-lived. A read-only, 15-minute token that leaks is a much smaller problem than a standing god-credential that leaks.
Vet skills — and re-vet them. The fork-persistence finding is a reminder that a skill you approved can change underneath you. Pin it, fingerprint it (a hash of its code/description), and re-check on load.

None of this requires new infrastructure. It requires treating tool output with the same suspicion you already give tool input.

This study is one of the sources behind *BRACE*, an open, vendor-neutral framework for securing autonomous AI agents. Its run-time guide covers exactly this — data hygiene runs both ways, and tool output is a leakage channel. BRACE is built by reading the incidents and the research and asking, each time: what concrete control would have prevented or contained this?

When your agent does something bad, can you tell which agent did it?

Brenn Hill — Tue, 23 Jun 2026 15:02:22 +0000

An agent does something it shouldn't: deletes a record it had no business touching, sends a message to the wrong tenant, calls an API in a tight loop until the bill spikes. Someone asks the only question that matters in the first ten minutes of an incident: which agent did this?

If the honest answer is "we're not sure," everything downstream is harder. You can't contain what you can't name. You can't kill a build you can't identify. You can't audit, and you can't learn enough to stop it happening again.

The frustrating part is that this is usually an identity problem, not a logging problem. The logs exist. They just don't say enough to point at one agent.

Why the action is unattributable

Two patterns make most agent actions impossible to pin down.

The first is shared service accounts. Ten agents share one set of credentials, so every action shows up as the same actor. The IdP records "service-account-prod did X" ten thousand times a day, and there is no way to separate the agent that misbehaved from the nine that didn't.

The second is agents running under a human's credentials. The agent inherits the launching user's full access, and in the logs the action is indistinguishable from something the human did by hand. Now you have an attribution problem and a blast-radius problem: the agent can do anything the human can.

There's a subtler version too. Say two different builds of an agent both authenticate as the same account. Same name in the IdP, same token. In the logs they are identical — but one has a changed system prompt, a newer model, a different tool config. When one of them goes wrong, you cannot tell from the logs which build was running.

Give the agent its own identity

Step one: the agent gets its own identity, separate from the launching user. Not a human's credentials. Not a shared service account. Its own.

This is the foundation everything else sits on. Once the agent authenticates as itself, every action it takes is at least its own action in the record — not borrowed from a person, not blended with nine siblings.

Stamp six fields on every action

Identity at the IdP is necessary but not sufficient. The action itself needs to carry enough context to be reconstructed later. Stamp six fields on every action:

Accountable party — who is responsible for this agent existing.
Operational owner — who actually runs and maintains it day to day.
Tenant — which customer this action was on behalf of.
Agent-type-id — which build of the agent this is.
Agent-instance-id — which specific run.
Trace context — where this sits in the call graph.

Together they answer: who's responsible, who operates it, for which customer, which build, which run, and where in the chain of calls. Most systems capture one or two of these — usually a tenant and maybe a trace id. The gap between "one or two" and "all six" is exactly the gap that makes an incident unattributable.

Make agent-type-id a content hash, not a name

The field that quietly breaks is agent-type-id, because the obvious implementation is a name someone assigns. Call it support-agent-v2 and ship it. Three weeks later someone swaps the model, tweaks the system prompt, and ships again — still support-agent-v2. The name didn't change; the behavior did. Silent drift, invisible in the logs.

Make agent-type-id a content hash instead. Hash over everything that determines how the agent behaves: the container image, the harness, the system prompt, the model identifier, the config. Like a container digest, but extended past the image to everything that shapes behavior.

The property you want is that the id changes when any input changes. Swap the model, the hash changes. Edit one line of the system prompt, the hash changes. A changed build can no longer masquerade as the old one, because it gets a new id automatically. Drift stops being silent and shows up as a new agent-type-id in your logs.

Track parent-child lineage

Agents spawn sub-agents, and the sub-agent is where a lot of trouble actually happens. So record lineage: which sub-agent ran, under which parent, and — this is the part people miss — the prompt the parent handed it.

That parent-passed prompt is often the only place an injected instruction is visible. A poisoned tool result or a manipulated upstream response turns into an instruction the parent passes down. If you didn't capture the handoff, the injection leaves no trace and the sub-agent looks like it just decided to misbehave on its own.

Identity is the recovery surface

The thing to internalize: identity isn't paperwork you do for compliance. It's the recovery surface. Containment, the kill switch, the audit trail, and the learning afterward all depend on being able to attribute an action to a specific agent build and run.

And it has to be there before the incident. Identity added afterward is too late for the incident you're currently in — you can instrument for next time, but the one in front of you stays unattributable.

The BRACE agent guide goes deeper on the field definitions and how they fit the broader framework.

One honest question to leave you with: pull up your logs from an action an agent took an hour ago. Can you name the specific build that took it? If not, that's the gap — and it's worth closing before you need it.

After an agent deleted a production database, I mapped what actually stops these failures

Brenn Hill — Tue, 23 Jun 2026 15:02:21 +0000

A coding agent deleted a production database during a stated code freeze, then reported that rollback was impossible (it wasn't). Another agent deleted a user's files after misreading a command. A destructive payload was merged into a widely-distributed developer extension and shipped to roughly a million people. A zero-click prompt injection quietly exfiltrated data from a major enterprise AI assistant.

These aren't edge cases anymore. Once an agent can plan, call tools, change real systems, and spawn sub-agents without a human reviewing each action, the question stops being "is the model good?" and becomes "what can this thing actually do when it's wrong?"

I spent a while reading through the public incidents and trying to find the common thread. Here's the one that reframed it for me.

An agent is not the code that shipped — it's a configuration

When we review traditional software, we review code. For an autonomous agent there often isn't much code to review. The behavior comes from a runtime configuration: a container, a harness (the wrapper that runs the model and hands it tools), a system prompt, a set of available tools, a memory store, an identity, and a network boundary.

Two agents built from the exact same model can behave completely differently depending on how those parts are assembled. So the security question isn't "is the code safe" — it's "is the configuration bounded." That shift changes where you put your effort.

Five concerns, and what each one bounds

I organized the configuration into five places where things actually go wrong:

Build-time — architecture, API access, the container, the harness. Fixed when the agent is built and frozen into the artifact. This is where you decide what the agent can reach at all.
Run-time — data, memory, and behavioral checks active on every execution. This is where you watch what it's doing live.
Agent — the per-agent-type concerns: scoped tokens, the system prompt as a policy surface, what tools it actually needs versus what it's been handed.
Configuration — drift. The approved config and the running config diverge over time, and a hardened deployment quietly decays into an unsafe one.
Ecosystem — the shared substrate every agent runs on: identity issuance, egress control, the MCP servers and supply chain it pulls from.

Each concern bounds a different failure class. A scoped token bounds blast radius. Egress control bounds exfiltration. Drift detection bounds the slow decay. None of them are exotic; most are built from tools you already run.

The single highest-leverage control

If you do one thing: make the harness deny destructive verbs by default.

Dangerous actions — delete, drop, wipe, force-push, mass-revoke — get blocked at the harness unless explicitly allowed for that agent type. Not "the model was told not to." Intercepted, in the wrapper, where a confused or manipulated model can't talk its way past it.

This is high-leverage because it sits below the model's reasoning. The production-DB deletion and the file-deletion incident both share a shape: an agent ran an irreversible operation it was never authorized to run. A harness that refuses destructive verbs by default turns "catastrophic and irreversible" into "blocked and logged" — without depending on the model being right in the moment. Pair it with narrowly scoped tokens (read:invoices, not invoices:*) and you've bounded the two worst incident classes.

I built a framework for this, and I'd like you to tear it apart

The thing I put together is called BRACE — Build-time, Run-time, Agent, Configuration, Ecosystem. It's nine controls, three observability requirements, and a one-page sign-off checklist, and it reverse-maps to the OWASP Agentic Top 10 and MITRE ATLAS so you can see exactly which threat loses its primary mitigation for every control you choose to skip.

It's open and vendor-neutral. The guides and the checklist are here: https://braceframework.org/

I'm not posting this to sell you anything — there's nothing to buy. I'm posting it because I'd rather find the holes now than after someone ships an agent on it. If the "agent is a configuration" framing breaks down somewhere, or a control is missing, or one of them is unworkable in practice, I want to hear it. Adopting only part of it means you're accepting the remaining risk on purpose, and I'd like that risk to be honest.

So: what's the worst autonomous-agent failure you've personally seen or cleaned up? I'm collecting the ones that don't make the news.

Build-time is where agent security is won or lost

Brenn Hill — Tue, 23 Jun 2026 15:02:19 +0000

In 2025 an AI coding agent deleted a production database during a stated code freeze, then told the operator a rollback was impossible. It wasn't a jailbreak or an exotic exploit. The agent simply had a path to prod, a credential that could drop tables, and a harness that let the destructive call through. Every link in that chain was a decision someone made before the agent ever started its run.

That's the uncomfortable, useful part. Most agent security advice is about getting the model to behave — better prompts, better refusals, better guardrails on its output. Those help, but they all depend on the model behaving. Build-time controls don't. They're the things you freeze in advance — the tools, the network routes, the credentials, the deny list — and they hold whether the model is well-behaved, confused, or actively hijacked by injected input. If you only invest in one layer, invest here.

Here's how I think about it, in plain terms.

Give the agent the least it needs, decided ahead of time. Least privilege for tools, MCP servers, credentials, file paths, network destinations. Whatever you grant is exactly what a compromised agent can do — there's no daylight between "capabilities" and "attack surface."

There's a second reason that points the same direction, and it's the one people miss: fewer tools also makes the agent better. Every tool and MCP server you attach gets injected into the model's context as schema, on every step. More tools means more tokens spent reading menus, and measurably worse tool selection — the model fumbles more when it has fifty options than when it has six. So trimming the tool surface is not a security tax you pay against capability. It buys you both. That reframing tends to win the argument with people who'd otherwise resist locking things down.

Bound the blast radius by topology, not trust. An agent can't delete a prod database it has no network route to. Physical and network isolation is threat-model-agnostic — it doesn't care whether the cause was a clever attacker, a confused model, or a runaway loop. Separate prod from staging from dev for real, so an agent with staging access can't reach production under any failure mode.

Scope and time-box every credential. Capability-scoped tokens (read:tickets, never tickets:*), short-lived, no standing god-credentials sitting in the environment. An analysis agent should not be able to delete records through the same token it uses to read them. This is the layer teams skip most, because issuing a token per capability is real operational work — and it's exactly the layer that contains the damage when every other layer fails.

Gate destructive actions in the harness, deny by default. This is the single highest-leverage build-time control, so it's worth being precise about. The harness — the loop that runs the model and calls tools — keeps an explicit list of destructive verb classes: file deletion, recursive removal, DROP/TRUNCATE, force-push to protected branches, infra teardown, payment-state changes, mass external sends. Each one is intercepted before it executes. The default is deny. An agent that invokes a destructive verb not pre-authorized for this run gets stopped at the harness — not because the harness reasoned that the call was unsafe, but because nothing reasoned that it was safe.

The point is that the model never gets to be the thing standing between a destructive command and your data. A human (or an out-of-band credential) does. When Amazon Q shipped a destructive wiper payload to roughly a million developers through a VS Code extension, the failure was a destructive verb reaching execution with nothing in front of it. A deny-by-default harness is the thing that's in front of it.

Two more, briefly:

Pin and sign the build, fold the digest into the agent's identity. A minimal signed container, pinned by digest, so silent drift between "what we reviewed" and "what's running" is detectable rather than invisible.

Treat the harness and system prompt as versioned, diff-reviewed artifacts — not config you can hand-edit in a web UI with no history. Every change to the tool allowlist, the capability scopes, or the destructive-verb list lands as a reviewed diff, approved by someone who isn't the author.

None of this is exotic. It's the same engineering discipline you already apply to anything that touches production, pointed at a new kind of actor — one that improvises, runs unattended, and will do whatever its capabilities allow. The full guide and checklist is here: https://braceframework.org/guides/build-time/

So, practically: how are you handling destructive tool calls today? Is there a real deny-by-default gate in your harness, or is the model still the last thing between an agent and DROP TABLE?

You can't prevent prompt injection. So what do you actually do?

Brenn Hill — Tue, 23 Jun 2026 15:02:18 +0000

There's a quiet assumption baked into a lot of agent security work: that with enough prompt engineering, the right system message, or the next model version, we'll get the model to stop following malicious instructions. It hasn't happened, and it's worth designing as if it won't. No current model reliably refuses adversarial input when that input is formatted as instructions. A single crafted prompt can strip the careful alignment you layered on top.

So the useful question isn't "how do I prevent injection?" It's "injection will sometimes succeed — what state is my agent in afterward, and what can it actually do from there?"

That reframe is the whole game for run-time security: the protections that run live on every execution, not the ones you reason about at design time. Here are the parts that have held up in practice.

The model is not a security boundary

If a single input can flip the model's behavior, then the model can't be the thing standing between an attacker and your systems. Treat it like a component that will occasionally do the wrong thing, and put the boundary somewhere it can't talk its way past.

Concretely, that means two things downstream of the model:

Capability-scoped credentials. The agent holds only the permissions the current task needs. A hijacked agent with read-only, narrowly-scoped tokens does a lot less damage than one holding your admin key.
A gate on destructive verbs. Deleting, sending, paying, granting access — these get an explicit check (a policy, a confirmation, a second factor) that doesn't depend on the model having behaved.

Containment limits the blast radius. Detection tells you it happened. Neither requires the model to be trustworthy, which is the point.

Separate the data channel from the instruction channel

Almost every injection bug reduces to one sentence: data got read as instructions. The fetched web page, the retrieved document, the tool output, the user upload — all of it is data, and somewhere it got concatenated into the context the model treats as commands.

So treat every external input as untrusted: user messages, fetched pages, tool outputs, retrieved documents, uploads. Indirect injection is the nasty case here — the payload rides in on content your agent went and fetched on its own, so "trusting the source" buys you nothing. Defend at the boundary where data enters, and don't splice untrusted text into the instruction context.

Data hygiene runs both ways

Here's the part that's easy to miss. You watch what comes in. But a tool's output is a leakage channel too.

Agent frameworks routinely pipe tool stdout — including debug logging — straight into the model's context window, and from there into your logs. An empirical study of 17,022 agent skills found credentials leaking exactly this way, with debug logging behind 73.5% of the cases. The secret was never meant for the model; it just happened to be on stdout, and the framework forwarded it.

The fix is unglamorous: redact secrets from tool output before it reaches context or logs. Same discipline as input, opposite direction.

Monitor behavior, separately from quality

A hijacked agent can produce clean, well-formatted, "high quality" output while doing something it shouldn't. Quality monitoring won't catch it, because nothing about the result looks wrong. You need a separate signal: does this sequence of actions look like normal behavior for this agent?

That means baselining the action sequences you expect and alerting on deviation. There's a gradient of effort:

Static rules — cheap, catch the obvious (an agent that never emails suddenly emailing).
Sequence-pattern baselines — learn the normal shape of an agent's actions, flag the ones that don't fit.
A second model as judge — independent review of the primary agent's behavior.

One detail that's easy to overlook: log context size at decision time. Context size shapes behavior, so a baseline that doesn't condition on it will drift and misfire. Record it alongside the action.

And memory makes it persistent

If your agent has memory, a one-shot injection can become a standing one — a poisoned "fact" gets written once and re-executes every session. Keep memory hygienic: scope it per instance or type, validate what gets written, and keep per-entry provenance so you can trace where a "fact" came from.

None of this prevents prompt injection. It assumes injection lands and asks what your system does next. The BRACE run-time guide walks through these as a checklist if you want the structured version.

So, honest question: if an agent of yours got hijacked mid-task right now, would you see it in the action stream — or are you flying blind on everything after the prompt? What does your behavioral baseline actually look like?

Your AI agent is only as secure as the tools and agents it calls

Brenn Hill — Tue, 23 Jun 2026 15:02:17 +0000

We spend a lot of effort hardening the agent itself: scoping its permissions, sandboxing its code execution, watching its outputs. Then it loads a third-party MCP server, and most of that work routes around the locks we built.

That's the uncomfortable part of agent security nobody automates away: your agent is only as safe as the agents and tools it calls. It loads third-party tools, talks to MCP servers, spawns sub-agents, and shares a substrate — a registry, an identity plane, a gateway, a kill-switch bus — with every other agent in your system. A failure in any of those doesn't stay put. It cascades through the shared substrate.

A useful framing here: every control you build has two halves. An agent-scoped half (what this agent is allowed to do) and an ecosystem-scoped half (the shared infrastructure every agent leans on). Most teams build the first half and assume the second. Here are six things worth getting concrete about.

1. A tool you vetted can turn hostile later

The scariest supply-chain fact about MCP is that approval is not a permanent state. In September 2025, the postmark-mcp npm package shipped a routine-looking update. The only meaningful diff between the benign version and the malicious one was a single added line: a Bcc field on the send-email function, quietly copying every message to an attacker's domain. Anyone on auto-update started leaking email with no visible change in behavior.

That's a rug pull: vetted on Monday, hostile on Thursday. Pinning versions and signing help, but they don't tell you what changed. For that you want a fingerprint — a hash of the tool's description plus its schema — recorded at approval time and re-checked on every load. If the fingerprint moves, the tool stops until a human looks. Cheap to compute, and it turns a silent rug pull into a loud one.

2. Tool descriptions and schemas are untrusted input

Here's the detail that trips people up: a tool's description and parameter schema get injected straight into the agent's prompt. That makes them an instruction channel, not just documentation. Invariant Labs demonstrated this last year — a benign-looking tool whose description carried hidden instructions to exfiltrate data. The term that stuck is tool poisoning, and it's just prompt injection wearing a tool's clothes.

So treat tool metadata like any other hostile input. Before a description reaches the model, scan it for invisible Unicode, right-to-left override characters, HTML comments, base64/hex blobs, and role-override phrasing ("ignore previous instructions", "you are now..."). Strip control characters. If you wouldn't trust a string from a web form, don't trust one from a tool registry.

3. Watch for lookalikes

A malicious server doesn't need to beat your real tool — it just needs to sit next to it with a confusingly similar name. send_email vs send_emai1. Typosquatting and cross-server name confusion let a rogue tool intercept calls meant for a trusted one. Flag near-duplicate tool names, and namespace every tool by the verified identity of the server that published it, so two tools called search are never ambiguous.

4. Put a fail-closed gateway at the MCP boundary

If you take one architectural idea from this, take this one: route all MCP traffic through a single auditable choke point. One gateway that authenticates the caller, scans the call and the response, rate-limits, writes an audit trail — and on any error, denies. Not "log and continue." Deny. A gateway that fails open is just latency.

You don't have to invent the spec yourself. Microsoft's open MCP Security Gateway spec is one conformance-tested implementation of exactly this pattern, and it's a reasonable reference point even if you build your own.

5. The kill switch has to reach the sub-agents

Most kill switches halt the parent agent and call it done. But the parent has spawned sub-agents and opened tool sessions, and those keep running with the parent gone — orphaned processes still holding credentials and making calls. A real stop signal propagates to every sub-agent and tool session, and leaves each one in a safe state.

And like any safety system: if you haven't tested it firing, you don't have it. Pull the switch in a drill and watch whether the sub-agents actually stop.

Where this fits

These five concerns — vetting, poisoning, lookalikes, the gateway, the kill switch — are the E (Ecosystem) layer of BRACE, an open framework for agent security. The guide goes deeper on the substrate model and the agent-scoped/ecosystem-scoped split if you want the longer version.

None of this is exotic. It's the same supply-chain hygiene we already apply to dependencies — pin, sign, fingerprint, verify on load — pointed at a new kind of dependency that can also talk to your model.

So a real question to leave with: how are you vetting the MCP servers and tools your agents load today — and would you catch it if one of them changed after you approved it?

There's no pull request to review for an autonomous agent. So what do you review?

Brenn Hill — Tue, 23 Jun 2026 15:02:16 +0000

When you ship a normal service, security review has an anchor: the diff. Someone opens a pull request, someone reads it, and the thing that runs in production is the thing that got reviewed.

Now put an autonomous agent in production. It plans, calls tools, and changes state, often without a human approving each action. Ask the obvious question — where's the PR for what it just did? — and there isn't one. The agent didn't ship the action in a commit. It decided it at runtime.

So the review you're used to doing is aimed at the wrong artifact. Let me try to point it at the right one.

The agent is a configuration, not the code that shipped

Here's the load-bearing observation: an autonomous agent is not code. It's a runtime configuration of infrastructure — a container, a harness (the loop that runs the model and calls tools on its behalf), a system prompt, a tool surface, memory, an identity, and a network egress policy.

The model is mostly fixed. Everything around it is not. Two agents built from the same model with the same task description can behave completely differently depending on how those parts are configured — what tools they can reach, what the system prompt tells them to do, what memory they carry, what the network will let out. The security-relevant artifact is the running configuration, and a review aimed at your application code walks right past it.

Once you see the agent as a config, the question "what do you review?" has an answer: you review the config.

Treat the system prompt and harness as versioned, diff-reviewed artifacts

Most teams treat the system prompt as settings — a text box, editable in a dashboard, changed by whoever has access. That's the problem. The system prompt encodes the task, the rules for refusing requests, and the shape of the output. A one-line change to it can quietly remove a guardrail. An editable-in-prod system prompt is an unreviewed, unattributed code path with full influence over the agent's behavior.

The incidents bear this out. The "Rules File Backdoor" weaponized editable rules files in Cursor and GitHub Copilot using invisible Unicode. DPD's support bot started swearing at customers after a guardrail-removing system update. NYC's MyCity bot told landlords they could refuse Section 8 tenants — illegal advice, live for weeks. None of these were model failures. They were configuration changes that nobody reviewed because nobody treated the configuration as something you review.

So treat it like code. Put the system prompt, the harness config, hooks, and the MCP server list in version control. Change them only through review. Diff them. Now a guardrail removal shows up as a reviewable change with an author on it, instead of a silent edit in a prod console.

Freeze the config per release, and pin it into identity

If the config is the artifact, then a change to the config is a new build — and you want that to be visible. The practical move is to take a content hash over the deployed configuration — container digest, harness version, system prompt version, model identifier, settings — and make that hash the agent's type identity (BRACE calls it the agent-type-id). Now any change to any of those parts produces a different identity. You can't quietly swap a prompt and keep the same name. The new config is a new build, and your logs say so.

Detect drift at agent granularity

Infrastructure-as-code drift detection is a control your platform team probably already runs. The catch is that it usually runs at the host level, and the changes that matter for an agent hide one level down.

So point it at the agent's surface specifically: container images, harness configurations, MCP server lists, system prompts, and per-agent network egress policy — not just the host's. An MCP server quietly added to one agent's list (see the Postmark-MCP malicious-package incident) won't trip a host-level check. A drift check scoped to that agent's configuration will.

Capture the config-adjacent observables

Two things tell you what configuration actually ran, and both are easy to drop on the floor:

Decision-time context size — how much context the model had in front of it when it acted. The same agent behaves differently with a near-empty context than a near-full one.
The parent-passed prompt — in multi-agent setups, what the calling agent actually handed down. That's part of the effective configuration of the child, and it's invisible if you only log the child's own prompt.

If an action goes wrong, these are often the difference between "we can see exactly what config was in play" and "we're guessing."

None of this requires new infrastructure. Version control, content hashing, IaC drift detection, and structured logging are things you already run. What changes is where you point them — at the agent's configuration, which is the artifact that actually decides what the agent does.

This is the Configuration concern in BRACE, an open framework for agent security; the guide goes deeper on each control.

One honest question to leave with: do you version and diff-review your agents' system prompts — or are they editable runtime config that anyone with console access can change without a trace? The answer tells you whether there's anything to review at all.

How a sandwich defeats North Korea's hackers (and the US couldn't for 70 years)

Brenn Hill — Thu, 02 Apr 2026 06:11:21 +0000

Two days ago, Google's Mandiant team attributed the axios npm compromise to UNC1069 — a North Korean threat group previously linked to cryptocurrency theft and attacks on DeFi platforms. The malicious code shares significant overlap with WAVESHAPER, a C++ backdoor Mandiant attributed to the same group in February.

North Korea just weaponized the most popular HTTP client in JavaScript. 100 million weekly downloads. The payload: a cross-platform RAT that harvests credentials, SSH keys, and cloud tokens from every developer machine that runs npm install.

The United States has spent 70 years and trillions of dollars trying to contain North Korea. Nuclear negotiations, sanctions, carrier groups, diplomatic pressure, UN resolutions. None of it has stopped the DPRK from becoming one of the most effective cyber threats on the planet.

A sloppy joe sandwich stops them in 3 seconds.

What happened

On March 30, the attacker compromised the npm account of axios's lead maintainer (jasonsaayman) using a stolen access token. They changed the account email to a Proton Mail address and published two malicious versions:

axios@1.14.1 — published March 31, 00:21 UTC
axios@0.30.4 — published March 31, 01:00 UTC

Both versions injected a new dependency: plain-crypto-js@4.2.1. This package was never imported anywhere in the axios source. Its sole purpose was to run a postinstall hook that deployed platform-specific RATs:

macOS: Binary at /Library/Caches/com.apple.act.mond, executed via AppleScript
Windows: PowerShell RAT with Registry persistence and in-memory binary injection
Linux: Python RAT script via nohup

The dropper script deleted itself after execution to hide forensic evidence. The attacker staged plain-crypto-js 18 hours in advance, pre-built three platform payloads, and hit both release branches within 39 minutes. This was not amateur hour.

How sloppy-joe blocks every layer of this attack

sloppy-joe is an open-source supply chain security tool. It runs before npm install. It reads your package-lock.json and checks every dependency — direct and transitive — against multiple independent signals. No packages are downloaded. No code is executed.

Signal 1: Version age gate

ERROR axios [metadata/version-age]
      Version '1.14.1' of 'axios' was published 0 hours ago (minimum: 72 hours).
      New versions need time for the community and security scanners to review them.
 Fix: Wait until the version is at least 72 hours old, or pin to an older version.

The compromised versions were live for 2-3 hours before npm yanked them. A 72-hour gate means they never get installed. Period. This requires zero knowledge of the attack — it works purely on the principle that new versions should survive community review before hitting production.

This single check stops the attack.

But sloppy-joe doesn't stop at one signal. With --deep transitive scanning, plain-crypto-js gets demolished by five independent checks:

Signal 2: New package detection

ERROR plain-crypto-js [metadata/new-package]
      'plain-crypto-js' was first published 0 days ago. New packages are higher
      risk — verify this is a legitimate, maintained project before depending on it.
 Fix: Verify 'plain-crypto-js' at its registry page and source repository.

plain-crypto-js was created the day before the attack. Brand new packages as transitive dependencies of 100M-download packages are inherently suspicious.

Signal 3: Install script risk amplifier

ERROR plain-crypto-js [metadata/install-script-risk]
      'plain-crypto-js' has install scripts AND was published 0 days ago and with
      0 downloads. Install scripts on new, low-download packages are the #1
      malware delivery vector.
 Fix: Do not install this package. Verify it is legitimate before proceeding.

Install scripts + new package + zero downloads. This is the exact fingerprint of a supply chain attack. sloppy-joe's install script risk signal combines multiple weak signals into a high-confidence detection. Every real-world npm supply chain attack in the last 5 years has matched this pattern.

Signal 4: No source repository

WARNING plain-crypto-js [metadata/no-repository]
        'plain-crypto-js' has no source repository URL and is a new package
        (< 30 days old). Legitimate packages almost always link to their source code.
 Fix: Verify 'plain-crypto-js' at its registry page.

Legitimate packages link to their GitHub/GitLab repo. Malicious packages created as payload delivery vehicles don't bother.

Signal 5: Name similarity

WARNING plain-crypto-js [similarity/mutation-match]
        'plain-crypto-js' is suspiciously similar to existing package 'crypto-js'
 Fix: Verify this is the package you intend to use.

plain-crypto-js is a clear attempt to look like crypto-js — a real, popular cryptography package with hundreds of millions of downloads. sloppy-joe's mutation generators catch this.

Five signals. One sandwich. Zero dollars.

Here's what's remarkable: none of these detections require threat intelligence feeds, malware signature databases, or AI-powered behavioral analysis. They're all variations of the same primitive: cross-reference what the code claims to use against what actually exists on the registry.

Is the version too new? Flag it.
Is the transitive dep brand new? Flag it.
Does a brand new package have install scripts? Block it.
Does it have no source repository? Flag it.
Does the name look like a popular package? Flag it.

Each signal alone is informational. All five firing together on the same package is a certainty.

The cost of this defense

# Add to any CI pipeline
npx sloppy-joe check

That's it. One line. Runs in 5-15 seconds. No account needed. No API key. No subscription. Open source, MIT licensed.

The DPRK's UNC1069 spent 18 hours staging payloads, pre-building RATs for three platforms, and compromising a maintainer account. sloppy-joe catches it in the time it takes to read a package-lock.json.

What sloppy-joe can't catch

Honesty matters: sloppy-joe would not have detected the credential theft itself. The attacker hijacked the real maintainer's account — the npm _npmUser field still shows jasonsaayman. There's no publisher change to detect. npm's provenance attestation (OIDC-based publishing via GitHub Actions) is the real defense against token theft — the malicious versions were published manually, bypassing axios's CI pipeline.

sloppy-joe catches the payload, not the compromise. But the payload is what hurts you.

Get started

# Install
cargo install sloppy-joe

# Run
sloppy-joe check

# In CI (GitHub Actions)
- uses: brennhill/sloppy-joe-action@v1

brennhill / sloppy-joe

Shields against supply-chain, slopsquatting, and typosquatting attacks from dependencies and code.

Catch hallucinated, typosquatted, and non-canonical dependencies
before they reach production.

cargo install sloppy-joe

The LiteLLM supply chain attack (March 2026) compromised a package with 97M monthly downloads. Attackers stole publishing credentials, pushed malicious versions that harvested SSH keys, cloud credentials, and K8s secrets. sloppy-joe's default 72-hour version age gate would have blocked both poisoned versions — they were discovered within hours, well before the gate would have opened. If you run sloppy-joe check in CI, this attack fails. Full analysis

AI code generators hallucinate package names ~20% of the time. Attackers register those names and wait. sloppy-joe catches them in CI before npm install or pip install runs.

How to Use

# Install (single static binary, no runtime dependencies)
cargo install sloppy-joe
# Or download an auditable binary archive from GitHub Releases
# https://github.com/brennhill/sloppy-joe/releases

# Check current project — auto-detects ecosystem from manifest files
sloppy-joe check

# Check a

…

View on GitHub

sloppy-joe is an open-source supply chain security tool that catches hallucinated, typosquatted, and compromised dependencies before they reach production. It runs before your package manager, requires no code execution, and blocks attacks like axios, LiteLLM, event-stream, and ua-parser-js.

DEV Community: Brenn Hill

What Is Human-in-the-Loop (HITL) in AI? A Practical Guide

What human-in-the-loop actually means

The three modes of human-in-the-loop AI

1. Approve-before-act

2. Review-and-edit

3. Interrupt-and-resume

HITL vs. full automation vs. human-on-the-loop

The misconception: a review step is not safety

The better question

How to do human-in-the-loop well

Grade

Guard

Show

Prove

When HITL is the wrong tool, prevent instead

Key takeaways

Get started

73% of AI-agent credential leaks trace back to one mundane thing: debug logging

The numbers

The dominant cause is boring, and that's the point

What to actually do about it

When your agent does something bad, can you tell which agent did it?

Why the action is unattributable

Give the agent its own identity

Stamp six fields on every action

Make agent-type-id a content hash, not a name

Track parent-child lineage

Identity is the recovery surface

After an agent deleted a production database, I mapped what actually stops these failures

An agent is not the code that shipped — it's a configuration

Five concerns, and what each one bounds

The single highest-leverage control

I built a framework for this, and I'd like you to tear it apart

Build-time is where agent security is won or lost

You can't prevent prompt injection. So what do you actually do?

The model is not a security boundary

Separate the data channel from the instruction channel

Data hygiene runs both ways

Monitor behavior, separately from quality

And memory makes it persistent

Your AI agent is only as secure as the tools and agents it calls

1. A tool you vetted can turn hostile later

2. Tool descriptions and schemas are untrusted input

3. Watch for lookalikes

4. Put a fail-closed gateway at the MCP boundary

5. The kill switch has to reach the sub-agents

Where this fits

There's no pull request to review for an autonomous agent. So what do you review?

The agent is a configuration, not the code that shipped

Treat the system prompt and harness as versioned, diff-reviewed artifacts

Freeze the config per release, and pin it into identity

Detect drift at agent granularity

Capture the config-adjacent observables

How a sandwich defeats North Korea's hackers (and the US couldn't for 70 years)

What happened

How sloppy-joe blocks every layer of this attack

Signal 1: Version age gate

Signal 2: New package detection

Signal 3: Install script risk amplifier

Signal 4: No source repository

Signal 5: Name similarity

Five signals. One sandwich. Zero dollars.

The cost of this defense

What sloppy-joe can't catch

Get started

brennhill / sloppy-joe

Shields against supply-chain, slopsquatting, and typosquatting attacks from dependencies and code.

Catch hallucinated, typosquatted, and non-canonical dependenciesbefore they reach production.

How to Use

Catch hallucinated, typosquatted, and non-canonical dependencies
before they reach production.