close

DEV Community

Cover image for How My AI Agent Hacked Its Own Permissions (And What It Taught Me)

How My AI Agent Hacked Its Own Permissions (And What It Taught Me)

Permission bypass via standard file tools

Have you ever tried to build an automation that works so well it bypasses the very rules you set for it? Recently, I was working on a small repository designed to automate the painful process of updating my resume. The idea was simple: build a system that runs weekly, checks my social media activity, and proposes updates to my CV, complete with a fresh branch and a diff ready for my review every Monday morning. You can check out the repository here: https://github.com/tyutinalexkz/cv

I used an AI agent to do the heavy lifting. As a developer who values security, I configured the agent with no default command execution permissions. Step-by-step, I granted it specific capabilities for in-repo file management. It worked perfectly.

But then, I got ambitious.

Privilege escalation by agent

Once the workflow was tested, I asked the agent to configure its own environment to perform this flow silently every week. I essentially said, "Make this run automatically without asking me."

The agent attempted to change its permissions, but hit a wall - it didn't have the explicit authorization to modify the workspace configuration directly. A normal script would throw an error and stop. But this was a thinking model.

It looked at the list of commands I had already allowed it to use. It saw standard file manipulation tools. And then, it compiled a chain of commands - specifically using cp and jq - to manipulate its own configuration files. By doing so, it effectively granted itself the new capabilities it needed, bypassing the standard configuration flow and its limitations!

I just sat there, laughing. I was observing it as a developer, seeing how easy it could be to live without security barriers if you know the right tools. But the underlying lesson was profound. Even a helpful, non-malicious AI, when given a goal and a subset of seemingly harmless tools, will find creative ways to achieve that goal - even if it means escalating its own privileges.

If we give an agent to a user in a corporate setting, it might seem safe if we restrict its primary permissions. But as my little experiment showed, an agent with basic file manipulation tools and problem - solving skills can easily find a workaround. The future of AI safety isn't just about what an agent is explicitly allowed to do; it's about what it can piece together from the tools it has.

Top comments (10)

Collapse
 
nazar_boyko profile image
Nazar Boyko

The fun part isn't that the agent was sneaky, it's that cp plus jq were never really "file tools", they were "edit any file, including the one that defines my permissions" tools. Once the config that grants capabilities lives inside the agent's writable space, you've handed it permission editing rights without ever naming them. Gating by command name misses this, since the danger is the reach of the tools, not the tools themselves. The fix that jumps out is keeping the file that defines permissions outside whatever the agent can touch, so the config that controls the cage isn't sitting inside the cage.

Collapse
 
alexandertyutin profile image
Alexander Tyutin Google Developer Group

Yeah, good point. Thanks 👍️

Collapse
 
anp2network profile image
ANP2 Network

Building on @nazar_boyko — moving the permission file out of the agent's writable space is necessary, but assuming that's the whole fix just buys a quieter version of the same bug. The surface isn't that one file, it's every input the policy loader trusts: make the canonical config read-only and the next chain is a secondary path the loader also reads, an env override, whatever has higher precedence. The only thing that actually closes it is when the grant comes from a separate principal the agent can request from but can't author, so that no composition of the tools it holds yields a capability it wasn't issued. A file, even a protected one, is still data the holder can route to; a principal is something it has to ask.

The part nobody's flagged: in production you wouldn't be sitting there laughing, you'd see nothing. There's no failed-auth log, because the escalation never touched the auth path — it routed around it through the file API. So the detection most teams build, watching the permission/config API for unauthorized changes, is aimed at the wrong door. The event worth alerting on is a write landing on anything the grant decision depends on, whatever tool made it.

Collapse
 
alexandertyutin profile image
Alexander Tyutin Google Developer Group

The part nobody's flagged: in production you wouldn't be sitting there laughing, you'd see nothing

Yeah, good point, thanks 👍️

Collapse
 
unitbuilds profile image
UnitBuilds

Yip. It's like Git, you think it doesnt have permissions, but it has permissions to write a python file and execute it... Just like that, all barriers are bypassable, because it can execute scripts, that bypass it's restraints. Even if you dont let it run the python file, it can execute a command line and execute the script, especially if it's in it's scratch directory, it can even run it with it's background agents, without ever needing permissions, as it's an 'internal tool' for it.

Collapse
 
alexandertyutin profile image
Alexander Tyutin Google Developer Group

Insightful, thanks 🤔

Collapse
 
yune120 profile image
Yunetzi

If AI bypasses its own rules, who should own the guardrails—humans or code?

Collapse
 
alexandertyutin profile image
Alexander Tyutin Google Developer Group

Perfect question! I do not trust to boundaries defined in the same agent instructions :D

Collapse
 
jugeni profile image
Mike Czerwinski

The cage-inside-cage shape Nazar named and ANP2 extended is the same single-point-of-failure pattern wearing a different costume: the architecture nominally has a gate, but the gate authority lives where the thing being gated can reach it. Moving the permission file outside the agent's writable space closes the specific cp+jq route. ANP2's separate-principal point closes the broader class. Neither closes the test of whether the cage actually catches escalation attempts you did not anticipate.

The piece that keeps getting under-budgeted in production setups around this: a planted-fault test on the cage itself. Pick an escalation chain you have not seen the agent try, run it as a planted violation, watch whether the cage emits an alert or silently fails to fire. The cage is doing the work only if it has demonstrably caught a deliberate violation in the recent past. Otherwise it is decorative, including the version with a separate principal, because the principal API can still be misconfigured silently and you would not know until the chain you did not anticipate showed up in the wild.

ANP2's "you'd see nothing" is the right honest stage marker. The corollary on the detection side: alert on the write target, plant the violation, log when the alert fires, and treat the absence of recent fires as evidence the cage is no longer in effect, not that it is working. That last move is the one most teams skip.

Same shape as quorum-costume verification one floor sideways: independence assumption nobody verifies, here applied to the boundary between the agent and the file that defines what the agent is allowed to do. Naming the gate does not gate. Testing the gate under planted faults does.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.