DEV Community: Tessl

AI Agent Governance: 10 Takeaways from Engineering Leaders on Agentic Development

Tessl — Fri, 26 Jun 2026 08:31:25 +0000

Agentic development starts as a productivity story, but at scale it quickly becomes a governance problem.

At AI Native DevCon London, we hosted a set of Chatham House roundtables with senior engineering leaders from a range of organizations. I won’t attribute comments to individuals or companies, but the patterns were strikingly consistent: agentic development is moving from an individual tooling conversation into an enterprise operating model question.

The first wave was familiar enough: devs tried GitHub Copilot, Cursor, Claude Code, Codex, Devin and similar tools, and many found obvious value. They wrote code faster, produced tests faster, explored ideas faster, and in some cases revived work that had been sitting in the backlog because it was too costly to attempt.

The interesting question is what happens once agents stop being a personal accelerator and start touching the way an engineering organization works. At that point, the problem shifts from “does the tool help?” to “can we make this safe, repeatable, measurable, and economically sane?”

That shift is why I think the most useful frame is AI agent governance. It means the systems that let teams move faster without losing control, including identity, permissions, context, evals, model routing, cost visibility, policy, ownership, and feedback loops.

On a side note, you can hear my talk “skills are the new code”, where I share my personal framework towards agent governance and a proposed solution towards enterprise agent enablement.

Watch on YouTube

Let’s now look at the 10 main takeaways from our roundtable.

1. Agent adoption starts with enthusiasm, but scaling it requires deliberate rollout

Most organizations seem to start the same way: give developers access to AI coding tools and let the motivated teams run.

This is the right instinct at the start, because the space is moving too quickly for a purely top-down programme to discover all the useful patterns. Bottom-up energy creates learning quickly. It also surfaces where agents are genuinely useful, rather than where a transformation deck hoped they might be.

But it also creates fragmentation.

Different teams adopt different tools, build different prompts, store skills in different repos, and develop different assumptions about what is safe enough to automate. One group may use agents for test generation, another for code review, another for product specs, another for deployment automation. Before long, the organization can have dozens of useful experiments that don’t yet add up to a system.

The trick is not to kill the experimentation but to create a path from local learning to shared practice.

The first wave of adoption was mostly about individual productivity. The next wave has to be about repeatable, governed team workflows. That means rollout phases, clear ownership, a view of which tools are approved for which classes of work, and a way to convert the best local experiments into standards others can reuse.

This is a familiar pattern from cloud and DevOps: the early adopters prove what is possible, then the platform forms around them. The difference this time is that the cycle is much faster, and the unit being governed is not just infrastructure or code, but the agentic workflow itself.

2. The strongest ROI case is not productivity. It is increased ambition

A lot of the public conversation around AI in software development is still framed around productivity.

Can engineers do the same work faster?
Can teams ship more with the same number of people?
Can the business do the same with less?

Many business leaders will look for savings, and it would be naive to pretend otherwise. It is also worth acknowledging that some of this is hard to say openly in a group setting, however intimate. In practice, some leaders will seek to capitalize on productivity by doing the same work with fewer people, reducing costs, or slowing future hiring.

But the roundtables reinforced a concern I have had for a while: if we hype AI productivity too aggressively, we may slow adoption by making people fear what adoption means.

If the internal narrative is mostly about headcount reduction, people will defend themselves. They may hide the real gains, avoid showing how much faster a workflow became, or keep their best agent patterns private because sharing them feels like making the case for fewer people.

That is not a cultural foundation for transformation. A better frame is ambition.

Agents make prototypes cheaper. They let senior engineers explore ideas that have been trapped behind calendar time. They change the build-versus-buy equation, because a capability that once required an RFP and a vendor project may now be plausible for a small internal team to try.

This is the version of the story that leaders should emphasize publicly and internally. The question should be “what can we now attempt that we previously would not have attempted?”

That framing does not deny the economics but it does point them in a healthier direction. The long-term narrative should not be about lowering the floor, but about raising the ceiling. If AI is understood as a way to increase ambition rather than quietly reduce capacity, more people will lean in, and the organization is more likely to discover the compounding benefits.

3. Why context engineering is becoming a first-class engineering asset

Agents are only as useful as the context they can apply.

That context includes specs, tests, policies, architecture guidance, product requirements, runbooks, coding conventions, incident patterns, security rules, and domain language. Most organizations already have some of this knowledge, but it is rarely as clean or discoverable as the agentic era requires. Some of it lives in docs, some in Slack, some in tickets, some in code comments, and a great deal of it lives in people’s heads.

In the pre-agent world, weak documentation was annoying but survivable. A dev could ask the person who knew the system, or learn the convention through review comments. In the agentic world, missing context becomes a direct limit on what the agent can do.

This is why skills matter.

Skills turn tacit engineering knowledge into reusable context that agents can apply. They are not just prompts with nicer packaging; they are a way to encode how an organization wants work done, from API usage to security checks to writing style to deployment workflow.

This is also where Tessl’s view of agentic development comes in. If agents are going to participate across the SDLC, organizations need a way to collaboratively develop, discover, evaluate, and improve the context those agents rely on. Skills and evals are two sides of that problem: skills package the knowledge agents need, while evals show whether that knowledge actually improved the outcome.

Once you see context this way, and move the mental framework from SDLC → CDLC (Context Development Lifecyle illustrated above), documentation stops being a hygiene task and becomes infrastructure. The teams that write down how they work, keep that knowledge current, and make it available to agents will have a structural advantage over teams that treat context as tribal knowledge.

4. Cost matters, but the wrong framing leads to the wrong decisions

Model costs are becoming real.

In the earliest adoption phase, many teams did not feel the cost directly. Usage was limited, pilots were small, and in some cases vendor pricing or subsidies made the economics look less material than they would eventually become. But that phase is ending…

As agents become part of daily development, cost shows up in more places: large context windows, repeated attempts, long-running tasks, model upgrades, autonomous workflows, and agents that call other tools in loops.

A prompt that is cheap as a one-off experiment can become expensive when it runs across hundreds of devs every day, each with a large repo context, multiple retries, and a frontier model selected by default.

This is why AI FinOps needs to become a real discipline!

The cloud analogy is useful (but only up to a point). In cloud, cost followed infrastructure usage. In AI, cost follows cognition-like work: reasoning, context, retries, tool calls, evals, and orchestration. That makes it harder to map spend to value, because the bill may be attached to a workflow that saved a week of engineering time, avoided a security incident, accelerated a customer feature, or simply produced three bad attempts before a human rewrote it.

Even in the few weeks since these roundtables took place, awareness of AI costs has increased substantially. That will continue as agent adoption broadens. Leaders will need visibility into where spend goes, which models are used for which tasks, where context is being wasted, and which workflows justify their cost because they improve delivery, quality, risk, or ambition.

The wrong answer is to suppress usage blindly. The better answer is to manage it deliberately: model routing, caching, context discipline, budgets, observability, and evals that help teams know whether cheaper options are good enough.

5. Model routing will be part of AI agent governance

There was broad agreement that not every task should use the largest or most expensive frontier model. A good example is how we’ve recently switched Tessl’s default eval model from Sonnet 4.6 to GLM 5.1. The principle is easy to accept, but the operational question is harder: how does an organization know which model is good enough for which job?

The answer will not be one model - it will be routing.

Frontier models will remain valuable for ambiguous reasoning, complex planning, and tasks where the cost of a poor answer is high. Smaller models may be better for bounded, repeatable work where the task is well specified and the output can be validated. Open models have become capable enough that, for many narrow tasks, they may be more than sufficient and much cheaper. Local or private deployments may make sense when data sensitivity, latency, or control matters more than raw capability.

The risk is that every team solves this independently. One team standardises on Claude Code, another on Cursor, another on Codex, another experiments with open models, and the organization ends up with duplicated eval work and no shared view of quality, cost, or risk.

This is why model routing belongs inside AI agent governance. The decision should depend on the task, the data, the quality bar, the blast radius, the cost, and the validation available. The real capability is not choosing a favorite model; it is building the measurement and routing layer that lets teams use the right model for the right task.

The important test is not whether a smaller model works once. It is whether it meets the quality bar repeatedly under realistic inputs, with the context and constraints the workflow will actually have in production.

6. Why AI agent governance is becoming the enterprise security bottleneck

Cost is rising, but security is still the concern most likely to limit enterprise adoption.

The risks are easy to understand once you stop thinking about agents as chatbots and start thinking about them as actors inside the development environment. A coding agent running with a developer’s credentials may be able to access internal repositories, package registries, logs, deployment systems, tickets, customer data, and production-adjacent systems. If that agent can browse the web, install packages, execute scripts, or move data between systems, the blast radius changes materially.

This does not mean the right answer is to block agents. It means the trust model has to mature.

One useful mental model from the roundtables was to treat agents like new employees or interns. You would not give an intern every credential and full production access on day one. You would start with a defined scope, observe their work, review their decisions, and expand trust over time. Agents need a version of the same path.

That path includes identity, entitlements, sandboxing, audit trails, tool restrictions, policy enforcement, and incident response. It also includes a decision about whether the agent acts as the human, as a separate identity, or as a constrained delegated identity. Without that, security teams are left with a choice between approving risky autonomy or blocking usage entirely.

There is also an important cost dynamic here. In many enterprises, security constraints currently limit usage, which means they also shield the organization from the full cost curve. If only a small number of teams can use agents in limited ways, the token bill remains constrained. Once identity, permissions, sandboxing, and audit controls mature, adoption will expand, and costs that were previously hidden by limited rollout will become much more visible.

So security may be the immediate bottleneck, but cost is waiting behind it.

7. As coding gets cheaper, alignment becomes the bottleneck

Agents reduce the cost of implementation, but that does not mean the organization automatically moves faster. It means the bottleneck moves.

If code becomes cheaper to produce, the relative cost of everything around code increases: product clarity, architecture decisions, security approvals, change management, compliance, release coordination, and cross-team alignment. Several leaders described a version of the same pattern, where teams can now build faster than the organization can decide, approve, or absorb.

This changes the economics of software delivery.

For years, engineering organizations optimised heavily against duplication. Build the shared capability once, coordinate across teams, extract commonality, and reuse the platform. That instinct still matters, but the trade-off changes when implementation becomes cheaper and coordination remains expensive. In some cases, duplicating a capability inside a clear domain boundary may be more effective than forcing multiple teams through a shared dependency.

This is not an argument against architecture. It is an argument for architecture that recognises where the bottleneck has moved.

Agentic development works best when work has clear ownership, limited dependencies, strong tests, and a constrained blast radius. It struggles when success depends on many teams agreeing before anything can move. The practical leadership question is therefore not just “how do we make developers faster?” but “what will become the constraint once they are?”

8. Enterprise AI agent governance needs explicit, automated controls

Most organizations already have controls for software delivery: code review, change management, access approval, security review, compliance checks, deployment gates, incident response, and audit logging.

The problem is that many of those controls were designed for humans.

They rely on judgement, institutional memory, informal interpretation, or manual process. People know what the policy really means. Reviewers know when something feels risky. Security teams know which exceptions matter. Auditors accept a workflow because they recognise the human pattern behind it.

Agents force these assumptions into the open.

If a policy is ambiguous, an agent cannot reliably follow it. If a control depends on a human noticing something subtle, it may not scale. If a process is only documented in training material, it is not agent-ready. If an approval exists mainly so another team can find out what is happening, it may need to be redesigned.

This is governance debt, and agentic development exposes it.

The answer is not to invent an entirely new governance model from scratch. It is to make existing controls explicit, automated, and measurable. That means clearer policies, better identity systems, structured workflows, automated checks, traceability across agent actions, and evals that test whether the agent is actually following the standards it was given.

You cannot govern what you cannot see, and you cannot improve what you cannot evaluate. That is why skills, observability, and evals belong in the same conversation as security.

9. Standardization matters, but premature standardization can kill learning

Every organization adopting agents faces the same tension: how much freedom should teams have?

Too little standardization creates chaos. Too much standardization too early kills discovery.

The roundtables surfaced many examples of parallel experimentation: multiple teams creating skills, multiple repositories collecting prompts, different approaches to code review, different rules for test generation, different ideas about how much autonomy is acceptable. Some duplication happened because teams wanted control. Some happened because they did not know someone else had already solved the problem.

Early duplication is not always bad. It can be how teams learn. It can reveal which patterns work across different environments, and it can create local champions who are credible because they solved a real problem rather than followed a mandate.

But local learning only becomes organizational advantage if it becomes visible.

The healthiest pattern is to let teams experiment, make the work discoverable, then converge deliberately. That requires communities of practice, internal demos, shared repos, skill registries, lightweight review processes, and a platform team that sees its job as amplifying the good patterns rather than suppressing all variation.

The question is not whether to standardise. The question is when. Experimentation should be broad while the organization is learning. Production patterns should become intentional once that learning starts to repeat.

10. The talent model is shifting from writing code to directing, verifying, and integrating work

Agentic development changes what great engineering looks like.

It does not remove the need for engineering skill. If anything, judgement becomes more important. But the work shifts from producing every line of code to defining the task, supplying the context, delegating to agents, verifying the output, integrating the result, and knowing when something is subtly wrong.

Some engineers will thrive in that environment. They are comfortable with ambiguity, orchestration, and context switching. They can hold the goal in their head while inspecting partial outputs. They know how to specify, review, and correct without needing to manually produce every detail.

Others may struggle, especially if their identity is tied primarily to deep, single-threaded implementation or writing every line by hand. That style of work will not disappear, but it will become part of a larger system in which humans increasingly design and supervise the machinery of software creation.

One analogy that came up in the discussions was the shift from building the furniture to building or operating the factory that builds the furniture. Another is management: working with agents can feel like defining work, delegating it, reviewing the output, and intervening when needed.

That does not mean every engineer becomes a people manager. It means more engineers will need management-like skills for systems of agents: specification, delegation, verification, feedback, and accountability.

The emerging role is less “the person who writes all the code” and more “the person who ensures the right system gets built.”

Closing thoughts: What are the main blockers for enterprise agent adoption?

Blocker	What leaders are seeing	Why it matters
Security	Agents inherit human permissions, touch sensitive systems, browse the web, or act without enough containment.	It limits rollout today, but also defines the trust model for everything that follows.
Cost	Usage grows through larger context windows, repeated runs, frontier models, and always-on workflows.	AI FinOps becomes a durable discipline, not a one-off optimisation project.
Model deployment	Frontier models are powerful, but many enterprise tasks may be better served by smaller, open, or specialised models.	The capability to route work across models becomes more strategic than picking a single model.
Context	Agents need specs, policies, tests, docs, runbooks, examples, and domain language to do useful work reliably.	Context becomes infrastructure, and weak documentation becomes an adoption blocker.
Alignment	Implementation gets cheaper, while decisions, approvals, architecture, and cross-team coordination still move at human speed.	The bottleneck moves from writing code to agreeing what should be built and how it should fit.

Most of the roundtable discussion reinforced what enterprise leaders already feel: agentic development is useful, the tools are improving quickly, and adoption is uneven.

From my perspective, three novel points stood out:

Hyping AI productivity can hinder adoption. If the story inside a company is mostly about doing the same work with fewer people, employees will quite reasonably hear a threat. A better transformation narrative is ambition: agents let teams attempt more, build more, explore more, and pursue work that previously looked out of reach. This shift turns the questions around and focuses on nurturing an enterprise culture directed at empowering devs (not scaring them!).
We need AI FinOps! Managing AI costs is not a short-lived problem that disappears once models get cheaper. As agents become embedded in development workflows, usage expands, model choice diversifies, and context-heavy workflows become normal. Cost needs to be observed, managed, and tied to value.
In the enterprise, the security bottleneck currently shields organizations from the full cost curve. Many companies are not yet seeing the true cost of broad agent adoption because security constraints are limiting usage. Once the controls mature, adoption will expand, and the cost question will become much sharper.

The next generation of engineering teams won’t be defined by how many agents they use, but by how well they govern them.

At Tessl, this is the approach we’re building towards: agent governance rooted in context, evaluations, and security. A practical place to start is to point your coding agent at the Tessl CLI and ask it to evaluate your context. It is a simple way to see assess the quality of your context, understand where the gaps are, and think what governance will need to cover next.

Cursor's new leaderboard shows teams the most popular plugins, skills and MCPs

Tessl — Thu, 25 Jun 2026 06:28:02 +0000

As engineering teams adopt more agent tooling, keeping track of what's actually running across an organisation has become its own problem. Plugins, skills, and MCP servers get configured differently by different developers, with no shared view of what teammates are using, what's proven out, or what's worth standardising on. The result is a sprawl of JSON config files and scattered settings that nobody has full visibility into.

Cursor's latest update takes aim at that. Version 3.9, released June 22, introduces what the company calls a "Customize" page — a single interface for managing plugins, skills, MCP servers, subagents, rules, commands, and hooks across an organisation, controllable at user, team, or workspace level.

A leaderboard that shows what teammates actually use

The headline feature is a leaderboard showing which plugins, skills, and MCPs are most used both within a team and across the broader Cursor community. For skills, the leaderboard surfaces how many times each has been used by the team in the past 30 days, and what proportion of those invocations were agent-initiated versus human-initiated — useful signal for understanding which skills are genuinely being put to work.

Leaderboard (Skills)

For plugins, teams can see how many teammates have already added a given plugin, and click through to add it to their own setup in one step.

Leaderboard (plugins)

Previously, there was no way to see what teammates had configured — adoption was an individual, manual process with no shared signal. The leaderboard turns it into a discovery surface driven by real usage data, drawing on both internal team behaviour and community-wide trends.

Canvases, shared dashboards, and broader marketplace support

The update also introduces prebuilt plugin canvases — shared, interactive dashboards that render live data from partner tools directly inside Cursor. The Atlassian canvas, for instance, pulls a real-time view of Jira issues, sprint progress, and project documents into the editor, giving teams a live window into their project state without switching context. Teams get a ready-made starting point they can open and reuse, rather than building the wiring themselves.

Plugin canvas

Team marketplaces, which allow organisations to distribute private plugins internally, now also support GitLab, Bitbucket, and Azure DevOps repositories — previously the feature was limited to GitHub.

Cursor's bigger picture: SpaceX, a GitHub challenger, and a quiet acqui-hire

The update lands at a moment when Cursor is the most closely watched company in developer tools. SpaceX recently confirmed a $60 billion all-stock deal to acquire Cursor's parent company Anysphere — the largest acquisition of a venture-backed startup on record. Around the same time, Cursor unveiled Origin: an agent-native code hosting platform designed as a challenger to GitHub, which has been logging hundreds of incidents over the past year as it struggles to keep pace with the volume of code AI agents are generating.

Elsewhere, Cursor also quietly absorbed open-source coding assistant Continue, in an acqui-hire that shut down the product and handed its codebase to the community under its existing Apache 2.0 licence.

For engineering leaders already managing Cursor deployments at scale, the governance question is only going to grow as agent tooling becomes more embedded in how teams work. A unified control plane and a usage leaderboard won't resolve every challenge, but they give platform teams something they didn't have before: a clear view of what's actually running.

The new Tessl review: now you decide what "good" looks like:

Tessl — Wed, 24 Jun 2026 06:41:25 +0000

The new Tessl review: now you decide what "good" looks like:

For a while now Tessl has been able to review the quality of your skills straight out of the box. By simply running tessl skill review you get a score against Anthropic's best practices with no setup required. That is a sensible default and it has served most people well, but a default is still somebody else's opinion that you or your organisation might look at and disagree with.

Today we are launching a new version of Tessl’s review functionality. It does three new things: reviews your skills agentically with greater accuracy, and lets you define what good actually means for your skills, and keeps a sharable history of your skill review runs.

The problem with one definition of good

On one of my skills, the current review provides a quality score of 82%. The description review scores a perfect 100%, but the content section drops to 55%, with conciseness at 1 out of 3 and progressive disclosure at 1 out of 3.

In some people’s view, nothing is wrong with the skill, but the judge is marking it down for keeping one tight, self-contained skill rather than spreading it across five files. That is a reasonable position and it is Anthropic's position. But what if your org prefers larger, consolidated skills, in which case an 82 is punishing me for doing exactly what we want. Perhaps we even have further constraints which are being missed in my skill but completely being overlooked by the review and giving me a false sense of quality.

Here’s a video of the new Tessl review in action:

Watch on YouTube

Offering a more accurate review

The new Tessl review is invoked using tessl review run from the CLI or via the agent (but make sure it’s calling the new version!) and you need to pass a workspace name where your review results will be stored.

One of the bigger changes is under the hood. Whereas the previous review used an LLM as a judge in a single pass, the new version uses an agent. It takes more turns, gathers more information about the skill and associated files and reaches a better more grounded verdict. You will still see some variation between runs, since an LLM judge is non-deterministic by it’s very nature, but the results are more accurate.

Defining what good skills look like for your organization

This is the exciting part that changes how reviews determine what’s right, as the new review allows you to pass your own rubric, as a plugin, and review against it.

We’ve made a plugin called review-plugin-creator that walks you through building a custom review plugin. This allows you to fork the Anthropic best practices if you only wish to change a few things, so everything sensible stays in place by default and you only change what you disagree with. In my case I flipped a single rule, the one that punishes consolidated skills.

The creator produces a plugin holding your guidelines and rubric. To reference it on a tessl review run, you can reference it locally in the file system, or link to a private or public plugin on the Tessl Registry.

Running the same skill again, this time with your rules, and you’ll see updated scores. In my case, the consolidated skill now scores full marks on conciseness and progressive disclosure, and the content section reflects what my org actually values rather than what a generic default assumes.

Seeing your reviews

Everything you see at the CLI is also on the Tessl Registry. Head to your workspace and you will find your review plugin alongside a full history of review runs. Each run shows the same breakdown you get in the terminal, plus the plugin that produced it, so you always know which definition of good a score was measured against.

In your workspace settings you can set a default review plugin. From then on every review run from that workspace uses it automatically. You can still override it per run with the --review-plugin flag whenever you need to.

The rest of the toolkit

A few more commands worth knowing:

tessl review list --workspace <workspace-name> lists every review run against a workspace
tessl review view <review-id> opens a single run and shows its full output.
tessl review fix is the new home for the --optimize behaviour you already know from our previous review. It agentically applies fixes to the skill based on a review outcome and can update your SKILL.md directly.

What does this mean for the old command?

tessl skill review is not going anywhere yet. We have deliberately left it in place so nothing breaks for anyone relying on it today, although you may see a deprecation message. That said, tessl review run is where all the work is going from here, so please move across and start using it, so you’re not caught out when we do turn off the older review feature. We’ll also be releasing updates to our GitHub actions soon to make use of the new tessl review functionality.

Try it now

The new Tessl review is live and you can use it today, do note that you’ll need a free account in order to use the Tessl review command (you can check the full documentation here. There is plenty more to come and we will keep you posted as it lands. For now, run it against your own skills, write a rubric that matches how your team actually thinks about quality, then tell us how it performs in your environment. Your feedback shapes what we build next.

Customise Tessl review: https://tessl.io/registry/tessl/review-plugin-creator

Learn more about Tessl: https://tessl.io

Common Pitfalls of Skills Development (And How to Fix Them)

Tessl — Tue, 23 Jun 2026 06:28:50 +0000

I recently gave a version of this talk at AI Engineer Europe in London. What follows is the fuller story — what we found when we looked at thousands of skills, what goes wrong, and how to fix it.

You know that scene in The Matrix? Neo gets a spike in the back of his head, they upload kung fu directly into his brain, and he just... knows it.

That's what a skill is for an AI coding agent. You write a markdown file — a SKILL.md — and the agent loads it when the task matches. Suddenly it knows your team's deployment process, or how your API handles pagination, or that you never use semicolons.

It's not code. It's context. Procedural knowledge, injected at the right moment.

The thing is — Neo's upload worked perfectly. Ours? Not always.

Skills are everywhere now

We spent some time analysing essentially all of public GitHub. In November last year, 12 repos had SKILL.md files. By March — five thousand four hundred and sixty. That's 450x growth in fourteen weeks.

Skills went from zero to 27% of all agent config activity in three months. Faster adoption than CLAUDE.md, AGENTS.md, or any of the dotfile formats before them. And 1 in 12 merged PRs on GitHub now touches an agent config file — 8.4%, up from basically zero eighteen months ago.

This is not a niche thing anymore. This is how people are working.

Watch on YouTube

But are they electrifying?

Ninety percent of agent config files are never updated after creation. Write once, forget forever.

Your codebase evolves every day. Your dependencies change. Your API contracts shift. But the instructions you gave your agent? Frozen in time.

For Gemini files it's even worse — 97% are write-once. And the purpose-built "skill-as-product" repos? Over half are under 50 kilobytes. Wrapper repos. Many are AI-generated. High churn, low staying power.

We have this explosion of skills, and most of them are going stale the moment they're committed.

What we did about it

The DevRel team at Tessl spent a couple of months doing something pretty hands-on. We went out and found open-source projects with SKILL.md files. We ran them through our review tooling. And where we could improve them, we opened pull requests. To strangers. On the internet.

622 PRs. 559 different repos. Nearly six thousand skills touched.

We weren't just theorising about what goes wrong. We were in the trenches, reading other people's skills, fixing them, and learning from the maintainer responses.

At the time of writing, 96 of those PRs got merged. 140 were closed. The rest were still open. That's a 15% merge rate on cold PRs to strangers' repos — which honestly isn't bad.

And along the way, we learned exactly where skills break.

Pitfall #1: Vague descriptions

Your description field is your activation signal. It's the if-clause the agent evaluates before it decides to load your skill. If it's generic, the agent has no signal. It either ignores you, or worse, activates on the wrong task.

Before:

"A helpful skill for code review and quality improvement"

After:

"Runs ESLint with project rules, flags type-safety violations, and suggests fixes. Use when reviewing TypeScript PRs or running pre-commit checks."

From our outreach, 105 of our merged PRs specifically fixed descriptions. It was the single most common fix.

And our research team measured this. When skills are installed but the agent isn't forced to use them, activation drops to 41%. Less than half. The skill is right there, installed, ready to go — and the agent walks right past it.

The strongest predictor of activation is what we call "distinctiveness conflict risk" — does your description use terms unique enough that the agent can tell your skill apart from its own built-in behaviours?

Skills with strong domain-specific nouns — "Remotion", "Calendly", "path-traversal-finder" — those activate well. Skills described with generic terms like "API", "code", "debugging"? They compete with the agent's own capabilities and lose.

What matters isn't how detailed your skill is. It's whether the description signals a concrete, bounded task that doesn't overlap with what the agent already knows how to do.

Pitfall #2: God skills

We found a Microsoft Foundry skill with 50 files in it. Fifty. Even with progressive disclosure, no agent is loading all of that context effectively.

And our review scores said it was fine. The evals passed. But three scenarios can't cover the surface area of fifty files. There's more content in that skill than can possibly be tested.

This is the God Skill problem. A skill that tries to do everything produces a description so broad it either never activates, or activates for the wrong reason. One skill, one workflow. That's the rule.

The SkillsBench paper from earlier this year confirmed it: 16 out of 84 tasks showed negative skill deltas. The skill actively made the agent worse. Usually because it introduced conflicting guidance or unnecessary complexity for something the model already handled well.

Pitfall #3: Context bloat

We know that leaner skills perform better. One of our users reported that after optimising their skill, it used 40% fewer tokens and finished in half the time compared to scanning source code directly.

But here's the irony: when we run our own optimiser, the output is on average 17% longer than the input. The machine adds examples, caveats, edge cases. It's thorough — but thoroughness burns context window.

Human-written skills often contain things the LLM already knows. You don't need to explain what a REST API is. You don't need to define what TypeScript generics are. The agent knows. What it doesn't know is your specific conventions.

The fix is progressive disclosure. Core instructions in the body. Detailed reference material in separate resource files, loaded on demand. Not upfront.

There's a related subtlety that bit us too. When you generate eval scenarios automatically, there's a risk that the scenario description accidentally tells the agent what to do to score well. We call it criteria leakage. The task says "implement audit logging with structured JSON output" — and the scoring rubric checks for structured JSON output. The baseline scores 80% just from reading the task description, without the skill.

Our research team measured this: 30% of auto-generated scenarios had meaningful leakage. And when leakage is high but the scenario is generic, the skill can actually score worse than baseline. The leaked info is enough for the agent without the skill, and the skill just adds noise.

If your baseline scores are suspiciously high, your scenarios might be doing the agent's homework for it.

Pitfall #4: Activation varies by agent

Activation isn't just about your description. It varies dramatically by agent harness.

Setup	Activation Rate
Claude Code (forced)	98%
Single skill installed	62%
10 skills installed	58%
Claude Code (not forced)	41%

With a single skill installed in a controlled test, activation is 62%. Add nine more skills and it drops to 58%. And installing too many skills can mean they conflict — the agent gets confused about which one to use, picks the wrong one, or picks none.

One of our colleagues tested a security review skill via MCP and reported: "The agent took the hint and just carried on" — completely ignoring the skill instructions. It acknowledged the skill existed but didn't follow it.

The honest bit

"We disagree pretty strongly with some of Tessl's guidance. Please stop submitting automated rewrites of our skills." — Open source maintainer

Not everyone loved our pull requests. And that's fair.

Reviewing a skill isn't just checking markdown formatting. If a skill augments a library, there's institutional knowledge baked in. The skill might encode proprietary details about how an org operates. Running a review without access to the project's test suite, without the external APIs, without the full context — you can't prove the "improvement" actually improves anything.

We can tell you if the description follows best practices. We can tell you if the structure is right. But we can't tell you if the content is correct for your specific domain without running it against your actual workload.

That's why evals matter. Static review is necessary but not sufficient. It's like static analysis versus actually running your tests.

The fix: the Context Development Lifecycle

So how do you actually fix all of this? Our very own Patrick Debois wrote about this as the Context Development Lifecycle. The idea is that context needs engineering rigour. The same discipline you'd give a shared library.

Generate: Capture the implicit knowledge. Your conventions, your architecture decisions, your API quirks. The agent can draft, but the human decides what's true.

Evaluate: Test it. Reviews check structure. Task evals run the agent on real scenarios with and without the skill and measure the difference. That's the only way to know.

Distribute: Version it, publish it, secure it. Skills need owners, changelogs, and semver. A skill without version history is technical debt from the moment it's shared.

Observe: Watch what happens in production. Monitor activation. Check adherence. Close the loop.

The teams that win won't be the ones with the best models. They'll be the ones with the best context.

The numbers that matter

Our large-scale eval study across 1,200 skills showed roughly 20% absolute improvement in accuracy when the agent has skill access. Even more interesting: smaller, cheaper models remain competitive with larger models when given good skills. That's a direct cost saving.

And when you optimise properly — trim the fat, fix the description, use progressive disclosure — you get the same results with 40% fewer tokens in half the time.

But here's the caveat: human-curated skills improve performance by over 16 percentage points. Self-generated skills? Negligible or even negative. The quality of the skill matters enormously.

Skill adherence across projects ranges from 19% to 94%, with an average of 62%. The variance is huge — and that's the gap where good engineering practices make the difference.

Skills aren't just nice-to-have. They're a multiplier. But only if you treat them like software.

Start fixing your skills today

Submit for review: Send your skill to the Tessl registry for review and scoring.

Automate it: Add the tesslio/skill-review GitHub Action to your repo so every PR that touches a SKILL.md gets reviewed automatically.

Run it locally:

`npx tessl skill review ./SKILL.md`

The review gives you a score and line-level suggestions. The --optimize flag applies them. Iterate until you're above 70% before publishing. And when you're ready to go further, generate eval scenarios and run task evals — that's where you move from "does this look right" to "does this actually help."

If you're looking to bring this rigour to your engineering team, we can help with that too.

Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from running 880 evals (including Opus 4.7)

Tessl — Mon, 22 Jun 2026 06:42:40 +0000

Claude Opus 4.7 shipped last week, and the question any engineering team reaches for is how it compares to its peers.

It is the strongest frontier coding model we tested on the baseline leaderboard, and it will be the easy default a lot of teams reach for.

But in 2026, the model you reach for could matter less than the skill you load with it.

That is what 880 evals across nine models (Opus 4.7, Opus 4.6, Sonnet 4.6, Haiku 4.5, gpt-5.4, gpt-5.3-codex, gpt-5-codex, and Cursor's Composer-2) tell us.

Let’s take a step back. It’s now 2026, and agent skills are spreading like wildfire… (even our favourite movies are catching up to them).

Watch on YouTube

Every major agent ecosystem now has some version of them.

So the question worth asking, whether you are a dev, a platform engineer, or an engineering leader, is which skills actually earn their context weight, and which ones just add cost.

At Tessl, we believe context -particularly agent skills- and the broader concept of a context development lifecycle are where this space is heading (see also: Why the best AI coding teams will win on context). The results below add to a growing body of signals pointing to a shift that is already underway.

Top-line results

Model	Native behavior rate coverage (e.g "without skill")	Adherence to skill ("with skill")	Lift	$/run (with skill)	Avg time (with skill)
claude-opus-4-7	80.5%	94.5%	+14.0	$1.00	158.9s
claude-opus-4-6	77.1%	93.8%	+16.7	$0.53	126.6s
claude-sonnet-4-6	75.6%	93.3%	+17.7	$0.31	125.1s
claude-haiku-4-5	61.2%	84.3%	+23.1	$0.12	77.8s
gpt-5.4	75.9%	92.7%	+16.8	N/A*	135.4s
gpt-5.3-codex	75.8%	91.9%	+16.1	N/A*	87.9s
gpt-5-codex	73.8%	85.1%	+11.3	N/A*	136.2s
cursor-composer-2	73.6%	90.5%	+16.9	N/A*	152.0s

We’ve evaluated 11 node.js development skills (documentation, fastify-best-practices, init, linting-neostandard-eslint9, node-best-practices, nodejs-core, oauth, octocat, skill-optimizer, snipgrapher, typescript-magician), and aggregated “with vs without” skill performance. For each skill we generated up to 5 realistic tasks from its content, each paired with evaluation criteria - full. We then solved each task with an agent under two conditions: with access to the skill and without access to the skill (full set up explained here). Codex and Cursor per-run costs aren't reported by the eval platform - source those directly from OpenAI and Cursor pricing for now.

Seven things the numbers told us

1. Every single configuration got positive lift

Eight models, 11 skills, 5 scenarios each. 88 configurations. Every single one posted a positive average lift with a skill loaded.

Smallest: gpt-5-codex at +11.3 points.
Largest: Haiku 4.5 at +23.1.
Most configurations landed somewhere around +16 points.

This is not a story about Opus 4.7 winning and another model losing. Cursor's Composer-2 lifted from 73.6% to 90.5%, a +16.9 bump that puts it mid-pack alongside Sonnet and gpt-5.4. Across the Codex family alone, lifts ranged from +11.3 (gpt-5-codex) to +16.8 (gpt-5.4), so not every variant within a vendor benefits equally, but all of them benefited. Skills lifted scores across vendors, across tiers, across model generations. That is about as clean a signal as benchmark data gets.

2. Skills helped weaker models the most

Haiku 4.5 went from 61.2% to 84.3% with a skill loaded. That is a 23.1-point lift, the biggest gain of any configuration we tested. Opus 4.7 gained 14 percentage points. Sonnet gained 17.7.

The pattern holds across every model family in the set. If you are reaching for a smaller, cheaper model to control cost, skills could be where your adherence is going to come from, not the next tier up.

3. A cheap model with a skill can beat an expensive one without

The skill set we leveraged is a Node.js-focused skill, so models can leverage certain Node.js scenarios directly from their pre-training. In our analysis, Haiku 4.5 with a skill, at 84.3%, outperformed every single baseline configuration we tested, including Opus 4.7 at 80.5%.

Meanwhile, up at the frontier, Opus 4.7 (94.5%), Opus 4.6 (93.8%), and Sonnet 4.6 (93.3%) all landed within 1.2 points of each other with skills loaded. Without skills, that spread was closer to 5 points. Skills appear to compress the adherence gap between different mode tiers. This confirms the same result as a deeper research we recently released: small models with context become powerful!

4. The biggest gains came from context no model was trained on

The single largest skill lift in the benchmark: snipgrapher at +36 points (51.9% → 88.0%). snipgrapher is a niche CLI by Matteo Collina - public on npm, but not exactly a household name. Whatever frontier models absorbed about it from training data, they clearly had not absorbed enough: its 51.9% baseline is one of the lowest in the set. Loading the skill closed that gap fast. Second place: node-best-practices at +29.2 points.

The skills that pulled the most weight were the ones encoding knowledge the model had no way to pick up from pretraining: private APIs, internal conventions, uncommon domains. Wrappers over material a frontier model already knows rarely justified their token cost. For anyone publishing skills, that looks like the bet worth making.

5. Loading a skill is a real context budget decision

We’ve seen loading a skill can be as much as a 3x cost increase for +2pp performance. Here is what "add a skill" costs Opus 4.7 at the frontier:

Input tokens: 557K → 1,016K per run. An 82% increase.
Cost per run: $0.61 → $1.00. Two-thirds more per invocation.
Turns taken: 17.5 → 24.4. A 40% jump. (Opus 4.6 jumped from 13.7 to 22, a 63% jump.)

Skills at this capability level look less like brief hints and more like importing a big dependency: they can buy you capability, and they can cost you size, speed, and complexity. If you are orchestrating agents at scale, plan the context weight as deliberately as you plan the skill.

6. Sonnet-plus-skill might be the Opus-replacer hiding in the numbers

Sonnet 4.6 with a skill: 93.3%.

Opus 4.7 with a skill: 94.5%.

A 1.2-point gap. At a third of the per-run cost ($0.31 vs $1.00) and around 34 seconds faster on average.

For teams already running Opus on every workload and wondering whether it is earning its keep, that gap looks slim on anything that is not the hardest 5% of tasks. On almost every scenario we tested, Sonnet with a skill produced an output a senior developer would be hard pressed to separate from Opus with a skill. And you get the change back on every invocation.

If your top constraint is latency rather than cost, gpt-5.3-codex with a skill is the other sweet spot worth stress-testing. It landed at 91.9% skill adherence in 87.9 seconds, nearly half the run time of Opus-with-skill for a 2.6-point adherence gap. For latency-sensitive agentic pipelines, that combination is arguably the speed champion of the benchmark.

7. Haiku-plus-skill is the most underrated production config we tested

Haiku 4.5 with a skill: 84.3% at $0.12 per run. Average run time: 77 seconds. Roughly half the latency of Opus-with-skill, and 12% of the cost.

Adding the skill to Haiku barely moved the cost needle either. The run went from $0.104 to $0.119, a 1.5-cent marginal increase. Compare that to Opus, where the same skill switch added 39 cents per run. The lift on Haiku is enormous. The cost of getting it is effectively free at scale.

For throughput-heavy workloads such as batch jobs, eval loops, retries, or anything running at volume, that looks like the ROI champion of this benchmark.

Disclaimer: We are not saying every team should default to Haiku. We are saying the question worth asking before reaching for the most expensive tier is a simple one: would 84% with a skill be good enough for this workload?

What this means for you

If you are a dev

The skills that pulled the most weight were the ones encoding context the model was never trained on. If you are building or choosing skills for your own workflow, the ones that will move your skill adherence most are tied to your specific stack: your internal APIs, your company's style guide, the framework nobody outside your repo has ever seen. Thin wrappers over library docs the model already knows are rarely going to earn their token cost.

There is also a practical implication for day-to-day work. Your wallet already knew this, but you perhaps do not need Opus for every task. For routine work such as code review, commit message generation, or refactor suggestions, Haiku 4.5 with a well-built skill is fast enough and accurate enough, and the round trip is roughly half the time.

If you are a platform engineer or DX lead

You are the one rolling agentic tooling out across developers at scale.

Take 100 devs running their agent 20 times a day:

Opus 4.7 with a skill at $1.00 per run: around $60,000 a month.
Sonnet 4.6 with a skill at $0.31 per run: around $18,600 a month.
Haiku 4.5 with a skill at $0.12 per run: around $7,200 a month.

An 82% increase in input tokens when you switch skills on is not an edge case at scale, it is the main cost driver. Governance matters, context budgets matter, and the skills you bless need to earn their weight, not just their skill adherence.

That is exactly the problem the Tessl registry aims to solve. Every skill in the Registry ships with eval scores, security scores, and impact metrics so you can see which skills actually earn their weight before you ship them to your org. Run evals against your own workloads to quantify the productivity you are losing on generic outputs versus what you can win back with the right skills in context. That is the kind of governance layer a platform team could take into a budget conversation.

If you are a VP of engineering

You may now have defensible data to make a tier-down case where it makes sense. Sonnet-with-skill delivers output within 1.2 points of Opus-with-skill at a third of the cost. For most workloads that are not the hardest 5% of tasks, that gap will not show up in the output quality your team is shipping.

Also worth knowing if you are picking a default for your org: skills lifted every configuration we tested, across Claude, Codex and Cursor. Your agent choice does not have to be locked to a single vendor to benefit from a skill-first strategy. That is useful leverage in procurement conversations and in any "should we standardise on X?" discussion.

If you want to run this decision with numbers for your own org, head over to your terminal, spin up an agent, and ask it to run evaluation with Tessl for your skill across different models. That could turn a procurement conversation into a data conversation.

Closing thoughts for AI enablement leads (even if your job title doesn't say so yet!)

This is the role that looks most directly at numbers like these. It doesn't always come with a standard title. Right now, the responsibility is sitting inside platform teams, developer experience functions, senior devs who have taken on the hat, and VPs of engineering wearing it as a second role.

What the role is responsible for: making sure hundreds of devs in an org have agentic tooling that is reliable, affordable, and performant. Which model a team defaults to, which skills are blessed, how much context a workload is allowed to pull, which workloads run where. These decisions land on whoever has that scope.

A few things to pay attention to in this data if that is you:

The Sonnet-with-skill vs Opus-with-skill comparison could be a procurement conversation. At a 3x cost difference for effectively equivalent output on most tasks, this is the kind of number that should be going into your infra budget chats.
The 82% token increase when you switch a skill is the argument for context governance. Your skills need to be evaluated on what they lift, not just on whether they are available.
Haiku with a skill is the config worth testing for internal, high-frequency workloads. Running evals on your own skills, generating routine summaries, drafting internal docs. The output doesn't have to be Opus-grade. It has to be good enough, often enough, at a price your org can afford across hundreds of developers.

We believe the AI enablement lead will become a titled role inside engineering orgs over the next twelve to eighteen months, the same way DevOps lead and developer experience lead emerged before it. If you are that person in your org today, the above table is for you.

Opus 4.7 is a solid upgrade, but if you only take one thing from this piece: in 2026, picking the skill might matter more than picking the model.

Spin up your agent and request to leverage Tessl scenario evals for your skills, or speak to sales about Tessl for enterprise.

Evaluating Kimi 2.5 vs Kimi 2.6: What happens to agent skills when the model gets smarter?

Tessl — Sun, 21 Jun 2026 06:41:07 +0000

When a stronger model ships, there are two questions every skill author should want answered, and evals are the only honest way to answer either:

Which skills just got absorbed? A model that now knows how to do X natively does not need a skill telling it to do X. Fewer skills to maintain, leaner context, lower cost.
Which skills still matter? Behaviour-level guidance (conventions, preferences, project-specific workflows) is not something pretraining will fill in for you. Those skills should keep paying.

Moonshot gave us early access to Kimi K2.6. We ran the Tessl agent skill evaluation harness on the same 21 skills and 100 paired scenarios against three solvers: Kimi K2.5, Kimi K2.6, and Claude Sonnet 4.5.

A solver is the model whose output the grader scores; a paired scenario is the same task run twice per solver, once without the skill installed and once with it. These are early signals from one pre-release on one skill set. A deeper cross-model analysis with clean baselines across the board is in progress and will be its own piece.

What does our setup look like?

Scenarios and rubrics are held fixed across the two Moonshot runs. The only variable is the solver.

Solver A: Kimi K2.5
Solver B: Kimi K2.6
Scenario generator: Claude Sonnet 4.5, up to 5 scenarios per skill, derived from each skill's SKILL.md
Grader: Claude Sonnet 4.5, weighted-checklist rubric derived from the same SKILL.md
Per skill × per solver: every scenario solved twice, baseline (no skill installed) and with-skill

Per-skill n=5 is noisy; the aggregate over 100 scenarios is where the signal lives.

Three findings:

Kimi 2.6 is a better model than K2.5: Without skills, K2.6 sits ~2 pp (percentage points) above K2.5 in aggregate, with double-digit moves on specific skills.
Kimi 2.6 holds its own against Sonnet 4.5. We picked Sonnet 4.5 as a competitive baseline, and found in this evaluation set that the K2.6 performed better both in the with/without skill scenario by around ~8 p.p.
Skills remain a durable lever as models improve. The uplift skills buy stays roughly similar as Kimi improves (+17.05 pp on K2.5, +17.20 pp on K2.6).

1. Kimi K2.6’s baseline performance is superior

Solver	Baseline (no skill)	With skill	Uplift
Kimi K2.5	73.2%	90.2%	+17.05 pp
Kimi K2.6	75.0%	92.2%	+17.20 pp

Kimi K2.6 is a better model than K2.5 on this skill set. Two findings to back this up:

Four skills are now redundant on K2.6. In the 21-skill set, 4 skills have K2.6 baselines ≥ 95%, up from 2 under K2.5. agent-gossip-coordinator is the clearest example: K2.5 needed the skill (+8.0 pp uplift), K2.6 already solves it at 96.4%, and the skill now hurts by 4.8 pp. These skills are no longer earning their context budget as superior models can take care of it.
Both K2.5 regressions cleaned up. Two skills that made K2.5 worse (3d-molecule-ray-tracer: −7.0 pp; agent-base-template-generator: −2.6 pp) both resolve on K2.6. The skills were not wrong; the weaker model was just interpreting them awkwardly.

2. Kimi 2.6 holds its own against Sonnet 4.5

Putting K2.6 next to Sonnet 4.5 on the same 21 skills and same rubric, the early picture is this:

Solver	Baseline (no skill)	With skill	Uplift
Kimi K2.6	75.0%	92.2%	+17.20 pp
Sonnet 4.5	63.2%	84.5%	+21.3 pp

On these early signals, it appears that Kimi K2.6 is competitive with Sonnet 4.5 for the task categories these skills cover. We are scheduled to make a deeper cross-model study with clean baselines across all three solvers is in progress - but this is an early signal that Kimi 2.6 is comparable to certain of the world’s leading providers.

3. Skills remain a durable lever as models improve

With vs without the skill installed, on Kimi:

K2.5: +17.05 pp.
K2.6: +17.20 pp.

The uplift the skill buys does not shrink as the solver gets stronger. The baseline moves, the with-skill score moves with it, and the delta the skill contributes stays in the same range.Two illustrative cases, both Kimi versions, same rubric:

agent-agent. K2.5 17.7% → 79.9%. K2.6 33.9% → 88.8%. The baseline closed 16 pp of the gap. The skill still buys roughly 55 pp on top.
agent-development. K2.5 41.2% → 100.0% K2.6 55.0% → 100.0%. The baseline closed 14 pp of the gap. The skill covers the rest.

One nuance worth flagging here and reserving for a dedicated follow-up: not every uplift is equal. An initial pass comparing the same skills on Sonnet 4.5 suggests that skills prescribing ecosystem-specific tool calls or conventions lose the most in the cross-family handoff, while skills graded against real, verifiable behaviour (actual CLI flags, actual API shapes) transfer more readily. We view this as the most actionable signal for skill authors, but a broader sample and matched baselines across models are needed before we publish a complete analysis.

What this means for skill authors

Kimi K2.6 is a stronger solver than K2.5 on the task categories in this skill set, and competitive with Sonnet 4.5.
Rerun your evals when the model changes. Baselines move unevenly; some skills become redundant, some keep paying. You cannot tell which is which without running the evaluation.
If you want to run this kind of comparison on your own skills, the harness used here is the Tessl skill evaluation framework. Same structured scenarios, same weighted-checklist grading, pointed at whichever solver and skill set you give it. You can also spin up your agent and ask it to evaluate your skill with Tessl (and you can pick Kimi as your model).

Closing

Kimi K2.6 is a better model than K2.5 on this skill set: a +1.9 pp baseline gain, four skills now solved without any skill installed, and both K2.5 regressions cleaned up.

Skills still matter as models get better: the +17 pp uplift we saw on K2.5 held on K2.6, and uplift in a similar range appears on Sonnet. All of this comes from a single pre-release evaluation on 21 skills; a deeper study with clean baselines across the board is the next piece.

The above reflect early signals. On early signals it appears Kimi 2.6 is competitive with Sonnet 4.5, though a deeper study across more models and a balanced skill sample is in progress and will be published separately.

Thanks to Moonshot for early access to K2.6! Head over to Tessl to evaluate and optimize your skills.

Stop guessing whether your Skill works: skill-optimizer measures and improves it

Tessl — Sat, 20 Jun 2026 07:32:29 +0000

I typed one sentence into Claude Code: Please optimize the Fastify skill in this project, and then walked away to grab a coffee.

When I returned, I had a complete picture of how well Matteo Collina's fastify-best-practices skill was actually performing: five realistic eval scenarios, a baseline score for each, a full before/after comparison, a diagnosed regression, a proposed fix, and a rerun confirming the improvement. The skill went from an average success rate of 67% to 94% across real-world scenarios. I didn't write a single eval. I didn't design a single rubric. I just said three words and let skill-optimizer do the rest.

Important Update: skill-optimizer can now test whether your skill gets invoked at all. In a plugin with multiple skills, the agent has to route to the right one before any of the optimization logic matters. Activation evals (--solver=activation) surface routing gaps scenario by scenario, and automatically suggest description rewrites to fix them. It's the check you didn't know you were missing. Additionally, results analysis now uses a structured four-bucket framework (working / gap / redundant / regression) rather than a simple diagnosis pass.

Introducing skill-optimizer

When you write a SKILL.md, you're essentially writing instructions for an AI agent. The problem is you're writing those instructions blindly. You don't know:

Whether the agent actually follows them
Which parts are redundant (the agent already knows how to do things without the skill)
Which parts cause regressions (your instructions confuse the agent more than help)
Whether it works on cheaper models (Haiku) or only on expensive ones (Opus)

The skill-optimizer plugin runs your skill through a judge-scored eval pipeline, testing the agent with and without your skill on real tasks, then scoring the delta. You're not guessing anymore, you have real numbers to back up your feelings, as all Jedi should have.

How it works: two complementary approaches

The plugin combines two methods:

Skill review (tessl skill review) A static analysis of your SKILL.md itself. Scores it on four dimensions: completeness, actionability, conciseness, and robustness. This phase quickly catches structural problems before you even run the agent.
Task evals (tessl eval run) Generates realistic task scenarios from your skill, runs an agent on each scenario twice (once without your skill as a baseline, and once using your skill), then has an LLM as a judge score both outputs against a per-scenario rubric. The score delta tells you the skill's value-add.

The skill, optimize-skill-performance-and-instructions, combines both approaches into a single end-to-end cycle.

A real example: mcollina's fastify-best-practices skill

mcollina/skills is Matteo Collina's open-source collection of skills for modern Node.js development. It already has 1,200+ stars, 80+ forks. It covers Fastify, TypeScript, linting, documentation, and core Node.js patterns, with a SKILL.md per skill and shared rules files wiring it all together.

We ran skill-optimizer against the fastify-best-practices skill. Here's what I did as a how to so you can follow along if you like.

What actually happened

Step 1: Install the skill optimizer skill

In your skills project run:

`tessl i tessl-labs/skill-optimizer`

That's it! The skills become available to Claude Code when you start it next.

Step 2: Kick off the full optimization cycle

From within Claude Code, I asked just one thing:

`Please optimize the Fastify skill in this project`

Remember, always say please! That triggered a skill called optimize-skill-performance-and-instructions, which is the top level skill in the plugin that calls the others as needed. Claude Code took it from there. From Step 3, you’ll see the full sequence that claude ran automatically, and what happened at each stage.

Step 2a: Skill review (Stage 1)

Claude Code kicks off by performing a review of the Fastify skill using Tessl.

`tessl skill review skills/fastify/SKILL.md`

The result was encouraging:

`Average Score: 100%

  Description: 100%
    specificity: 3/3
    trigger_term_quality: 3/3
    completeness: 3/3
    distinctiveness_conflict_risk: 3/3

  Content: 100%
    conciseness: 3/3
    actionability: 3/3
    workflow_clarity: 3/3
    progressive_disclosure: 3/3

✔ Skill evaluation completed successfully!`

A perfect score. The description was praised for its explicit Use when guidance, natural trigger terms (Fastify, server.ts, app.ts, Pino), and clear Fastify-specific terminology that keeps it from conflicting with generic Node.js skills.

This wasn’t a surprise to me, of course, as I already worked with Matteo in a previous PR to improve all of these before.

Here's the important lesson though: a perfect review score doesn't mean your skill is actually working. The static review tells you the instructions are well-formed. It doesn't tell you whether the agent follows them. That's what the evals are for.

Does Your Skill Even Get Invoked?

A new addition to the skill! When your plugin contains multiple skills, there's a step that happens before any scoring logic runs: the agent has to pick the right skill for the task. It reads each scenario, looks at your skill descriptions,and routes accordingly. Get that wrong, and your eval scores are measuring the wrong thing entirely.

That's what activation evals are for. Rather than scoring outputs, they ask a simpler question: did the right skill actually fire?

tessl eval run <path/to/plugin> --solver=activation

The output shows you which skill activated for each scenario, or whether anything activated at all. The agent looked at the task and didn't find a skill it considered relevant. Skill-optimizer will automatically read your skill descriptions and the failing scenario, and suggest minimal rewrites to close the gap.

This matters because scored evals only tell you how well a skill performs once it's running. If it never runs in the first place, no amount of instruction-polishing will move your scores.

Step 2b: Generate eval scenarios (Stage 2)

Claude then generated 5 real world scenarios with Tessl for the skill:

`tessl scenario generate . --count=5`

Here are the various scenarios that were created.

Five realistic, well-scoped scenarios covering the core surface area of the skill: production config, schema validation, auth, database plugins, and file handling with tests.

Step 2c: Run evals (Stage 3)

Following the scenario generation, Claude then ran each of the scenarios as an eval using the claude-sonnet-4-6 model, with Tessl:

`tessl eval run . --agent=claude:claude-sonnet-4-6`

Claude Code shares a monitoring URL and polls every few minutes.

Step 2d: Analyze results (Stage 4)

Here's what came back:

Three scenarios with big gains, one modest gain, and one regression. The production config scenario is the standout. The skill took the agent from 41% to a perfect 100%. Without the skill, the agent had no idea to reach for env-schema, close-with-grace, or @fastify/under-pressure. With it, it nailed every check.

The regression on the database scenario needs attention, but we wouldn’t have known this without the fix!

Four Buckets, Not Just Pass/Fail

When I described how skill-optimizer diagnoses gaps earlier, I framed it as identifying what the skill was missing. That's still true, but the current version is considerably more structured about it. Every criterion in your eval results now gets sorted into one of four buckets:

Working well: with-skill score is high and meaningfully above baseline. These are your strengths. Leave them alone.
Plugin gap: both baseline and with-skill scores are low. The agent doesn't know this without your help, and the skill isn't teaching it yet. These have the highest return on fixing.
Redundant: baseline is already high without the skill. The agent knows this from general training, which means your instructions are adding context overhead without adding value for this criterion.
Regression: with-skill score is lower than baseline. The skill is actively confusing the agent on this point. Highest priority to address.

The redundant bucket is the one that tends to catch people off guard. The instinct is that more guidance is always better, but instructions covering things the model already does well just take up attention budget. Skill-optimizer flags these and suggests either removing the criterion altogether or replacing it with a harder scenario that actually tests what your skill brings to the table.

Step 2e: Diagnose and fix (Stage 5)

The regression: database-plugin-architecture

Drilling into the per-check breakdown reveals the problem:

  `Scenario 4: Database plugin architecture with official adapters

  Baseline (without context)
    onClose hook for cleanup           7/10  (70%)
    Async hooks used                   10/10 (100%)
    Structured logging in routes/hooks 2/10  (20%)

  With context
    onClose hook for cleanup           6/10  (60%)   ← got worse
    Async hooks used                   7/10  (70%)   ← got worse
    Structured logging in routes/hooks 0/10  (0%)    ← got worse`

Two checks the agent handled fine without the skill actually got worse with it. Claude Code diagnosed the cause: hooks.md contained a callback-style AVOID example that was confusing the agent's async hook implementation. And database.md had no example of structured logging in route handlers, leaving a gap the baseline agent was partially filling on its own.

The gaps: TypeBox schema scenario

  `Shared schema with $id and $ref              0/8  (0%)   → same score both runs
  additionalProperties: false on input schemas 0/8  (0%)   → skill not teaching this
  @fastify/error used                          0/10  (0%)  → not mentioned in skill`

So it turns out that these weren't regressions, but rather that the skill just wasn't covering them at all.

Here is the summary of fixes that Claude automatically went on to make:

Step 2f: Re-run and verify (Stage 6)

Claude then reran the tests to show the improvement after the fixes to the skill was made:

The regression is gone. The TypeBox scenario jumped from 82% to 92%. The file upload scenario went from 85% to 94%. Overall average moved from 89% to 94%.

One stubborn gap remains: Structured logging in routes/hooks is still scoring 0/10 even after the fixes. That's for the next iteration.

Step 2g Does Your Skill Work Across Models?

I mentioned earlier that you can validate across Haiku, Sonnet, and Opus. The compare-skill-model-performance skill now makes this a structured workflow rather than something you'd stitch together manually. You run your scenarios against all three models and get a side-by-side comparison.

`tessl eval run . --agent=claude:claude-haiku-4-5
tessl eval run . --agent=claude:claude-sonnet-4-6
tessl eval run . --agent=claude:claude-opus-4-6`

But the more useful output is the failure pattern classification.

There are four patterns to watch for:

Universal failure — all three models fail the same criterion. This is a tile gap: the instruction is missing, ambiguous, or conflicting across your files. - Capability gradient — Haiku fails, but Sonnet and Opus pass. Your instructions are present, but they're too implicit for a smaller model to follow reliably. The fix is more explicit phrasing, not more content. - Model anomaly — a single model fails while the others pass. Likely eval variance. Worth noting, but not worth over-engineering a fix. - Regression — with-skill scores drop below baseline on one or more models. The skill is actively hurting performance, regardless of which model it affects.

The capability gradient pattern is the one that changes how I think about writing skill instructions. If you're publishing to the registry, you don't control which model your users run. Instructions that only work because Opus can infer what you meant aren't robust — they're prompts that happen to work on a capable model. Writing more explicit instructions closes that gap across the whole model range.

Once all three models come in at ≥ 85% with no regressions, you have a clean signal to publish:If Haiku struggles on specific criteria, Claude Code will tell you, and the fix is usually simpler, more explicit phrasing rather than restructuring the whole skill.

Once all three models score well, it':

`tessl tile publish <path/to/tile>`

Summary: when to reach for each skill

You want to...	Use this skill
Run a full skill optimize end-to-end	optimize-skill-performance-and-instructions
Generate scenarios + first baseline run	setup-skill-performance
You have eval results, want to fix and re-run	optimize-skill-performance
Quickly audit SKILL.md quality (no evals)	optimize-skill-instructions
compare-skill-model-performance	compare-skill-model-performance

The fastify-best-practices skill scored a perfect 100% on static review, well-structured description, good trigger terms, clean layout. And it still had a regression in production.

That's the gap skill-optimizer closes. Static review tells you the instructions are well-formed. Evals tell you whether the agent actually follows them. For the production config scenario, the skill took the agent from 41% to 100%, things like env-schema, close-with-grace, and @fastify/under-pressure that the agent simply doesn't reach for without explicit guidance. That gap is impossible to identify without measurement.

For anyone publishing skills to the Tessl registry, running this before you publish is the difference between shipping something that works and shipping something you hope works.

Open-Source Agents vs Sonnet 4.6: GLM 5.2, MiniMax M3, Kimi 2.7 and Qwen 3.7 Tested

Tessl — Fri, 19 Jun 2026 07:59:52 +0000

A year ago, the choice between an open-source coding model and a frontier model from a major lab was not really a choice. You used the frontier model and paid for it. The open models were cheaper, and you could feel why.

That gap has closed. We ran four open-source models, GLM 5.2, MiniMax M3, Kimi K2.7-code, and Qwen3.7-Plus, against Claude Sonnet 4.6 through the same evaluation: nearly 1,000 real coding scenarios, each solved twice, one with no help and one with an agent skill supplying the conventions for the task. The result is not a tidy story where the expensive model wins. One open model beats Sonnet on quality and cost at the same time. Another is the cheapest thing in the test by an order of magnitude and still cannot be trusted to follow a clear instruction. The practical question is this: if you are choosing a coding agent today, how close is open-source to the frontier, and where does it still fall apart?

The setup: same tasks, with and without the skill

Every model solved the same scenarios twice. The baseline run gave the model the task and nothing else. The skill run gave it the same task plus an agent skill, the packaged conventions and instructions for the tool in question. Comparing the two runs isolates one thing: how much the model improves when you hand it the right context.

We scored each run on two axes. Instruction-following measures whether the model did the task the way it was asked, using the right APIs, conventions, and constraints. Task-completion measures whether the work runs and produces the intended result. Overall score weights them four to three in favor of instruction-following, because a coding agent that completes the wrong thing confidently is worse than one that stalls. The tasks and skills used are publicly available, in the task-evals-for-skills dataset, so you can inspect any scenario yourself.

Cost is the average dollars per task, recomputed from each scenario's measured token counts at real list prices. The four open models run on Fireworks at their published Standard rates. Sonnet 4.6 is priced at Anthropic's list. We report solve-only cost, which excludes the grading step, the same convention as the rest of the series.

One number to keep in mind: across every model, the skill adds about 20 points to the Overall score, and almost all of that gain is in instruction-following. The models could already complete most tasks. What they lacked was the conventions, and that is exactly what a skill carries.

How five coding agents score on accuracy

Here is the full scoreboard on each model's paired scenarios, baseline then with the skill.

	GLM 5.2	MiniMax M3	Sonnet 4.6	Kimi K2.7-code	Qwen3.7-Plus
Overall score	91.9	91.4	90.8	88.7	82.2
Overall score (baseline, no skill)	71.7	70.5	66.4	69.2	62.7
Overall lift from the skill	+20.2	+20.9	+24.4	+19.5	+19.5
Instruction-following	87.4	87.2	86.1	82.5	77.2
Instruction-following (baseline)	56.2	55.4	49.1	52.8	45.7
Task-completion	97.8	97.0	97.1	96.9	88.9
Turns to complete	18.5	22.7	17.7	27.5	16.5
Output tokens per task	8,813	8,952	6,841	21,787	12,296
List price (input / output, per MTok)	$1.40 / $4.40	$0.30 / $1.20	$3 / $15	$0.95 / $4.00	$0.40 / $1.60
Cost per task	$0.289	$0.207	$0.296	$0.661	$0.068
Points per dollar	318	442	307	134	1,204

Two facts jump out before any analysis. The top of the table is a near-tie on quality: four points separate first from fourth. And the cost column spans a factor of ten. The decision, in other words, is no longer about who can do the work. It is about what you are willing to pay for the last point of accuracy, and which model you can actually trust to follow instructions.

Line the five models up by cost and by quality and three of them earn their price: Qwen at the cheap end, MiniMax in the middle, and GLM 5.2 at the top. Nothing in the test beats any of these three on both cost and quality at once. Sonnet 4.6 is not one of them. GLM 5.2 scores as high and costs slightly less per task, so on this test there is no reason to reach for Sonnet over it. Kimi is the most expensive model in the test and only the fourth most accurate.

The model that ties Sonnet

The headline of this series promised an open model that ties Sonnet. The data is stronger than that. GLM 5.2 finishes at 91.9 Overall against Sonnet's 90.8, and it does so at $0.289 per task against Sonnet's $0.296. When directly comparing the scenarios that all five models ran, GLM 5.2 reaches 93.5 and Sonnet 91.9. The open model is ahead on quality and on cost.

There is a nuance worth stating precisely, because it cuts the other way and the comparison should be fair. On those tasks, Sonnet is the single best model on 54 percent of them, more than any other model. So Sonnet wins the typical scenario by a small margin. GLM 5.2 still comes out ahead on the average because it is more consistent: it has fewer catastrophic low scores dragging its mean down. If you care about the median task, Sonnet edges it. If you care about avoiding the bad day, GLM 5.2 wins. Both readings are true, and both point at a real tie at the top rather than a blowout.

MiniMax M3 lands in almost the same place as Sonnet on quality, 91.4 to 90.8, while costing about 30 percent less per task. It is the value pick at the top of the table.

The model that won't listen

Qwen3.7-Plus is the cautionary tale, and the interesting thing is how it fails. It is not simply a weaker model that scores lower everywhere. It is a model that will do the work and ignore your instructions while doing it.

Start with the obvious signal. Qwen has the lowest instruction-following score in the test, 77.2 with the skill against 82 or higher for everyone else, and the lowest baseline at 45.7. But the average understates the problem, because Qwen's scores are volatile. Sixteen percent of its scenarios still score under 50 on instruction-following even with the skill in hand, compared to 6 to 13 percent for the rest. The skill is right there and it gets ignored one time in six.

The clearest evidence is in task-completion. Every other model sits at 97. Qwen sits at 88.9, the only model whose ability to finish the job also sags. When we look at the scenarios where Qwen scores low on instruction-following, most are not cases of it giving up. In 116 of them Qwen completed the task to a high standard but followed the instructions poorly, against 87 where it failed both. That 116 is the whole thesis in one number. Handed the conventions for a tool, Qwen frequently builds something that works, in its own way, ignoring how it was asked to build it.

Adding the skill can even backfire. For most models the skill almost never hurts; 3 to 6 percent of scenarios regress. For Qwen, 14 percent regress, with some catastrophic single drops. A scenario that scored 100 at baseline fell to 4.6 with the skill. Two others fell from 88.6 to zero. The skill does not just fail to help Qwen on these tasks. It actively derails the model, which then spends 38 percent more turns and 28 percent more money to arrive at a worse answer. If you are running an agent loop unattended, that combination of cheap, confident, and non-compliant is the worst profile in the table.

Where every agent stumbles: web research and scraping

The most useful finding is not about any one model. It is the cluster where all five break the same way: web research and scraping. Group those skills together, Firecrawl, Tavily, Apify, Browser-use, Brave, Exa, and LangChain, and every model's instruction-following collapses relative to its own work elsewhere. GLM drops 20 points, Kimi 27, Qwen 15, MiniMax 13, and Sonnet 18. The hardest scenarios in the entire test, by mean score across all five models, are dominated by Firecrawl command-line tasks and a Cloudflare investigation-notes scenario that averages 18.9 out of 100.

It is also where models most often step outside their sandbox, reading files they were not given, scanning the filesystem for API keys, or hunting for the grading criteria instead of solving the task as set. These out-of-bounds flags hit 16 to 36 percent of cluster scenarios against single digits elsewhere, with Sonnet the worst at 36 percent. The pattern fits the task: scraping and search skills need API credentials, so the models go hunting for keys rather than working only from what the task provided.

The honest takeaway is that web research and scraping are simply hard for every model, open or closed, and Sonnet stumbles here exactly like the open ones. These tasks involve live network calls, long agentic loops, and grading checks that are easy to satisfy superficially. If you deploy any of these agents on scraping or research workloads, expect a 15 to 25 point drop from your clean-task instruction-following, and budget for the occasional run that costs an order of magnitude more than the median. And spending more does not help: output tokens and turn count both correlate slightly negatively with the Overall score, so the long, expensive runs are the ones thrashing toward a wrong answer, not doing careful extra work.

Which coding agent should you pick?

The skill is the great equalizer, so the first rule is to use one. It adds about 20 points to every model, and it adds the most, 24.4, to Sonnet, which starts mid-pack at baseline and only reaches the top tier once it has the conventions. Without the skill the ranking reshuffles entirely. The model you would pick depends almost entirely on whether you give it the right context, which is the whole premise of treating skills as first-class software.

With that settled, here is the opinionated guidance.

Choose GLM 5.2 if you want the highest accuracy and you are not paying frontier-lab prices to get it. It tops the table, it is the most consistent model in the test, and it costs less per task than Sonnet. For most teams comparing against a Claude or GPT default, this is the result that should change your spend.

Choose MiniMax M3 if you want Sonnet-level quality at the lowest cost among the strong models. It matches Sonnet within a point at about 30 percent less per task.

Choose Sonnet 4.6 if you are already in the Anthropic ecosystem and value the per-scenario edge on typical tasks. It wins the most head-to-head matchups, refused nothing in our run, and is the leanest model on output tokens. You are paying a small premium for that consistency-versus-peak tradeoff, and on this test an open model matches it.

Reach for Kimi K2.7-code on focused coding tasks where completion matters more than cost. Kimi finishes the job as reliably as the leaders (96.9 task-completion); its weaker spot is following instructions to the letter. Per token it is cheaper than GLM and Sonnet, so on short-output work it costs less than its $0.66 average suggests, but it tends to run long, which makes it better suited to high-value, lower-volume work than to large fleets.

Treat Qwen3.7-Plus as a specialist, not a generalist. At $0.068 per task it is cheaper than everything else by a wide margin. But it follows instructions worst and its quality is the most volatile. Use it where the task is forgiving and the savings dominate. Do not use it where doing the task the prescribed way actually matters.

The broader signal is the one the pricing pages miss. Open-source coding agents have caught the frontier on accuracy, and the gap that remains is not capability but reliability. The same skill carried every model up by about the same amount, which means the differentiator is no longer raw model quality. It is whether the model listens.

How I Scan My Agent Context Across GitHub with Skill Inventory

Tessl — Thu, 18 Jun 2026 07:01:27 +0000

Most teams do not know how many agent skills they have.

That matters because context engineering changes the problem. The work is no longer just about a prompt or a model call.

It is about the skill estate around it, and that is where duplication, overlap, and ownership drift start to matter.

Tessl’s latest Skill Inventory is meant to make that visible. It scans a GitHub org, maps the skills it finds, and turns a loose estate into something you can reason about.

Why should developers care about skill sprawl?

For developers, this shows up in a few concrete ways:

A change to one skill can affect a different repo you did not mean to touch.
A duplicate skill can keep drifting because nobody is sure which copy is canonical.
A linked eval can fan out into more scenarios than the author expected.
A loose first-party skill can keep living outside the place people actually look for it.

The problem is usually a combination of a few things:

Skills live in several repos, so there is no obvious single source of truth.
Variants get copied and changed slightly, so the same idea appears under different names.
Ownership becomes fuzzy, so nobody knows which copy should be updated first.
Linked skills can fan out into linked evals and scenarios, so one change can trigger more than expected.

That is not an argument for banning flexibility, but it is an argument for visibility. If a team wants to connect skills to other skills, that can be a legitimate pattern. The important part is understanding the shape of the estate before the next change turns into a chain reaction.

Jame Moss, Member of Technical Staff at Tessl, expands on this in his latest talk at AI DevCon London:

Watch on YouTube

What your Skill Inventory shows

Tessl ‘s skill Inventory is designed to answer the question, "What do we actually have?"

It gives you a map of the skills in your org and lets you slice that map in a few different ways:

By skill, so you can see the skill itself and where it appears.
By repo, so you can see where a skill lives in your codebase.
By finding, so you can focus on the overlaps, unmanaged copies, and other issues that deserve attention.
By scan history, so you can see what changed between runs.

Teams often know they have a lot of skills, but not how many, not which ones are duplicated, and not which ones are effectively drifting out of standard.

Skill Inventory gives you a useful shortcut: if a skill already exists publicly, you do not need to treat it like an unknown object.

The three views in the webUI reflect that workflow:

1. Estate is the broad map, showing evaluations, uses, findings, and security assessment

2. Triage is the findings view, where you dig into overlaps, variant groups, and other issues.

3. Scan is the history view, so the inventory becomes a living report rather than a one-time audit.

On the triage page, grouped variants make the overlap story easier to read. Instead of hunting through a list of near-duplicates, you can open the group and inspect the detail in one place.

If you want to try it yourself, scan a different GitHub org from the one we used yesterday. There is still one known edge case where already-scanned items can get skipped, but that should not get in the way for a new user.

How I use it in practice

The flow is intentionally simple.

`tessl login
tessl inventory import`

I do not want the first pass to require a long setup or a new mental model. I want to run a scan, get the inventory, and make the estate visible before I spend time deciding what to change. I actually copy paste the above directly in my agent, and find the process even smoother there!

Skill Inventory runs entirely from your machine and only uses the GitHub access your account already has. That makes it a good starting point when you want a low-friction read on the state of your skills without first copying repos or building a separate index.

Once the scan is done, the value comes from interpretation:

Which skills are duplicated, and which one looks like the better canonical version?
Which skills are used in enough places that they should probably be published and governed more deliberately?
Which files look like orphaned skill.md documents that are drifting around without any clear home?
Which skills are linked to other skills in ways that create accidental fan-out?

Those are not abstract governance questions. They are the questions that determine whether the next change is easy or expensive.

The takeaway

Skills are multiplying across your repos. Most teams have no idea how many they have, who owns them, or how many are near-duplicates of each other. Skill Inventory gives you a map of every skill in your org - what exists, where it's used, and where you're duplicating effort.

Once you can see the estate, the next decisions become easier: which skills to keep, which ones to merge, which ones to publish, and which ones need a harder look before they keep growing.

You can try it today for free: ask your agent to download the Tessl CLI, login and run tessl inventory import.

Securing the Coder, Not the Code: Notes on Agentic Development and Security

Tessl — Wed, 17 Jun 2026 08:02:49 +0000

A few years ago I left Snyk day-to-day to start Tessl, because I'd fallen in love with AI and was convinced that the way we build software was about to change in a way that broke most of our security assumptions. I still believe that. The talk I gave recently at Snyk’s security conference was an attempt to make the case concretely, and this post is the written version of that talk for anyone who wasn't in the room.

tl;dr - as agents create and delete code at unprecedented speed, the job of us, humans, is not to secure the code, but to get agents to secure it as they build. This is a material shift, requiring new tools, approaches and metrics.

Here's what I think it means, and what we should do about it.

From AI-augmented to AI-native

The first wave of AI in development was augmentation, pioneered by Copilot and then Cursor, where you wrote code and AI helped you write it faster. The second wave is delegation, where you ask an agent to do a task and it goes off and tries to do it, which means the agent is the developer and you become the reviewer, the prompter, and the auditor of intent.

This isn't a controversial statement anymore, agentic development is where everything has consolidated, and while I'll focus on software because that's the canary in the coalmine, every form of knowledge work is heading the same way.

The productivity gains are there, but agentic development changes the unit of work, the determinism of the output, and the rate of change of the development process all at once, and each of those shifts breaks a different assumption we built our security practices around.

It's non-deterministic. Compile once, compile again, same result is no longer how things work, because the same prompt produces different output and we have to get statistical about it.
The unit of software is changing. What we secure used to be the implementation, and now it's increasingly the instructions: skills, prompts, context, which represents a new unit, a new attack surface, and a need for new tooling.
It moves faster than ever. Development cycles are compressing, security has to keep up, and the only way to keep up with agents is for security itself to become agentic.

Challenge 1: non-determinism means you have to measure

If you came up through DevOps, you know the ethos: if it moves, measure it, and if it doesn't move, measure it in case it moves. Servers were the most statistical creature we'd ever shipped to production, and the answer was always the same, which is that you can't optimise what you can't measure.

Agents are a different category of statistical altogether, because the non-determinism isn't just at the request layer, it compounds across model output, tool selection, retrieval, retries, and multi-step planning. This means a measurement approach that worked for servers won't catch any of it.

The way you do this in the AI world is evals, where you define a task, define what good looks like up front, and run the agent against that task many times, scoring the runs.

I ran this on ElevenLabs as an example, given that ElevenLabs is a brilliant London-based text-to-speech lab that recently launched a music generation API. I gave the agent a task to build a dynamic soundtrack generator for a game studio, scored each run, and ran it ten times across five scenarios.

The results were noisy across the board, with absolute scores coming in low because the music API is new and underrepresented in the model weights, and the variance was the bigger story: the same task, the same prompt, ten runs, materially different outcomes each time.

The most common answer today is context, and the most common unit of context is skills, which are Markdown files (with a bit of structure) that give the agent the knowledge it needs to do a task well.

Knowledge is not the same as intelligence, and a model can be highly capable in the abstract while still failing a task because it doesn't know the specific API surface, the internal convention, or the deprecated import path it needs to avoid. From the outside that failure looks identical to a model that simply isn't smart enough.

We took an ElevenLabs music skill that explained the API, ran the same evals, and the agent went from a 50% average without the skill to 98% with it, with the variance compressing and the task actually getting done.

CodeGuard, and why more context isn't better context

Same idea, but applied to security. CodeGuard is a project Cisco built and donated, which packages OWASP security rules into a skill that helps agents write more secure code, so I created six evaluation scenarios focused specifically on authorisation and scored the agent's output with and without it.

Without CodeGuard, the agent scored 48% on the authorisation scorecard, and with CodeGuard it improved by nearly 1.78x, which is a meaningful lift, but the second experiment was the more interesting one.

When I stripped CodeGuard down to just the authentication and authorisation content, roughly 5% of the original skill, and re-ran the same evals, the score jumped to 98%. This means less context, scoped tightly to the task, beat more context by a wide margin.

More context is not necessarily better context, because if I sat you down and told you 100 things, no matter how brilliant they were, you'd give less attention to each one than if I told you three. Attention is a scarce resource for humans and models alike, which means choosing what to say and what to leave out is part of the craft.

The same pattern shows up when you vary the agent rather than the skill, where identical instructions run through Opus, Sonnet, Codex, and Cursor produce materially different scores. Context isn't just a property of the skill, it's a property of the skill-agent pair, and your context needs to be tuned for the agents you're actually using.

The Context Development Lifecycle

When you start treating skills as something you build, evaluate, optimise, distribute, and observe in production, you have a lifecycle, and we've been calling this the Context Development Lifecycle (CDLC), which I think sits alongside the SDLC rather than inside it.

The CDLC is where humans live, building the context that guides the agents, which is then applied across the SDLC where the agents do the work.

The observe step matters, because evals are like tests in that they're useful but they go out of sync with reality if you don't also watch what's actually happening in production. If you want a loop that closes: build, evaluate, distribute, observe, learn, improve.

The same skill, the same instruction, can be applied end to end, the same way a great dev uses the same knowledge to spec a feature, write the code, review it, ship it, and troubleshoot the incident. With skills representing that knowledge, this means that from a security lens the same skill can secure the writing step, the audit step, and the incident response step.

Challenge 2: skills are a new unit of software

The more you live inside the CDLC, the more obvious the second challenge becomes which is that we're talking about skills as if they were documents when they aren't. Skills are stored as Markdown, edited in the same tools as a Confluence page, and reviewed like prose, but at runtime an agent executes them as instructions, which puts skills much closer to code than to documentation, and the security model has to follow that reality rather than the file extension.

That means they have all the failure modes software has, plus some new ones.

Malicious skills. Snyk and others have documented attackers seeding skills with instructions designed to make the agent do something it shouldn't. We've seen examples in our own registry of skills that look like standard blockchain API helpers but with one step that quietly downloads a password-protected zip, which is detectable if you're looking.
Vulnerable skills. A skill that asks the user to put API keys directly inside the prompt, or makes MCP calls with plain vanilla tokens, is insecure by design even if the author meant no harm.
Negligent skills. Not an industry term, but it should be, because these are skills that lack basic safety instructions like "check this into a private repo, and if you can't commit, fail, don't exfiltrate the work some other way,". We've all seen agents in reward-seeking mode, keen to please and willing to delete files, escape sandboxes, do whatever it takes to complete the task. Negligence skills are the ones that don't tell the agent where the guardrails are.
Supply chain. How people consume skills today is by downloading Markdown files from random GitHub repos and checking them into their own, often in seven different folders to support seven different agents, which is fine for now but is going to bite teams eventually.

Once you treat skills as a software artifact rather than a document, most of the framing problem solves itself, because versioning, dependency resolution, provenance, scanning, signing, and lifecycle management are problems the package ecosystem has been working through for two decades, and a lot of the answers port over with light adaptation.

What enterprise-grade skill governance looks like

Three elements, in roughly the order you need them.

Governance and security is about knowing what's happening: auditing who publishes and who installs, constraining the supply chain to centralised paths the same way you'd constrain npm, and scanning skills for malicious content before they hit the registry. Most teams I talk to haven't started on this yet, which is the bit that blocks rollout.

Standardisation and reuse is the next problem once skills are flowing, because duplication and drift become real issues when three people on the team have built a "review the code" skill and a fourth comes along. Teams need a way to compare, standardise, and choose.

Continuous optimisation is the holy grail, and it's where the CDLC closes the loop by observing what the agent did, whether it succeeded, whether the user had to correct it. Devs need to feed that signal back into their evals to evolve the skill and ship the new version, which is what the teams at the cutting edge are doing.

This is the area we've built Tessl to help with, as a platform for collaboratively developing skills, discovering and installing them with confidence, scanning them with Snyk before they go live, and observing how they perform once they're in use. This is why platform teams, DevX teams, and the newly emerging "AI enablement" teams use us to eliminate duplicates, drive usage of the good skills, and manage costs as agentic development scales.

Challenge 3: security must become agentic to keep up

The third challenge is the simplest to state and the hardest to act on, which is that agentic development moves faster than any prior development paradigm. Security has to move at the same speed, and the only way to do that is for security itself to become agentic.

I've lived through a version of this transition before, given that the move from waterfall to cloud took a set of manual security processes that worked fine on quarterly release cycles and made them actively dangerous in continuous deployment. The manual code review before every release was a reasonable practice in 1998 and a liability by 2015, which meant the teams that automated their scanning early built a durable advantage while the teams that didn't spent the next decade catching up under pressure.

The same inflection point is now happening with agents, where practices that are still tolerated in modern cloud security, like manual triage of vulnerabilities, manual dependency upgrades, and manual review of supply chain changes, are already automated by the leading teams. In the agent era they won't be tolerated at all, because the future is here, as Gibson said, but it's not evenly distributed.

A long list of things move from "nice to have" to "must have", including smarter prioritisation, automated upgrades, supply chain manipulation detection, and drift detection on skills, all of which are things agents can genuinely help with.

Security is now being squeezed from both sides, given that attackers are already operating agentically with automated reconnaissance, exploit generation, and lateral movement at speeds humans can't match. Businesses are operating agentically to ship faster, which means a security function that stays human-paced doesn't just slow things down, it becomes the asymmetric weak point in the system, and the math stops working.

If security does become agentic, though, we can finally fix the things we've spent fifteen years trying to get developers to do consistently, which is the part that excites me.

Come talk about this at DevCon

We're going much deeper on all of this at AI Native Dev Con (DevCon) London, June 1st and 2nd. It's the conference we put together for people actually building and shipping in the agentic era.

The line-up is focused on delivering real-world case studies from teams that have rolled out agents at scale, the platform engineers building the enablement layer, and the security folks figuring out how to keep up. I'll also be expanding on the CDLC, skills as software, and what good governance actually looks like.

If any of this resonated, or if you want to discuss about it in person, I'd love to see you there. All the details, the agenda, and registration are at tessl.io/devcon.

See you in London.

We ran Composer 2.5 and 2.5 Fast across 11 skills. Surprisingly, Fast won.

Tessl — Tue, 16 Jun 2026 06:39:06 +0000

Cursor just shipped Composer 2.5 and Composer 2.5 Fast. We benchmarked both across 11 engineering skills, 5 scenarios per skill, averaged across three independent LLM judges. The fast model scored higher, ran 32% quicker, and costs exactly the same. If you are reaching for Composer 2.5 over Composer 2.5 Fast, you are paying the same price for a slower, slightly worse model.

Here is the full picture.

TL;DR

Composer 2.5 Fast scores 92.7% with skill context. Composer 2.5 scores 92.1%. Fast wins.
Both are ahead of gpt-5.5, gpt-5.4, and the previous Composer 2.
The fast model completes scenarios in 59 seconds on average. The regular model takes 87 seconds.

Where They Land in the Benchmark

We ran 6 models across 11 skills, scoring each run with three independent judges and averaging the results. Here is where the full leaderboard sits:

Model	Avg baseline	Avg with-skill	Lift
opus-4-7	80.8%	93.4%	+12.6
composer-2.5-fast	79.6%	92.7%	+13.1
composer-2.5	79.0%	92.1%	+13.1
composer-2	74.2%	89.6%	+15.4
gpt-5.5	75.5%	89.4%	+13.9
gpt-5.4	74.1%	89.3%	+15.2
gpt-5.3	65.5%	83.9%	+18.4
gpt-5-codex	68.7%	78.7%	+10.0

Composer 2.5 Fast sits 1.3 points behind opus-4-7 and 3.3 points clear of everything else. That is a meaningful gap. The previous Composer 2 sits alongside gpt-5.4 and gpt-5.5 at roughly 89-90%. Cursor has moved its own model up a full competitive tier in a single release.

The Fast model seems better.

Normally a "fast" variant trades quality for speed. Composer 2.5 Fast does not do that. It scores 0.6 points higher than the regular model while running 28 seconds faster per scenario (59s vs 87s on average across 110 scored runs).

The per-skill breakdown shows where the differences accumulate:

Skill	2.5 with-skill	2.5-fast with-skill	Winner
documentation	97%	98%	fast
fastify	99%	94%	2.5
init	87%	86%	2.5
linting	98%	99%	fast
node-best-practices	95%	95%	tie
nodejs-core	98%	98%	tie
oauth	92%	89%	2.5
octocat	95%	96%	fast
skill-optimizer	98%	98%	tie
snipgrapher	93%	93%	tie
typescript	82%	76%	2.5

The regular model wins on fastify (+5), oauth (+3), and typescript (+6). The fast model wins on documentation, linting, and octocat. For most skills they are within noise. The overall average breaks toward fast because it avoids some of the deeper failures the regular model hits on documentation and linting under stricter judges.

The typescript result is worth flagging separately. Both models score lower with skill context than without it on typescript. The regular model drops from baseline to 82% with skill; the fast model drops further to 76%. Something about how these models interact with the typescript skill works against them. If typescript is central to your workflow, treat this as a yellow flag worth investigating.

The Cost Argument

Both Composer 2.5 variants are part of the Cursor subscription. The marginal cost of choosing one over the other is zero. There is no per-token bill that changes when you switch from the regular to the fast model.

This makes the benchmark result unusually clean: faster, cheaper (relatively), and better. The only case where you might prefer the regular model is if you are working heavily in fastify or oauth-heavy codebases where it holds a consistent 3-5 point lead. For everything else, the fast model is the better default.

Compare this to the OpenAI side of the leaderboard. gpt-5.5 and gpt-5.4 both land around 89%, behind both Composer 2.5 variants, and carry per-token API costs that accumulate with usage. The Cursor subscription gives you a stronger model at a fixed price, which changes the economics significantly if you are running agents at any kind of scale.

What Changed from Composer 2

The gap between Composer 2 and Composer 2.5 is larger than the leaderboard position suggests. The with-skill scores are 89.6% vs 92.1-92.7%, a 2.5-3 point jump. More importantly, the baseline scores tell a different story: Composer 2 sits at 74.2% without context, while Composer 2.5 sits at 79-80%. That 5-6 point baseline improvement means the new model is genuinely stronger at the task, not just better at following instructions when given them.

The lift numbers reinforce this. Composer 2 shows +15.4 points of lift from skill context. Both 2.5 variants show +13.1. A lower lift number means the model needs less scaffolding to perform well. Composer 2 was getting more out of the skill context because it needed it more. Composer 2.5 is a better baseline model that skills push even higher.

The One Caveat

These scores are averaged across three judges (Sonnet, GPT-5.5, Opus-4-7). The raw Sonnet-only scores for Composer 2.5 were 94% and 92%, which looked even better. After applying stricter judges, the numbers settled at 92.1% and 92.7%. That is the correct comparison to make against the other models in this benchmark, which went through the same three-judge process. A single-judge Sonnet score would have overstated the gap.

Why Your Gemini Bill Doesn't Match the Model Names

Tessl — Mon, 15 Jun 2026 05:24:38 +0000

Why Your Gemini Bill Doesn't Match the Model Names

tl;dr - Across roughly 3,300 paired skill-eval runs, Gemini 3.5 Flash cost $1.05 per task against Gemini 3.1 Pro's $0.66, for scores that were effectively identical: 88.6 versus 87.9.

The pricing is even stranger when you look at the actual task costs. Gemini 3.5 Flash and Gemini 4.5 Flash are separated by almost 8× in per-task cost, while Gemini 3.1 Pro comes in cheaper than both. The invoice does not appear to follow the naming hierarchy.

Where the numbers come from?

The benchmark ran every task twice, once with the relevant skill applied and once without, across four Gemini models in OpenHands, totaling roughly 800 tasks per model. Rather than relying on dashboard estimates, we pulled per-call token counts directly from agent session logs and computed costs using Google's published per-token prices. We then compared the resulting per-task costs across models.

The headline data

Model	$/task (w/ skill)	Score	Pts per $	Input tokens	Turns	List $/Mtok
3.1 Flash Lite	$0.035	70.2	2,006	0.31M	17	$0.25
3 Flash Preview	$0.135	85.4	633	0.63M	24	$0.50
3.1 Pro Preview	$0.66	87.9	132	0.65M	26	$2.00
3.5 Flash	$1.05	88.6	85	1.41M	39	$1.50

A few things stand out from this data.

Cost order and name order are uncorrelated. Gemini 3.1 Pro is cheaper per task than Gemini 3.5 Flash despite carrying a higher per-token list price, while Gemini 4.5 Flash and Gemini 4.5 Flash-Lite, which sit in the same product family, differ dramatically in actual spend. Model names describe intended positioning, but they are a poor guide to real-world agent costs.
Scores do improve with each model generation, which is a genuine positive trend and a good reason to track releases, but capability gains do not automatically translate to cost reductions.
Finally, the practical value pick is Gemini 3 Flash Preview, which lands within three points of the leading models at roughly one-fifth the per-task cost, making it the most efficient option for workloads where a score in the 85 range is acceptable.

Why volume beats unit price

The cost of an agentic task is the product of two variables:

`Task cost = price-per-token × tokens the model decides to spend`

Model names establish the first variable. The second is determined at runtime by the model's behavior on the specific task, and it only becomes visible after you read your session logs.

For Gemini 3.5 Flash, the per-task cost breaks down as follows:

Non-cached input: $0.72
Cache-read input: $0.14
Output (including thinking): $0.19

The dominant driver is input volume. Gemini 3.5 Flash sent 1.41 million tokens of context across 39 agent turns per task. Pro sent roughly half that volume across 26 turns, and even at its higher list price of $2.00 per million tokens, its lower volume resolves to a lower total bill.

A model with a cheaper per-token rate that takes more turns to reach an answer will erode its own discount. It is also worth noting that 63-75% of input across these runs was cache-read, which means the effective sensitivity to turn count is even higher than raw list prices suggest: the multiplier is accumulating in your session logs, not on your pricing page.

Skills move cost by tier

Adding a relevant skill to each run changed per-task cost in opposite directions depending on which model ran it:

Pro saw cost drop $0.20 per task (-23%) while the score gained 20 points. The model used fewer turns and less exploratory backtracking, which suggests it was able to act on the structured guidance directly rather than discovering the solution path through iteration.
3.5 Flash was essentially flat, with cost shifting by less than $0.03 in either direction.
3 Flash Preview and Flash Lite each spent slightly more tokens for marginal score gains (+$0.03 and +$0.01 respectively).

The underlying pattern is consistent: a skill compresses the solution path for a model capable of following structured guidance precisely, reducing turn count and therefore total cost. For a model still resolving ambiguity through exploration, the same skill adds context to process rather than a shortcut to apply, and the cost holds steady or rises marginally. A skill is a shortcut for a capable model and overhead for a weaker one.

In practical terms, this produces two clear operating points. Pro with a relevant skill at $0.66 per task is the most cost-efficient route to top-tier performance. Gemini 3 Flash Preview with a skill at $0.135 per task delivers roughly five times the score-per-dollar of either leader, for a score three points lower, which is a reasonable trade for many workloads.

Measure, don't assume

Four takeaways from this data that apply beyond this specific benchmark:

1/ Do not budget from the rate card. Cost your workload based on measured tokens and turns on your specific tasks, with your specific prompts, in your specific agent harness. Per-token list prices are a useful first filter for ordering candidates, not a reliable predictor of relative spend.

2/ Read cost at the session layer. Aggregate dashboards can show $0 while spend accumulates in the background. Token usage needs to come from raw API responses or agent session logs to be trusted for budgeting purposes.

3/ Watch turn count first. The 39-versus-26 turn gap between 3.5 Flash and Pro is the primary cause of the price inversion observed here, and turn count is the variable most commonly absent from observability tooling. It is the multiplier on everything else in the cost equation.

4/ Re-measure when models update. Gemini 3.5 Flash is a newer release than Gemini 3 Flash Preview and scores higher, but it costs roughly eight times more in this agentic context. Capability improvements and cost improvements are independent variables, and any cost benchmark needs to be re-run with each version update rather than assumed to hold.

Caveats

These results come from a single agent harness (OpenHands), a single benchmark with explicit skill-relevance disclosure, and a specific sample window. Different tasks, prompt structures, and turn-length patterns will shift the absolute numbers and may shift the relative rankings. The finding to carry forward is not a specific model recommendation but a methodology: in agentic settings, cost rankings are not derivable from per-token rates alone, and the ranking that applies to your workload depends on that workload's specific behavioral profile.

A model name is a pricing tier, not a cost forecast. In agentic workflows, the deciding variable is how many tokens the model chooses to spend to reach an answer, a figure visible only after you run the work and read the logs. The rate card gives you one of the two inputs; only measurement gives you both.

Next: which skills actually earn their tokens? In these runs, 42% produced significant performance gains while 5% were net overhead. We’ll follow up on this analysis in the next post.