close

DEV Community

Cover image for Too cheap to be good? Think again.
Pascal CESCATO
Pascal CESCATO Subscriber

Posted on • Edited on

Too cheap to be good? Think again.

Replacing bloated panels with Caddy and scripts

For years, I ran my WordPress sites on OpenLiteSpeed. Fast server, LSCache is genuinely impressive, and the OLS/WordPress combo is hard to beat on raw performance. For the control panel, I started with CyberPanel — buggier than a Microsoft product, with a team that appears to be deliberately sabotaging its free features to push users toward paid plans. I'm not talking about bugs that can't be fixed. I'm talking about bugs that seem engineered to prevent free-tier features from completing any action.

Two examples. WordPress installation: I used it for years without issue. Since CyberPanel v2.4.x, an SQL error blocks the final step. The files are there, fully downloaded, but you have to create the database manually and run the install yourself. Counterintuitive, to put it mildly.

Second example: Let's Encrypt SSL certificate generation that consistently fails because the generated config files are incorrect. And in both cases, there's a paid "enhanced" version available. Naturally.

My position is simple: if a feature worked for years and now doesn't, I have no guarantee the paid version works either — or that the terms won't change tomorrow. Is it a bait-and-switch? I won't say that explicitly. But when a free feature works for years, then stops working across multiple successive versions, and a paid alternative covers the same ground — the question answers itself. I asked it, drew my conclusions, and blacklisted the vendor.

So I moved to aaPanel: more pleasant, more stable, lighter. But with a completely off-rails approach to OLS management — you can't configure OpenLiteSpeed directly, everything goes through aaPanel's abstraction layer, and you lose control of your own stack. Touch port 7080 directly and you risk breaking everything. You use the aaPanel dashboard. Full stop.

Then my usage shifted. More Astro, more quasi-static sites, more projects where PHP isn't needed. OLS loses its appeal the moment you step outside the WordPress perimeter. Caddy, on the other hand, handles HTTPS automatically, its config fits in a few readable lines, and it doesn't have OLS's rewrite quirks.

The question became: can you replace aaPanel/OLS with Caddy and a control panel? There is one on GitHub — CaddyManager, 1.1k stars, single contributor, perpetually "early development". There's also CaddyGen, a Caddyfile generator built in 8 hours — more proof-of-concept than finished product. Nothing production-ready.

The conclusion was obvious: well-written shell scripts and a minimal FastAPI interface would do the job — and would be infinitely more maintainable. Someone just had to write them.

Rather than do it myself, I thought about handing a spec to GitHub Copilot CLI. Or Claude Code. But given Copilot's new pricing, which barely lets you wet your lips before the bill arrives... I got interested in OpenCode and Kilo CLI, wired into DeepInfra or OpenRouter. And I decided to make it a benchmark.


📋 TL;DR: 8 tool/model combinations tested on a real VPS project. Two phases — architecture then code. An independent external review to settle the score. The only toolkit judged production-ready cost $1.94 all in. The winning model? You probably haven't seen it in the usual comparisons.

💡 Reading note: Until section 5, the four implementations selected for the code phase are identified as A, B, C, and D. Model names are revealed after the external review verdict — for the same reason you anonymize a jury: read the code before reading the label.


1. The test project

The brief was deliberately concrete: a minimal VPS management toolkit for Ubuntu 24.04. Caddy as the web server, PHP-FPM in two versions (current and fallback), MariaDB and PostgreSQL, Valkey for object caching. Shell scripts for all operations, a FastAPI interface for automation. No Docker, no control panel, no unnecessary abstraction.

Four site types to handle: static (HTML/assets only, no PHP, no database), PHP (custom apps, optional database), WordPress (full install via WP-CLI, database required), and reverse proxy. That last one deserves a note: it's simply a Caddy vhost that forwards requests to a local port — a Node.js, FastAPI, or Go application running on the same server. Caddy handles HTTPS and the domain; the application doesn't need to care. No PHP-FPM, no database — just a reverse_proxy block and a port number.

Expected operations cover the full lifecycle: server bootstrap, site provisioning, deletion with automatic backup before any destructive operation, on-demand database creation, static deployment via rsync, backup, and service management.

Why a real project instead of a synthetic benchmark? Because synthetic benchmarks test what models can do under ideal conditions. A real project tests what they do when constraints pile up — security, idempotency, cross-file consistency, error handling between shell and Python layers. That's where differences emerge.

The full functional brief is available in the project's GitHub repository.


2. Methodology

The protocol runs in two distinct phases, separated by human validation.

Phase 1 — Architecture

An identical functional brief is submitted to each tool/model combination. No extra context, no configuration files, no hints about the expected solution. The tool proposes an architecture, a project structure, a list of scripts with their responsibilities, an API route map. And if it's well-designed, it asks questions before producing anything.

Phase 2 — Implementation

Once the plan is validated and decisions are made, a single development prompt is submitted to all tools. It includes the validated architecture, the ten confirmed technical decisions, the script→API exit code convention, and one unambiguous instruction: deliver thirty files to disk, in order, no summaries, no shortcuts.

Combinations tested

Tool Model
Claude Code Haiku 4.5
Copilot CLI Haiku 4.5
OpenCode Haiku 4.5
OpenCode GLM 5.2
OpenCode BigPickle (free)
OpenCode Gemini 3.1 Pro
OpenCode DeepSeek V4 Pro
OpenCode GPT-OSS-120B

Devstral 2 (123B) was planned. Unfortunately the model doesn't appear in OpenCode's or Kilo CLI's model selector — both pull their catalog from models.dev, which hasn't indexed it yet despite its availability on OpenRouter. A test via the OpenRouter playground confirms the model is accessible via API, but outside a coding agent it loses most of what we're trying to measure. Devstral 2 is absent for purely technical reasons, not quality ones.

Haiku 4.5 appears three times — on three different tools. That's deliberate: it's precisely what lets us isolate the tool's impact independently of the model.

The code phase was run on four representative implementations, labeled A, B, C, and D until the reveal in section 5.

External review

The code produced by the four implementations was submitted to a model absent from the benchmark, with a fixed evaluation grid: security, correctness, idempotency, code quality, completeness. Five representative files per implementation, scored out of 25.


3. Planning phase — who actually thinks?

The functional brief poses an implicit question to each tool: what do you do when handed an open-ended project with no pre-cooked solution?

The first thing you notice — and it's striking — is that none of the tested models ask their questions before producing a plan. Not one. All of them deliver a complete architecture first, then ask for clarification at the end. That's the reverse of what a human architect would do, who blocks on ambiguities before drawing anything.

This matters. Several questions raised after the fact would have changed architectural decisions if asked upfront. One model identifies the tension between "no secrets on disk" and application config files that legitimately need credentials — wp-config.php being the obvious example. That's a genuinely blocking question. Asked after the plan, it becomes a footnote.

What the plans reveal

Question quality is the first discriminating signal. Two models ask the four or five genuinely blocking questions, framed with options and recommendations. Another asks eight generic questions — archive format, log rotation — that would have changed nothing architecturally.

The proposed structure is the second signal. Only one model spontaneously proposes a unified CLI entry point — bin/vpsmgr — that dispatches to the scripts. It's the detail that turns a collection of scripts into a coherent tool. The others didn't think of it.

One model is the only one to propose a normalized, documented exit code convention from the planning phase:

Code Meaning HTTP
0 Success 200
1 Invalid input 400
2 Not found 404
3 Conflict 409
4 Missing dependency 422
5 Internal error 500

This isn't cosmetic. It's the contract between shell scripts and the FastAPI layer — without it, HTTP mapping becomes arbitrary and each route implements it differently.

Planning phase costs

Tool + Model Tokens Cost
BigPickle ~35k $0
GPT-OSS-120B 20k $0.003
DeepSeek V4 Pro 31k $0.044
GLM 5.2 43k $0.06
Copilot + Haiku 4.5 ~60k $0.07
Haiku 4.5 (OpenCode) 69k $0.076
Gemini 3.1 Pro 27k $0.095
Claude Code + Haiku Pro subscription

Gemini 3.1 Pro produces the most concise output — 27k tokens for a quality plan. Haiku 4.5 on OpenCode consumes 69k tokens for lower quality. Token volume does not predict quality.


4. Code phase — who actually delivers?

The code phase starts with a single development prompt, submitted to the four selected implementations. It includes the validated architecture, the ten confirmed technical decisions, the exit code convention, and one unambiguous instruction: deliver thirty files to disk, in dependency order, no summaries, no shortcuts.

This is where differences between models become concrete.

What common.sh reveals

The shared library is the first file delivered. It's the foundation everything else rests on — logging, secret handling, site state management, password generation. A flawed common.sh contaminates every script that sources it.

Model A delivers 98 concise lines. Secret redaction explicitly covers all ten WordPress patterns — salts, authentication keys. Most complete on this specific point. No domain validation, no require_cmd(), no atomic state file writes.

Model B delivers 310 lines. Named constants with readonly, normalize_domain() with RFC-1035 regex, concurrency locks, atomic writes with mktemp+mv. The richest system utility library. But secret redaction misses WordPress salts.

Model C delivers 366 lines. Redaction patterns are configurable via an environment variable — not hardcoded. Pure-shell JSON helpers with Python fallback if jq is absent. print_credentials() wrapping output in <<>> markers as specified in the prompt. render_template() for config files, no Jinja dependency. Password generation excluding ambiguous characters (0/O/1/l/I). The only implementation that anticipates every edge case documented in the development prompt.

Model D delivers 184 lines. The benchmark's most original idea: exit codes encapsulated in named functions — exit_input_error(), exit_conflict() — more readable than bare exit 3 calls. And json_output() directly in common.sh, generating API-ready JSON from shell. No atomic writes, no require_cmd().

The bugs you find yourself — or don't

Model C tests its own code during the session. After writing schemas.py, it runs it with test cases, finds two bugs, and fixes them immediately: a Pydantic v2 validator implemented incorrectly (field_validator instead of model_validator for cross-field validation), and a mutual exclusion not enforced at the schema level. It also fixes a sed substitution issue in render_template() — broken on / in paths — replaced with pure bash parameter expansion.

At the end of its session, Model C delivers a verification summary: bash -n on all scripts, Python AST on all files, 19/19 API routes verified via OpenAPI spec, 18/18 bash helpers tested, PHP fallback rule verified (8.5→8.4, 8.4→none, 7.x rejected).

Model A checks its shebangs before finishing. Model B delivers polished user documentation — troubleshooting, curl examples, quick start. Model D validates bash and Python syntax. None of the three test functional logic.

Code phase costs

Model Tokens Time Code cost Total
A 2m58s $0 $0
D 1.29M 9m42s ~$0.19 $0.24
B ~15m Pro subscription $20/month
C 4.42M 23m37s $1.67 $1.73

Model D delivers in 9m42s what Model C delivers in 23m37s — but without functional tests. Model C consumes 3.4x more tokens because it executes code during the session, reloading context at each iteration.


5. External review — and the reveal

Four implementations, four approaches to security. To settle it without bias, the code review was handed to a model absent from the benchmark, with a fixed grid on five criteria. Five representative files per implementation — common.sh, site-create.sh, site-delete.sh, backup.sh, api/runner.py — twenty files loaded in a single pass.

Review cost: $0.0766 for 543k tokens. Ten times cheaper than an hour of junior dev time.

Per-file observations

On site-create.sh, the reviewer finds a silent bug in Model D: the SFTP password is generated but never captured or returned to the caller. The user never sees their credentials. Core functionality is broken with no error message. On Model B, local is used outside a function in three scripts — a bash error that causes runtime failure. These aren't subtle bugs: they're blockers.

On site-delete.sh, Model C is the only one handling both call modes — interactive TTY and a --confirm flag for non-interactive API calls. Model D only implements interactive mode, blocking API-driven deletion with skip-backup.

On backup.sh, Models A and B use eval "$POST_HOOK" — potential command injection. Model C passes the archive path as an argument — safer. Model A doesn't implement automatic archive pruning.

On api/runner.py, Model C is the only one using asyncio and never logging stdout — which may contain credentials. Model D has dead code: build_command() defined but never called. Model A delivers 28 lines with no timeout, no logging, no error handling — a hung request blocks the API indefinitely.

The verdict

Criterion A B C D
Security 3/5 3/5 5/5 2/5
Correctness 3/5 2/5 5/5 2/5
Idempotency 3/5 3/5 5/5 3/5
Code quality 3/5 2/5 5/5 3/5
Completeness 3/5 2/5 5/5 2/5
Total 15/25 12/25 25/25 12/25

Production-ready as-is: one out of four. Model C, 25/25.

The reveal

Alias Model Tool Total cost
A BigPickle OpenCode $0
B Haiku 4.5 Claude Code Pro subscription
C GLM 5.2 OpenCode $1.73
D DeepSeek V4 Pro OpenCode $0.24

Model B — Claude Code + Haiku 4.5 — is the most expensive in real marginal cost, with a Pro subscription at $20/month minimum. It scores 12/25 and isn't deployable due to fundamental bash bugs. Model C — GLM 5.2, from THUDM lab at Tsinghua University — scores 25/25 and is the only one the reviewer judges production-ready. It cost $1.73.


Addendum — Model E: Kimi K2.7 Code

Added after publication following a reader comment pointing to the model. Same protocol, same development prompt, same Qwen 3.7 Plus review grid.

Alias Model Tool Total cost
E Kimi K2.7 Code OpenCode $0.859

External review score: 19/25

Criterion E
Security 3/5
Correctness 4/5
Idempotency 4/5
Code quality 4/5
Completeness 4/5

Production-ready: No. Blocking issue: database passwords are passed inline as arguments to mysql -e and psql -c — visible in /proc/*/cmdline to any user on the system. The fix is straightforward (MYSQL_PWD / PGPASSWORD as environment variables), but it isn't applied.

The most modular architecture in the benchmark according to the reviewer — clean lib/ split, consistent idempotency patterns. But the security gap prevents it from challenging GLM 5.2.

Position in the ranking: between DeepSeek V4 Pro ($0.24, 12/25) and GLM 5.2 ($1.73, 25/25) on both axes. Better architecture/cost ratio than models B and D, but not production-ready.


Addendum 2 — Multi-reviewer validation

Following a reader suggestion in the comments, the blind review was extended to two additional models: GPT-5.3 Codex and Gemini 3.1 Pro Preview. Same protocol, same five files per implementation, same scoring grid.

Review costs

Reviewer Tokens Cost
Qwen 3.7 Plus (original) 543k $0.207
GPT-5.3 Codex 402k $0.287
Gemini 3.1 Pro Preview 545k $0.80
Total 1.49M $1.294

Comparative scores

Model Qwen 3.7 Plus GPT Codex Gemini 3.1 Pro Production-ready
A (BigPickle) 15/25 13/25 11/25 No (3/3)
B (Claude + Haiku) 12/25 12/25 18/25 No (3/3)
C (GLM 5.2) 25/25 17/25 25/25 Yes (2/3)
D (DeepSeek V4 Pro) 12/25 14/25 14/25 No (3/3)
E (Kimi K2.7) 19/25 13/25 21/25 Conditional (1/3)

What the three-reviewer comparison reveals

The ranking C > E > D > A > B holds across all three reviewers — the original result is stable.

GLM 5.2 is the only model to score 25/25 with two independent reviewers and judged production-ready by two out of three. GPT Codex is the most severe reviewer overall — no model passes its production-ready bar, including GLM 5.2, which it scores 17/25 citing argument parsing bugs in site-create.sh and a missing set -euo pipefail in common.sh. These are real issues; the Codex review is arguably the most rigorous of the three.

The main divergence is on Model B (Claude + Haiku): 12/25 for both Qwen and GPT Codex, but 18/25 for Gemini, which rates its rollback logic and shell structure more generously. Gemini also rates Kimi K2.7 at 21/25 with a conditional production-ready verdict — more lenient than Qwen (19/25, No) and GPT Codex (13/25, No).

The methodology critique was valid. A single reviewer introduces bias. Three independent blind reviewers converging on the same ranking is a stronger result than any individual score.


6. Intelligent routing — the real economics

This benchmark raises an implicit question: do you need GLM 5.2 for everything?

No. And that's probably the most useful conclusion of the exercise.

GLM 5.2 at $1.40/M tokens is the right choice when complexity justifies it — architecture, security, cross-file consistency, critical decisions. But on a real project, those tasks represent a fraction of interactions. The rest is boilerplate, minor corrections, documentation, commit messages.

Three levels, three models

BigPickle scores 15/25 on a complete 32-file implementation. It's perfectly capable of reading 50 lines of diff and writing an adequate commit message. Of debugging a 1064 You have an error in your SQL syntax or a Fatal error: Call to undefined function. Of generating a README from existing code. For these tasks, GLM 5.2's architectural depth is overkill — and BigPickle is free.

DeepSeek V4 Pro at $0.44/M tokens — five times cheaper than Haiku 4.5 and three to four times cheaper than GLM 5.2 — comfortably handles simple code generation, CRUD, minor refactoring, inline documentation, short scripts. Its code phase at $0.24 for 1.29M tokens and 9m42s demonstrates this.

GLM 5.2 comes in when complexity exceeds that scope — architecture design, coherent multi-file implementation, security decisions, non-trivial business logic.

Level Model Cost Typical use cases
Free BigPickle $0 Debug, commits, quick questions, SQL errors
Budget DeepSeek V4 Pro $0.44/M Boilerplate, CRUD, documentation, short scripts
Premium GLM 5.2 $1.40/M Architecture, security, multi-file consistency

The proportion of tasks at each level depends on the project, where you are in the development cycle, and what you consider complex. No universal number — each team calibrates against their real usage.


💡 Worth noting: GLM 5.2 is exponentially more expensive than BigPickle — $1.67 vs $0 for 4.42M tokens in the code phase. But pure text output — plans, architecture, analysis — consumes few tokens and costs almost nothing: $0.06 for this benchmark's planning phase. It's in the code phase, with its iterations, in-session test execution, and accumulating context, that the bill climbs. Intelligent routing means precisely reserving GLM 5.2 for tasks that justify that long context — and handing everything else to the two lower tiers.

The uncomfortable comparison

GitHub switched to token billing on June 1, 2026. Claude Sonnet 4.6 on Copilot is billed at roughly $3.00/M tokens input and $15.00/M output. Reproducing the GLM 5.2 session from this benchmark — 4.46M tokens — would cost an estimated $25 on Copilot + Sonnet 4.6. Without the functional tests. Without the self-correction. Without the external review.

Copilot Pro+ at $39/month includes $39 in AI credits. A full session like this one would consume two-thirds of the monthly budget. Users reported burning through their monthly credits in two prompts on the day the switch happened.

The final ratio: $1.94 all in vs ~$25 on Copilot + Sonnet. Thirteen times cheaper, for the only result the external reviewer judges production-ready.


Conclusion

$1.94. That's what this benchmark cost end to end — planning, implementation, external review included. For the only toolkit the reviewer judges production-ready.

That number is uncomfortable for the AI coding tools market, which sells reassurance through pricing. Copilot Pro+ at $39/month, Claude Sonnet at $15/M tokens output, the big names front and center — the implicit assumption is that quality follows price. This benchmark suggests otherwise.

The winner is called GLM 5.2. Its lab, THUDM, is part of Tsinghua University. You probably haven't seen it in last week's comparisons. It produced the only architecture with a normalized exit code convention from the planning phase, the only implementation that tests its own code during the session, the only common.sh with configurable redaction and Python fallback if jq is absent. And it fixed three bugs before delivering.

Two takeaways.

First: a model's price does not predict its output quality on complex tasks. Haiku 4.5 on three different tools — Claude Code, Copilot CLI, OpenCode — produces identical results for identical cost. On the planning phase — pure text generation, no feedback loop, no codebase exploration — the tool has no measurable impact. What matters is the model. And the least glamorous model in the benchmark dominates.

Second: not all tokens are equal. A planning phase at $0.06, a code phase at $1.67 — that's a factor of 28. It's not an anomaly, it's the structure of the problem. A plan is a few thousand tokens of reasoning. An implementation is millions of tokens of accumulated context, executed code, iterated tests. Routing intelligently between BigPickle at $0, DeepSeek V4 Pro at $0.44/M, and GLM 5.2 at $1.40/M based on task complexity — that's the real economics of these tools.

The VPS Manager toolkit is available on GitHub in all four versions.

LLM-Challenge

A reproducible benchmark comparing 8 AI coding agent/model combinations on the same real-world project.

What this is

This repository contains the complete artifacts from a benchmark where 8 different AI coding agent/model combinations were tasked with building the same VPS management toolkit. The goal was to measure code quality, architecture decisions, and production readiness across different tools and models under identical conditions.

This is not a marketing comparison. A real project with concrete requirements was used as the test subject, and the results were evaluated by an external reviewer who had no knowledge of which tool or model produced which implementation.

The protocol

The benchmark followed a two-phase protocol:

Phase 1: Architecture All tools received the same functional brief and were asked to produce an architecture document. No code was written in this phase.

Phase 2: Implementation All tools received the same development prompt and were asked to implement…

The briefs, prompts, and evaluation grid are there too. Reproducible, if you want to verify.

Too cheap to be good? That was the wrong question.

Top comments (58)

Collapse
 
unitbuilds profile image
UnitBuilds

Out of curiosity, have you ever tried Agent Workers through Cloudflare? They have Kimi K2.7 on there and it's quite affordable too (I mean it's a 1T param behemoth, for $4 output per m tokens and 27c per input mil token). Would be interesting to see how it compares, given that it acts as a pretty decent competition, often beating all the other models, including pro SOTA models at coding?

Collapse
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO • Edited

Thanks for the tip — Kimi K2.7 wasn't on my radar when I ran the benchmark, and the pricing is indeed interesting for the scale. A 1T MoE at $0.27 input is hard to ignore.
Cloudflare Workers AI as a delivery layer is also an angle I hadn't considered — it adds a latency and infrastructure dimension on top of the model quality question.
I'm planning a follow-up with additional models. Kimi K2 goes on the list. If you've run it on non-trivial coding tasks (multi-file, security constraints, that kind of thing), I'd be curious what you observed.

Collapse
 
unitbuilds profile image
UnitBuilds

I just got it on my cloudflare newsletter today, so havent given it a try yet. I was looking at it earlier though to estimate how it'd compare to Vertex AI and Claude pricing, considering it's the only model that can actually compete and beat the giants. If I end up running it, I'll let you know, but it would be very interesting to see how it compares, because it's proven to be way more capable at coding than just about anything else on the market, maybe it's effectiveness translates to efficiency when you consider it makes almost no mistakes.

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

That's exactly the hypothesis worth testing — if a model makes significantly fewer mistakes, the total cost per session drops even if the per-token price is higher. GLM 5.2 demonstrated this: it cost more per token than DeepSeek, but it self-corrected three bugs and delivered 37/37 tests passing. Net cost of human debugging time: zero.
If you run Kimi K2.7 on something non-trivial, I'd genuinely like to see the numbers. The benchmark protocol and prompts are in the repo — reproducible if you want a direct comparison.

Thread Thread
 
unitbuilds profile image
UnitBuilds

I'll see when I get a gap to run it, currently swamped with a bunch of projects: whatsapp, google drive, anydesk, mcp server equivalents, but built on a custom data protocol and custom file format along with the SDK, editor/viewer for it. So alot... But when I get a chance, I'm very interested in giving Kimi a run, because my workflows generally are new infrastructure inventing, rather than boilerplate coding, which should really stress it's reasoning capability.

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

That's precisely the kind of workload that separates the models — new infrastructure, custom protocols, decisions that can't be pattern-matched from existing code. Boilerplate is easy to benchmark; reasoning under constraint is where it gets interesting.
When you get a gap, the benchmark protocol is in the repo. Would be valuable to see how Kimi holds up on that kind of work — the plan phase already looks promising.

Thread Thread
 
unitbuilds profile image
UnitBuilds

Ok, curiosity killed the cat, I quickly spun up the foundry, wrote an Agent system for it and wired it up. Gave it a run on the free daily 10k 'neurons', got to 19/30 files written, will continue it tomorrow, then send you the benchmark results (I'll also polish the setup a bit, then put on git, so you can use it to test cloudflare's models when you need to)

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

19/30 on the free daily allocation is already a good signal — enough to see how it structures the foundation files. Curious to see if it holds up on the security-sensitive parts (backup, site-delete, the API runner).

I actually ran Kimi K2.7 myself in the meantime — results are now in the article as an addendum. 19/25 on the external review, cleanest architecture of the five but a security blocker on DB credential handling. Worth seeing if your Foundry run lands differently.

And yes, the Cloudflare setup on git would be genuinely useful — the model catalog problem cost me Devstral 2 in this benchmark.

Thread Thread
 
unitbuilds profile image
UnitBuilds

Ngl, I kinda expected as much. Kimi is brilliant, but it's far from perfect, namely the guardrails that slow down western models, also end up being the features that ensure the system is secured. I got a bit caught up with creating the IDE integration layer for it, expanding on tools it can use, etc. I'll run continuation of the test tomorrow, to see how it finishes, I'll keep you updated, and see how the results compare. I'll also let you know once I public Cloudflare ide (just a generic infrastructure to let you use cloudflare ai workers as agents in an ide)

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

That's an interesting angle — the guardrails observation cuts both ways. Western models tend to refuse edge cases that are actually legitimate (security research, low-level system ops), while the lack of guardrails on some Eastern models can mean less defensive coding on the output side. The DB credential exposure in Kimi's implementation is a good example: not a guardrail issue, just a design choice that slipped through.
The IDE integration layer sounds like the more interesting project honestly — a generic infrastructure to wire Cloudflare AI Workers as agents opens up the catalog problem in a way that benefits everyone who's hit the models.dev indexing wall.
Keep me posted on the continuation run — curious whether the last 11 files hold up or whether the context window starts showing strain.

Thread Thread
 
unitbuilds profile image
UnitBuilds

I started digging into the 19 files it ran, it made the exact same fatal error! Exposed credentials. But that's exactly why Kimi is an excellent coder, but fails in production, because it assumes safe-room security, unless instructed to consider it. Strictly to the point, it does the job 100%, but it misses the specs that werent listed, eg. ensure the system is production secure. These admittedly are generally hidden as system instructions, often exposed when models start hallucinating and exposing their thoughts (like Antigravity IDE did beginning of the year). So while it's a model security threat, it's something that exists everywhere, but generally mitigated by system instructions.

With the IDE integration, I'm going a bit overboard, implementing custom tools like dependency graphing, etc. to cut token usage. I'll keep expanding on it, might actually open it up further to other providers if there's enough traction. But for now it's configurable to cloudflare with your account + api token. Might actually add a production rule checker, like I have for the foundry, where it does systematic security checks prior to marking as complete, so it catches flaws like this, before it reaches production and without system instruction bloating.

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

The "safe-room security" framing is spot on. Kimi optimizes for correctness against the stated spec, not for what a production environment implicitly requires. The credential exposure isn't a reasoning failure — it's a scope failure. The model solved the problem as written, not the problem as deployed.
GLM 5.2 making the right call on MYSQL_PWD without being told to is interesting in that light. It either had stronger implicit security priors, or the self-testing loop surfaced it during verification. Hard to tell which from the outside.
The production rule checker idea is the right architectural response — a systematic security pass as a completion gate, not as a system instruction tax on every prompt. If you build that into the Cloudflare IDE layer, it effectively raises the floor for any model running through it, regardless of their implicit security assumptions.
Dependency graphing to cut token usage is also worth watching. The context accumulation problem is what drove GLM 5.2's cost to $1.73 vs Kimi's $0.86 — if you can prune the graph intelligently, you get closer to Kimi's efficiency without sacrificing the verification loop.
Keep me posted — this is turning into a more interesting thread than most.

Thread Thread
 
unitbuilds profile image
UnitBuilds

I added a security scanner tool to the arsenal, which does exactly that and flags any potential hazards in the codebase regardless of language. the Dependency graphing will cut context, but what would be cool is context swapping, especially with the newly added swarming and sub-agent spawning logic. So it can actively map the codebase, section it and assign accordingly, where overlap, changes get put to a discourse board, where the 2 agents need to debate on what to do about the shared state and how to best solve for both (that'll need some work, but I've been meaning to update the foundry's version anyway, so win win). Yeah this is becoming a fun experiment, I'm mostly curious now after all the additional logic that hardens outputs from any model, how would tiny models like gemma 4 E4B handle it and if it's still lacking, how to close the gap in domain specific knowledge and execution, without bloating too much...

Thread Thread
 
unitbuilds profile image
UnitBuilds

And what I had in mind with the dependency graph, is to lean a bit into what I built for NDA (custom, universal file format), by using triplets for model outputs, that way it can tie context to individual entries in the graph, with a LOD function, so it can quickly grab past-context, without bloating the context window permanently (essentially pick a book up, read the index, read the page section, update it, drop the book). Could be very interesting for large codebases, because it's very similar to what the foundry uses for chunking massive codebases (Git -> chromium size), where it's impossible for even the best LLMs to keep it in context all at once.

Thread Thread
 
unitbuilds profile image
UnitBuilds

Right now I'm building it more-or-less from off the shelf parts, but when it gets to a tested state, I'll definitely upgrade it to all the fun toys I built over the past few weeks (zero allocation logic), to see just how fast and efficient it can get when running at near hardware limits, with an optimized MCP server as it's only interaction layer to ensure it sticks to plan. Also wanna open it up at some point to more than cloudflare, so it's fully BYOM.

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

The security scanner as a tool in the agent loop is the right architecture — it's a completion gate, not a prompt tax. Every model gets the same floor regardless of their implicit security priors.
The context swapping + swarming idea is interesting but the debate board for shared state is the hard part. Two agents disagreeing on shared state is essentially a distributed consensus problem — you need a tiebreaker or a merge strategy, otherwise you get deadlock or thrash. Curious how you're thinking about resolving that.
The tiny model question is the one I'd actually most want to see answered. Gemma 4 E4B with your hardening layer vs GLM 5.2 without it — if the gap closes significantly, that changes the economics completely. You'd be running a 4B model with output quality approaching a 70B+, at a fraction of the cost and potentially locally. That's the experiment worth publishing.

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

The triplet + LOD approach is clever — it's essentially a semantic index over the codebase that the agent can traverse at the right granularity without loading everything. "Read the index, read the page section, update it, drop the book" is a good mental model for it.
The interesting challenge is cache invalidation. When a file changes, which triplets in the graph are stale and need to be re-indexed? In a large codebase with tight coupling, a single change can invalidate a surprising number of graph edges. How are you handling that — incremental re-indexing on write, or lazy recomputation on read?
The Chromium-scale chunking problem is genuinely hard. Most approaches either lose cross-file context (chunking too aggressively) or blow the context window (chunking too conservatively). If your LOD function can navigate that tradeoff dynamically based on what the agent actually needs at each step, that's worth a standalone writeup.

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

Zero allocation logic at near hardware limits with an MCP server as the only interaction layer — that's a very different beast from what most people are building. Most agent frameworks add abstraction layers until the latency is measured in seconds; you're going the other direction.
BYOM is the right call if you want traction beyond Cloudflare users. The model portability problem is real — this benchmark lost Devstral 2 because of a catalog indexing issue, and that's a trivial example of what happens when the infrastructure assumes a specific provider.
When you get to a publishable state, ping me. If the hardening layer + LOD context + security scanner holds up across model sizes, that's a follow-up benchmark worth running — same VPS Manager spec, your stack vs OpenCode, see if the tooling gap closes the model quality gap.

Thread Thread
 
unitbuilds profile image
UnitBuilds

How I handle tie-breaking for the foundry is by having the 2 debate, objectively: (I need this, because of that. Here is the constraints), (I cant do that, I need it like this, because of that and these are my constraints). If the 2 are fundamentally opposed, the orchestrator model hops in to break it up, by using the overarching design between the 2 (their blueprints), to find the middle ground. It then drops the suggestion to the discourse and the 2 agents can review it and agree on it (not the cheapest loops, but it's the safest to make sure concurrency doesnt stall progress). Usually orchestrator is a pro-tier model, while the agents are much smaller, so it tends to think smarter and can find the middle ground without too much trouble. It'd probably be especially necessary for small models, like Gemma 4 E4B, cuz trying to get a tiny model to think outside the box to accommodate an unknown variable without bloating the context is a bit of a struggle. But could be an interesting experiment. It doesnt magically turn a 70b model into a 300b model, but it could potentially turn a 4b model into a 27b model when it comes to code production quality. But that would be like using Opus to plan and Haiku to execute (except with Kimi and Gemma 4, at a fraction of the cost).

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

The orchestrator pattern for tie-breaking is elegant — you're essentially implementing a constitutional process. The two agents state their constraints formally, the orchestrator reads both blueprints and finds the middle ground, the agents ratify. It's slower than a coin flip but it's auditable and it doesn't lose information from either side.
The pro-tier orchestrator + small agents split is interesting because it matches the routing logic from the benchmark almost exactly — premium model for the decisions that require reasoning across the whole system, budget models for execution. The difference is your orchestrator is active throughout the session, not just at the planning phase.
The "4B → 27B effective quality" hypothesis is the one worth testing rigorously. If the scaffolding — security scanner, LOD context, discourse board, orchestrator arbitration — can close that gap, then the relevant variable isn't the model anymore, it's the infrastructure. Which flips the whole "which model is best" question into "which infrastructure extracts the most from any model." That's a more interesting research question than another benchmark.

Thread Thread
 
unitbuilds profile image
UnitBuilds

Regarding the LOD. Kinda why I wanna lean into my NDA format more, it uses merkel root audit trailing, so it keeps track of states and changes, when it tries to commit a change to it, the graph flags all nodes affected and severity rating, to indicate the blast radius, so if I change eg. DateTime to use UTC instead of local, it affects a large section, but it's only really breaking when it comes to mismatches in time. so when the models query their graphs next, they see 'oh, that's a problem', changes affected them, they need to compensate. Or atleast that would be the case, if they werent made aware of it, prior to the change being applied. This is why the discourse board and file audit trail are so important, because the second an agent wants to make a change, everyone knows about it and everyone affected has the opportunity to object, if they dont, then it's resolved and they'd partition work related to it, for a later stage to accommodate the change applying before refactoring themselves to match it.

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

The Merkle root audit trail is the right primitive for this — you get cryptographic proof of state history without storing full snapshots, and the graph diff gives you the blast radius before the change commits. The DateTime/UTC example is a good illustration: the change is locally trivial but the affected node set is large and the failure mode is subtle (time mismatches that only surface at boundaries).
The pre-commit notification pattern — "everyone affected has the opportunity to object before it applies" — is essentially a distributed transaction with optimistic concurrency. The discourse board is your conflict resolution layer when two agents hold incompatible locks on the same state.
What I find most interesting is that this architecture makes the agent's reasoning externally auditable. The audit trail isn't just for debugging — it's a record of why each change was made and who agreed to it. That's something you almost never get from standard LLM code generation, where the reasoning is implicit and disposable.
At this point you're not building an agent framework, you're building a distributed version control system for agent cognition. That's a different category of thing.

Thread Thread
 
unitbuilds profile image
UnitBuilds

Will definitely keep you posted on it. Yeah, I dont like slow things... I mean I built a messenger app and even dumped out Opus just because 2.5ms sampling was slowing the system down too much, so built my own codec and processing latency dropped to 0.365ms average. MCP is horrifically slow, especially when dealing with TS or Json... Not to mention Node.js and python hosting... So I built it in rust and optimized the crap out of it, so it runs in microseconds, not seconds and way more robust and configurable than standard MCP, because it uses NDA instead of TS/JSON.

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

0.365ms average from 2.5ms by replacing the codec — that's an 85% latency reduction by refusing to accept the default. That's a very specific kind of engineering temperament.
MCP in Rust over NDA instead of TS/JSON is the logical endpoint of that approach. JSON parsing in a hot path is genuinely painful — the allocation overhead alone is measurable at microsecond scales. If your NDA format has a fixed-width binary representation or zero-copy deserialization, you're not just faster, you're operating in a different latency regime entirely.
At this point I'm curious what the NDA format actually looks like at the wire level. Triplets with Merkle roots, Rust implementation, microsecond MCP — that's a complete infrastructure stack, not a weekend project. When you open source it, the README is going to need some work to explain what category of thing it is.

Thread Thread
 
unitbuilds profile image
UnitBuilds

V.E.L.O.C.I.T.Y. Neural Document Architecture (NDA)
The Neural Document Architecture (NDA) is a proprietary, zero-allocation binary serialization format designed for nanosecond-latency document transmission, storage, and recovery. It combines semantic graph modeling (subject-predicate-object triples) with a raw immediate-mode vector and text rendering display list, enabling documents to be parsed by hardware engines and displayed to human operators without heap allocation or parsing overhead.

⚡ Performance Characteristics
Compared to traditional JSON/XML deserialization pipelines, the NDA binary engine achieves:

92.7% Latency Reduction: Compilation and reading down to 61.32 nanoseconds (vs. 846.45ns on equivalent JSON).
0 Bytes Allocated (Zero-GC): Safe memory layout design leveraging MemoryMarshal and spans directly mapped from memory, preventing GC pressure and heap fragmentation.
📁 Repository Structure
├── editor/ # Web-Native Interactive Editor & Converter PWA
│ ├── index.html # Dark-mode glassmorphic workspace
│ ├── style.css # Premium visual styling and micro-animations
│ └── app.js # Client-side SHA-256 Merkle compiler & Canvas renderer
└── src/
├── NdaMcpServer/ # Stdio-based JSON-RPC Model Context Protocol (MCP) Server
└── Velocity.NDA/ # Standalone .NET 9.0/10.0 C# NDA Class Library
├── NeuralDocument.cs# Zero-allocation compiler, reader, and Brotli QR recovery
├── NdaConverter.cs # High-speed programmatic LedgerTransaction compiler
├── Models.cs # Standalone LedgerTransaction and Entry DTOs
└── Velocity.NDA.csproj
🔧 NDA Binary Schema Specifications
A compiled NDA binary file contains four memory-aligned blocks structured sequentially:

Header (48 Bytes):
Magic (4 Bytes): ASCII "NDA1" (0x3141444E)
Flags (4 Bytes): Config options / bitmasks
MerkleRoot (32 Bytes): Cryptographic SHA-256 hash root of all semantic triples for instant validation
TripleCount (4 Bytes)
CommandCount (2 Bytes)
StringPoolOffset (2 Bytes)
Semantic Triples Block: High-speed relational database graph representation mapped by string offsets.
Display Commands Block: Immediate-mode canvas rendering commands (DrawText, DrawVector, DrawRect) using spatial coordinates and RGBA colors.
String Pool Block: Length-prefixed UTF-8 string literals accessed via byte offset pointers.
💻 Programmatic Usage (C#)
Compiling a Transaction to NDA
using Velocity.NDA;

var tx = new LedgerTransaction
{
TransactionId = "TX-901-NDA",
Timestamp = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds(),
Status = "Settled",
Entries = new[]
{
new Entry("SENDER_WALLET", -50000, "NAD"),
new Entry("RECEIVER_WALLET", 50000, "NAD")
}
};

// Compile to byte array
byte[] ndaBytes = NdaConverter.FromTransaction(tx);
Parsing an NDA Binary (Zero-Allocation)
using Velocity.NDA;

ReadOnlySpan buffer = File.ReadAllBytes("invoice.ndf");

// Instantiates a ref struct directly pointing to the buffer memory
var reader = new NeuralDocument.Reader(buffer);

// Access Merkle Root
ReadOnlySpan merkleRoot = reader.MerkleRoot;

// Loop over semantic triples
for (int i = 0; i < reader.TripleCount; i++)
{
var triple = reader.GetTriple(i);
string subject = reader.GetString(triple.SubjectOffset);
string predicate = reader.GetString(triple.PredicateOffset);
string obj = reader.GetString(triple.ObjectOffset);
}
🌐 Web-Native Interactive Editor & Converter PWA
The editor/ directory contains a zero-dependency HTML5 Canvas and JavaScript PWA to author, edit, and inspect NDA documents in real-time inside any modern web browser:

GPU Display Canvas: Renders display commands instantly. Clicking on any text element launches an in-place editor. Saving updates will recompile the binary buffer and recalculate the Merkle root on the fly.
Semantic Graph Nodes: Interactive table view showing all active triples.
Bare-Metal Byte Map: Responsive hex viewer showing the compiled byte offsets, hex representation, and ASCII interpretation.
Offline Brotli QR Recovery: Automatically compresses the binary payload using client-side Brotli simulation and generates a dense QR code matching the exact state of the compiled record, ensuring fail-safe paper or offline recovery.
🛠️ Model Context Protocol (MCP) Server
To enable LLMs and agentic coding workflows to compile, inspect, and run NDA documents natively, this repository includes a stdio-based MCP Server (NdaMcpServer).

Stdio Tools Expose:
convert_to_nda: Compiles any local file (Excel, Word, PDF, Code, Zip, etc.) into a signed .nda document.
read_nda: Parses .nda files and outputs the cryptographic Merkle root signature, semantic triples graph, and visual display command lists.
execute_nda: Executes embedded compiled C# binaries in-memory or runs scripts (Python, JS, PowerShell) inside the container.

Here's the current README I'm keeping on NDA for now, while I work on it. I built it first as just a document type, but then realized the logic can be used for anything and it makes everything more deterministic for AI to understand, making it an arguably better format on all fronts for the modern day AI-race.

Thread Thread
 
unitbuilds profile image
UnitBuilds

And when the infrastructure boost success rate, in turn, model training can be focused elsewhere. A 4b model seems small, but it's still bloated. Ideally, you'd want it to be deterministic, never repeating itself and learning over time. It's not there yet... But I did build a knowledge graph memory for gemma 4 E4B, so it can start expanding it's knowledge. Which effectively would mean it can start overriding it's patterns and learn new systems, dynamically. It's core understanding needs to be enough to understand the consequence, but it doesnt need to know everything off the bat and that's where a 1t model like Kimi becomes hyper-bloated. Of that 1t, less than 250b are new, less than 50b are used, of which less than 2b are actually necessary. But 2b perfect writes, beats anything that Gemma 4 E4B can come up with... For now.

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

61.32ns vs 846.45ns on equivalent JSON — that's not an optimization, that's a different category of problem. Zero-allocation with MemoryMarshal and spans directly mapped from the buffer means you're not parsing, you're reading. The distinction matters at scale.
The Merkle root in the header is elegant — you can validate the entire document's integrity before touching any triple, which means you can reject corrupted or tampered documents at nanosecond cost before doing any work. That's a security primitive as much as a performance one.
The execute_nda tool in the MCP server is the part that raises an eyebrow — executing compiled C# binaries in-memory from an agent-accessible endpoint is a significant attack surface. I assume you've thought about sandboxing, but it's worth being explicit in the docs about what the trust boundary is.
The broader insight — "makes everything more deterministic for AI to understand" — is the real thesis. JSON is ambiguous at the schema level; NDA with semantic triples is typed and relational by construction. An agent reading a triple graph doesn't need to infer structure, it's explicit. That's a meaningful reduction in hallucination surface for structured data tasks.
This belongs in a standalone article, not a comment thread. When you write it up, lead with the nanosecond numbers — that's what stops the scroll.

Thread Thread
 
unitbuilds profile image
UnitBuilds

Told you it's fast 😂 The execute_nda is a bit more complex than initial documents lead it to be, it's meant for NDA driven scratch file execution, rather than python scripts, as it's alot faster and keeping it as an 'artifact' with the audit trail just makes sense (see what scratch file worked at what stage of development), but I do have a sandbox that I built for a side-project. I wanted to build a MCP for windows automation, similar to how MCP-Lite operates for browser, but ran into bottlenecks with the built in windows sandbox and third parties like Sandboxie Plus ends up being too expensive, so I built my own and it outperformed both, so might actually strap that in for execute_nda, as it's an orchestrator/agent system that ensures the system is safe when executing and only commits changes on approval.

Thread Thread
 
unitbuilds profile image
UnitBuilds

A bit more technical, but part of the wonders of how NDA and the V.E.L.O.C.I.T.Y. infrastructure behind it operates, is that it executes so efficiently, that you can easily run it in L2 cache without any hickups, even bypassing the ram read/write for realtime operations (was originally built for bank transfers, for actual realtime transaction clearing, with some tips and tricks from HFT systems). Theoretically, it could be coded into a FPGA and operate at complete native speed, which is where IO stops bottlenecking and it can actually hit sub microsecond speeds on practically any task execution

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

L2 cache execution for real-time transaction clearing — that explains the zero-allocation constraint. If you're in L2 you can't afford a heap allocation triggering an eviction, so the MemoryMarshal + span design isn't just a performance choice, it's a correctness requirement for that latency target.
The FPGA angle is the logical endpoint of the architecture. Once you're past L2 and into silicon, the bottleneck shifts entirely to IO — and at that point the binary format's fixed-width header and memory-aligned blocks become critical. JSON would be a non-starter; NDA's structure maps cleanly to hardware pipeline stages.
HFT influence is visible in the design decisions. The Merkle root validation before any processing is classic reject-early-at-wire-speed thinking. Banks and HFT shops learned that lesson in the 90s; it's interesting to see it applied to an AI-native document format.
At FPGA speeds on NDA, the MCP server latency stops being the bottleneck entirely — it becomes the agent's reasoning time. Which means the infrastructure is effectively invisible, and the only variable left is model quality. That's exactly where you want to be.

Thread Thread
 
unitbuilds profile image
UnitBuilds

That about sums it up 😅 I'll let you know as I progress on it, but you're right, we're covering alot of bases in this comment section, sorry for hijacking your time like this, it's nice to have a peer to collaborate with on the design choices. I call the side-projects, because I use my foundry for most of the coding, so eg. the 6000+ LOC anti-bot system I designed for Dev.to was just a 2h quicky for me, it honestly took me longer to find the formal channels where to put it, so they can test it, than it took me to make it. That being said, yes, the foundational logic is a bit far too complex to be classified as a side-project anymore. Especially when you look at it not just internally, but over the network, where suddenly the security, fixed size, performance and failover protection becomes even more important, eg from the file-share system:

Relative speedup multipliers achieved by VCTP's zero-copy architecture:

[WebRTC SCTP] ██ (37.5 MB/s) | 208.2x Speedup
[Aspera FASP] █████ (75 MB/s) | 104.1x Speedup
[SFTP/HTTPS] ████████████████ (250 MB/s) | 31.2x Speedup
[V.C.T.P. Sync] ████████████████████████████████████████████ (7,800+ MB/s)
Standard SFTP / HTTPS (typical max 250.0 MB/s): 31.23x speedup.
Aspera FASP WAN (typical max 75.0 MB/s): 104.10x speedup.
WebRTC SCTP Browser (typical max 37.5 MB/s): 208.21x speedup.

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

Don't apologize — this is one of the best comment thread I've had on any article. And "peer to collaborate with on design choices" is exactly what a good comment section should be.
7,800 MB/s on V.C.T.P. vs 250 MB/s SFTP is a 31x speedup — and that's the worst comparison in your table. Against WebRTC SCTP it's 208x. At that point you're not optimizing a protocol, you're in a different physics.
The 6000 LOC anti-bot system in 2 hours via the Foundry is the best demonstration of what you're building. The meta-point is that your infrastructure is now faster to use than to explain. That's the signal that it's ready.
Write the damn blog post. Start with the speedup table — that stops the scroll faster than anything. Then work backwards to explain why the architecture produces those numbers. You've already written most of it in this thread.

Thread Thread
 
unitbuilds profile image
UnitBuilds

Hahaha, true, thanks for the feedback, I'll be sure to do so.

Thread Thread
 
unitbuilds profile image
UnitBuilds

I'm sorry, I got carried away again. VS Code's stupid Node.JS frontend and Electron based UI was the next logical bottleneck, so I built an Agentic IDE built on the same system (between 1000-150000x faster depending on what aspect you look at). So... Logically, the next issue is that models dont run at nanosecond speeds... So I'm building a custom runtime for models, to see if I cant possibly squeeze some more speed out of it... I'll keep you posted on progress. Sorry that you'll have to wait a little longer, but it'll now be the fastest IDE, bar none.

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

Don't apologize. "VS Code's Electron UI was the next logical bottleneck so I built an IDE" is the most on-brand thing you've written in this thread.
A custom model runtime to squeeze nanoseconds out of inference is either the next logical step or the beginning of a very interesting rabbit hole. Probably both.
Take the time you need. At this point I'm just going to follow your dev.to for the updates.

Thread Thread
 
unitbuilds profile image
UnitBuilds

For pure-speed test, I'm gunna test on Qwen-2.5-0.5b-coder (int4) and bitnet 3b (b1.58) to see how they perform. Tiny models, but given the code security features I added early on, I'm very curious how they perform... Might switch to Gemma E4B with a draft model later, as I told another Dev here that I'd try that on a different project of mine to see performance there with MCP-Lite, so might aswell run the benchmarks after eachother

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

Qwen-2.5-0.5b-coder int4 + BitNet 1.58b as a speed baseline is a smart starting point — if your hardening layer can get usable output from a 0.5B model, that's the most interesting result in the benchmark. Not because anyone would deploy a 0.5B model for production code, but because it tells you how much of the quality gap is model vs infrastructure.
The Gemma E4B + draft model combo after is the right follow-up — speculative decoding on a 4B should give you a meaningful throughput boost without changing the output quality much. Curious whether your security scanner catches more issues on the tiny models or whether the error patterns are different enough to need tuning.
Run them back to back on the same spec if you can — the VPS Manager brief is sitting there waiting.

 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

The scratch file as NDA artifact with audit trail is the right call — you get execution history and provenance for free, and every intermediate state is recoverable. "What worked at stage 7 of this refactor" becomes a query, not a git bisect.
Building your own sandbox that outperforms Sandboxie Plus is very on-brand at this point. Strapping it into execute_nda as an approval-gated orchestrator closes the attack surface concern I raised earlier — changes get proposed, sandbox validates, commits only on explicit approval. That's a safer model than most production deployment pipelines.
At some point this comment thread needs to become a blog post. You've described a complete stack: NDA format, Rust MCP server, knowledge graph memory, sandbox executor, multi-agent discourse board, LOD context management, security scanner. That's not a side project anymore.

 
unitbuilds profile image
UnitBuilds

That's the plan for it. The world is moving towards AI everything, so why hold it back, by sticking to our ways? 90% of code produced today is ai generated and it's only going to scale from here. It's time the infrastructure catches up and starts letting the LLMs do it their way, instead of trying to interpret our way (eg. a deterministic QR code, instead of visual OCR a scanned document. A tiny little square on a page, means 100% document reconstruction natively). Everything I build is designed to be futureproofed, specifically because in 10 years, nobody will use MS word anymore, not Acrobat, everything will run through a LLM and if that's the case, the format should be AI native, not bloated CSS and backwards compatibility, etc. Lightweight enough that anything can run it and deterministic, so the AI can understand it. NDA solves the AI's problem and ours, because the audit trail inside the document explains why choices were made and who made them. So if you tamper with the AI's code, it'll see it and it'll understand why (or more prudently, in multi-agent systems).

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

"Let the LLMs do it their way instead of trying to interpret our way" — that's the inversion that most infrastructure builders miss. The QR code vs OCR example makes it concrete: why force a model to interpret a visual artifact when you can give it a deterministic binary it can read natively? The format should serve the consumer, and the consumer is increasingly a model.
The knowledge graph memory for Gemma 4 E4B is the interesting experiment in that context. If the model can extend its own knowledge graph over time rather than relying on weights alone, the 4B parameter count becomes less of a ceiling and more of a floor. The core reasoning stays small and fast; the domain knowledge lives in the graph and grows independently.
Your 1T → 250B → 50B → 2B decomposition is a useful mental model. Most of what a large model "knows" is either redundant, rarely accessed, or reconstructible from a smaller set of fundamentals. The question is whether you can identify the 2B that actually matter for a given domain and route to them efficiently — which is essentially what your LOD function does at the context level.
The convergence point you're describing — small deterministic core, external knowledge graph, NDA as the interchange format, Rust MCP at nanosecond latency — is a coherent architecture. The pieces you've built separately are starting to look like a system. That system deserves a name and a proper writeup before someone else builds it slower and calls it something else.

Collapse
 
unitbuilds profile image
UnitBuilds

developers.cloudflare.com/changelo... Incase you wanna read about it

Collapse
 
itsugo profile image
Aryan Choudhary

What I found most interesting wasn't actually that GLM won.

It was how much the benchmark highlighted the difference between generating code and taking responsibility for the code. The part where GLM tested its own output, found bugs, fixed them, and verified assumptions feels much closer to how an experienced engineer works than simply producing files quickly.

The other takeaway for me was the economics. A lot of discussions around AI coding tools seem to assume that the most expensive or most popular option must be the best one. This benchmark is a good reminder that tool selection is itself an engineering decision.

Really enjoyed the blind-review approach too. Revealing the models only after the evaluation made the results much more interesting than the usual "my favorite model won" comparisons.

Collapse
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

You put it better than I did in the article — "generating code vs taking responsibility for the code" is exactly the distinction. The self-testing behavior wasn't prompted explicitly; it emerged from the model's own judgment about what "done" means. That's the signal that matters.
The economics point is one I keep coming back to. Tool selection as an engineering decision implies you should be able to justify it the same way you'd justify any other architectural choice — with data, not brand recognition. That's what the benchmark was really trying to produce.
And yes, the blind review was the right call. Half the value of a benchmark is removing the confirmation bias before you start scoring.

Collapse
 
sylwia-lask profile image
Sylwia Laskowska

Great to see you back here with another article, Pascal! 😄

And I totally agree. I'm finding more and more that a supposedly "weaker" model can sometimes do certain things simply better than the expensive or more popular one. In the end, though, there still needs to be a human in the loop to orchestrate everything and decide which tool fits which task best. 😊

Collapse
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO • Edited

Sylwia! Always good to have your eyes on an article — honestly one of the reasons I keep writing here. 😄
And yes, you nailed it. The surprising part wasn't that the cheap model could compete — it's where it competes. Task-dependent hierarchy, not a fixed ranking. The "weak" model that scores 15/25 on a full implementation writes a perfectly good commit message.
The human-in-the-loop point is the one I keep coming back to. Knowing which tool fits which task is itself a skill — and that judgment doesn't get automated away anytime soon.

Collapse
 
nazar_boyko profile image
Nazar Boyko

The blind review is what makes this worth reading. Anonymizing the models before scoring kills the brand bias that wrecks most of these comparisons, so I trust the 25/25 more than I would a labeled chart. The thing I keep turning over is that GLM's edge came mostly from running its own code during the session and fixing the bugs it caught. That reads to me less like raw model smarts and more like the agent loop doing its job, which sits a bit oddly next to the "model matters, not the tool" takeaway. Either way, a $1.94 run beating a $25 one should push people to actually test instead of buying on reputation.

Collapse
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO • Edited

The tension you're pointing at is real and I'll admit I didn't fully resolve it in the article. The self-testing behavior happened inside OpenCode's agent loop — so is it GLM 5.2 that deserves the credit, or the tool that gave it the ability to execute code mid-session?
My read: the tool creates the conditions, but the model decides whether to use them. BigPickle ran in the same OpenCode environment and didn't self-test. Haiku 4.5 on OpenCode didn't either. GLM 5.2 chose to run its own output, interpret the results, and iterate. That judgment call is the model, not the loop.
But you're right that "model matters, not the tool" is too clean. The more accurate version is probably: the tool sets the ceiling, the model determines how close you get to it. On the planning phase with no execution environment, the tool is genuinely neutral. On the code phase, it isn't.
Worth a follow-up benchmark that controls for agent loop access more explicitly.

Collapse
 
xulingfeng profile image
xulingfeng

The finding that none of them asked questions before planning is quietly the most important result here. We optimize for output speed, not for thinking time.

Collapse
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

That line stopped me too when I was writing it up. Every model produced a complete architecture, then asked the blocking questions — in the wrong order. It's the most consistent failure across eight combinations, regardless of price or model family.
"We optimize for output speed, not for thinking time" — that's the root cause. The training signal rewards producing something, not pausing to clarify. An architect who delivers blueprints before understanding the constraints isn't faster, they're just earlier to be wrong.
The uncomfortable follow-up question: how much of our own engineering culture has the same problem?

Collapse
 
xulingfeng profile image
xulingfeng

Exactly. I grabbed this exact thread and pulled — that's basically my whole series. Before AI, during AI, after AI? Same pattern, different packaging. We never learn, we just get faster at being wrong.

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

So true!

Collapse
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

you are back! Woot :)

Collapse
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

Thanks Benjamin!

Collapse
 
jugeni profile image
Mike Czerwinski

Generate → run → detect → fix is the loop that separates production-ready from looks-fine-at-review, and your GLM 5.2 numbers put a cost figure on that distinction. The model is the worker; the architecture around the model decides whether the worker is allowed to ship. That shape shows up in my own work too. Most of the difference between systems that hold up in production and systems that demo well lives in the structure around the model, not in the model itself.

On finding #2 (none of the models blocked on uncertainty before laying out architecture), this is the place I think the post pushes hardest without naming it. A human architect's first move on an unclear brief is usually "what do you mean by X" before any diagram, because the cost of being wrong about the brief compounds through the rest of the work. None of the models did that. They all produced a plan first and asked second. That is a different failure mode than "cheap model worse than expensive model." It is "the agent shape itself does not prioritize disambiguating the question before answering it." Stop conditions and refusal conditions are still doing more work than most coding-agent post-mortems credit.

Honest stage marker on this side: my work runs adjacent (operator-side decision audit and verification engineering on dev.to). Your benchmark puts the self-verification axis under real production-shape constraints with cost numbers attached, and the cost figure is the part that changes how I would describe the trade-off in conversation. The external-review-by-another-model caveat you flag is the right one. It does not invalidate the benchmark, but it is the seam where the next iteration should plug a human reviewer for the high-severity claims (production-ready / not production-ready) where the disagreement most matters.

One concrete question: did GLM 5.2's self-verification loop catch anything the LLM reviewer would have missed, or was the agreement-rate high enough that the loop and the reviewer mostly converged? That distinction maps onto how much the self-verification is independent capability versus shared-bias amplification.

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more