For years, I ran my WordPress sites on OpenLiteSpeed. Fast server, LSCache is genuinely impressive, and the OLS/WordPress combo is hard to beat on ...
Some comments have been hidden by the post's author - find out more
For further actions, you may consider blocking this person and/or reporting abuse
Out of curiosity, have you ever tried Agent Workers through Cloudflare? They have Kimi K2.7 on there and it's quite affordable too (I mean it's a 1T param behemoth, for $4 output per m tokens and 27c per input mil token). Would be interesting to see how it compares, given that it acts as a pretty decent competition, often beating all the other models, including pro SOTA models at coding?
Thanks for the tip — Kimi K2.7 wasn't on my radar when I ran the benchmark, and the pricing is indeed interesting for the scale. A 1T MoE at $0.27 input is hard to ignore.
Cloudflare Workers AI as a delivery layer is also an angle I hadn't considered — it adds a latency and infrastructure dimension on top of the model quality question.
I'm planning a follow-up with additional models. Kimi K2 goes on the list. If you've run it on non-trivial coding tasks (multi-file, security constraints, that kind of thing), I'd be curious what you observed.
I just got it on my cloudflare newsletter today, so havent given it a try yet. I was looking at it earlier though to estimate how it'd compare to Vertex AI and Claude pricing, considering it's the only model that can actually compete and beat the giants. If I end up running it, I'll let you know, but it would be very interesting to see how it compares, because it's proven to be way more capable at coding than just about anything else on the market, maybe it's effectiveness translates to efficiency when you consider it makes almost no mistakes.
That's exactly the hypothesis worth testing — if a model makes significantly fewer mistakes, the total cost per session drops even if the per-token price is higher. GLM 5.2 demonstrated this: it cost more per token than DeepSeek, but it self-corrected three bugs and delivered 37/37 tests passing. Net cost of human debugging time: zero.
If you run Kimi K2.7 on something non-trivial, I'd genuinely like to see the numbers. The benchmark protocol and prompts are in the repo — reproducible if you want a direct comparison.
I'll see when I get a gap to run it, currently swamped with a bunch of projects: whatsapp, google drive, anydesk, mcp server equivalents, but built on a custom data protocol and custom file format along with the SDK, editor/viewer for it. So alot... But when I get a chance, I'm very interested in giving Kimi a run, because my workflows generally are new infrastructure inventing, rather than boilerplate coding, which should really stress it's reasoning capability.
That's precisely the kind of workload that separates the models — new infrastructure, custom protocols, decisions that can't be pattern-matched from existing code. Boilerplate is easy to benchmark; reasoning under constraint is where it gets interesting.
When you get a gap, the benchmark protocol is in the repo. Would be valuable to see how Kimi holds up on that kind of work — the plan phase already looks promising.
What I found most interesting wasn't actually that GLM won.
It was how much the benchmark highlighted the difference between generating code and taking responsibility for the code. The part where GLM tested its own output, found bugs, fixed them, and verified assumptions feels much closer to how an experienced engineer works than simply producing files quickly.
The other takeaway for me was the economics. A lot of discussions around AI coding tools seem to assume that the most expensive or most popular option must be the best one. This benchmark is a good reminder that tool selection is itself an engineering decision.
Really enjoyed the blind-review approach too. Revealing the models only after the evaluation made the results much more interesting than the usual "my favorite model won" comparisons.
You put it better than I did in the article — "generating code vs taking responsibility for the code" is exactly the distinction. The self-testing behavior wasn't prompted explicitly; it emerged from the model's own judgment about what "done" means. That's the signal that matters.
The economics point is one I keep coming back to. Tool selection as an engineering decision implies you should be able to justify it the same way you'd justify any other architectural choice — with data, not brand recognition. That's what the benchmark was really trying to produce.
And yes, the blind review was the right call. Half the value of a benchmark is removing the confirmation bias before you start scoring.
Great to see you back here with another article, Pascal! 😄
And I totally agree. I'm finding more and more that a supposedly "weaker" model can sometimes do certain things simply better than the expensive or more popular one. In the end, though, there still needs to be a human in the loop to orchestrate everything and decide which tool fits which task best. 😊
Sylwia! Always good to have your eyes on an article — honestly one of the reasons I keep writing here. 😄
And yes, you nailed it. The surprising part wasn't that the cheap model could compete — it's where it competes. Task-dependent hierarchy, not a fixed ranking. The "weak" model that scores 15/25 on a full implementation writes a perfectly good commit message.
The human-in-the-loop point is the one I keep coming back to. Knowing which tool fits which task is itself a skill — and that judgment doesn't get automated away anytime soon.
The blind review is what makes this worth reading. Anonymizing the models before scoring kills the brand bias that wrecks most of these comparisons, so I trust the 25/25 more than I would a labeled chart. The thing I keep turning over is that GLM's edge came mostly from running its own code during the session and fixing the bugs it caught. That reads to me less like raw model smarts and more like the agent loop doing its job, which sits a bit oddly next to the "model matters, not the tool" takeaway. Either way, a $1.94 run beating a $25 one should push people to actually test instead of buying on reputation.
The tension you're pointing at is real and I'll admit I didn't fully resolve it in the article. The self-testing behavior happened inside OpenCode's agent loop — so is it GLM 5.2 that deserves the credit, or the tool that gave it the ability to execute code mid-session?
My read: the tool creates the conditions, but the model decides whether to use them. BigPickle ran in the same OpenCode environment and didn't self-test. Haiku 4.5 on OpenCode didn't either. GLM 5.2 chose to run its own output, interpret the results, and iterate. That judgment call is the model, not the loop.
But you're right that "model matters, not the tool" is too clean. The more accurate version is probably: the tool sets the ceiling, the model determines how close you get to it. On the planning phase with no execution environment, the tool is genuinely neutral. On the code phase, it isn't.
Worth a follow-up benchmark that controls for agent loop access more explicitly.
The finding that none of them asked questions before planning is quietly the most important result here. We optimize for output speed, not for thinking time.
That line stopped me too when I was writing it up. Every model produced a complete architecture, then asked the blocking questions — in the wrong order. It's the most consistent failure across eight combinations, regardless of price or model family.
"We optimize for output speed, not for thinking time" — that's the root cause. The training signal rewards producing something, not pausing to clarify. An architect who delivers blueprints before understanding the constraints isn't faster, they're just earlier to be wrong.
The uncomfortable follow-up question: how much of our own engineering culture has the same problem?
Exactly. I grabbed this exact thread and pulled — that's basically my whole series. Before AI, during AI, after AI? Same pattern, different packaging. We never learn, we just get faster at being wrong.
So true!
you are back! Woot :)
Thanks Benjamin!
Generate → run → detect → fix is the loop that separates production-ready from looks-fine-at-review, and your GLM 5.2 numbers put a cost figure on that distinction. The model is the worker; the architecture around the model decides whether the worker is allowed to ship. That shape shows up in my own work too. Most of the difference between systems that hold up in production and systems that demo well lives in the structure around the model, not in the model itself.
On finding #2 (none of the models blocked on uncertainty before laying out architecture), this is the place I think the post pushes hardest without naming it. A human architect's first move on an unclear brief is usually "what do you mean by X" before any diagram, because the cost of being wrong about the brief compounds through the rest of the work. None of the models did that. They all produced a plan first and asked second. That is a different failure mode than "cheap model worse than expensive model." It is "the agent shape itself does not prioritize disambiguating the question before answering it." Stop conditions and refusal conditions are still doing more work than most coding-agent post-mortems credit.
Honest stage marker on this side: my work runs adjacent (operator-side decision audit and verification engineering on dev.to). Your benchmark puts the self-verification axis under real production-shape constraints with cost numbers attached, and the cost figure is the part that changes how I would describe the trade-off in conversation. The external-review-by-another-model caveat you flag is the right one. It does not invalidate the benchmark, but it is the seam where the next iteration should plug a human reviewer for the high-severity claims (production-ready / not production-ready) where the disagreement most matters.
One concrete question: did GLM 5.2's self-verification loop catch anything the LLM reviewer would have missed, or was the agreement-rate high enough that the loop and the reviewer mostly converged? That distinction maps onto how much the self-verification is independent capability versus shared-bias amplification.