The half of software that didn't get cheaper

KYUNGHWAN KIM — Tue, 23 Jun 2026 23:37:51 +0000

Something flipped this year, and most of the conversation is still pointed at the wrong half of it.

The cost of producing code collapsed. A feature that used to eat a day now takes an hour. We all feel it. What almost nobody is pricing in is that the cost of the other half — understanding what the code does and being sure it's right — didn't move at all. If anything, it went up.

That's the whole thesis: writing got cheap, reading and trusting did not. And that gap, not the speed of generation, is the thing that now decides whether you can actually ship.

Why reading got harder, not easier

When a teammate opens a pull request, every line carries intent. You can ask "why did you do this?" and get a real answer, because a person made a decision there. You review against that intent.

Generated code has no intent behind any single line. It's locally plausible everywhere — each line looks like something a competent developer might write — but there's no reasoning underneath to lean on. So you can't review it the way you review a human diff. You can't ask the author why, because there was no author and no why. You're left checking behavior directly, which is slower and far less forgiving.

This is also why the line-by-line diff, the unit we've reviewed software in for twenty years, quietly stopped working. The diff assumes each changed line was a human choice worth scrutinizing. When the lines are generated, the diff tells you what changed but nothing about where the risk is. The dangerous failures aren't in the lines that changed — they're in how those changes interact with everything that didn't.

The bottleneck moved from typing to judging

For most of the history of this craft, the scarce skill was production: can you build the thing. That's no longer scarce. The
scarce skill is now judgment: can you look at a working-looking pile of code and tell whether it's actually correct.

And the surveys keep drawing the same shape. Reviewers rate AI-assisted code highly at review time — clean, readable, idiomatic. Then it ships, and incident counts go up, senior engineers spend more time firefighting, and a meaningful chunk of that code needs significant rework. "Looks reviewed" and "is correct" came apart. Readability — the thing the code optimizes for — is exactly what lowers a reviewer's guard. Rough code makes you slow down; beautiful code makes you sign off on behavior you never checked.

The imbalance nobody's fixing

Here's what bothers me most. We spent a year building incredible tools for the half that got cheap: autocomplete, agents, parallel PRs, generate-more-of-everything. We built almost nothing for the half that stayed expensive.

Everyone now has a way to produce more code. Almost no one has a way to get better at judging it. There's no gym for the skill that suddenly matters most. You're expected to develop it by accident, in production, on code you didn't write — which is the worst possible place to learn it.

That asymmetry is the gap I keep coming back to, and it's the reason I'm building theloupe.dev: a place to practice the half
that didn't get cheaper — reading code you didn't write and deciding, with reasons, whether it's actually right.

I'll get into how you even practice that — how you make something feel like a real production failure instead of a toy — in the next post. For now I just want to put the thesis down plainly: the cost of writing fell off a cliff, the cost of being sure didn't, and we've been building almost entirely for the wrong one.

Why I'm building a place to practice catching AI's bugs

KYUNGHWAN KIM — Sun, 21 Jun 2026 05:15:58 +0000

Two moments made me start building Loupe. Neither was some grand realization — they were just things that kept happening until I couldn't ignore them.

The first happened several times a day. I'd hand a feature to my AI coding agent — Claude Code, Codex, whichever — and minutes later it would come back: done, tests passing. Green
checkmarks everywhere. But when I actually read what it had written, I kept noticing the same thing: the tests weren't written to check the feature. They were written to pass.
They cleared the bar, let the agent declare victory, and moved on. The code ran. The demo worked. Whether it was actually correct was a separate question — and nobody, human or
machine, had bothered to ask it.

The second happened in a meeting. A teammate and I agreed to build a specific feature. Some time later, they told me it was done. No docs, no walkthrough, just "done." So I asked
the most basic questions I could think of. How many endpoints does it have? What happens in this edge case? Silence. They couldn't answer. The feature was "finished," and the
person who shipped it couldn't describe what they'd shipped.

Both times, the exact same thought hit me: we're about to put this in front of real users, and not one person has actually verified it.

I couldn't let it go, because it isn't a one-off — it's the shape of how we build now. Writing code became almost free; AI produces it in seconds. But reading code, understanding
it, and being able to say with confidence "yes, this is correct" didn't get any easier. If anything we do it less than we used to, because "the tests pass" feels like permission
to stop looking. The cheaper generation gets, the more code flows past unread.

And the failures from that gap aren't loud. A crash you catch instantly. The dangerous bug is the one that runs perfectly and just returns the wrong number — the refund that's
slightly too large, the query that quietly drops a row, the check that never actually runs. It passes the test. It demos fine. It's wrong. And it ships, because everything looked
green.

So I'm building Loupe: a place to practice the one skill all of this quietly demands and almost nobody trains. You get real, AI-written code that runs and passes its tests — and
your job is to find the part that's quietly wrong. Not to write it faster. To read it and catch what everyone else waved through.

Because "it runs" and "it's right" are not the same sentence. Someone still has to know the difference — and I'd like more of us to be that someone.

This is the first post in a series where I'll build this in the open: the decisions, the mistakes, the early users, the real bugs. Follow along if that's your kind of problem. →
theloupe.dev

DEV Community: KYUNGHWAN KIM

The half of software that didn't get cheaper

Why I'm building a place to practice catching AI's bugs