Davidslv

The View Layer Rails Couldn't See

Wed, 24 Jun 2026 00:00:00 +0000

For most of Rails’ history, almost every layer of the stack has had a tool that parses it. RuboCop reads your Ruby as a syntax tree; Brakeman traces tainted data through it; ESLint reads your JavaScript. The view layer was the exception. The ERB linters we had worked on the text of the template, not on the HTML that text produces.

Marco Roth’s Herb closes that gap: written in C, it parses HTML-with-embedded-ERB into a real syntax tree. We have run it in production in a Rails engine for a few months. Here is what we wired in, what it caught, and what is still rough.

For about twenty years, ERB tooling was string-based. erb_lint and better-html did useful work, but they pattern-matched over text — they had no parse tree of the HTML you were producing, so they could not reliably catch a <div> opened inside a <p>, or a value interpolated into an attribute where the escaping rules differ from body text.

The harder case in our codebase was not ERB at all. Parts of our admin surface were written in Arbre — the object-oriented “DOM in Ruby” that ActiveAdmin uses to build markup with div do … end blocks instead of templates. Arbre first appeared in 2011 and solved a real problem of its era. I never warmed to the abstraction myself, but taste is beside the point here. The real issue is that Arbre is invisible to static analysis: there is no HTML in the file, only Ruby that emits HTML at runtime. No HTML linter sees it, no language server checks the Stimulus targets inside it, no formatter touches it. Every .html.arb file is a file outside the ecosystem.

What a parse tree makes possible

Once you have a tree, checks become expressible that a string linter cannot do reliably. In our engine, herb analyze gates on two:

Nesting — invalid element nesting, the kind browsers silently repair and then render differently than you meant.
Accessibility — structural checks (missing labels and the like) that only make sense with the element tree in front of you.

Herb can also check security — for instance, output in an attribute position, where a naive <%= %> can break out of the attribute. We leave those rules off, though: better_html (under erb_lint) already owns ERB safety, and running both would just double-report. Either way the point holds: the tree is what makes the check possible at all.

The view layer has been consolidating on ERB

Herb did not appear in a vacuum. For a decade the view layer has been moving in one direction:

2019

ViewComponent born at GitHub

An ERB template plus a Ruby class. Built and run by the most prolific ERB renderer on the planet — and still rendering ERB.

2023 · Rails 7.1

Strict locals land

A magic comment gives plain partials required, typed parameters — closing much of the safety gap that sent people looking for alternatives.

2024–25

Phlex 2 ships

Pure-Ruby views — fast and composable, but deliberately outside ERB. Good, and against the grain of everything around it.

2025

Herb arrives

An HTML-aware ERB parser, linter, formatter and language server, written in C. The first time the view layer gets a real parse tree.

2025–26

GitHub adopts Herb · Rails core leans in

Herb runs across GitHub's enormous ERB surface, and a proposed “ReActionView” direction at Rails World 2025 imagines Herb as the foundation Rails parses views with.

Two things get conflated here — a commenter rightly called me on it. The substrate has settled: ERB is where the ecosystem’s weight sits, and that is a safe bet. The tooling on top is young — Herb is at 0.10. That is not a contradiction; it is the reason this is worth writing down. (The one popular bet the other way is Phlex — pure-Ruby views, genuinely good for composition. It is a different trade: write views as Ruby, and you are back to having no HTML for an HTML-aware tool to read.)

The lived test: porting Arbre out of an engine

Here is where it became a diff. We wrote a hard rule into the engine’s style guide: new view partials are .html.erb, and the moment you touch an existing .html.arb file for any reason, you port it to ERB in the same pull request.

Existing Arbre files keep working untouched until something forces an edit — a bug fix, a copy change, a new field. Then you port the whole file first and make your change in ERB. Pre-commit and CI both fail on any added or modified .html.arb in the diff (git diff --diff-filter=AM | grep '\.html\.arb$'). No hotfix bypass — the gate is only credible if it survives pressure.

That has the engine at 193 ERB partials and three remaining Arbre files — the three are just ones nobody has needed to touch. There was no big-bang migration; the legacy set shrinks as you work.

What makes the rule fair is that porting Arbre is mechanical. You convert the whole file once, and it is done. That is cheap enough to demand on every touch — which it would not be for, say, a cop that rewrote existing method calls.

Here is one of those three files, trimmed. In Arbre the markup is Ruby — there is no <div>, no class, no attribute for a tool to read:

div class: 'bg-white rounded-xl border px-6 py-5' do
  para t("#{i18n}.heading"), class: 'text-xs uppercase'
  text_node button_to t("#{i18n}.button"),
            sync_record_path(record),
            method: :post,
            class: 'px-4 py-2 rounded-lg bg-blue-600 text-white'
end

Ported, the same output is real HTML that herb analyze and herb-lsp can parse:

<div class="bg-white rounded-xl border px-6 py-5">
  <p class="text-xs uppercase"><%= t("#{i18n}.heading") %></p>
  <%= button_to t("#{i18n}.button"),
        sync_record_path(record),
        method: :post,
        class: "px-4 py-2 rounded-lg bg-blue-600 text-white" %>
</div>

div do … end becomes <div>…</div>, para becomes <p>, and the text_node wrappers fall away. One file, one pass.

Honest about the rough edges

In the editor, herb-lsp (we list marcoroth.herb-lsp as a recommended extension) gives live HTML diagnostics as you type. But this is a young toolchain — 0.10.1 — and two things are still rough:

Stimulus checks are off, on purpose. We bootstrap Stimulus inline in an ERB partial rather than in application.js, and the Stimulus parser cannot follow controller registrations embedded in ERB — so it flags every real, registered controller as unknown. The four stimulus-data-* rules are disabled in .herb.yml until our bootstrap moves to a .js file. So the diagnostic I most wanted — catching a typo’d data-controller as I type it — is not live for us yet.

Two ERB linters, side by side. herb analyze for the parser-level validators, Shopify’s erb_lint for whitespace and better_html safety. Where they overlap, we keep the rule in one so they do not double-report. The new tool has not replaced the old one; it sits next to it and will subsume more over time.

The lesson

One idea is worth keeping: the reach of your tooling is an architectural property, and you choose it. Picking Arbre years ago did not just pick a way to write markup — it put that markup beyond every analyser we would later wish we had. Picking ERB put the view layer back inside reach of the linters, the language servers, the safety checks, and whatever Rails builds next on Herb. The template-engine debate usually turns on ergonomics; the more durable axis is legibility to tools, and on that axis HTML-aware ERB is in a different class from the DSLs it replaces.

So: write ERB, lint it with something that actually parses the HTML, and when you find a corner the tools cannot see, treat it as a bug in your architecture — not a quirk you live with.

The setup here is a production Rails engine running Herb 0.10.1 (herb analyze in pre-push and CI), erb_lint 0.9.0 with better_html for safety, marcoroth.herb-lsp as a recommended extension, and an enforced Arbre-to-ERB porting rule that has the engine at 193 ERB partials to 3 remaining Arbre files.

If you want the longer story on building Rails applications that stay maintainable as they grow — boundaries, engines, testing, and honest trade-offs — that is what Modular Rails: Architecture for the Long Game covers in depth. Read it free on the web, or get the paperback (UK).

The Propshaft Version Lever You Were Told Was Gone

Tue, 23 Jun 2026 00:00:00 +0000

A piece of feedback to the Rails community crossed my feed this week. A team had migrated an application to Rails 8.1.3, adopted Propshaft — the asset pipeline that replaced Sprockets as the Rails 8 default — and concluded that it had removed the ability to set a version string to force new fingerprints on precompile. Their words were that this introduced “a weakness to the platform.” The reasoning was sound: they used that lever to be certain a client was running the latest deployed assets, and now it appeared to be gone.

The instinct is correct. That lever matters. But the conclusion is wrong, and the way I know it is wrong is the point of this post: I cloned Propshaft, read the source, and then generated a fresh Rails 8.1.3 app and tested it, rather than trusting the blog posts. The version setting is still there. It is wired into the digest. The Rails generator writes it into your initializer with a comment explaining what it does. And when I precompiled twice to prove it, it behaved exactly as advertised. It is missing from Propshaft’s README — which is a different problem from being removed, and a much smaller one.

What the blog posts will tell you

Search for “Propshaft cache busting” and every result says the same thing, more or less correctly:

Propshaft appends a content-based fingerprint to each filename. application.css becomes application-a1b2c3d4.css. When the content changes, the digest changes, the filename changes, and the browser fetches the new file. Unlike Sprockets, there is no config.assets.version to manage — the content hash handles everything.

That last sentence is the one that does the damage. It is repeated across tutorials, and it is the source of the belief that the lever was deleted. The first half is true. The second half is folklore.

What the source actually says

Propshaft is small enough to read in a sitting, which is exactly why I reach for git clone before I reach for an opinion. The whole digest mechanism is one method in lib/propshaft/asset.rb:

def digest
  @digest ||= Digest::SHA1.hexdigest("#{content_with_compile_references}#{load_path.version}").first(8)
end

Read it slowly. The string being hashed is not just the file’s content. It is the content concatenated with load_path.version. The fingerprint is a SHA1 of content plus a version string, truncated to eight characters.

So where does load_path.version come from? Two short hops up. The load path is built in lib/propshaft/assembly.rb:

def load_path
  @load_path ||= Propshaft::LoadPath.new(
    config.paths,
    compilers: compilers,
    version: config.version,          # <- right here
    file_watcher: config.file_watcher,
    integrity_hash_algorithm: config.integrity_hash_algorithm
  )
end

And config.version is config.assets.version, which the Propshaft railtie sets a default for in lib/propshaft/railtie.rb:

config.assets.version = "1"

The chain is unbroken:

flowchart LR
    A["config.assets.version<br/>(generated app: &quot;1.0&quot;)"] --> B[Assembly]
    B -->|"version: config.version"| C[LoadPath.version]
    C --> D["Asset#digest<br/>SHA1(content + version)"]
    D --> E["application-a1b2c3d4.css"]

    style A fill:#e8a838,stroke:#b07828,color:#fff
    style D fill:#4a90d9,stroke:#2c5f8a,color:#fff
    style E fill:#27ae60,stroke:#1e8449,color:#fff

config.assets.version exists in Propshaft. The railtie defaults it to "1", it is folded into every single asset digest, and this is Propshaft 1.3.2 — the version shipping with current Rails 8.

And here is the part that turns “undocumented” into “actually documented, in your own repository.” Generate a fresh Rails 8.1.3 app and open config/initializers/assets.rb, and the generator has already written this for you:

# Version of your assets, change this if you want to expire all your assets.
Rails.application.config.assets.version = "1.0"

The generated app overrides the railtie’s "1" with "1.0", but the point is the comment. The exact “enter version information to force new fingerprints” feature the feedback believed was disabled is scaffolded into every new Rails app, on a line whose comment names the precise use case: change this if you want to expire all your assets. Nobody removed the lever. It is sitting in an initializer the generator wrote, one git grep version config/ away.

Using it

Because the version string is part of the hashed input, changing it changes the hash for every asset, regardless of whether any file content changed. Edit the line the generator already gave you:

# config/initializers/assets.rb
Rails.application.config.assets.version = "2.0"

Precompile, and every fingerprint differs from the previous build — the same shift for every file in the pipeline. The asset URLs embedded in your views all change, and any client requesting an old URL gets a cache miss and pulls the fresh file. That is precisely the “force new fingerprint generation on precompile” behaviour the feedback assumed had been taken away.

It is worth noting this is more reliable than the feature people remember. Sprockets had a long-standing bug (sprockets-rails#240) where bumping config.assets.version produced identical digests anyway — the lever was connected to nothing. Propshaft’s version genuinely participates in the hash. The thing that was supposedly removed actually works better than the original.

I didn’t take the source’s word for it

Reading the source tells you what should happen. Before publishing this I generated a clean Rails 8.1.3 app (propshaft (1.3.2), no Sprockets in the lockfile), added one declaration to app/assets/stylesheets/application.css, and precompiled it twice. Between the two builds I changed nothing except the version string, and I checked that the CSS file was byte-for-byte identical (same MD5) across both runs.

Build one, with the default config.assets.version = "1.0", produced this manifest:

{ "application.css": { "digested_path": "application-a863ad16.css" } }

Build two, after changing only the version to "2.0" — no content change, identical MD5 — produced:

{ "application.css": { "digested_path": "application-09b5bd28.css" } }

Every digest moved, not just the stylesheet’s:

asset	`version = "1.0"`	`version = "2.0"`
`application.css`	`a863ad16`	`09b5bd28`
`rails-ujs.esm.js`	`e925103b`	`a4ead74f`
`rails-ujs.js`	`20eaf715`	`0b7c6ef1`

Then, to be sure I understood why, I reproduced the digest by hand from the formula in asset.rb — SHA1(content + version) truncated to eight characters:

require "digest"
content = File.read("app/assets/stylesheets/application.css")
Digest::SHA1.hexdigest("#{content}1.0").first(8)  # => "a863ad16"  ✓ matches build one
Digest::SHA1.hexdigest("#{content}2.0").first(8)  # => "09b5bd28"  ✓ matches build two

Both hand-computed digests matched the precompiled filenames exactly. The lever works, it works for the documented reason, and the mechanism is no deeper than concatenating a string before hashing.

Why you almost never need it

Here is the more interesting half, and the reason the lever is quietly scaffolded rather than loudly advertised.

With Sprockets, the version string earned its keep because Sprockets’ digests were not purely content-addressed and were occasionally inconsistent between environments and gem versions. The version knob was the manual override you reached for when you did not trust the automatic digest. It was a workaround for unpredictability.

Propshaft’s digest is a plain SHA1 of the content. It is deterministic: identical bytes always produce the identical fingerprint, and any byte change produces a new one. The automatic case is now trustworthy, so the manual override has almost nothing left to do. If you changed an asset, its fingerprint already changed — you do not bump a version to “make sure,” because the content is the version.

The only scenario where the global lever is the right tool is when you want every asset to get a new URL without changing any content: forcing a CDN that keyed on something unexpected to re-pull, recovering from a poisoned edge cache, or invalidating after a build-toolchain change that altered how files are produced but not what they contain. Real situations, but rare ones. That rarity is why the generated app sets the field to "1.0" and most teams never touch it again — not because it was deleted.

The cache-busting problem the lever does not solve

There is a deeper point hiding in the original feedback, and it is the part worth carrying away even if you never touch config.assets.version.

The stated goal was “make sure the client indeed has the latest asset being deployed.” Bumping the asset version does not, on its own, guarantee that — because the fingerprinted URL lives inside your HTML:

<link rel="stylesheet" href="/assets/application-9f8e7d6c.css">

A fingerprinted asset is safe to cache forever, which is the whole win: the URL only points at one immutable version of the file. But the document that references it is a different cache layer entirely. If a CDN, a reverse proxy, or the browser is holding an old copy of the HTML, the client keeps reading the old asset URL — and your shiny new version = "2.0" digests sit on the server unrequested.

So “did the client get the latest assets?” is really two questions stacked on top of each other:

Assets: solved by fingerprinting, automatically, the moment content changes. Cache them with max-age set to a year and immutable.
The HTML that names them: not an asset-pipeline concern at all. This is the layer to get right — a short or zero max-age on the document, ETag/Last-Modified revalidation, and correct CDN cache rules for HTML responses. If this layer serves stale HTML, no amount of fingerprinting downstream will help.

Reaching for config.assets.version to fix a stale-client problem is, most of the time, fixing the layer that already works and ignoring the one that does not. The fingerprint was never the weak link. The document cache is.

The generalisable lesson

There are two, and they are both cheap habits.

Read the source, then run the experiment, before you declare a feature gone. Propshaft is a few hundred lines. The entire digest behaviour — the thing three dozen blog posts summarise, sometimes wrongly — is one method you can read in under a minute. Cloning the repository and grepping for version took less time than writing the post that announced its removal, and it would have produced the opposite, correct conclusion. Generating a throwaway app and precompiling it twice took five more minutes and turned “the source says it should work” into “I watched it work.” When a tool’s behaviour surprises you, the library’s own code is the primary source and a two-build experiment is the proof. Tutorials are secondary, and they inherit each other’s mistakes.

Match the fix to the layer. “The client has stale assets” feels like one problem but spans two caches with two different owners. Fingerprinting owns the asset layer and has owned it well since Propshaft shipped. HTTP cache headers own the document layer. Bumping an asset version to solve a document-caching symptom is the kind of fix that appears to work — because you redeployed and cleared something — while leaving the actual cause in place to resurface on the next deploy.

The lever you were told was gone is still bolted to the dashboard. It is just that the car mostly steers itself now, and the warning light you are chasing is wired to a different system entirely.

The code in this post is from Propshaft 1.3.2, the asset pipeline that ships by default with Rails 8. The digest method is lib/propshaft/asset.rb; the railtie default is in lib/propshaft/railtie.rb; the generated config.assets.version line is in config/initializers/assets.rb. Every digest in this post was reproduced from a clean rails new on Rails 8.1.3 — two precompiles, one version change, identical content — not from memory.

From One Controller to Thirteen Handlers: A Webhook Refactor

Tue, 16 Jun 2026 00:00:00 +0000

A webhook controller is the natural place to put webhook code. You name it WebhooksController, you put a def stripe action in it, and you start writing. Six months later it is 200 lines long and you cannot remember what half of it does. This post is about the moment I noticed mine had become a god object, and the small architectural shift that fixed it.

The code is real. It is from Seams, the gem I am building that scaffolds modular Rails engines. The Billing engine handles Stripe webhooks and the controller had grown to seven distinct responsibilities. I will walk you through what was wrong, what I changed, and – more usefully – the generalisable pattern underneath, so you can recognise the same smell in code that has nothing to do with webhooks.

What a webhook controller actually does

When a Stripe event arrives, the controller does roughly this:

Read the request body.
Verify the Stripe signature so you know the payload is real.
Record the event id in your database so retries do not re-fire your subscribers.
Decide which Stripe event type this is.
Pull the relevant fields out of the payload (customer id, subscription id, amount).
Upsert your local mirror of the resource (a Subscription row, an Invoice row).
Publish a domain event for the rest of your application to react to.

That is seven verbs. Each one is a separate concern with its own reasons to change, its own failure modes, and its own testing requirements. Putting them in one place means you cannot exercise any of them in isolation, you cannot see at a glance what the controller is responsible for, and – the part that broke me – adding a new event type means editing the same file every time.

Mine had five event types mapped. The roadmap said twelve. Going from five to twelve in the same controller would have produced a 300-line action method with a case statement that nobody would want to review.

The smell, named

The cleanest way I know to spot a god object is to write down its responsibilities as verbs. If the list is longer than three, the class is doing too much. The Single Responsibility Principle is usually taught as “a class should have one reason to change,” which sounds vague until you try to change something. If two unrelated bits of code in the same file change for two unrelated reasons, you keep tripping over the other one.

The webhook controller had one reason from each of these:

Trust: signature verification changes when Stripe rotates webhook signing schemes.
Idempotency: dedupe logic changes when you move from a unique-index dedupe to a Redis-based one.
Mapping: the event type table changes every time you support a new event.
Extraction: the payload-parsing rules change when Stripe ships an API version bump.
Persistence: the local upsert changes when your data model changes.
Publishing: the canonical event names change when your event-bus contract changes.
Forking: the LTD (Lifetime Deal) special case changes when product decides to add another mode.

Seven independent change vectors in one class. Every commit risked touching unrelated code; every test had to boot the full request stack just to exercise a single field’s extraction.

The refactor

The shape I landed on splits the controller into three layers: a thin entry point, a router, and a flat directory of single-purpose handlers.

flowchart LR
    Stripe([Stripe]) -->|POST /billing/webhooks/stripe| WC[WebhooksController]
    WC -->|verify + record| DB[(WebhookEvent)]
    WC -->|lookup| ER{EventRouter}
    ER -->|"customer.subscription.created"| H1[SubscriptionCreatedHandler]
    ER -->|"invoice.paid"| H2[InvoicePaidHandler]
    ER -->|"checkout.session.completed"| H3[CheckoutSessionCompletedHandler]
    ER -->|"... 10 more"| H4[...]
    H1 -->|upsert + publish| Bus([Event bus])
    H2 -->|upsert + publish| Bus
    H3 -->|fork on mode| Bus

    style WC fill:#4a90d9,stroke:#2c5f8a,color:#fff
    style ER fill:#e8a838,stroke:#b07828,color:#fff
    style Bus fill:#27ae60,stroke:#1e8449,color:#fff

The controller shrank from ~210 lines to about 95. Most of those 95 are documentation comments. The action itself is now ten lines:

def stripe
  payload   = request.body.read
  signature = request.headers["Stripe-Signature"]

  event = Billing.gateway.verify_webhook(
    payload: payload, signature: signature, secret: Billing.configuration.webhook_secret
  )

  record_and_dispatch("stripe", event)
  head :ok
rescue Billing::WebhookError => e
  Seams::Observability.adapter.warn("billing.webhook.invalid", error: e.message)
  head :bad_request
end

record_and_dispatch inserts the WebhookEvent row inside a transaction and then calls Webhooks::EventRouter.handler_for(event[:type]) to look up a handler class. If there is one, it instantiates and calls it; if there is not, the controller no-ops. Stripe sends event types nobody subscribed to all the time, so a missing handler is normal, not an error.

The handlers themselves form a small inheritance tree:

Billing::Webhooks::Handler                          ← abstract base
  ├── SubscriptionHandlerBase                       ← shared upsert
  │   ├── SubscriptionCreatedHandler                ← SEAMS_EVENT = "subscription.created.billing"
  │   ├── SubscriptionUpdatedHandler                ← "subscription.updated.billing"
  │   ├── SubscriptionDeletedHandler                ← "subscription.canceled.billing"
  │   └── SubscriptionTrialWillEndHandler           ← "subscription.trial_will_end.billing"
  ├── InvoiceHandlerBase                            ← shared upsert
  │   ├── InvoiceCreatedHandler                     (status: draft)
  │   ├── InvoicePaidHandler                        (status: paid)
  │   ├── InvoicePaymentFailedHandler               (status: open)
  │   ├── InvoiceFinalizedHandler                   (status: open)
  │   └── InvoiceVoidedHandler                      (status: void)
  ├── PaymentSucceededHandler
  ├── PaymentFailedHandler
  ├── ChargeRefundedHandler
  └── CheckoutSessionCompletedHandler               ← LTD fork lives here

Most leaves are three lines. SubscriptionCreatedHandler is literally:

class SubscriptionCreatedHandler < SubscriptionHandlerBase
  SEAMS_EVENT = "subscription.created.billing"
end

The shared upsert lives in SubscriptionHandlerBase. The leaf only declares which canonical event name to publish. That is the entire difference between “subscription created” and “subscription updated” – one constant.

Three patterns, one shape

What I have just described is three classical patterns layered on top of each other.

The Template Method pattern is doing the heavy lifting in SubscriptionHandlerBase: a base class defines the algorithm (upsert + publish), subclasses fill in the variable parts (the canonical event name, the invoice status). When five out of six subclasses share the same logic and only the constants differ, Template Method is the right shape. It keeps the shared code in one place and makes each variant trivial to read.

The Strategy pattern is the relationship between the controller and the handlers: the controller does not know which handler will run; it asks the router for one and invokes it through a uniform interface (handler.new(event:, gateway:).call). The controller and the router are decoupled from the concrete strategy. Adding a new strategy does not require changing either of them.

The Registry pattern is what EventRouter is. It is a hash of strings to class names with a register method that lets hosts add their own mappings without monkey-patching. This is the seam that turns a closed system into an open one. A consuming application can write:

Billing::Webhooks::EventRouter.register(
  "customer.tax_id.created",
  "MyApp::TaxIdCreatedHandler"
)

…and now a Stripe event type that the gem never heard of routes to host code. No subclassing, no config block, no fork. The extension point is published. This is what people mean when they say “open for extension, closed for modification” – the framework’s behaviour does not need to change for the host to add behaviour.

Five concrete wins

The reason I find this shape worth talking about is that it pays off in five different ways, and the wins compound.

Adding event types becomes a one-class job. A new file, four lines long, registered in one place. There is no integration risk because the controller does not change. There is no “what else does this method do?” anxiety because each handler does one thing.

Each handler is testable in isolation. Today’s controller spec used to need a full request stack. After the refactor, a handler spec is a Plain Old Object instantiated with a hash. I have a directory of saved Stripe event fixtures from Phase 3 (1/4) of the same project; a handler spec reads one, calls .new(event:, gateway:).call, and asserts on the resulting database state. That is a unit test, not an integration test. It runs in milliseconds.

The Lifetime Deal fork stops being a special case. Before: there was a checkout_lifetime? predicate baked into the controller, with its own branching. After: CheckoutSessionCompletedHandler examines mode and metadata.access_type and forks internally. The controller does not know LTDs exist. The router does not know LTDs exist. Only the handler that needs to know, knows. That is the kind of thing that lets you delete the LTD feature later – if product changes its mind – by deleting one file.

Async dispatch becomes a config flip, not a code rewrite. I shipped a ProcessEventJob that takes the same (gateway:, event_data:) arguments and calls the same router. The controller checks Billing.configuration.process_webhooks_async and either runs the handler in the request thread or enqueues the job. Stripe recommends responding in <100ms; hosts who need that flip a flag. Hosts who prefer the existing transactional semantics (handler failure rolls back the WebhookEvent insert, Stripe retries) keep them. The handler did not change.

The extension point is published. Hosts adding Stripe events the gem does not ship with do not have to fork the gem. They write a handler in their own codebase and call EventRouter.register. This is the difference between a tool you use and a tool you have to maintain a fork of.

The trade-off, honestly

Thirteen small files where there was one large file. That is a real cost. You now have to navigate a directory tree to read all the webhook code, and someone seeing the codebase for the first time will spend a minute orienting themselves.

I think it is the right trade. Here is why.

When code is in one big file, you read it linearly. When it is in thirteen small files, you read whichever file matches the case you care about. The “navigate a directory” cost is only paid by readers who need to understand the whole system. Readers who only need to understand “what happens when an invoice is paid” go to one file with seven lines in it.

The opposite trade – one big file – is paid every time you change anything. Every commit shows up in git blame next to unrelated code. Every test has to set up state for the whole controller. Every bug fix risks breaking a sibling case. The cost is paid continuously, by everyone, forever.

When this pattern does not apply

If your controller has two event types and they are stable forever, leave it alone. The refactor’s value is in case count and change frequency. With two cases, the inheritance tree adds more cognitive load than it removes.

The breakeven I have seen empirically is somewhere around five cases or “I expect this to grow.” Below that, a case statement is fine. Above that, the per-case classes start paying for themselves.

The other condition is that the cases must be similar shape. Webhook handlers are: each takes an event hash, optionally upserts something, publishes a canonical event. That uniformity is what makes a registry possible. If your “cases” each take different inputs and do different things, you do not have a Strategy problem – you have a routing problem at the wrong layer.

The generalisable lesson

The Single Responsibility Principle scales by case count. A controller with one verb has one responsibility. A controller with seven verbs has seven, and that does not feel like a problem until your case count grows enough that the verbs start interfering with each other.

When you find yourself adding the Nth when to a case statement, or the Nth if to a method that is already long, ask whether the cases should be classes. Not always – it costs more files. But often enough that the question is worth asking every time.

Three follow-on principles fall out of this:

When the cases share most of the work and differ in constants, Template Method keeps the shared code in one place.

When the cases need to be looked up by a runtime value (an event type, a strategy name, a content type), Registry is the seam that lets you add cases without editing the dispatcher.

When the host application needs to add cases that the framework never anticipated, publish the extension point. A register method on a public module is worth more than any documentation telling people how to monkey-patch.

The webhook controller is just the example. The shape is everywhere.

The code in this post is from Seams, an open-source gem that scaffolds modular Rails engines. The Billing engine ships the full handler hierarchy, the registry, and the opt-in async job, ready to use in your host application. If you find yourself building this pattern by hand, you might save a few hours.

If you want the longer story on building modular Rails applications — boundaries, engines, testing, and the trade-offs — that is what Modular Rails: Architecture for the Long Game covers in depth. Read it free on the web, or get the paperback (UK).

When Rails Engines Are the Wrong Tool

Tue, 09 Jun 2026 00:00:00 +0000

I have spent an entire book making the case for Rails engines. Now I am going to tell you when not to use them.

This is not a hedge. It is honesty. Every architectural tool has a cost, and engines are no exception. Using them when they are not warranted creates overhead that slows your team down rather than speeding them up. Knowing when not to reach for an engine is just as important as knowing how to build one.

The Decision Flowchart

Before introducing an engine, run through this:

flowchart TD
    A["Do you have more than<br/>10-15 models?"] -->|No| B["Engines are overkill.<br/>Use namespaces."]
    A -->|Yes| C["Do you have more than<br/>one team or domain area?"]

    C -->|No| D["Consider namespaces<br/>and conventions first."]
    C -->|Yes| E["Do you have clear<br/>domain boundaries?"]

    E -->|No| F["Identify boundaries first.<br/>Engines can wait."]
    E -->|Yes| G["Will the engine have<br/>its own models and<br/>business logic?"]

    G -->|No| H["A plain Ruby gem<br/>or concern may suffice."]
    G -->|Yes| I["An engine is<br/>likely the right choice."]

    style A fill:#e8a838,stroke:#b07828,color:#fff
    style B fill:#d9654a,stroke:#8a3a2c,color:#fff
    style C fill:#e8a838,stroke:#b07828,color:#fff
    style D fill:#4a90d9,stroke:#2c5f8a,color:#fff
    style E fill:#e8a838,stroke:#b07828,color:#fff
    style F fill:#4a90d9,stroke:#2c5f8a,color:#fff
    style G fill:#e8a838,stroke:#b07828,color:#fff
    style H fill:#4a90d9,stroke:#2c5f8a,color:#fff
    style I fill:#27ae60,stroke:#1e8449,color:#fff

Notice how many paths lead away from “use an engine.” That is intentional. Engines should be the answer to a specific problem, not the default structure for every Rails application.

Applications That Are Too Small

If your application has fewer than 10-15 models, engines almost certainly add more overhead than value. The ceremony of gemspecs, dummy apps, mountable routes, and inter-engine dependency management is not justified when the entire codebase fits comfortably in one developer’s head.

For small applications, namespaces give you most of the organisational benefit at zero cost:

# app/models/billing/invoice.rb
module Billing
  class Invoice < ApplicationRecord
    # All the billing logic, clearly namespaced
  end
end

# app/models/notifications/mailer.rb
module Notifications
  class Mailer < ApplicationRecord
    # All the notification logic, clearly namespaced
  end
end

This communicates domain boundaries to developers without introducing any infrastructure. The Billing:: prefix tells you where this class belongs. The directory structure mirrors the namespace. It is not enforced, but it is clear.

Teams That Are Too Small

A team of two or three Software Engineers working on a single application does not need engine boundaries. The communication overhead of a small team is low enough that conventions and code review are sufficient to maintain boundaries.

Engines shine when teams are large enough that not everyone can hold the full codebase in their head. If every developer on your team already knows every model, every controller, and every service object, the boundary enforcement that engines provide is solving a problem you do not have.

The threshold is not precise, but in my experience, engines start paying for themselves when you have 5+ developers working on a codebase with 30+ models across at least 2-3 distinct domain areas.

The Honest Calculation

Before introducing engines, ask three questions:

What is the actual cost of the problem we are solving? Not the theoretical cost. The actual cost. How many hours per month does your team lose to cross-domain coupling? How many production incidents were caused by unexpected dependencies? If you cannot point to specific, recent pain, the problem may not justify the solution.
What is the ongoing cost of the engine infrastructure? Each engine needs its own gemspec, its own test setup, its own factories, its own migration strategy. Someone has to maintain that infrastructure. That someone is usually the most senior developer on the team, which means your most expensive resource is spending time on plumbing.
Is there a cheaper solution that gets us 80% of the benefit? Namespaces, conventions, Packwerk, or even just better code review might address the boundary problem without the full weight of engines.

The Premature Boundary Trap

The most common mistake is drawing boundaries before you understand the domain. You create an engines/billing and an engines/shipping on day one, then discover three months later that billing and shipping share a concept – “order line items” – that does not fit neatly into either engine.

Now you have three bad options: duplicate the concept, create a third engine for shared logic, or collapse the boundary you just built. All of them are expensive. The premature boundary cost you more than having no boundary at all.

The antidote is simple: wait. Let the domain reveal its boundaries through co-change patterns (as discussed in Chapter 9) rather than guessing them up front. Six months of git history is a better domain expert than any whiteboard session.

Signs You Have Over-Modularised

If you have already introduced engines, watch for these signals that you have gone too far:

graph TB
    subgraph "Signs of Over-Modularisation"
        S1["Every PR touches<br/>3+ engines"]
        S2["Engine interfaces are<br/>thicker than the logic<br/>behind them"]
        S3["Developers spend more time<br/>on engine plumbing than<br/>on features"]
        S4["Cross-engine integration<br/>tests outnumber<br/>unit tests"]
        S5["New developers take<br/>longer to onboard<br/>than before engines"]
    end

    S1 --> FIX["Consider merging<br/>those engines"]
    S2 --> FIX
    S3 --> FIX
    S4 --> FIX
    S5 --> FIX

    style S1 fill:#d9654a,stroke:#8a3a2c,color:#fff
    style S2 fill:#d9654a,stroke:#8a3a2c,color:#fff
    style S3 fill:#d9654a,stroke:#8a3a2c,color:#fff
    style S4 fill:#d9654a,stroke:#8a3a2c,color:#fff
    style S5 fill:#d9654a,stroke:#8a3a2c,color:#fff
    style FIX fill:#27ae60,stroke:#1e8449,color:#fff

If every pull request touches three or more engines, your boundaries are in the wrong place. If the interface code between engines is more complex than the domain logic inside them, you have created accidental complexity. If new developers are slower to become productive than they were before you introduced engines, the architecture is working against you.

The fix is not to abandon engines entirely. It is to merge the ones that should not have been separate in the first place. Collapsing a bad boundary is not failure – it is learning.

Alternatives That Might Be Enough

Before reaching for an engine, consider whether one of these lighter-weight alternatives solves your problem:

Namespaces and directory structure. Zero cost, immediate clarity. If your problem is “developers put code in the wrong place,” namespaces may be all you need.

Concerns and modules. Shared behaviour extracted into mixins. Not a boundary mechanism, but effective for reducing duplication within a bounded context.

Service objects. Encapsulate a business operation in a single class. Good for complex workflows, but they do not create boundaries – they live inside them.

Packwerk. Static boundary analysis without runtime isolation. If your problem is “we want to detect boundary violations” rather than “we need hard enforcement,” Packwerk gives you most of the benefit at a fraction of the cost.

Plain Ruby gems. If the module has no Rails dependencies, a gem gives you complete isolation with minimal ceremony. A pricing calculator, a tax engine, a PDF generator – these are gems, not engines.

Each of these tools has a place. The mature Software Engineer asks “what is the cheapest tool that solves my actual problem?” rather than “what is the most architecturally pure solution?”

Knowing when not to use a tool is a sign of mastery, not timidity. The best architectures are not the ones with the most boundaries. They are the ones where every boundary earns its keep.

Honesty about what your application actually needs – not what it might need someday – is the beginning of architectural maturity.

This was adapted from Chapter 15 of Modular Rails: Architecture for the Long Game. The book covers the full analysis including performance overhead, boot time impact, memory considerations, and route compilation costs.

Read the entire book free on the web — every chapter, no paywall. Prefer print or Kindle? Amazon US · Amazon UK · all editions & prices.

Testing Strategy for a Modular Rails Application

Tue, 02 Jun 2026 00:00:00 +0000

This is an adapted excerpt from Chapter 13 of Modular Rails: Architecture for the Long Game, my book on building maintainable Ruby on Rails applications using Rails Engines.

Your test suite takes 40 minutes. Mine takes 4.

That is not because I write fewer tests. It is because when I change billing code, I only run billing tests. When I change notification code, I only run notification tests. The full suite runs in CI, but a Software Engineer working on a single engine gets feedback in seconds, not minutes.

This is the testing payoff of a modular architecture. But it does not happen automatically. Engines need a deliberate testing strategy – one that preserves isolation while catching integration failures.

The Testing Pyramid for Engines

The testing pyramid for a modular Rails application has an extra dimension: scope.

graph TB
    subgraph "Testing Pyramid"
        E2E["End-to-End Tests<br/>(host app, few)"]
        INT["Integration Tests<br/>(cross-engine, some)"]
        CONTRACT["Contract Tests<br/>(engine boundaries, moderate)"]
        UNIT["Unit Tests<br/>(inside engine, many)"]
    end

    E2E --- INT
    INT --- CONTRACT
    CONTRACT --- UNIT

    style E2E fill:#d9654a,stroke:#8a3a2c,color:#fff
    style INT fill:#e8a838,stroke:#b07828,color:#fff
    style CONTRACT fill:#4a90d9,stroke:#2c5f8a,color:#fff
    style UNIT fill:#27ae60,stroke:#1e8449,color:#fff

The bottom of the pyramid – unit tests inside an engine – should be the vast majority of your tests. These run fast because they only load the engine, not the entire application. Contract tests verify that the interfaces between engines work correctly. Integration tests confirm that engines compose properly. End-to-end tests are few and focused on critical user journeys.

The Dummy App and RSpec Setup

Every engine generated by Rails includes a test/dummy (or spec/dummy) application. This is a minimal Rails app that mounts your engine, providing just enough context to run tests without loading your full application.

Here is a typical spec/rails_helper.rb for an engine:

# engines/billing/spec/rails_helper.rb
require "spec_helper"

ENV["RAILS_ENV"] ||= "test"
require File.expand_path("dummy/config/environment", __dir__)

abort("The Rails environment is running in production mode!") if Rails.env.production?

require "rspec/rails"

# Load engine factories
Dir[Billing::Engine.root.join("spec/factories/**/*.rb")].each { |f| require f }

ActiveRecord::Migration.maintain_test_schema!

RSpec.configure do |config|
  config.fixture_paths = [Billing::Engine.root.join("spec/fixtures")]
  config.use_transactional_fixtures = true
  config.infer_spec_type_from_file_location!
  config.filter_rails_from_backtrace!
end

Notice that this helper loads the dummy app, not your real application. The engine’s tests are completely self-contained. They boot in a fraction of the time because they only load billing code, not the 200 models from the rest of your application.

Engine Factory Setup

Factories need special attention in a modular application. Each engine should define its own factories, and those factories should only reference models within the engine:

# engines/billing/spec/factories/invoices.rb
FactoryBot.define do
  factory :invoice, class: "Billing::Invoice" do
    sequence(:number) { |n| "INV-#{n.to_s.rjust(6, '0')}" }
    amount { 99.99 }
    currency { "GBP" }
    status { :draft }
    user_id { 1 }  # Simple foreign key, no User factory dependency
  end
end

The key decision here is user_id { 1 } instead of association :user. The billing engine should not depend on a User factory from the core application. It only needs a valid foreign key. This keeps the engine’s tests truly independent.

Contract Tests: The Boundary Guarantee

Contract tests verify that engines honour their interfaces. They are the most important and most underused testing pattern in modular applications.

Here is a concrete example. Your billing engine expects that any “billable” object responds to certain methods:

# engines/billing/spec/contracts/billable_contract.rb
RSpec.shared_examples "a billable" do
  it { is_expected.to respond_to(:email) }
  it { is_expected.to respond_to(:billing_name) }
  it { is_expected.to respond_to(:stripe_customer_id) }
  it { is_expected.to respond_to(:billing_address) }
end

The billing engine defines this contract. Any model that wants to be billable must pass it:

# In the host app or core engine
RSpec.describe User do
  it_behaves_like "a billable"
end

Now here is where contract tests prove their worth. Imagine you upgrade the billing engine and add a new requirement to the billable interface:

# After upgrade: billing engine v2.0
RSpec.shared_examples "a billable" do
  it { is_expected.to respond_to(:email) }
  it { is_expected.to respond_to(:billing_name) }
  it { is_expected.to respond_to(:stripe_customer_id) }
  it { is_expected.to respond_to(:billing_address) }
  it { is_expected.to respond_to(:tax_id) }  # New in v2.0
end

When the host app runs its tests, the contract test fails immediately:

Failures:

  1) User behaves like a billable is expected to respond to :tax_id
     Failure/Error: it { is_expected.to respond_to(:tax_id) }
       expected #<User> to respond to :tax_id

The failure is clear, specific, and caught before deployment. Without contract tests, this would surface as a runtime error in production when someone tries to invoice a user without a tax_id.

Selective Test Execution

The real speed gain comes from only running the tests that matter. Here is a script that determines which engines were affected by a change and runs only their tests:

#!/bin/bash
# scripts/run_affected_tests.sh
# Runs tests only for engines that changed since the base branch

BASE_BRANCH=${1:-main}
CHANGED_FILES=$(git diff --name-only "$BASE_BRANCH"...HEAD)

AFFECTED_ENGINES=()
for file in $CHANGED_FILES; do
  if [[ $file == engines/* ]]; then
    engine=$(echo "$file" | cut -d'/' -f2)
    if [[ ! " ${AFFECTED_ENGINES[@]} " =~ " ${engine} " ]]; then
      AFFECTED_ENGINES+=("$engine")
    fi
  fi
done

if [ ${#AFFECTED_ENGINES[@]} -eq 0 ]; then
  echo "No engine changes detected. Running host app tests only."
  bundle exec rspec spec/
else
  echo "Affected engines: ${AFFECTED_ENGINES[*]}"
  for engine in "${AFFECTED_ENGINES[@]}"; do
    echo "--- Testing $engine ---"
    (cd "engines/$engine" && bundle exec rspec)
  done
  echo "--- Testing host app integration ---"
  bundle exec rspec spec/
fi

This script is the bridge between local development speed and CI thoroughness. Locally, a Software Engineer runs only the affected engine’s tests. In CI, you can run this script for PR builds while running the full suite on merge to main.

CI Flow

Your CI pipeline should reflect the modular structure:

graph LR
    PR["Pull Request"] --> DETECT["Detect Changed<br/>Engines"]
    DETECT --> E1["Test Engine A"]
    DETECT --> E2["Test Engine B"]
    DETECT --> E3["Test Engine C"]
    E1 --> INT["Integration Tests"]
    E2 --> INT
    E3 --> INT
    INT --> MERGE["Merge"]

    style PR fill:#4a90d9,stroke:#2c5f8a,color:#fff
    style DETECT fill:#e8a838,stroke:#b07828,color:#fff
    style E1 fill:#27ae60,stroke:#1e8449,color:#fff
    style E2 fill:#27ae60,stroke:#1e8449,color:#fff
    style E3 fill:#27ae60,stroke:#1e8449,color:#fff
    style INT fill:#8e44ad,stroke:#6c3483,color:#fff
    style MERGE fill:#27ae60,stroke:#1e8449,color:#fff

Each engine’s tests run in parallel. Only the affected engines are tested on PR builds. Integration tests run after engine tests pass. The full suite runs as a merge gate.

Why Your Test Suite Gets Faster

Let’s do the arithmetic. Suppose your monolithic test suite has 3,000 tests that take 40 minutes. You extract the application into 5 engines with roughly equal test distribution:

Each engine: ~600 tests, ~8 minutes
Integration tests: ~200 tests, ~5 minutes
Parallel engine execution: 8 minutes (all 5 run simultaneously)
Total CI time: 8 + 5 = 13 minutes (down from 40)

But the real win is local development. A Software Engineer working on billing runs 600 tests in 8 minutes instead of 3,000 tests in 40 minutes. And because the engine boots faster (no loading 200 unrelated models), those 600 tests often run in under 4 minutes.

The arithmetic only gets better as the application grows. Adding a new engine does not slow down existing engine tests. Each engine’s test time stays constant while the monolithic suite would keep growing.

That is why your test suite takes 40 minutes and mine takes 4. Not cleverness. Structure.

This was adapted from Chapter 13 of Modular Rails: Architecture for the Long Game. The book covers the full testing strategy including SimpleCov configuration, Capybara setup, database cleaning, CI YAML examples, and automated quality tools.

Read the entire book free on the web — every chapter, no paywall. Prefer print or Kindle? Amazon US · Amazon UK · all editions & prices.

The Modular Monolith as the Default Starting Point

Tue, 26 May 2026 00:00:00 +0000

This is an adapted excerpt from Chapter 17 of Modular Rails: Architecture for the Long Game, my book on building maintainable Ruby on Rails applications using Rails Engines.

“Majestic monolith. The vast majority of web applications should start here and never leave.” – David Heinemeier Hansson

The microservices conversation has been going on for over a decade now, and the industry is starting to reach a consensus that most teams arrived at too late: distributed systems are expensive, and the default starting point should be a well-structured monolith.

This chapter makes the case that a modular monolith – specifically, a Rails application structured with engines – is the right default for most teams. Not because microservices are bad, but because the operational cost of distribution is almost always underestimated.

The Operational Cost Nobody Talks About

Consider a simple operation: recording a payment. In a monolith, this is a method call:

# Monolith: one process, one database, one transaction
class PaymentsController < ApplicationController
  def create
    payment = Payment.create!(payment_params)
    Invoice.find(payment.invoice_id).mark_paid!
    NotificationMailer.payment_received(payment).deliver_later
    AuditLog.record(:payment_created, payment)

    render json: payment, status: :created
  end
end

Four operations, one request, one database transaction. If any step fails, the transaction rolls back. The code is straightforward to write, straightforward to test, and straightforward to debug.

Now consider the same operation in a microservices architecture:

# Microservices: four services, four databases, eventual consistency
class PaymentsController < ApplicationController
  def create
    payment = PaymentService.create(payment_params)

    # Synchronous call to billing service
    response = BillingClient.mark_invoice_paid(
      payment.invoice_id,
      idempotency_key: SecureRandom.uuid
    )
    raise BillingServiceError unless response.success?

    # Asynchronous event for notification service
    EventBus.publish("payment.created", {
      payment_id: payment.id,
      user_id: payment.user_id,
      amount: payment.amount
    })

    # Asynchronous event for audit service
    EventBus.publish("payment.created.audit", {
      payment_id: payment.id,
      action: :created,
      timestamp: Time.current.iso8601
    })

    render json: payment, status: :created
  rescue BillingClient::TimeoutError
    # What do we do? Payment is recorded but invoice is not marked paid.
    # Retry? Compensate? Queue for later?
    CompensationJob.perform_later(:payment_billing_sync, payment.id)
    render json: payment, status: :accepted  # 202, not 201
  rescue EventBus::PublishError
    # Payment and invoice are updated but notifications may not fire.
    # Is this acceptable? Depends on the business rules.
    FailedEventJob.perform_later("payment.created", payment.id)
    render json: payment, status: :created
  end
end

The same four operations now involve network calls, serialisation, idempotency keys, timeout handling, compensation logic, and eventual consistency. The code is three times longer, but more importantly, the failure modes have multiplied. What happens when the billing service is down? What happens when the event bus loses a message? What happens when the compensation job fails?

graph LR
    subgraph "Monolith"
        M1["Controller"] --> M2["Payment"]
        M1 --> M3["Invoice"]
        M1 --> M4["Mailer"]
        M1 --> M5["AuditLog"]
    end

    subgraph "Microservices"
        S1["Payment Service"] -->|"HTTP"| S2["Billing Service"]
        S1 -->|"Event Bus"| S3["Notification Service"]
        S1 -->|"Event Bus"| S4["Audit Service"]
        S2 -->|"timeout?"| S5["Compensation Job"]
        S3 -->|"lost message?"| S6["Retry Queue"]
    end

    style M1 fill:#27ae60,stroke:#1e8449,color:#fff
    style M2 fill:#27ae60,stroke:#1e8449,color:#fff
    style M3 fill:#27ae60,stroke:#1e8449,color:#fff
    style M4 fill:#27ae60,stroke:#1e8449,color:#fff
    style M5 fill:#27ae60,stroke:#1e8449,color:#fff
    style S1 fill:#e8a838,stroke:#b07828,color:#fff
    style S2 fill:#e8a838,stroke:#b07828,color:#fff
    style S3 fill:#e8a838,stroke:#b07828,color:#fff
    style S4 fill:#e8a838,stroke:#b07828,color:#fff
    style S5 fill:#d9654a,stroke:#8a3a2c,color:#fff
    style S6 fill:#d9654a,stroke:#8a3a2c,color:#fff

Every arrow in the microservices diagram is a potential failure point. Every potential failure point needs handling code, monitoring, alerting, and runbooks.

Companies That Came Back

The most compelling argument for starting with a monolith comes from companies that tried microservices and came back:

Amazon Prime Video published a case study in 2023 describing how they moved from a distributed microservices architecture to a monolith for their video quality monitoring tool – and reduced costs by 90% while improving throughput. The distributed architecture created bottlenecks at service boundaries that vanished when the code ran in a single process.

Segment famously migrated from a microservices architecture back to a monolith after discovering that the operational overhead of managing 120+ microservices was consuming more engineering time than feature development. Their CTO wrote candidly about how the microservices architecture that was supposed to enable faster development had become a tax on every team.

Istio, the service mesh project, consolidated from multiple microservices into a single binary called Istiod. Their blog post explained that the microservices architecture added operational complexity without meaningful benefits at their scale.

Shopify – one of the largest Rails applications in the world – chose a modular monolith over microservices. They invested heavily in Packwerk and component-based architecture rather than splitting into services. Their reasoning: the cost of network boundaries was not justified by the organisational benefits.

Engines as a Stepping Stone

The modular monolith gives you the best of both worlds. You get the organisational benefits of clear boundaries – team ownership, independent development, focused testing – without the operational cost of distribution.

And critically, engines preserve the option to extract services later. An engine with a clean interface can become a microservice when (and only when) the operational cost is justified by a genuine need.

graph LR
    A["Monolith<br/>(everything in app/)"] -->|"Step 1:<br/>Add structure"| B["Modular Monolith<br/>(engines)"]
    B -->|"Step 2:<br/>Only if needed"| C["Selective Extraction<br/>(one engine becomes<br/>a service)"]
    C -->|"Step 3:<br/>Rarely needed"| D["Distributed System<br/>(multiple services)"]

    style A fill:#d9654a,stroke:#8a3a2c,color:#fff
    style B fill:#27ae60,stroke:#1e8449,color:#fff
    style C fill:#e8a838,stroke:#b07828,color:#fff
    style D fill:#4a90d9,stroke:#2c5f8a,color:#fff

Most applications never get past step 2. And that is perfectly fine. The goal is not to arrive at microservices. The goal is to have a codebase that is maintainable, testable, and adaptable to whatever comes next.

The Decision Framework

When the microservices conversation comes up – and it will – use this framework:

flowchart TD
    A["Do you have a scaling problem<br/>that cannot be solved with<br/>vertical scaling?"] -->|Yes| B["Is the problem isolated<br/>to a specific domain?"]
    A -->|No| C["Stay with<br/>modular monolith"]

    B -->|Yes| D["Do you have the team<br/>and infrastructure to<br/>operate a distributed system?"]
    B -->|No| C

    D -->|Yes| E["Extract that one<br/>domain as a service"]
    D -->|No| F["Invest in infrastructure<br/>first, extract later"]

    style A fill:#e8a838,stroke:#b07828,color:#fff
    style B fill:#e8a838,stroke:#b07828,color:#fff
    style C fill:#27ae60,stroke:#1e8449,color:#fff
    style D fill:#e8a838,stroke:#b07828,color:#fff
    style E fill:#4a90d9,stroke:#2c5f8a,color:#fff
    style F fill:#8e44ad,stroke:#6c3483,color:#fff

The framework is deliberately conservative. Each “No” sends you back to the monolith because the default should be the simpler architecture. You only move to a distributed system when you have a specific, measurable problem that cannot be solved any other way, and the team and infrastructure to support it.

Start with a modular monolith. Structure it well. Extract when – and only when – the evidence demands it.

This was adapted from Chapter 17 of Modular Rails: Architecture for the Long Game. The book covers the full microservices question including network latency, debugging, data consistency, and the complete decision framework.

For the bigger picture — engines, Packwerk, data ownership and the full set of trade-offs in one place — see The Modular Monolith in Rails.

Read the entire book free on the web — every chapter, no paywall. Prefer print or Kindle? Amazon US · Amazon UK · all editions & prices.

Spec is the Artefact

Fri, 22 May 2026 00:00:00 +0000

A passing test tells you the implementation is correct. The second-order question — was the work behind this code the work we meant to do — is the one comprehension debt and the perception gap have both been circling. This post is the third leg.

The argument of the first two posts was diagnostic. Teams using AI-assisted code accrue a gap between what exists in the codebase and what anyone on the team understands; the mechanism that hides the gap from inside is a perception failure on both sides of the review. Neither post offered a remedy. The closing of the second one promised this one would.

There are several remedies that would be defensible at this point — more local architecture, mentor-model review, a financial reframing of the conversation with leadership. They are all real moves. The argument here is for the one I have found most leverage in: change what the primary artefact is.

The wrong primary artefact

The implicit model for most code review, AI-assisted or not, is that code is the artefact. The author produces it. The reviewer evaluates it. Tests guard it. CI signs off on it. Everything is oriented to the diff.

This worked, more or less, when the human writing the code carried the intent in their head. Code was a lossy projection of intent, and the reviewer could partially reconstruct the projection because both author and reviewer had been trained to read code as if it spoke for the work behind it. The PR description filled in what the code didn’t say.

When the AI produces the code, the human author no longer carries the intent in the same way. The intent was in the prompt, in the chat session, in the back-and-forth — most of it gone by the time the diff lands in the reviewer’s queue. What remains is what the previous post called an appearance signal — visible enough to be trusted, opaque about whether the work behind it cohered. It looks like the work has been done. It cannot be inspected to confirm the work was done.

The PR description is in the same category. A thoughtful PR description is the human side of a structured reasoning trace. The reviewer is now holding two appearance signals and no ground truth.

Spec is the artefact

The structural move that follows is small and unfashionable. Stop treating the code as the primary artefact. Treat the specification as the primary artefact — the contract the implementation is meant to honour, written by a human before the AI touches anything. Code becomes implementation detail. Review becomes verification against the contract.

The framing has a forty-year lineage. Bertrand Meyer’s Applying Design by Contract (IEEE Computer, 1992) made the same structural argument for Eiffel: a routine is the contract it promises to honour, the implementation is detail. What is new is what the cost of skipping the contract has become. Under human authoring, the cost was future debugging. Under AI authoring, the cost is that the reviewer cannot tell whether the work was done at all.

The current vocabulary is Spec-Driven Development, the framing The Serious CTO uses in the talk this trilogy has drawn vocabulary from. GitHub’s Spec Kit calls the same thing a project constitution: non-negotiable principles around code quality, testing, user experience, and performance, baked in before generation begins. Caporusso & Perdue (ISCAP 2025) compared direct prompting against requirements-first prompting across seven LLMs and found that structured requirements improved code quality — early empirical support for a move whose case is still mostly programmatic.

The point is not new tooling. The point is that the artefact a reviewer is asked to evaluate now sits above the code, in a layer the human author still authored. The author still carries intent — into the spec, where it is durable, rather than into the chat session, where it is not.

Why this compresses the perception gap

The perception gap was structural. Author and reviewer held different artefacts in working memory. The author held the description, the boundaries, the tests they wrote. The reviewer held the diff. Neither could feel what they were costing the other.

When the spec is the artefact, this asymmetry compresses. The author and the reviewer are both oriented to the same object — the contract. The author wrote it; the reviewer is evaluating whether the implementation honours it. The cognitive load on the reviewer is bounded by spec size, not diff size. An eighteen-hundred-line implementation of a one-page spec is reviewable in a way an eighteen-hundred-line diff with a thoughtful description is not.

This also removes the silent disagreement about what the work is. The author and the reviewer can disagree about whether the implementation honours the contract — that is a productive disagreement, scoped to a shared artefact. The perception gap depended on them disagreeing about what the work had been at all.

What this looks like in practice

“Spec” is doing several jobs at once here, and it is worth being honest about it. Some specs are prose contracts a human writes out in English. Some are executable. Some are inferred from a type system. Each lives at a different level of formality. What they share, under AI authoring, is that they are independently checkable artefacts of intent — the only parts of the loop that are not appearance-of-thought signals.

In a Ruby codebase, the move is mostly elevating instruments the team already has.

RSpec and the discipline behind it. A well-written test is a spec of behaviour at the granularity the team has chosen. The shift is in workflow order — write the contract as a test before the AI drafts the implementation, then accept the implementation only when it honours the test. This is TDD without nostalgia; it works at LLM speed because the spec lives somewhere the AI cannot edit during generation.
Property-based testing. Claessen & Hughes (2000) framed properties as specifications of permitted output shapes — the implementation is graded by random sampling against the property, not by the cases the author happened to enumerate. Where example-based tests check the cases you remembered, property tests check the cases the AI might have missed.
Consumer-driven contracts at service boundaries. Ian Robinson’s 2006 framing — and twenty years of Pact in production — capture exactly the property contracts need to have under AI authoring: durable, executable, and owned by both sides. A boundary contract is a spec the implementation cannot edit on its way through.
Architectural decision records. Michael Nygard’s 2011 piece already named the problem the trilogy is circling: “one of the hardest things to track during the life of a project is the motivation behind certain decisions.” An ADR is the spec for the next change, written by the team that owns the consequences.
Type systems where you have them. Sorbet and RBS in Ruby, or static typing in any other language, are specs the compiler verifies for free. Mündler et al. (PLDI 2025) found that in TypeScript, 94% of compilation errors in LLM-generated code are type-check failures — useful evidence that the instrument pays off where it exists, and a useful caution that dynamic languages do not get this protection for free and have to recover the discipline elsewhere.

None of these are new. What is new is the framing. Under code-as-artefact, these instruments are quality-of-life. Under spec-as-artefact, they are structurally load-bearing — they are the parts of the loop that intent is durable in.

What this doesn’t fix

The obvious objection is that the AI will draft the spec too, and the perception gap re-enters one layer up. The objection is real. Zietsman (2026) calls this the correlated-failure problem: without an external reference, the generating agent and the reviewing agent share the same training distribution, and “the review checks code against itself, not against intent.”

The defence is that a specification a human authored — even one the AI drafted and the human pruned — sits in a different cognitive position than code the human glanced at. The specs worth keeping are the ones the author can answer questions about. That discipline is exactly what the previous post called comprehension, surfaced one layer up.

The honest counter is that the same deadline pressure that produced comprehension debt will erode spec-first discipline too. Kuutila et al.’s systematic review of time pressure in software engineering found that quality assurance is the practice that bends first under load — and spec-first work is QA upstream of itself. The argument is not that spec-first survives the pressure without institutional support. It is that under spec-first, the practice that bends first is visible, named, and budgetable.

The trilogy

I have started writing specs first on the work I review. I have not finished. The discipline is harder than the writing it replaces, because it forces decisions I was making by inference earlier — what is in scope, what is out of scope, what counts as success. The work surfaces.

Three sentences for the trilogy. Comprehension debt is what AI-assisted teams accrue when generation outruns understanding. The perception gap is what hides the debt from inside the team. The structural response is to move the artefact one layer up — to write the contract first, and to commit to keeping it there.

This is one shape of answer. There are others; the architectural ones and the financial ones are both real. I have argued for this one because it makes the smallest change to the workflow and the largest change to what is being reviewed. The discipline is fragile. It is also the only one I have found that compresses the gap.

Series: Comprehension Debt · The Perception Gap · Spec is the Artefact (this post).

Sources: Meyer — Applying Design by Contract (IEEE Computer, 1992) · Mündler et al. — Type-Constrained Code Generation with Language Models (PLDI 2025) · Caporusso & Perdue — ISCAP 2025 · Claessen & Hughes — QuickCheck (ICFP 2000) · Robinson — Consumer-Driven Contracts (2006) · Nygard — Documenting Architecture Decisions (2011) · Zietsman — The Specification as Quality Gate (2026) · Kuutila et al. — Time Pressure in Software Engineering (2020) · GitHub’s Spec Kit. Vocabulary from The Serious CTO’s video on the hidden cost of AI coding.

The Perception Gap

Thu, 21 May 2026 00:00:00 +0000

An engineer opens a pull request. It is eighteen hundred lines across roughly fifteen files. The description has the kind of structure you write when you mean it — what changed, why, what you would push back on if you were the reviewer. They have thought about it. They feel organised. They are organised, from where they’re sitting.

The reviewer opens it and is being asked to evaluate two or three architectural decisions and several new features in one sitting. The diff is too large to hold in working memory. The PR description helps, but only at the level it summarises — the line-by-line judgement is still on the reviewer. They are also organised, in the way the work has arrived at their desk.

Both of them are right about how organised they are. Neither of them is in a position to feel what they are costing the other.

This is the perception gap. It runs through teams building seriously with AI-assisted code, and from inside it is hard to see.

What the research says

In a 2025 study by METR, sixteen experienced open-source developers were timed on 246 real tasks. They predicted AI would speed them up by 24%. It slowed them down by 19%. After the slowdown was measured, they still reported feeling 20% faster.

The sample is small and the headline slowdown finding is contested — METR themselves posted an update in early 2026 acknowledging selection-bias problems that may mean their numbers underestimate AI speedup. What the update doesn’t undermine is the perception finding: a 39-point gap between what developers reported feeling after the experiment and what was measured during it. Developers who had just been timed being slower still believed they had been faster.

The interesting finding is not the slowdown itself. The interesting finding is that the developers could not perceive it. This is a calibration problem of a familiar shape — the literature on self-assessment is decades old — with a new accelerant. The feedback loop that would ordinarily tell an experienced engineer “you should adjust how you’re working” was gone.

It doesn’t stop at the developer

The METR study measured solo developers on tasks. The same shape extends past the solo case in a way the study didn’t measure: the perception gap operates not just within a single developer using AI, but between the AI-assisted author and the human reviewer.

I have spent recent sprints on the receiving end of this. A single AI-assisted author on one of the codebases I review put up tens of thousands of net lines in a fortnight, across PRs whose individual diffs comfortably exceeded the eighteen hundred lines I opened with. The PRs were thoughtful: tight commit messages, sensible scoping, descriptions that named what changed. The volume was not reviewable at human pace. What I felt as a reviewer — that I was perpetually catching up, that the diff was always larger than the attention I could supply — is the same shape the METR developers reported on themselves, only viewed from the other side of the keyboard.

The author writes a thoughtful PR. They feel organised because they are organised — by the description they wrote, by the boundaries they named, by the tests they added. The artefacts of their thinking are visible to them. What is not visible to them is the cognitive cost of the whole change held in the reviewer’s head at the same time.

The reviewer experiences something else. The description helps but does not substitute. Two-or-three architectural decisions plus several new features in one diff is a working-memory load no amount of structure in the PR body resolves. The reviewer is not lazy, not slow, not failing — the diff is simply asking for a kind of attention the human reviewer cannot supply at the rate the PR was assembled.

Both authors and reviewers are honest about their experience. Neither has the feedback that would let them correct the other. The author keeps shipping at one rate, the reviewer keeps absorbing at theirs, the mismatch compounds across every PR until something visible breaks.

Where the cost shows up

The cost is real. It appears in places leaders aren’t watching closely.

The 2024 DORA report found that a 25% increase in AI adoption corresponded with a 7.2% decrease in delivery stability and a 1.5% decrease in throughput. Independent industry telemetry from Faros AI, sampling ten thousand-plus developers across more than a thousand teams, puts the throughput side of the same picture in starker terms: high-AI-adoption teams merging substantially more pull requests while code review time goes up. The dashboards leaders watch for “AI is working” — PR volume, individual velocity — were green. The dashboard for whether the system itself was holding together was quietly drifting in the other direction.

Veracode’s 2025 GenAI Code Security Report tested over 100 large language models against 80 code-completion tasks designed to surface OWASP Top 10 vulnerabilities. 45% of the generated code contained security flaws. AI failed to defend against cross-site scripting in 86% of relevant samples. Java fared worst, at a 72% security failure rate.

GitClear’s analysis of 211 million lines of code found that 2024 was the first year on record where copy-pasted code exceeded refactored code. Duplicated code blocks of five or more lines rose roughly eightfold over the year, while moved code fell from around a quarter of all changes in 2021 to under 10% in 2024. The code was growing faster than it was being shaped.

The pattern isn’t only on the human side. Apple’s Illusion of Thinking (2025) found that large reasoning models reduce their reasoning effort as problem complexity grows past a threshold, despite having token budget to spare — and raised the question of whether the visible reasoning trace reflects reasoning or its appearance. (The headline accuracy-collapse finding has been contested on methodology grounds; the effort-decline result is the part the critique didn’t touch.) The author’s structured PR description and the model’s structured reasoning trace are artefacts of the same kind — visible enough to be trusted, opaque about whether the work behind them held. The reviewer is downstream of both.

Three independent sources, three different facets — throughput, security, code shape — all measuring outcomes consistent with a perception gap.

The pattern that ties them together is what I called comprehension debt in the previous post: the gap between how much code exists in your system and how much of it any human actually understands. If the perception gap is the mechanism, comprehension debt is the form it takes. The author shipped code, the reviewer approved it, the dashboards stayed green — and at the end of all that, fewer people on the team can explain what the system is doing than before.

What to do about it

I am in the vignette above. The reviewer is me as often as the author is, and both seats are familiar. When AI has sped up my own work, the speed-up has felt real. So has the cost on the other side of my own multi-thousand-line PRs. Both kinds of experience happened to the same engineer. Neither corrected the other in time to change the next one.

The dashboards leaders look at are not pointed at this. PR volume, individual velocity, lead time — none of them measure the cognitive split between the two sides of a review, or the fraction of last quarter’s shipped code anyone on the team could explain at 2am without opening the file. There is no automated way to measure that second number; it requires asking, which is part of why it doesn’t get measured.

A question worth taking to your team this week, then. Look at the last three large pull requests you approved. Without re-reading the diff, can you re-derive why the architectural decisions in each had to go that way? If the answer is no for two of them, the perception gap is already operating in your codebase, and your dashboards haven’t told you. The structural response is the subject of the next post.

Data drawn from: METR — Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity (2025) and METR’s Feb 2026 design update · DORA — Accelerate State of DevOps Report 2024 · Veracode — 2025 GenAI Code Security Report · GitClear — AI Copilot Code Quality 2025 Research · Apple — The Illusion of Thinking (2025).

Vocabulary from The Serious CTO’s videos on AI killing code review, where developer time actually goes, and the hidden cost of AI coding.

Rails Engines vs Packwerk: When to Use What

Tue, 19 May 2026 00:00:00 +0000

This is an adapted excerpt from Chapter 16 of Modular Rails: Architecture for the Long Game, my book on building maintainable Ruby on Rails applications using Rails Engines.

Rails engines are not the only way to introduce structure into a monolith. Packwerk, plain Ruby gems, namespaces, and even Hanami slices offer different trade-offs. The question is not which tool is “best” – it is which tool fits the problem you actually have.

This post focuses on the comparison that comes up most often: Rails engines versus Packwerk.

Packwerk: Static Boundary Enforcement

Packwerk, created by Shopify, takes a fundamentally different approach to modularity. Instead of runtime isolation (separate load paths, independent gemspecs, mountable routes), Packwerk enforces boundaries at analysis time through static checks.

A package in Packwerk is a directory with a package.yml file:

# components/billing/package.yml
enforce_dependencies: true
enforce_privacy: true
dependencies:
  - components/core

That is the entire configuration. The directory structure stays inside your existing app/ folder. There are no gemspecs, no dummy apps, no mountable routes. You add Packwerk to an existing application and draw boundaries around code that already exists.

Packwerk then analyses your code statically – without running it – and reports violations:

components/notifications/app/models/notifications/mailer.rb:12
  Billing::Invoice is private to components/billing

The violation tells you that the notification mailer is reaching into billing’s internals. You fix it by either making Invoice part of billing’s public API or by introducing an interface.

The Public API Pattern

Both engines and Packwerk benefit from explicit public APIs, but Packwerk makes this a first-class concept. You mark classes as public by placing them in a public/ directory within your package:

components/billing/
  app/
    models/
      billing/
        invoice.rb          # private
        line_item.rb         # private
        payment_gateway.rb   # private
    public/
      billing/
        charge_customer.rb   # public API
        invoice_summary.rb   # public API
  package.yml

Other packages can only reference Billing::ChargeCustomer and Billing::InvoiceSummary. Any reference to Billing::Invoice directly triggers a violation. This is a powerful pattern – it forces you to think about what your module exposes rather than what it contains.

Engines can achieve the same thing through convention and code review, but Packwerk enforces it automatically.

When to Use Which

Here is where it gets practical. Each tool excels in different situations:

Dimension	Rails Engines	Packwerk
Isolation	Runtime (separate load paths, gemspecs)	Static analysis only
Setup cost	Medium-high (gemspec, dummy app, routes)	Low (add gem, create package.yml)
Enforcement	Hard – code literally cannot see other engines without dependencies	Soft – violations are warnings, not errors
Migration path	Must move files, update requires	Draw boundaries around existing code
Independent testing	Yes – each engine has its own test suite	Partial – tests still run in one suite
Route isolation	Full mountable routes	No route concept
Database migrations	Can be engine-specific	Application-level only
Team ownership	Natural – each engine is a unit	Possible but requires tooling
Extraction to service	Straightforward – engine is already isolated	Requires significant refactoring

The key difference is enforcement philosophy. Engines say “you physically cannot cross this boundary.” Packwerk says “we will tell you when you cross this boundary.” Both are valid. The right choice depends on your team’s discipline and your application’s trajectory.

Brief Mentions: Other Approaches

Plain Ruby gems are the lightest-weight option. If your module has no Rails dependencies – a pricing calculator, a tax rules engine, a PDF generator – a gem gives you complete isolation with minimal overhead. No Rails, no ActiveRecord, just Ruby.

Namespaces and modules cost nothing to set up. They communicate intent – Billing::Invoice tells developers that this class belongs to the billing domain. But namespaces have zero enforcement. Nothing prevents Notifications::Mailer from calling Billing::Invoice.find(42).

Hanami slices offer a middle ground for teams building new applications. Each slice gets its own container, dependencies, and persistence layer. The trade-off is that you are no longer writing Rails.

Each Tool’s Sweet Spot

Tool	Best for
Rails Engines	Teams that need hard boundaries, independent deployability potential, or are on the path to eventual service extraction
Packwerk	Large teams adopting modularity incrementally in an existing monolith, where moving files is too disruptive
Plain Ruby gems	Framework-agnostic domain logic with no Rails dependencies
Namespaces	Small teams with strong conventions, or as a stepping stone to stronger boundaries
Hanami slices	New applications where the team is willing to move beyond Rails conventions

Layering Your Tools

These tools are not mutually exclusive. In practice, many mature applications use several of them together:

graph TB
    subgraph "Application"
        direction TB
        N["Namespaces & Conventions<br/>(every team, day one)"]
        P["Packwerk Packages<br/>(boundary detection)"]
        E["Rails Engines<br/>(hard isolation)"]
        G["Plain Ruby Gems<br/>(framework-free logic)"]
    end

    N -->|"When conventions<br/>aren't enough"| P
    P -->|"When static analysis<br/>isn't enough"| E
    E -->|"For domain logic<br/>without Rails"| G

    style N fill:#4a90d9,stroke:#2c5f8a,color:#fff
    style P fill:#e8a838,stroke:#b07828,color:#fff
    style E fill:#27ae60,stroke:#1e8449,color:#fff
    style G fill:#8e44ad,stroke:#6c3483,color:#fff

You start with namespaces because they are free. When namespaces are not enough, you add Packwerk to detect boundary violations. When detection is not enough and you need enforcement, you extract an engine. When the engine contains logic that does not need Rails at all, you pull it into a plain gem.

Each layer builds on the one below. You do not have to pick one tool and commit to it forever. You escalate as the pain justifies the cost.

The best architecture teams I have worked with use this layered approach. They start cheap, escalate deliberately, and always ask: “Is the boundary problem we have worth the cost of the tool we are reaching for?”

This was adapted from Chapter 16 of Modular Rails: Architecture for the Long Game. The book covers all five approaches in depth – with working code, migration guides, and the honest trade-offs for each.

Read the entire book free on the web — every chapter, no paywall. Prefer print or Kindle? Amazon US · Amazon UK · all editions & prices.

Comprehension Debt

Sun, 17 May 2026 00:00:00 +0000

“You’re not a developer anymore. You’re a reviewer of code you don’t understand.”

That line is from The Serious CTO, and it named something I’d been feeling but didn’t have words for. The shape of the work has changed. The volume of code that gets generated, reviewed, and shipped has decoupled from the volume of code any human actually holds in their head. There’s a debt accruing in that gap, and we don’t track it.

He calls it comprehension debt — the difference between how much code exists in your system and how much of it anyone on the team could explain at 2am.

I want to make the case that this is the most important kind of debt our industry has accumulated in the past two years, and that nothing in the standard toolkit measures it.

It’s not technical debt

Ward Cunningham coined “technical debt” in 1992 to describe a deliberate trade. You take a shortcut, you know you took it, you plan to pay it back. The transaction was between an engineer and the future engineer who would inherit the code. Both were assumed to be humans, and both were assumed to remember why.

Comprehension debt is different in three ways.

It isn’t deliberate. Nobody chooses to ship code they don’t understand. It happens because the AI generated 600 lines, the spec was implicit in the conversation, the conversation is gone, and the tests pass. The trade was never on the table to refuse.

It isn’t local. Technical debt usually sits in a specific module you can point at. Comprehension debt is distributed across thousands of small decisions, each one defensible, none of them remembered. The total is much larger than any of its parts.

And it doesn’t show up in any dashboard, which is the part I want to spend a moment on.

What the dashboards measure

Look at any modern engineering analytics platform and you’ll see roughly the same vocabulary. DORA metrics — lead time for changes, deployment frequency, change failure rate, mean time to recovery. Flow metrics — cycle time, work-in-progress, throughput. And now, increasingly, AI Impact metrics — suggestion acceptance rate, percentage of PRs assisted by AI, generated lines per engineer per week. The pitch is the one I keep seeing in the marketing copy: actionable engineering insight across DORA, Flow, AI Impact, and more. Turn engineering insights into predictable outcomes.

These are useful. I look at DORA numbers regularly and I’d argue every team should. They are also, all of them, measurements of motion.

Lead time tells you how fast work moves through the pipeline. Deployment frequency tells you how often it ships. Change failure rate and MTTR tell you what fraction breaks and how fast you recover. Cycle time tells you how long an item sat in flight. AI acceptance rate tells you how often a generated suggestion was kept.

None of them ask the question that matters here: of the code we shipped last quarter, what percentage could any member of the team explain right now, without opening the file?

That number doesn’t have a name yet. The “AI Impact” category got close — it noticed AI was changing something about engineering and tried to put a measurement on it — but the things it measures are still adoption and volume, not comprehension. Acceptance rate doesn’t care whether the engineer who accepted the suggestion understood it. Lines-per-engineer doesn’t care whether anyone could narrate those lines six weeks later.

So the lagging-indicator failure mode is precisely what you’d expect. DORA numbers stay green right up until they don’t, at which point the debt has already compounded across every change that touched anything near the broken thing. The dashboards eventually notice, but only via the breakage. By then you’re not measuring comprehension — you’re measuring its absence.

The standard answers don’t measure understanding

This is the part I want to be honest about, because every senior engineer reading this has the same instinct I had: surely more review, more linting, more tests, more automation catches this. They don’t, and it’s worth being precise about why.

Code review measures “looks reasonable”, which is approximately the same problem as the code itself — a human skimming a diff, deciding whether it pattern-matches against something they’d write. It doesn’t ask whether anyone could explain why this code exists, or what would have to change in the world to make it wrong. The failure mode has a name now: LGTM syndrome. The data backs it up. High-AI-adoption teams in the recent DORA report are merging 98% more PRs while review time goes up 91%. We’re rubber-stamping more, faster.

Linters and type-checkers measure syntax, not intent. They will tell you the function returns a string. They will not tell you whether the string represents the thing the caller assumed it represented. TypeScript catches an enormous fraction of LLM errors that are type-check failures, which is real value, but the errors that matter are the ones that compile.

Tests measure observed behaviour at the point of writing. They are a memory of the assumptions that were live when the test was written. When the assumptions change — and they do, constantly, in any product still being shaped — the tests pass and the meaning quietly diverges. Tests are necessary infrastructure. They are not a measurement of comprehension.

“Future AI will refactor it” is the most seductive answer and the one most worth refuting. AI can refactor syntax. It cannot refactor meaning it never had. If no human ever understood why a piece of code is the way it is, the AI cleaning it up is doing the same thing the AI that wrote it did — pattern-matching against training data, producing something plausible, hoping the tests pass. You’re not paying down the debt. You’re laundering it.

What it looks like when it’s compounding

The shape is familiar once you start looking for it.

A field gets added to a model. Six months later nobody can quite explain what it’s for, but removing it breaks four jobs. The PR that introduced it was approved, the tests passed, the issue is closed. The “why” exists in nobody’s head.

A controller has three early-return branches that each handle a “subtle case”. The cases were real when the code was written. Whether they’re still real is unclear, and checking would require reconstructing a conversation that happened in a chat session that’s now gone.

A 2am incident lands and the person on call can read the trace but can’t narrate the code that produced it. The original author, if there even was a single one, is the model that generated it. The graceful degradation everyone hoped for at the architecture stage requires understanding that no longer exists in the team.

None of these are individually catastrophic. Collectively they’re the new shape of legacy systems, and we’re building them at a rate previous generations of engineers couldn’t have imagined.

A field-level note

I’ve been building static-analysis and review tooling on a real Rails codebase for a while now. Custom linters, date-gated style rules, multi-agent PR review with worktree isolation, the works. Each layer was a response to something concrete. Each one helps. None of them measure comprehension. That’s not a criticism of the tools — they were never trying to. It’s a recognition that the thing I’m trying to defend against doesn’t yet have a measurement, which means it doesn’t yet have a budget, which means it grows.

What I want to ask

This is the open part, because I genuinely don’t know.

What would a codebase look like that actively maintained comprehension? Not after the fact, via documentation written under deadline, but as a continuous property the team owned.

What would the measurement be? A coverage metric for “any team member can describe this module within five minutes”? A required-reviewer rule that says the reviewer has to be able to teach the change, not just approve it? A spec-first workflow where the human writes the contract and the AI generates the implementation, so review is “does this honour the contract?” rather than “does this code look correct?”

I have partial answers and I’m not confident in any of them. What I am confident in is that the response to comprehension debt cannot be more of the same review, more of the same linters, more of the same hope. Whatever the answer is, it has to measure something we are not currently measuring.

If you’re seeing this too, I’d genuinely like to hear what you’re trying. The vocabulary exists now. The next part is figuring out what to do with it.

Further viewing: The Serious CTO’s videos on AI killing code review, where developer time actually goes, and the hidden cost of AI coding are where the comprehension-debt vocabulary comes from, and they’re worth your time.

Davidslv

The View Layer Rails Couldn't See

The blind spot

What a parse tree makes possible

The view layer has been consolidating on ERB

ViewComponent born at GitHub

Strict locals land

Phlex 2 ships

Herb arrives

GitHub adopts Herb · Rails core leans in

The lived test: porting Arbre out of an engine

Honest about the rough edges

The lesson

The Propshaft Version Lever You Were Told Was Gone

What the blog posts will tell you

What the source actually says

Using it

I didn’t take the source’s word for it

Why you almost never need it

The cache-busting problem the lever does not solve

The generalisable lesson

From One Controller to Thirteen Handlers: A Webhook Refactor

What a webhook controller actually does

The smell, named

The refactor

Three patterns, one shape

Five concrete wins

The trade-off, honestly

When this pattern does not apply

The generalisable lesson

When Rails Engines Are the Wrong Tool

The Decision Flowchart

Applications That Are Too Small

Teams That Are Too Small

The Honest Calculation

The Premature Boundary Trap

Signs You Have Over-Modularised

Alternatives That Might Be Enough

Testing Strategy for a Modular Rails Application

The Testing Pyramid for Engines

The Dummy App and RSpec Setup

Engine Factory Setup

Contract Tests: The Boundary Guarantee

Selective Test Execution

CI Flow

Why Your Test Suite Gets Faster

The Modular Monolith as the Default Starting Point

The Operational Cost Nobody Talks About

Companies That Came Back

Engines as a Stepping Stone

The Decision Framework

Spec is the Artefact

The wrong primary artefact

Spec is the artefact

Why this compresses the perception gap

What this looks like in practice

What this doesn’t fix

The trilogy

The Perception Gap

What the research says

It doesn’t stop at the developer

Where the cost shows up

What to do about it

Rails Engines vs Packwerk: When to Use What

Packwerk: Static Boundary Enforcement

The Public API Pattern

When to Use Which

Brief Mentions: Other Approaches

Each Tool’s Sweet Spot

Layering Your Tools

Comprehension Debt

It’s not technical debt

What the dashboards measure

The standard answers don’t measure understanding

What it looks like when it’s compounding

A field-level note

What I want to ask