nicolas.vbgh

Posted on Jun 24

Less Database, More Files

#database #duckdb #postgres #dataengineering

A few things I noticed. None of them alarming on their own.

Our database doubles in size every year. Not "grows a lot" — doubles.
Of its 180 GB, five tables are north of 20 GB each, and I have never run an UPDATE on a single row in any of them.
Those tables are genomic mutations — variant calls. We don't even produce them. Another service does the sequencing and the calling and hands them to us as VCF files — essentially horrible CSVs — which we unpack into rows. We just store them and show them.
Refreshing pre-production copies the whole thing across, every time, almost none of which changed since the last refresh.
The same mutations get serialized out of one service and deserialized into the next, through a mapping layer, and when a number looks wrong, nobody can tell you who actually owns it.

Individually, each one earns a shrug. Together they're a pattern, and once I saw the pattern I couldn't unsee it. Every item on that list is the same mistake wearing a different mask:

We are keeping immutable data — data we don't even generate — inside a machine whose entire reason to exist is managing change.

What a database is actually for

A database is a machine for managing change. That's the job, the whole job. Transactions, locks, MVCC, UPDATE, DELETE, foreign keys — every hard, expensive thing Postgres does, it does so that two things changing the same row at the same time don't turn it into mush. It's a brilliant machine for a genuinely hard problem.

Now look at those five tables. One writer — and not even us, an upstream service. Append-only. No updates, no deletes, nothing racing anything for anything. They use none of that machinery and pay full price for all of it, because the machinery is bolted to the box whether you touch it or not.

The mutations were never database-shaped. They're records of something that already happened, produced somewhere else, frozen on arrival. That isn't a row. That's a file.

You don't UPDATE something that already happened — you keep it or you forge it. Postgres is a machine built to let you forge it safely, with transactions and locks and rollbacks. These never change. We brought a vault to guard a photograph.

(And no, "just delete the old ones" isn't on the table — these are clinical records, retained for years by law. Which is exactly the point: we need keep-and-read, not throw-away. Deleting loses the data; we want it small and still there.)

The shape that fits (three boring pieces)

If you've never built this way, here are the parts. Each is dull on its own, which is why the combination works.

Object storage (S3, GCS, Azure Blob): a giant hard drive that lives on the internet. You hand it a file and a name — mutations/sample_42.parquet — and later ask for that name back. The cloud provider runs the disks, the redundancy, the scaling. It isn't a filesystem and isn't a database: no real folders, no queries, just give me the bytes for this key, fetched over HTTP. You pay about two cents per GB per month, you don't run it, size it, patch it, scale it, or back it up, and it does not care how full it is. Hold that thought.

Parquet: a table stored column by column instead of row by row. Read 3 columns out of 12 and you touch a quarter of the file. Columns of similar values compress hard, because compression eats repetition. Half a million mutation rows land at around 5 MiB. (VCF is the genomics interchange standard — great for streaming one sample top to bottom, slow for reading four columns across millions of rows. Parquet holds the same facts in a shape our queries like.)

DuckDB: SQLite for analytics. A library, not a server — pip install, runs inside your process, nothing to deploy, nobody to page. It reads Parquet directly off object storage — gs://… for Google Cloud Storage, s3://… for AWS — grabbing only the bytes a query needs.

Together, and this is the whole integration:

import duckdb

# Runs inside your app, pulls only the bytes this query needs, straight from the bucket.
duckdb.sql("""
    SELECT chrom, pos, gene, impact
    FROM read_parquet('gs://mutations/sample_42.parquet')
    WHERE chrom = '7' AND pos BETWEEN 50000000 AND 60000000
""")

No instance, no replica, nothing to fail over. The compute is the app you already run; the storage is that two-cent bucket.

Put a rough number on it: at a few MiB per sample and a couple of cents per GB-month, a gigabyte holds a couple hundred samples — so a dollar a month stores something like 10,000 of them. Don't hold me to the exact figure; these are order-of-magnitude numbers, and the order of magnitude is the point. That's the whole storage bill, with no server in the loop, because there isn't one. (Reads add small per-request and egress charges; the storage itself really is this cheap.)

What changes if we do it — and what doesn't, if we don't

Take the observations one at a time. For each, the same two futures: leave it in Postgres, or move it to files.

The growth curve. Doubling is the most expensive word in infrastructure — it turns up dressed as success and leaves with your budget. The disk-space version of this is the obvious one and everyone already gets it: storage is cheap per gigabyte, but a database that doubles every year still climbs the price tiers. Fine — but marginal, and not why I'm writing this. The bytes were never the real cost. What doubling actually inflates is everything operational that scales with total size — backup and restore windows, replication lag, failover time, the hours a major-version upgrade runs, how long it takes to clone the instance, the working set fighting for cache. Don't move the data and that tax compounds every year. Do, and the bucket just absorbs the bytes — 50 GB or 50 TB, it doesn't slow down, doesn't want a maintenance window, doesn't have a bad night. The first time a dataset of mine 50×'d and nothing happened, I felt vaguely cheated. That's it? That's it.

The reads. We measured it, deliberately over a bad path — laptop, home internet, bucket on another continent — so the numbers would be a worst case, not a flattering one. A full read of a 500k-row file took nearly 30 seconds, which looks damning until you see why. Parquet doesn't store a table as one solid block; it splits the rows into batches called row groups and records each column's min and max per group, so a reader can skip the groups that can't match a filter and fetch only the ones it needs. That file had been written with about a hundred small row groups, so DuckDB pulled the footer (the index of where everything lives) and then a separate byte range for every column chunk in every group — a few hundred sequential round trips, almost all the time spent waiting rather than moving data. We rewrote the same rows into one large row group and the read dropped to ~2 seconds:

500k-row file, cold read over the public internet	time
`SELECT *` — ~100 small row groups	~29 s
`SELECT *` — one large row group	~2 s
`COUNT(*)` — answered from the footer, no data scan	~1 s

Same bytes, an order of magnitude faster, purely from layout. The COUNT(*) row is the tell: the same file totals up in about a second because DuckDB touches only the footer — so the 30 seconds was never the file's size, it was a full scan paying a separate round trip for every chunk. Read that same data off local disk instead of across an ocean and it comes back in tens of milliseconds, a hundred times faster again — the bottleneck was the round trips, not Parquet and not DuckDB. Co-locate the compute with the bucket and most of it evaporates.

Row-group size is the dial, and it cuts both ways: too small and you drown in round trips; collapse everything into a single group and the per-group min/max statistics can't skip anything, so a selective WHERE quietly reads the whole file. Sort the rows by the columns you filter on — here chrom and pos — and pick a row-group size in the low millions, and you get both: few requests and real skipping. The absolute numbers above are pessimistic on purpose; the lesson is the ratio: layout is the performance model. (Every figure here came out of a small, repeatable benchmark harness — one config file, one command — so the same suite re-runs co-located instead of living as a screenshot.)

The backups. Don't, and pg_dump re-serializes the same frozen rows every night, and you keep thirty nightly copies of data that changed zero times — backing up a brick, on the off chance the brick moves. Do, and the file you wrote is the backup; object storage already gives you durability, versioning, cross-region copies. Your database backups shrink to the part that can actually break.

Every one of those is the same move. We kept reaching for a bigger database when the fix was less database.

The one that actually changed my mind: ownership

The cost stuff is nice. This is the part that made me sit up.

The same mutations show up in several services, so each keeps its own copy, in its own database, because that's what good microservices do — own your data. Four teams, four copies, four subtly different truths. (A "single source of truth" you have to reassemble from four databases isn't a single source of truth.) To feed those copies you build the machinery: exports, imports, a nightly sync, a serializer to turn one service's rows into a payload, a deserializer to turn it back on the other side, a mapping layer in the middle because one service calls the field impact, the next calls it effect, and a third buries it three levels deep in JSON. We called it a data pipeline. It was an expensive game of telephone, and we acted surprised when the message came out garbled.

That plumbing is where the incidents actually live. Not the logic — the pipes between the copies. The serializer that drops a field after a schema bump. The import that half-finishes. And when a figure looks wrong, debugging becomes archaeology: the display team says "not us, ask the provider," the provider says "not us, we just ingest what the upstream service sends," and three teams and half a week later the bug is a serializer two hops up. Nobody owns the data; everybody owns a copy of it.

And here's the bit I genuinely like — maybe you will too. The data never changed. We had built an entire synchronization system for static facts. Synchronization is for things that move. These don't. We manufactured a whole category of bugs to solve a problem that, for immutable data, does not exist.

Put the mutations in one place, as files, and the apparatus evaporates. One copy in the bucket. Every service reads the same files with an embedded DuckDB. No export, because there's nowhere to export to. No serializer between services, because Parquet already is the format — the bytes you stored are the bytes you read. No mapping layer, because everyone reads the same columns.

The schema does drift over time, of course, so here's how that's handled without the old pain. The format lives in code, versioned: v1, v2, v3. A file is stamped with its version; a small shared reader upcasts v1 and v2 into the current v3 shape on the way out. That reader is a library every service imports. The file plus the reader is the contract — and it travels with the data instead of rotting on a Confluence page. The upstream producer writes v3; old v1 and v2 files keep working forever; nobody has to coordinate a migration across teams.

You're not sharing a database — the thing the dogma rightly warns against, because writers fight over rows. You're sharing a published, self-describing, read-only artifact. Which is the thing you were supposed to do all along.

"Isn't this just Iceberg?"

If you've been near data engineering you're already thinking it: this is a lakehouse, go use Apache Iceberg (or Delta, or Hudi). And you could. These are table formats: you still keep a pile of Parquet files in a bucket, but a metadata-and-catalog layer sits on top and makes the pile behave like one big database table. You can update it, get transactions across many files at once, ask what it looked like last Tuesday ("time travel"), and let several writers change it without corrupting each other — the catalog keeps track of which files currently belong to which table. They're genuinely impressive, and for big, mutable, many-writer datasets they're worth every byte.

But look at what we actually need. One producer, not many. Append-only, so no multi-file transactions to coordinate. No second writer to referee. The only database-ish feature on the whole list is schema evolution — and we just did that with a version stamp and a fifty-line reader.

So we take the one thing we need and skip the entire apparatus: no catalog to run, no metadata layer, no compaction jobs, no Iceberg-aware query engine, no new ecosystem to adopt. A file and a function. If we ever grow into many writers and cross-file transactions, Iceberg will be right there, and that'll be the day to reach for it. Today it would be a cathedral built to house a bicycle.

Not that this beats Iceberg — for immutable data from a single producer it's dramatically simpler, and simpler is the entire point.

It scales — which was the original problem all along

Go back to the first list. The complaint underneath it was never really size — it was the pipeline: the serializers, the nightly sync, the mapping layer shuttling the same frozen facts between four databases. That design had two weaknesses wired in, and reading files in place removes both.

It was serial — stages in a line, each one a place to stall or half-finish. And it ran on serialization — rows out of one service into a payload, then back into rows on the other side, a translation at every hop and a fresh way to corrupt the data each time.

A bucket has neither. There's no shared lock, no connection pool, no single writer everyone queues behind; serving huge numbers of independent requests at once is the entire product. So parallelism is close to free. Reading one 500k-row file took 2 seconds — reading ten of them at once, each on its own thread, also takes about 2 seconds. They don't wait on each other, because there's nothing to wait on. (You're bounded by your own bandwidth and thread count, not by the bucket.) You scale reads by issuing more of them; writes are the same story — independent files, no coordination, nothing to referee.

And the serializer layer is simply gone. The bytes a service reads are the bytes the producer wrote — no payload format in between to drop a field, rename it downstream, or half-finish an import at 3am. We didn't fix the pipeline bugs; we deleted the pipeline, and the whole category of failure went with it.

The honest bill

Nothing's free. What you give up:

Joins get a little harder. Mutations probably join patient or sample data that lives in Postgres. DuckDB can ATTACH the database and join files to live tables in one query — it works, but it crosses a network. Test it before you bet on it.
No indexed lookups on arbitrary columns. Parquet is fast on the column you sorted by and shrugs at the rest. Need instant lookups by many different keys? Keep a thin index table in Postgres.
No UPDATE. Change a historical row and you rewrite a file. Fine for data that never changes, a dealbreaker for data that does. That's the sorting rule.
Co-locate compute and storage. Remember the 30-versus-2 seconds. Over a bad network, object storage bills you a round trip for every breath.
The query engine now lives in your app. Postgres did the CPU and RAM of querying on a separate box; now DuckDB does it inside your service. Budget for it: tens of MB resident per instance, brief spikes into the hundreds on a heavy sort or aggregation (cap it with SET memory_limit), and a short CPU burst per query — decompressing a 5 MiB file is a few tens of milliseconds, near-zero between queries. Modest, but it now scales with your replica count instead of one database. The part you can't tune away is a latency floor: even co-located, a query spends a few network round trips reaching the bucket before the first row arrives — tens of milliseconds, where a warm Postgres index lookup is about one. Great for scans, a step backward for chatty point-lookups.
Many-file queries reopen the problem. One file per sample is ideal for reading one sample. Ask something across the whole cohort — allele frequency over ten thousand samples — and you're opening ten thousand files, back in the thousands-of-requests regime that made the first read slow. Pack and partition for the queries you actually run; a per-sample layout is a bet that most reads are per-sample.

The balance

Here's the trade laid out flat, so you can judge for your own context.

What you gain

Win	Why it lands
Storage scales on its own axis, cheaply	The database stops growing where it can't use the room
Backups and migrations shrink to data that changes	You stop copying frozen rows on every move
One copy, one owner, no sync	The whole class of serializer-and-drift bugs disappears
Reads parallelize for free	Ten files in 2 seconds, same as one
Write the reader once, reuse everywhere	A schema change updates one shared library, not N services — less code, fewer places to be wrong
Far simpler than a lakehouse	A file and a function, no catalog to run

What it costs — and how to blunt it

Cost	Mitigation
The engine runs in your app: memory, CPU, a latency floor	Cap `memory_limit`, co-locate with the bucket, use it for scans rather than chatty lookups
No fast lookups on arbitrary keys	Keep a thin index table in Postgres
No `UPDATE`	Only put genuinely immutable data here; to correct a fact, rewrite the file
Joins to live tables cross a network	`ATTACH` Postgres and benchmark the joins that matter before betting on them
Cohort-wide queries reopen the many-files problem	Partition or pack files for the access pattern you actually run

Line them up and the asymmetry is obvious. The gains are structural — they hold, and they widen as the data grows. The costs are conditional — each one bites only if you need a thing this data doesn't do: an update, a point lookup, a join you didn't plan for. And every mitigation reduces to "keep a little index in Postgres" or "lay the files out for the query you run" — never "go back to keeping immutable data in a change-management engine." For a single producer of append-only facts, the scale doesn't tip. It barely wobbles.

Who this is actually for

Before you start planning the migration: this isn't a general-purpose "ditch your database" argument. It pays off on one specific shape of data, and it's worth being honest about the edges.

Reach for it when:

The data is append-only and immutable — written once, read often, never updated. Audit logs, event histories, financial ledgers, sensor readings, ML feature snapshots, and yes, genomic calls all fit.
There's one producer (or a few), and many readers.
That data is a large, growing fraction of a database whose change-management features it never touches.
You recognize the symptoms: storage that doubles, migrations that drag, a sync pipeline nobody enjoys owning.

Leave it in the database when:

The data mutates — anything with real UPDATE/DELETE traffic.
Your primary access pattern is sub-millisecond point lookups across many keys.
You have many concurrent writers contending for the same records — that's exactly the problem a database, or a lakehouse like Iceberg, exists to solve.
The dataset is small enough that none of this hurts. Then the simplest thing is to do nothing.

So — would you actually do this?

I built all of this as a POC. I measured it, I argued myself through every objection, and every answer came back the same: cheaper, simpler, less haunted. The database stops growing on the axis it can't use. The serializers vanish. The data finally has one home, one owner, and a version stamp instead of a Confluence page.

And here's the part that keeps a question mark on the end instead of a victory lap. It's still a POC. I have not run it in production. I haven't felt what happens at 3am when a query pattern I didn't predict meets a network I don't control. I haven't lived a year of schema drift, or hit the join that turns out to matter more than I thought, or met the colleague who genuinely needed a point lookup the layout can't give them.

I have run out of objections to argue with. The numbers work, the ownership model is cleaner, the POC holds up. I am ready to flip the switch — I just haven't yet. That gap between "convinced in staging" and "confident in production" is exactly what I'm curious about. If you've done this — put immutable data in files and read it in place, in a real system, at 3am — I'd genuinely like to hear where it bit you. What did the POC not warn you about?

DEV Community