Elvis

Posted on Jun 24

How Database Indexes Actually Work — From the Disk Up

#computerscience #database #performance #tutorial

Every index is a different answer to the same trade-off.

A while ago I gave a talk to my engineering team about database indexes. The question I actually wanted to answer was the one nobody asks out loud: what is an index, physically, and why does adding one turn a 40-minute query into a 40-millisecond one?

I don't think you really understand a tool until you can rebuild it from the ground up. So that's what this is. We're going to start at the disk, invent the index ourselves, and only then give it a name. By the end, "which index should I use?" stops being a thing you memorize and becomes a thing you can reason about.

First, what does "slow" even mean?

Before indexes, before B-trees, there's one question: how does a database read a row off a disk?

It doesn't read rows. It reads blocks (also called pages). A block is a fixed-size chunk of disk — say 512 bytes. If each row is about 128 bytes, then one block holds four rows:

512 B per block ÷ 128 B per row = 4 rows per block

The disk can only hand you a whole block at a time, and fetching one costs a seek, on a spinning disk, on the order of 10 milliseconds. Hold onto that number; it's where all the pain comes from.

Now run this query with no index:

SELECT * FROM users WHERE id = 7;

The database has no idea where id = 7 lives. So it does the only thing it can: it reads every block and checks every row. This is a full scan.

With 100 rows:

100 rows ÷ 4 rows/block = 25 blocks
25 blocks × 10 ms        = 250 ms

A quarter of a second to find one row. Annoying, but survivable. Now scale up to a million rows:

1,000,000 ÷ 4   = 250,000 blocks
250,000 × 10 ms ≈ 42 minutes

Forty-two minutes. To find one row. Every query, reading every block, forever. This is not a tuning problem, it's unusable by design. The whole reason indexes exist is sitting right there in that number.

Inventing the index

Here's the move. What if, instead of searching the big table, we kept a second, much smaller table on the side, one that just says which block each id lives in?

INDEX                 USERS TABLE
id  → block           Block 2
 1  → Block 1         ┌────┬───────┬─────────┐
 2  → Block 1         │ id │ name  │ dept    │
 ...                  │  5 │ David │ Finance │
 5  → Block 2         │  6 │ Lin   │ Eng     │
 6  → Block 2         │  7 │ Yusuf │ HR      │  ← right here
 7  → Block 2         │  8 │ Eva   │ Eng     │
 ...                  └────┴───────┴─────────┘
100 → Block 25

Each index entry is tiny, just an id and a block pointer, maybe 16 bytes. That means a single block holds far more index entries than table rows:

512 B ÷ 16 B = 32 index entries per block

And because the index is sorted by id, we don't have to scan it, we can binary-search it. Let's redo the million-row lookup:

Index size:      1,000,000 ÷ 32   = 31,250 blocks
Binary search:   log₂(31,250)     ≈ 15 block reads
Fetch the row:   + 1 read
Total:           16 reads × 10 ms = 160 ms

42 minutes → 160 milliseconds. Same disk, same query, same data. The only thing that changed is that we stopped scanning and started navigating.

But notice the new problem hiding in that math: the index itself is 31,250 blocks. It's a big sorted thing that we now have to search efficiently and keep sorted as rows are inserted and deleted. Binary search over a flat file is fine until you insert a row in the middle and have to shuffle everything down. So the real question becomes: how do we organize the index? That choice, and it is always a choice, is the entire rest of the story.

B-trees: the default answer since 1971

Most of the time, when you type CREATE INDEX and don't think about it, you get a B-tree. It's been the default for over fifty years, and once you see what it's optimizing for, you understand why.

Start from a binary tree, the structure you'd reach for instinctively. The problem: a binary tree of a million keys is about 20 levels deep, and on disk every level you descend is another ~10 ms seek. Twenty seeks is slow. The depth is the enemy.

So you ask the obvious next question: why only two children per node? If a node can have two children, why not hundreds? Make each node exactly one disk page (~8 KB), pack it with hundreds of keys, and suddenly the tree is wide and shallow instead of narrow and deep. That's a B-tree: a balanced, sorted tree with a high fanout.

                 [ 30 | 60 | 90 ]
                /      |       |     \
         [10|20]   [40|50]  [70|80]  ...
         /  |  \    /  |  \
     1..9 .. ..  41..49 51..59 ..

Finding id = 45: at the root, 45 is between 30 and 60, so take that branch; in the next node, land in the 41..49 leaf; done. Three page reads.

Here's the payoff that still feels like magic to me. With a fanout of ~256 keys per node, the depth grows as log₂₅₆(n):

log₂₅₆(1,000,000,000) ≈ 3.74

Four page reads will find any row in a billion-row table. That's the whole reason B-trees won.

And because the tree stays sorted, the same structure answers a whole family of questions for free:

Equality — WHERE email = '...'
Range — WHERE created_at BETWEEN x AND y
Sorting — ORDER BY on the indexed column
Prefix — WHERE name LIKE 'Joh%'

What it's bad at is just as informative, because it's the mirror image of what it's good at. A B-tree struggles with low-cardinality columns (WHERE active = true — a boolean index barely narrows anything), with suffix or substring search (LIKE '%son' can't use the sorted order), and with full-text search inside long documents. Every one of those weaknesses is a direct consequence of "balanced, sorted tree." The shape that makes ranges fast is the same shape that makes %son impossible.

In practice you write one line and the database does the rest:

CREATE INDEX idx_users_email ON users(email);
SELECT * FROM users WHERE email = 'alice@example.com';

You declare the index. The engine picks B-tree by default. Queries use it automatically. This is why Postgres and MySQL reach for it first.

When the default doesn't fit

Everything above is one answer to "how do we organize the index." The interesting part is what happens when your workload asks a different question. Each of the indexes below exists because someone had a query a B-tree couldn't serve well, and they were willing to give something up to get it.

Hash indexes — equality, and nothing else

If all you ever do is look up an exact key, a tree is overkill. Hash the key, jump straight to its bucket, read the row:

alice@…  →  hash()  →  bucket 47  →  row #4,827,193

That's O(1) — faster than any tree, no matter how big the table gets. The catch is total: hashing destroys order. Similar keys scatter to unrelated buckets, so there are no ranges, no sorting, no prefix matches. You traded the entire ORDER BY / BETWEEN family for raw point-lookup speed. This is the engine behind Redis and Memcached:

SET user:1234 "{...}"
GET user:1234   → value, in microseconds

Exact key in, exact value out. No range query is the point, not a limitation.

Inverted indexes — searching inside text

Now flip the question entirely. A B-tree answers "find the row matching this key." Full-text search asks "find the documents containing these words." Different question, different index.

The trick is to invert the natural mapping. Instead of document → words, you store word → documents:

Forward                          Inverted
Doc 1 → "refund for shipping"    "refund"   → [Doc 1, Doc 2]
Doc 2 → "refund policy"          "shipping" → [Doc 1, Doc 3]
Doc 3 → "shipping was fast"      "policy"   → [Doc 2]

Search a word, get its document list in one lookup — microseconds, regardless of how big the corpus is. That unlocks full-text search across millions of documents, boolean queries (refund AND shipping NOT cancelled), relevance ranking, phrase search, and autocomplete.

The cost is the inverse of B-tree's strength. Writes are slow, because one document touches many word lists. It's eventually consistent — Elasticsearch refreshes on roughly a one-second delay. And it gives up ACID guarantees, so it isn't a system of record. Which is why the most common pattern in modern backends isn't "replace your database with Elasticsearch," it's both:

PostgreSQL (source of truth, B-trees) → pipeline → Elasticsearch (search, inverted index)

Each index does the job it's shaped for.

LSM-trees — when writes outpace reads

B-trees do their bookkeeping at write time: every insert may have to find the right leaf, maybe split a node, maybe rebalance. That's random I/O, and it's fine until you're ingesting a firehose of writes.

The Log-Structured Merge tree refuses to play that game. It never overwrites in place. It appends new writes to an in-memory table (and a write-ahead log), and when that fills up it flushes to disk as a sorted file. Background threads later merge those files together:

writes → MEMTABLE (RAM) --flush--> SSTable 1, 2, 3 (disk) --compact--> merged SSTable

The write path is just an append — no tree navigation, no splits, sequential I/O. That makes writes 10–100× faster than a B-tree.

But — and this is the line from the talk I most wanted people to remember — the cost didn't disappear, it moved. Reads got harder: a key might be in the memtable or in any of the SSTables, and if it was updated, several versions exist and the newest wins. Worst case, a read probes every file. Two optimizations make that bearable: bloom filters (which answer "definitely not here, or maybe here" cheaply) and small sparse in-memory indexes per file. The work shifted from synchronous write-time effort to asynchronous background compaction. This is the engine inside Cassandra, RocksDB, ScyllaDB, and most time-series databases.

Vector indexes — searching by similarity

The newest question is the strangest. Not "find the match" but "find the things similar to this thing." An embedding model turns text, images, or audio into vectors, positioned so that similar meanings land near each other in high-dimensional space:

"shipment missing"  ─┐
                      ├─ cluster together
"package never came" ─┘            "update my email"  (far away)

The index finds nearest neighbors fast. The twist is that exact nearest-neighbor search in high dimensions is brutally expensive, so these indexes go approximate: ANN (approximate nearest neighbor) algorithms like HNSW and IVF return results that are ~98% correct but ~1000× faster. You trade a sliver of accuracy for tractability — and that trade is what makes semantic search, recommendations, and RAG pipelines possible. (This is the part I run into daily; the retrieval layer in the systems I build lives or dies on this trade-off.) In practice: pgvector, Pinecone, and friends.

So which index should you use?

Here's the whole landscape on one page:

Index	Best for	The trade-off	In the wild
B-tree	Equality, range, sort	Slower writes (tree upkeep)	Postgres, MySQL
Hash	Equality only — O(1)	No range, sort, or prefix	Redis, Memcached
Inverted	Full-text search	Slow writes, eventually consistent	Elasticsearch, Lucene
LSM-tree	Write-heavy workloads	Reads probe many files	Cassandra, RocksDB
Vector	Similarity search	Approximate, not exact	Pinecone, pgvector

Read that table top to bottom and the pattern is unmistakable. Every row buys speed on one kind of question by giving up speed on another. There is no free lunch and there is no universal winner.

That's the thing I wanted my team to walk away with, and it's the thing I'll leave you with too:

There is no best index. There are only indexes that fit your workload — and indexes that fight it.

The next time someone asks "should we add an index here?", you don't need to remember a rule. Ask what question the query is really asking, ask what you're willing to give up, and the right structure picks itself. That's the difference between memorizing a tool and actually understanding it — and it's a difference that, once you start looking, shows up absolutely everywhere in software.

DEV Community