close

DEV Community

강해수
강해수

Posted on • Originally published at dailymanuallab.com

FTS5 on Cloudflare D1 cut my Korean keyword search from 220ms to 22ms — here's what almost broke it

The bm25() relevance score in FTS5 returns a negative float. Lower means more relevant. I ordered results DESC for three days and wondered why garbage surfaced first.

I run 12 production Workers handling Korean D2C ad-ops. One indexes ~400K ad keywords — brand names, product slugs, mixed Korean/Latin search terms — and drives a dashboard that rerenders on every keystroke. The original LIKE '%키워드%' scans were clocking 180–220ms in D1. That's a hard no for typeahead. FTS5 is shipped enabled in D1, no opt-in required, so the migration looked straightforward. It wasn't.

The first real trap is Korean tokenization. FTS5's default unicode61 tokenizer splits on whitespace and punctuation. Korean doesn't use spaces between morphemes consistently — a single compound word can contain what a user types as two separate search tokens. That mismatch kills recall on single-morpheme queries. By week 3 at 180K rows I was measuring 88% recall on a hand-verified 500-row golden set, down from 91% at 40K rows. The drop isn't catastrophic but it's directional, and it gets worse as the index grows with noisier long-tail terms. The mitigation I landed on was query-side: splitting user input and joining tokens with OR before passing to MATCH, which recovers most of the recall loss without touching the tokenizer.

const ftsQuery = query.trim().split(/\s+/).join(' OR ');
Enter fullscreen mode Exit fullscreen mode

The second trap is content tables. content='kw_master' means FTS5 stores only the index, not the raw text — good for space, bad if you forget that inserts, updates, and deletes on kw_master don't automatically propagate. You need triggers, or the index silently drifts from the source table. I set up three triggers (insert, delete, update) and a one-time backfill. P50 latency at 390K rows after six weeks: 22ms. P95: under 45ms. Both within the sub-50ms target.

There's more to the full story — the exact trigger DDL that handles the content table sync without double-writing, how I sampled latency inside the Worker without killing performance, and the one wrangler command sequence I run at 3am when the index drifts.

I wrote up the full breakdown — including the week-6 benchmark table and the production incident that validated the trigger setup — over on dailymanuallab.com.

Full post →

Top comments (0)