Doctor PDF

Posted on Jun 24

How I built a PDF Word converter that runs 100% in the browser (no server)"

#programming #ai #productivity #opensource

"Extracting text with coordinates from pdf.js, rebuilding paragraphs and tables, fixing Arabic right-to-left order, and generating a real .docx — all client-side, no upload."
tags: javascript, webdev, pdf, opensource
canonical_url: https://doctor-pdf.com/pdf-to-word.html

cover_image:

TL;DR — I built a PDF → Word (.docx) converter that does everything in the browser: no file ever leaves the user's device. The hard parts weren't the file formats — they were reconstructing layout (paragraphs, tables, headings) from a flat stream of positioned glyphs, and getting Arabic / right-to-left text to come out in logical order. Here's how it works, with the gotchas I hit along the way. Live tool: doctor-pdf.com/pdf-to-word.html.

Why client-side?

Almost every "free" PDF tool uploads your file to a server. For contracts, IDs, and confidential documents that's a non-starter for a lot of people. Browsers are now powerful enough that the entire pipeline — parsing the PDF, reconstructing the document, and writing a Word file — can run in JavaScript on the user's machine. Nothing is uploaded. That constraint shaped every decision below.

The stack is intentionally boring and CDN-only:

pdf.js — read text, positions, fonts, colours, and the operator list.
docx — generate a real .docx (lazy-loaded ~740 KB, only when the user clicks Download).
<canvas> — the browser's own text shaper, which turns out to be the secret weapon for Arabic.

Step 1 — PDF text isn't text, it's positioned glyphs

The first surprise for anyone new to PDFs: there are no paragraphs, no lines, often no spaces. pdf.js hands you textContent.items, each roughly:

{ str: "Invoice", transform: [fs,0,0,fs, x, y], width, fontName }

That's a glyph run with a position. A "line" of text might arrive as ten fragments; a visual space between two columns might be no character at all, just a horizontal gap. So step one is to rebuild lines and spaces from geometry:

// group items into lines by their y coordinate
lines.sort((a, b) => b.y - a.y); // top-to-bottom

// within a line, rebuild missing spaces from the x-gap
for (const cur of chunks) {
  const gap = cur.x - (prev.x + prev.w);
  if (gap > cur.fs * 0.18) text += " ";      // a normal word space
  if (gap > cur.fs * 4)    text += "   ";     // a column gap → several spaces
  text += cur.str;
}

That fs * 0.18 threshold (a fraction of the font size) is the single most-tuned number in the whole project. Too low and words run together; too high and you get H e l l o.

Step 2 — From lines to paragraphs, headings, and lists

Once you have lines, you infer structure from typography and whitespace — the same cues a human reads:

New paragraph when the vertical gap between lines jumps, when alignment changes, or when a line stops well short of the right margin (a "short last line").
Heading when a line's font size is meaningfully larger than the body — or it's short, bold, and standalone. Important caveat I learned the hard way: in a document where everything is bold (lots of legal letters are), "bold" carries no signal. So I compute a document-wide boldFraction and only treat bold as a heading hint when bold is uncommon (< 40%).
List item when a line starts with a bullet/number glyph.

A fun bug: bullet characters (U+2022) are often drawn in a SymbolMT font. Carry that font name into Word and the bullet renders as a □ tofu box. The fix is to drop the font for known symbol fonts so the bullet falls back to a Unicode-capable default.

Step 3 — The hard one: Arabic and right-to-left

This is where most converters fall over. Legacy PDFs frequently emit Arabic glyphs in visual order — i.e. already laid out left-to-right on the page — instead of logical order. Concatenate the fragments naively and you get Arabic that's reversed and, worse, mixed Arabic/Latin lines like QUOTATION / عرض أسعار that come out scrambled.

The reconstruction is a small bidi pass per line:

// line contains Arabic drawn in visual (LTR) order →
// 1) reverse all runs to base right-to-left logical order
runs.reverse();
// 2) re-flip maximal runs of Latin / digits so they read left-to-right again
flipLatinRunsBackToLTR(runs);

Pure-Latin lines short-circuit out (so English documents are untouched), and the same routine runs inside table cells. The payoff: الإمارات العربية المتحدة / Sharjah - United Arab Emirates comes out correct in both scripts, and the Word runs get bidirectional: true + rightToLeft: true.

One more Arabic-specific gotcha: italic / sheared text. A naive "skip rotated text" watermark filter (checking the transform matrix) also kills faux-italic text, because the shear shows up in the same matrix entries. The fix is to normalise by scale and separate rotation from shear before deciding to drop a run.

Step 4 — Reconstructing tables geometrically

Tables are the other place server engines usually win, because a PDF table is just… lines of text that happen to align into columns, often with rows that wrap across several text-lines. My first attempt (split each line at big x-gaps) produced fragmented messes on real multi-column tables.

The approach that actually worked is a small geometric reconstructor:

Find a region — consecutive lines that each have ≥ 2 well-separated fragment groups ("anchors").
Rows — group fragment y-positions; a gap bigger than fontSize * 1.5 starts a new row, so wrapped multi-line rows merge correctly.
Columns — cluster fragment x-positions (tolerance ~25pt) and keep clusters that appear in at least half the rows.
Assign each fragment to its nearest column centre.

The make-or-break part is the precision guards — what separates a real data table from a page that merely looks columnar (like a two-column CV):

cols >= 3 && rows >= 3 &&
fillRatio >= 0.7 &&
fullRowFraction >= 0.7 &&   // most rows fill EVERY column
headerRowFullyFilled        // kills "contact block" false positives

fullRowFraction is the key signal: a genuine table has rows that fill every column (≈1.0); a sidebar layout doesn't (≈0.67). With those guards the tool reconstructs reflowed/wrapped tables that match what the commercial converters produce — while a CV's two-column layout correctly stays as prose. Borderless label/value info-grids (the Client | …, Project | … block at the top of invoices) get their own lighter detector and render without grid lines.

Step 5 — Carrying images

To match diagrams in English docs and stamps/signatures in Arabic ones, I walk pdf.js's operator list, track the current transformation matrix through save/restore/transform, and on each paintImageXObject pull the bitmap from page.objs, draw it to a canvas at ~2× for sharpness, and emit a PNG. Images that overlap vertically get grouped into a single side-by-side row so figures that sit next to each other in the PDF stay next to each other in Word.

⚠️ Hang warning worth its own paragraph: page.objs.get(id, callback) will wait forever if an image object never becomes ready — on some files that froze the whole conversion. The fix was to make the lookup synchronous and skippable: objs.has(id) ? objs.get(id) : null. A skipped image is infinitely better than a frozen tab.

Step 6 — Writing the .docx

With an ordered list of blocks (paragraph / table / image), the docx library does the heavy lifting. The thing to get right is fidelity of the runs: explicit font size (real points), colour (only when it's not near-black, so dark-mode Word doesn't get weird), bold/italic, alignment, and — a detail that bit me — set the heading run's size and colour explicitly instead of relying on Word's Heading2 style, which otherwise renders everything the same size and blue.

new TextRun({
  text: run.text,
  font: cleanFontName(run.fontName),   // strip "ABCDEF+" subset prefix, map Helvetica→Arial
  size: Math.round(run.fs) * 2,        // docx uses half-points
  bold: run.bold, italics: run.ital,
  color: nearBlack ? undefined : run.hex,
  rightToLeft: isArabic, // ...
})

Packer.toBlob() produces the file, and a classic appendChild → click → remove anchor triggers the download. No server round-trip.

Things I'd tell my past self

PDF is a presentation format, not a content format. You're not parsing a document; you're reverse-engineering one from ink positions.
Tune thresholds against real files, not synthetic ones. Every magic number here (0.18, 1.5, 0.7) was calibrated on actual contracts, CVs, and invoices — and re-checked for regressions on every change.
The browser is a great text shaper. For anything involving Arabic shaping or RTL, rendering to <canvas> and embedding the result sidesteps a whole category of font-embedding pain.
Never ship a feature that can hang. A wrong-but-fast result beats a correct-but-frozen tab every time. Guard every async decode with a synchronous fallback or a timeout.
Know your limits. Damaged ToUnicode maps (the PDF's own defect) extract as �, and Arabic OCR isn't good enough to silently recover small/decorative text. When that happens the tool shows an honest quality banner instead of pretending.

Try it / read the code

The tool is live and free, no signup, nothing uploaded: doctor-pdf.com/pdf-to-word.html. It's part of a suite of 15 browser-only PDF tools I'm building — Arabic-first, since 400M+ Arabic speakers had no real privacy-respecting option.

If you're building something similar, happy to compare notes in the comments. 🔒

DEV Community