From 1.4 tok/s to 36 tok/s: What Building a Zero-Dependency C LLM Engine Taught Me About DRAM Ceilings
I started Project Zero with a single question: how fast can you run BitNet b1.58 inference on a CPU if you write everything in C and skip every ML framework?
The first answer was humbling: 1.4 tokens/second. Debug build, scalar arithmetic, no SIMD. The CPU was spending most of its time loading 8192 bytes of FP32 weights to compute what is mathematically just negations and no-ops on ternary values.
Nine months later, the same model runs at 36.25 tok/s on a 4-core Xeon - and that number isn't an estimate. It's 95% of the analytical DRAM bandwidth ceiling for that hardware, third-party verified on OpenBenchmarking.org. On my dev laptop (i5-11300H), it hits 42.83 tok/s with the INT4 classifier path.
This is the story of how we got there, and what I learned about the single constraint that governs CPU LLM inference.
Why C99 and Zero Dependencies?
Before the performance story: why bother?
Most LLM inference runs on Python with CUDA. Those stacks are genuinely excellent for production GPU workloads. But they carry real costs: ~2GB of Python runtime, framework libraries, CUDA toolkit, model conversion tooling. For someone running inference on an old server, edge device, or embedded system — the overhead is the bottleneck.
Project Zero compiles with gcc -O3. No Makefile magic beyond that. The binary is one executable. You give it a .gguf file and a prompt. That's it.
This also turns out to be a useful constraint for understanding what's actually slow. When you can't blame the framework, you have to understand the math.
The Optimization Journey
BitNet b1.58 weights are ternary: each weight is one of {−1, 0, +1}. Dense matrix-vector multiply with ternary weights isn't really multiplication — it's negation, no-op, or accumulation. The naive approach (dequantize to FP32, run FMA) wastes 98% of memory bandwidth loading floats that encode 1.58 bits.
Here's the journey, with the specific cause of each jump:
1.4 tok/s — baseline. Scalar, debug mode, FP32 dequantization.
5.5 tok/s — AVX-512 enabled, CPU governor set to performance, spinlock thread pool, debug flags stripped. Pure mechanical cleanup.
10.5 tok/s — disabled earlyoom on the dev machine. +91% from one config change. The OOM killer was quietly pausing worker threads during inference. This was one of the more embarrassing debugging sessions.
13.0 tok/s — HT scheduling fix, tokenizer rewrite, top_p sampling removed from hot path. At this point we were within 97% of the single-channel DDR4 bandwidth ceiling on the dev laptop. The hardware was the limit, not the code.
15.5 tok/s — RAM upgrade from single-channel to dual-channel DDR4. Ceiling doubled, we got most of it.
16.1 tok/s — T=6 thread sweet spot identified, KV cache strategy fix. Hyperthreading past T=4 on the i5 adds thermal pressure that starts throttling — the ceiling actually drops.
At this point we started working on the Xeon (Emerald Rapids, AVX-512 VNNI). The i5 plateau was a hardware ceiling.
21.2 tok/s (Xeon) — INT8 VNNI classifier path. Instead of floating-point accumulation, use vpdpbusd to accumulate int8 dot products into int32 accumulators. This is where the compute picture changes.
32.7 tok/s — VBMI 3-instruction unpack. This is the leap worth explaining in detail.
36.25 tok/s — INT4 VBMI classifier + PGO/LTO. 95% of the DRAM ceiling. This is where we are now.
The Three Instructions That Matter
The hot path for ternary matmul on AVX-512 VBMI machines reduces to three instructions per 64-element block:
vpermi2b — table lookup decode. Each byte of the packed weight stream encodes 4 ternary values in 2-bit pairs. vpermi2b performs a 64-way byte-granularity lookup across two 512-bit registers — decoding 32 bytes of packed ternary to 64 signed bytes in a single instruction at 3 cycles latency. This replaces the unpack→shift→mask sequence entirely.
vpternlogd — 3-input bitwise. A somewhat underused AVX-512 instruction that can express any boolean function of three inputs using an 8-bit truth table. We use it to compute sign masks and zero masks from the decoded ternary values without branching.
vpdpbusd — INT8 VNNI accumulation. This is the accumulation workhorse: 64 int8 MAC operations per instruction, accumulated into 16 int32 lanes. The ternary weights are stored as {0, 1, 2} (not {−1, 0, +1}) to satisfy the unsigned constraint, with a bias correction applied to the final result.
On Sapphire Rapids at 4 cores: FP32 FMA gives you 128 MACs/cycle. INT8 VNNI gives you 512 MACs/cycle. But the bigger win is memory: 512 bytes of packed ternary vs. 8192 bytes of FP32 for the same weight matrix. At 36 tok/s we are consuming ~11.7 GB/s of DRAM — the ceiling on that hardware is ~12.3 GB/s measured.
The code is in src/math/ternary_matmul_packed_vbmi.c if you want to read it.
What "95% of the Ceiling" Actually Means
There's a moment during optimization when you realize you've stopped fighting the algorithm and started fighting physics.
The analytical DRAM bandwidth ceiling for BitNet inference is:
ceiling = DRAM_bandwidth / bytes_per_weight / model_size_weights
For a 4-core Xeon with 16.0 GB/s DRAM and a 512-byte-per-row packed ternary representation:
ceiling ≈ 16.0 GB/s ÷ (512 bytes / 2048 weights) = ~38 tok/s
We're at 36.25. The remaining 5% is cache miss overhead, thread synchronization, and the non-matmul parts of the transformer (attention, normalization, sampling).
There's no algorithmic trick that gets you above this ceiling without changing the memory layout. You'd need either more DRAM bandwidth (different hardware) or a fundamentally different representation (speculative decoding, batching, sparse attention). For single-stream inference on a fixed model, 36.25 is roughly as fast as this hardware can go.
The Honest Gap: DeepSeek MoE
I want to be upfront about where the engine is slow.
On dense GGUF models (SmolLM2-135M F16), Project Zero runs at 100 tok/s on the i5-11300H, slightly ahead of llama.cpp at 1-2 threads and within 5% at peak threads.
On DeepSeek-V2 Q4_K_S: 1.9 tok/s vs. llama.cpp's 13.8 tok/s. We're 7x slower.
The root cause is a memory access pattern problem that I haven't solved. MoE routing selects 2 out of 64 experts per token, and each expert's weights are scattered non-contiguously in memory. On a dense model, the weight stream is sequential — prefetching works, DRAM throughput is high. On MoE, each expert selection triggers a cold fetch from a different region of the weight matrix, causing an L3 miss rate above 80%.
llama.cpp handles this with memory layout optimizations I haven't replicated. This is the open problem.
What's Next
The engine currently supports BitNet ternary and dense F16/Q4_K GGUF models in a single binary with an OpenAI-compatible HTTP API. Pre-built x86-64 Linux binaries are in the GitHub releases (no compiler required).
Two things I'm actively working on that I don't have good answers for:
Fused Q4_K matmul kernel — dense models need the same treatment as ternary. The current path dequantizes before accumulation; a fused kernel would eliminate ~30% of memory bandwidth.
MoE expert prefetching — if I can predict which experts will be selected 2-3 layers ahead, I can prefetch their weights before the scatter. The routing decision is deterministic given the hidden state, so prediction is possible. Whether it's fast enough is unproven.
If you work on CPU inference, memory-bound kernel optimization, or have ideas on the MoE problem — the repo is at github.com/shifulegend/project-zero and there's a Help Wanted section in the README.
The OpenBenchmarking results are public and reproducible: Xeon result · i5-11300H result.
If you try it and hit something weird — or you get a better result than 36.25 tok/s on any hardware — I'd genuinely like to hear about it.
Top comments (0)