Machine learning in production: the model is the easy part

#ai #machinelearning #mlops

A model that scores 95% on your test set feels like the finish line. Then you ship it, and you find out it was the starting line. The model was maybe 10% of the work; everything that makes it survive production is the other 90%.

We deploy machine-learning systems for companies, and the projects that stall almost never stall on model accuracy. They stall on the engineering around the model. Here's what actually breaks.

1. Training-serving skew

Your model was trained on clean, batch-computed features. In production it gets features computed by different code, at request time, sometimes from a slightly different source. The distributions drift apart and accuracy quietly craters — with no error, no crash, just worse predictions.

The fix is sharing one feature-computation path between training and serving (a feature store or shared library), and logging production features so you can compare them against training. If training and serving don't compute features the same way, nothing else matters.

2. The model rots and nobody notices

The world changes — user behaviour, pricing, seasonality, an upstream schema. A model frozen at launch slowly decays against a moving target. Without monitoring, the first signal you get is a business metric dropping months later.

Monitor input drift and prediction distributions, not just uptime. Alert when the live data stops looking like the training data. Treat "the model still works" as a claim that needs evidence, not an assumption.

3. There's no retraining pipeline

"We'll retrain when it degrades" usually means a person manually re-running a notebook they half-remember. Build the retraining path early: reproducible data snapshots, an automated training run, evaluation against a held-out set, and a gate that blocks a worse model from shipping. Retraining should be a button, not an archaeology project.

4. The offline winner loses online

The model with the best test-set score is often too slow or too expensive to serve at real traffic. Latency and cost are product features. Sometimes a smaller, simpler model that answers in 50ms beats a heavyweight that needs 2 seconds — because users feel the 2 seconds and the heavyweight quietly blows the budget.

5. Accuracy is not the business metric

A fraud model at 99% accuracy is worthless if it flags so many false positives that support drowns. Define the metric that maps to the actual outcome — caught fraud net of review cost, handle-time saved, conversion lifted — and optimise that. Accuracy is a proxy, and proxies get gamed.

The pattern underneath all of these

None of this is exotic ML. It's data engineering, monitoring, CI/CD, and clear success metrics — applied to a component that happens to be statistical and non-deterministic. The teams whose models create value aren't the ones with the fanciest architecture; they're the ones who treat the model as one part of a production system that has to be observed, retrained, and held to a real metric.

That's the lens we bring from running production systems at scale before this wave. If you have models that work in a notebook but stall on the way to production, that's the kind of machine-learning deployment and consulting work we do at Krazimo.

What's killed an ML project for you on the way to production? I'll get into specifics in the comments.

Top comments (1)

Nazar Boyko • Jun 24

Point 3's gate against shipping a worse model is doing a lot of quiet trusting, since it assumes the held-out set itself stays honest. The drift you describe in point 2 hits that eval set too, so a model can clear the gate while already being worse on this week's traffic. Pulling the held-out slice from recent production instead of the original snapshot keeps the gate pointed at the real target. Calling all of this plain data engineering with a statistical part bolted on is the honest framing most ML writeups dodge.