close
Production infrastructure notes

Spend the
error budget wisely

Infrastructure notes on running systems that can fail but must recover. Production patterns for AI workloads, VMware deployments, and audit-ready operations — from an architect operating regulated fintech systems for 10+ years.

Free · Operators only · No spam

The premise

Every production system has an error budget — the amount of failure it can absorb before SLAs break. The job is not zero downtime. It is spending that budget deliberately: on planned maintenance, controlled rollouts, and calculated risk — not on surprises at 3am.

What I write about

VMware vSphere NVIDIA AI Enterprise Dell VxRail HPE Synergy vGPU & MIG Audit-ready ops Reliability patterns NCA-AIIO prep

More notes

Compliance 22 min read

What auditors asked when we deployed AI: questions, answers, and what we learned

Real audit questions when AI infrastructure entered our regulated environment. PCI DSS, ISO 27001, and regulatory inspection patterns. The answers that passed, the answers that didn't, and how to prepare evidence that scales.

Read article
Operations 18 min read

The AI memory crunch: how DRAM and NAND price shocks reshape infrastructure budgets

DDR5 prices up 3-4x. Enterprise SSDs up 470%. Memory manufacturers redirecting capacity to AI customers. Notes from infrastructure operators navigating the worst memory market in a decade, and the procurement strategies that work.

Read article
Operations 12 min read

Bandwidth contention at peak: backup vs traffic vs telemetry

At peak, four streams fight for one network: live user traffic, near-realtime backup replication, log shipping, and metrics. Here's a quantified worked example of the saturation, why load tests miss it, and a tiered must-have / should-have / nice-to-have fix list.

Read article
Architecture 12 min read

Security-first infrastructure for payments: isolation, key management, and PCI scope reduction

How payment infrastructure is architected security-first: PCI scope reduction, HSM-backed key management, tokenization, and the segmentation that keeps the highest-risk data in the smallest possible blast radius.

Read article
vSAN 20 min read

vSAN for mixed workloads: policy design, AI patterns, and the OSA-to-ESA transition

Operating vSAN clusters that host both regulated banking workloads and AI training. Storage policy design for mixed workload classes, OSA and ESA architecture trade-offs, and lessons from running both in production.

Read article

the error budget

One deep technical note every Friday. Production patterns, audit-ready configurations, and lessons from operating mission-critical infrastructure. Written by an architect, for architects.

Free. Unsubscribe anytime. No spam, ever.

10+

Years fintech infrastructure

10

VMware clusters in production

Anonymous

No vendor influence