AditModi's Blog

AditModi's Blog

AditModi's Blog

Generative AI on Kubernetes — Book Review
There's a specific kind of frustration that comes from reading a book that's half about what you actually needed. Most resources on running AI in production either assume you're a data scientist who p
Apr 5, 20267 min read
NVIDIA's Two Gifts to Kubernetes: DRA and AICR — What They Mean for Your EKS GPU Platform
From integer counting to structured resources — how Dynamic Resource Allocation and the AI Cluster Readiness framework finally make GPU infrastructure manageable at scale. Contents The Two Nightmares
Apr 4, 202618 min read
AditModi's Blog
421 posts
Senior Cloud Engineer at Digital-Alpha
llm-d on EKS: The New Inference Resource Model That Changes How You Think About GPU Routing
Your vLLM cluster has a problem you probably don't know about. It's not a bug. Nothing is crashing. The metrics dashboard looks fine. But right now, every time a request hits your load balancer, there
Mar 28, 202627 min read
GPU Deadlock on EKS: What Gang Scheduling Actually Is, Why the Default Scheduler Fails You, and Three Ways to Fix It
There's a class of production incident that doesn't page anyone. No error rate spikes. No latency alert fires. The cluster health dashboard shows green. GPU nodes are online. Pods are running. And yet
Mar 28, 202630 min read
AWS Community Day Pune 2026 — Notes From a Grateful Attendee and Speaker
Travel has been relentless lately. Back-to-back weeks, airports blurring into each other, calendar looking like a game of Tetris someone is losing badly. But AWS Community Day Pune was non-negotiable.
Mar 22, 20264 min read
GTC 2026: Day 1 Notes From Someone Who Couldn't Sleep After Watching It
I watched Jensen's keynote live. Three hours. I had tea going cold next to me and a notepad filling up fast. I'm not going to recap every announcement. There are enough of those. What I want to do is
Mar 17, 20265 min read
Optimizing LLM Inference at Scale: SGLang and NVIDIA Dynamo on Amazon EKS
There's a GPU utilization chart that haunts every platform engineer running LLM inference in production. The x-axis is time, the y-axis is GPU utilization, and the line does something uncomfortable: i
Mar 15, 202628 min read
The Unified GPU Platform: Running Slurm, Ray, and Kubernetes Inference on a Single EKS Cluster Without Scheduling Chaos
There's a meeting that happens at every organization the moment their AI ambitions outgrow their GPU budget. It usually involves three teams talking past each other. The HPC team says: "We need 32 GPU
Mar 8, 202632 min read
Inference Engineering — Book Review
Some books explain AI infrastructure with clean diagrams and tidy abstractions. This one pulls you into the engine room and shows you what actually happens between a prompt and a response — the memory
Mar 3, 20266 min read