If you're building an app on top of OpenAI or Anthropic, you've probably noticed something uncomfortable: there's no native way to limit how much any individual user can cost you.
OpenAI gives you org-level spend caps and project budgets. These protect OpenAI's billing relationship with you. They don't protect you from the one user who discovers they can run your AI feature in a loop at 3am and generate $400 in API costs before you wake up.
I learned this the hard way. Twice.
Why the obvious solutions don't work
The first thing most developers try is an org-level spend cap. Set it to $100/month and call it done. The problem is that cap applies to your entire organisation. One heavy user can consume it entirely, leaving everyone else with nothing. You also have no visibility into who did it.
The second thing people try is request-based rate limiting. Limit each user to 100 requests per day. This sounds reasonable until you realise that 100 requests of gpt-4o-mini costs almost nothing, while 100 requests of gpt-4o with long context windows could cost $50. Request counting and cost are completely different things.
The third thing people try, and what I tried first, is building it in their existing stack. Log the token counts to Postgres after each call, check the total before the next one. The problem is latency. A database read before every AI call, on top of the AI call itself, in a serverless environment adds 200-400ms to every request. Users notice immediately.
The architecture that actually works
After two failed attempts I landed on something that works cleanly.
Redis for the counters
Redis is the right tool for spend counters because it's microsecond-fast and handles concurrent increments safely. If two requests from the same user arrive simultaneously, Redis handles that without double-counting. Postgres simply can't compete on latency here.
The key structure is simple:
spend:{user_id}:{YYYY-MM} → decimal (running spend in dollars)
blocked:{user_id} → "1" or "0"
The monthly key resets naturally. You start writing to a new key each month and the old one expires. No cron jobs needed.
An edge function for the intercept
The check that happens before every AI call needs to be as fast as possible. Running it on a Cloudflare Worker means it executes at the edge, physically close to wherever your app server is. A Redis read from a Cloudflare Worker in the same region typically completes in under 20ms.
The intercept logic is straightforward:
- Read the user's current spend from Redis
- Compare against their tier limit
- Return allow or block
No database calls in the allow path. The allow path is purely Redis.
Fire-and-forget logging
After the AI call completes, you need to log the cost and update the counter. This should never block the response to your user. Use waitUntil in a Cloudflare Worker to fire the log call after you've already returned the response.
The log call does the heavier work: calculates cost from token counts, increments the Redis counter using INCRBYFLOAT, sets the blocked key if the new total exceeds the limit, and writes the event to your database for analytics.
The SDK wrapper
The cleanest integration is a single wrap function around your AI call:
const callAI = guard(openai.chat.completions.create.bind(openai.chat.completions))
// Everywhere else in your codebase, completely unchanged:
const result = await callAI({ model: 'gpt-4o', messages: [...] }, ctx)
Before the call: intercept. After the call: log. If the user is blocked: throw a typed error with your upgrade message. The rest of your codebase never changes.
Handling streaming
Most modern AI integrations use streaming for the typewriter effect. This complicates cost tracking because you don't know the final token count until the stream ends.
The solution for OpenAI is to set stream_options: { include_usage: true } on the request, which causes OpenAI to append a final chunk with complete token usage. For Anthropic it's slightly more involved. The message_start event contains input tokens and message_delta contains output tokens, and they're on different paths in the chunk object. You need to accumulate across both events rather than just reading the last one:
const accumulated: Record<string, number> = {}
for await (const chunk of stream) {
if (chunk.type === 'message_start' && chunk.message?.usage) {
Object.assign(accumulated, chunk.message.usage)
}
if (chunk.usage) {
Object.assign(accumulated, chunk.usage)
}
yield chunk
}
What about users who go over the limit?
The cleanest approach is to allow one call to go over, then block the next one.
You can't block before a streaming call completes because you don't know the final cost until it's done. Blocking mid-stream is also a terrible user experience. So the flow is: check if the user is already blocked before the call, and check if the new total exceeds the limit after the call. A user at 99% of their limit might get one more call through before being blocked. That's fine. It's how mobile data plans and banks work. Document it clearly and users understand.
Building it yourself
If you want to build this yourself, here's what you need:
- Upstash Redis (serverless, generous free tier, works in Cloudflare Workers)
- Cloudflare Workers for the intercept endpoint
- Your existing database for the event log and analytics
- An npm package that wraps your AI calls
The worker itself is roughly 150 lines of TypeScript. The SDK wrapper is another 100. The tricky parts are getting the streaming accumulation right and making sure the logging is genuinely fire-and-forget and not just awaited somewhere upstream.
The shortcut
I extracted all of this into a standalone tool called Nasca. You sign up, define your user tiers and monthly spend budgets in dollars without any token calculations, and copy paste the integration code. The SDK wraps your AI function once and handles everything else: tracking, blocking, upgrade prompts, and a dashboard showing spend per user.
It supports OpenAI and Anthropic. Free up to 250 end-users.
If you want to build it yourself, everything in this post is enough to get started. If you'd rather skip straight to shipping your actual product, Nasca does it in 10 lines.
Top comments (0)