Last month's Anthropic invoice: $312. Sixty percent of it traced back to a single retry pattern I couldn't see anywhere in my normal logs.
The agent was failing on tool calls, then re-entering the loop with the full context intact — 18K input tokens per invocation on a task that needs 3-4K. Claude Code's UI looked fine. Workers logs showed 200s. D1 writes were clean. The billing dashboard just said "tokens used" with no breakdown by worker or call chain.
I found the culprit only after shipping Workers logs to R2 via Logpush and querying with DuckDB:
SELECT
worker_name,
COUNT(*) as call_count,
AVG(input_tokens) as avg_input,
SUM(input_tokens) as total_input
FROM read_parquet('s3://my-logs/workers/2026-05/*.parquet')
GROUP BY worker_name
ORDER BY total_input DESC;
One worker — ad-report-summarizer — was eating 58% of total input tokens. That query cost me maybe 20 minutes to set up. The Logpush + R2 + DuckDB stack runs under $5/month.
Once I had a suspect, I used Claude Code's --verbose flag to reconstruct the tool call chain. Most people treat --verbose as a log-level toggle. It's not — it dumps the full tool input/output JSON for every call in the session. Pipe it to a file, run jq on it, and you can replay the exact sequence that blew up your context.
For multi-agent loops specifically (I run 6 Slack bots coordinated through Workers), KV counters have been the single most reliable safeguard. A counter keyed to the conversation thread, checked on every bot invocation, with a last_actor field — when the counter approaches the limit, last_actor tells you immediately which bot is driving the chain. Six months in, it's almost always summarizer-bot triggering router-bot triggering summarizer-bot again.
The harder unsolved problem: I'm still seeing intermittent schema drift in tool call responses — same prompt, same model, valid JSON but different structure. It's non-deterministic, doesn't reproduce on demand, and when it triggers a retry, costs double. I haven't confirmed whether it's a Sonnet serialization quirk or something in my Workers pipeline.
I wrote up the full breakdown — including the PostToolUse hook setup for snapshotting tool call sequences, the cf-ray correlation trick for tracing multi-worker chains, and the per-tool production evaluation table — over on riversealab.com.
Top comments (1)
This is a great example of why AI costs are often an observability problem before they're a model problem.
The silent retry loop is particularly painful because everything can look healthy from an infrastructure perspective 200 responses, successful writes, no obvious errors while token consumption quietly explodes in the background.
The KV counter approach is smart too. In multi-agent systems, the expensive failures are rarely single-agent mistakes; they're usually feedback loops between agents that individually behave correctly but collectively create runaway execution paths.
We've seen similar patterns while helping teams scale AI workflows at IT Path Solutions. Cost spikes are often traced back to retries, context bloat, or agent-to-agent loops rather than model selection itself. The biggest savings usually come from better tracing and guardrails, not switching providers.
The lesson here is valuable: if you can't attribute token usage to specific workflows, you're debugging your AI bill blind.