If you're building AI agents that work with browser screenshots, you already know the pain.
You take a full 1920×1080 screenshot, pass it to GPT-4o or Claude, and watch your token bill climb — while the model downscales the image anyway and blurs the exact text you needed it to read.
There's a better way.
The problem
Vision LLMs are expensive for two reasons when you feed them full screenshots:
- Token cost — a full screenshot can cost 10–20x more tokens than a small crop
- Accuracy loss — models internally downscale large images, blurring fine text, labels, and UI elements
But your agent already knows where to look. Browser automation tools like Playwright and Puppeteer give you getBoundingClientRect() — the exact pixel coordinates of any element on screen.
So why are you sending the whole screenshot?
The solution
I built a stateless pay-per-use API that takes a screenshot and pixel coordinates, and returns just the cropped element as a lossless PNG — ready to pass directly to your vision LLM.
POST /crop
{
"image": "<base64 screenshot>",
"x": 120,
"y": 45,
"width": 640,
"height": 80
}
Returns:
{
"success": true,
"data": {
"base64": "iVBORw0KGgo...",
"mime": "image/png",
"width": 640,
"height": 80,
"bytes": 4821
}
}
A 4KB crop instead of a 2MB screenshot. Same information. 95% fewer tokens.
How payment works
Here's where it gets interesting. The API uses the x402 payment protocol — HTTP's long-dormant 402 Payment Required status code, finally put to use.
There are no API keys. No accounts. No subscriptions. The agent pays $0.0005 USDC per crop on Base L2 automatically.
The flow:
1. Agent POSTs to /crop (no payment header)
← 402 with payment instructions in headers
2. Agent transfers 0.0005 USDC to recipient wallet on Base
(near-zero gas, ~2 second settlement)
3. Agent POSTs again with x-payment-tx-hash header
← 200 with cropped PNG
The entire exchange happens inside the HTTP request cycle. No human intervention. No billing dashboard. The money lands directly in the operator's wallet on-chain.
Real agent integration
Here's what using it looks like in a Playwright agent:
import { chromium } from 'playwright';
import { readFileSync } from 'fs';
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/dashboard');
// Take screenshot
await page.screenshot({ path: 'screen.png' });
const imageB64 = readFileSync('screen.png').toString('base64');
// Get element coordinates
const rect = await page.$eval('.price-display', el => el.getBoundingClientRect().toJSON());
// Probe the API for payment instructions
const probe = await fetch('https://x402-vision-cropper.onrender.com/crop', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
image: imageB64,
x: Math.floor(rect.x),
y: Math.floor(rect.y),
width: Math.floor(rect.width),
height: Math.floor(rect.height),
}),
});
// → 402 response with payment details in headers
const recipient = probe.headers.get('x-payment-recipient');
const amount = probe.headers.get('x-payment-price-usdc');
// Pay on Base L2 using viem
const txHash = await sendUsdc({ recipient, amount }); // your wallet logic here
// Resubmit with payment proof
const result = await fetch('https://x402-vision-cropper.onrender.com/crop', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-payment-tx-hash': txHash,
},
body: JSON.stringify({
image: imageB64,
x: Math.floor(rect.x),
y: Math.floor(rect.y),
width: Math.floor(rect.width),
height: Math.floor(rect.height),
}),
});
const { data } = await result.json();
// Pass the tiny crop to your vision LLM instead of the full screenshot
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: [
{ type: 'image_url', image_url: { url: `data:${data.mime};base64,${data.base64}` } },
{ type: 'text', text: 'What is the price shown?' }
]
}]
});
The architecture
The server is intentionally minimal:
- Fastify on Node.js — low memory footprint
- Sharp for image processing — in-RAM only, no disk writes
- Zero persistent storage — every request is stateless, data exists only for the duration of the request
- Runs on a 512MB single-CPU container on Render
The entire codebase is about 400 lines across 7 files. No database. No session state. No auth layer beyond the payment itself.
Try it
The API is live now:
# Check it's running
curl https://x402-vision-cropper.onrender.com/health
# Trigger the payment challenge
curl -X POST https://x402-vision-cropper.onrender.com/crop \
-H "Content-Type: application/json" \
-d '{"image":"'"$(python3 -c "print('A'*200)")"'","x":0,"y":0,"width":10,"height":10}'
Machine-readable docs for agents: https://x402-vision-cropper.onrender.com/llms.txt
What I learned building this
x402 is genuinely exciting but very early. The protocol works cleanly — payment instructions in headers, proof in the retry, settlement on-chain. But the agent ecosystem is still catching up. Most frameworks don't have native wallet support yet.
Stateless by design is underrated. No database means no breach, no GDPR headache, no backup strategy, no connection pooling. Every request lives and dies in RAM. For a high-throughput API that processes sensitive screenshot data this is the right architecture.
The unit economics make sense at scale. At $0.0005 per crop the service costs less than a rounding error compared to what it saves on vision tokens. The challenge isn't pricing — it's volume.
If you're building browser agents or anything that feeds screenshots to vision models, give it a try. And if you're building in the x402 / agentic payments space I'd love to hear what you're working on.
Top comments (0)