Headless Browser Cost Per LLM-Vision Call: 2026 Benchmarks

Rendering a page with Playwright or Puppeteer costs roughly $0.0003 to $0.002 per screenshot, but the headless browser cost per LLM-vision call blows that number out of the water once you add inference. in 2026, the real bill isn’t the browser — it’s what you do with the pixels. this benchmark breaks down the full stack cost so you can decide whether vision-based scraping actually pencils out for your use case.

What You’re Actually Paying For

the cost of a single LLM-vision scrape has four components:

browser compute — EC2/GCP/Hetzner instance time to render and screenshot
screenshot resolution — image size in bytes directly affects token count
vision model input tokens — the dominant cost line
output tokens — parsed JSON or structured data returned

a 1280×800 screenshot at standard PNG compression runs about 200KB to 600KB, which encodes to roughly 800 to 2,400 vision tokens depending on the model’s tiling scheme. at GPT-4o pricing ($2.50/1M input tokens as of Q1 2026), that’s $0.002 to $0.006 per screenshot in model costs alone — before you count the system prompt or output.

if your pipeline auto-generates Pydantic extraction schemas the way LLM-Driven Scraping Schemas: Auto-Generating Pydantic Models (2026) covers, output token counts stay tight. but a generic “extract everything” prompt can balloon output to 500+ tokens per call.

Provider Benchmark: Vision Cost Per Call (2026)

the table below uses a 1280×800 screenshot (approx. 1,200 vision tokens in), a 150-token system prompt, and a 200-token JSON output. prices are public list rates as of May 2026.

model	input $/1M	output $/1M	est. cost/call	notes
GPT-4o (OpenAI)	$2.50	$10.00	$0.0052	fastest, widest MIME support
Claude 3.5 Sonnet	$3.00	$15.00	$0.0066	strong structured output
Claude 3 Haiku	$0.25	$1.25	$0.0006	low accuracy on dense layouts
Gemini 1.5 Flash	$0.075	$0.30	$0.0002	best cost/call, slightly worse JSON
Gemini 1.5 Pro	$1.25	$5.00	$0.0028	best accuracy on complex tables
Llama 3.2 Vision (self-hosted, A100)	~$0.40/hr GPU	—	$0.0008–0.0015	depends on throughput

at scale, Gemini 1.5 Flash is the clear winner on cost if your pages are clean. for dense e-commerce grids or dynamically rendered dashboards, Gemini 1.5 Pro or GPT-4o accuracy often offsets the 10x price difference in downstream data quality.

Browser Compute Cost and Resolution Trade-offs

headless Chromium on an e2-standard-2 GCP instance (2 vCPU, 8GB RAM, ~$0.067/hr) handles about 30 to 50 screenshot+call cycles per hour at safe concurrency. that adds $0.0013 to $0.0022 per call in browser compute — negligible vs. model costs.

where you can cut costs meaningfully:

lower resolution: drop to 960×600. vision token count falls ~35%, saving ~$0.001/call on GPT-4o
crop to the relevant DOM node: screenshot only the target element with element.screenshot() rather than full page
JPEG over PNG: same visual fidelity at 40-60% smaller file size, fewer tokens
cache rendered screenshots: if the underlying page content doesn’t change hourly, storing the screenshot and re-using it for multiple model calls is free

from playwright.async_api import async_playwright
import base64

async def screenshot_element(url: str, selector: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page(viewport={"width": 960, "height": 600})
        await page.goto(url, wait_until="networkidle")
        el = page.locator(selector)
        img_bytes = await el.screenshot(type="jpeg", quality=75)
        await browser.close()
        return base64.b64encode(img_bytes).decode()

this pattern gets you a 75KB JPEG in most cases instead of a 400KB PNG — a 4x token reduction on the same content.

When Vision Actually Beats CSS Selectors

vision-based scraping earns its cost premium in three specific scenarios:

rendered canvas elements, PDF embeds, or SVG charts where there is no DOM to parse
sites with aggressive selector obfuscation that causes frequent breakage (the maintenance cost argument in Self-Healing Scrapers with LLMs: When Selectors Break (2026) quantifies this well)
rapid prototyping where you need extraction working in hours, not days

for stable sites with predictable HTML, vision adds cost without accuracy benefit. the break-even point is roughly 3 selector breakages per month — if a site breaks your selectors less often than that, traditional Playwright scraping is cheaper over a 6-month horizon.

Downstream Storage: Don’t Overlook Embedding Costs

once you have LLM-extracted data, many pipelines push it into vector stores for semantic retrieval. that’s a separate cost layer. embedding 200 tokens of extracted output via text-embedding-3-small costs $0.00000004 — essentially free per record. but at 100K records/day, you’re generating ~50MB of 1536-dimension float32 vectors daily, and storage and query costs in pgvector vs Qdrant vs Weaviate (2026) become non-trivial.

if your use case is building a RAG pipeline over scraped documentation, the vision extraction cost is a one-time ingest cost — amortize it against the number of downstream queries to get the real per-query economics.

for data pipelines that pull from structured news or financial feeds rather than raw HTML, it’s worth comparing against API alternatives. the NewsAPI Pricing 2026 breakdown illustrates how paid feed APIs can undercut full-stack vision scraping by 10x to 50x for content-class data, where the structured data already exists.

Bottom Line

for most scraping use cases, Gemini 1.5 Flash at $0.0002/call with cropped JPEG screenshots is the cheapest viable vision option in 2026 — use it as the default and escalate to GPT-4o or Gemini Pro only when output quality fails validation. the browser compute cost is a rounding error; the model call is the bill. DRT will continue tracking provider pricing as it shifts through the year, since this table will look different by Q4.