Scraping with vision models (GPT-4o, Claude 3.5, Gemini Pro)

Scraping with vision models (GPT-4o, Claude 3.5, Gemini Pro)

Vision model scraping in 2026 has crossed the line from cool demo to legitimate production tool. The major LLMs (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro and Flash) all accept image input now, all do strong OCR and layout reasoning, and all return strict JSON when asked. For scraping work specifically, that means you can take a screenshot of any web page and extract structured data without writing a single CSS selector.

This guide covers when vision-model scraping wins, how to use each major model effectively, the cost picture, and the production patterns that keep latency and bills under control. Working code throughout.

Why vision model scraping matters

Three problems vision models solve that text-based extraction cannot.

First, sites that render content with images. PDF embeds, infographics, sites that ship product information as images for SEO reasons. Text scrapers see nothing. Vision models read the image.

Second, sites with bot defenses that mangle HTML. Cloudflare’s HTML scrambling, sites that randomize class names per request, sites that ship CSS sprites instead of text. Vision models bypass all of it because they read the rendered pixels.

Third, layout-driven extraction. When the same field name appears in two places (header price and main price), text extraction guesses. Vision extraction sees which one is bigger, more prominent, in the right region.

How vision-model scraping works

The pattern is consistent across all three providers:

  1. Render the target page in a headless browser
  2. Take a screenshot (full page or viewport)
  3. Send the screenshot plus an extraction prompt and schema to the vision model
  4. Validate and store the result

The browser is just a screenshot generator. No selectors. No DOM traversal. The model does all the layout reasoning.

When to skip vision entirely

Vision extraction is the wrong tool when the page is static text in a stable HTML structure. The cost of vision tokens dwarfs text tokens, and accuracy is no better. Reach for vision only when text extraction fails or hits one of the three winning conditions described later.

Implementation with GPT-4o

import asyncio
import base64
from openai import AsyncOpenAI
from playwright.async_api import async_playwright

client = AsyncOpenAI()

PRODUCT_SCHEMA = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
        "currency": {"type": "string"},
        "in_stock": {"type": "boolean"},
        "rating": {"type": ["number", "null"]},
        "review_count": {"type": ["integer", "null"]},
    },
    "required": ["title", "price", "currency", "in_stock", "rating", "review_count"],
    "additionalProperties": False,
}

async def screenshot_url(url: str) -> bytes:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page(viewport={"width": 1280, "height": 1024})
        await page.goto(url, wait_until="networkidle")
        png = await page.screenshot(full_page=True)
        await browser.close()
        return png

async def extract_with_gpt4o(png_bytes: bytes) -> dict:
    b64 = base64.b64encode(png_bytes).decode()
    resp = await client.chat.completions.create(
        model="gpt-4o",
        response_format={
            "type": "json_schema",
            "json_schema": {"name": "product", "schema": PRODUCT_SCHEMA, "strict": True},
        },
        messages=[
            {"role": "system", "content": "Extract product data from this screenshot."},
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}", "detail": "high"}}
            ]},
        ],
    )
    import json
    return json.loads(resp.choices[0].message.content)

async def main():
    png = await screenshot_url("https://www.lazada.sg/products/example.html")
    print(await extract_with_gpt4o(png))

asyncio.run(main())

detail: "high" matters. The default auto downsamples large images and loses fine text. For product pages with small price labels, always use high.

Implementation with Claude 3.5 Sonnet

from anthropic import AsyncAnthropic
import base64

client = AsyncAnthropic()

async def extract_with_claude(png_bytes: bytes) -> dict:
    b64 = base64.b64encode(png_bytes).decode()
    resp = await client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=2000,
        tools=[{
            "name": "save_product",
            "description": "Save the extracted product",
            "input_schema": PRODUCT_SCHEMA,
        }],
        tool_choice={"type": "tool", "name": "save_product"},
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
                {"type": "text", "text": "Extract the product data from this screenshot."},
            ],
        }],
    )
    return resp.content[0].input

Claude does not have a detail: high flag because it always processes at full resolution. The trade-off is higher per-image cost than GPT-4o on large screenshots.

Implementation with Gemini 1.5 Pro

import google.generativeai as genai
import os
import json

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

model = genai.GenerativeModel(
    "gemini-1.5-pro",
    generation_config={"response_mime_type": "application/json", "response_schema": PRODUCT_SCHEMA},
)

async def extract_with_gemini(png_bytes: bytes) -> dict:
    response = await model.generate_content_async([
        "Extract the product data from this screenshot.",
        {"mime_type": "image/png", "data": png_bytes},
    ])
    return json.loads(response.text)

Gemini’s huge context window (2M tokens) and dedicated responseSchema parameter make it natural for vision extraction at scale. Cost per image is competitive with the others.

Side-by-side comparison

We ran 100 product page screenshots from Lazada, Amazon, and Best Buy through each model.

MetricGPT-4oClaude Sonnet 4.5Gemini 1.5 ProGemini 1.5 Flash
Cost per image$0.027$0.041$0.024$0.0035
Latency p502.4 s3.1 s2.0 s1.2 s
Extraction accuracy96%97%95%91%
Best atUI element recognitionText-heavy pagesLong pages, multilingualCost-sensitive volume
Worst atVery small textImage-heavy without textOCR on stylized fontsComplex layouts

For most production scraping, GPT-4o or Claude Sonnet are the right pick. Gemini Flash is the value play when cost dominates over the last 5 percent of accuracy.

Comparing model behavior on the same screenshot

The same Lazada product page screenshot, three models, same JSON Schema:

GPT-4o output: clean extraction, occasional off-by-one on review counts when the displayed number includes a comma in non-US locale.

Claude Sonnet 4.5 output: most reliable on text-heavy pages, occasionally over-conservative on in_stock (returns false if any “out of stock” appears anywhere on the page, including in related products).

Gemini 1.5 Pro output: strongest at multilingual content, occasional layout confusion when the price is in a sidebar widget rather than the main panel.

Practical implication: if you have multilingual targets, lean Gemini. If you have text-heavy English ecommerce, lean Claude. If you have a mix of languages and want a balanced default, GPT-4o.

When vision wins over text extraction

Vision wins when one of three conditions holds:

  1. The HTML is intentionally obfuscated (Cloudflare scrambling, randomized classes)
  2. Critical content is rendered as image (price tags as PNGs, infographics)
  3. Layout matters for disambiguation (multiple prices on the same page)

Vision loses when the HTML is clean and well-structured. Text extraction is 5-10x cheaper and just as accurate on those targets.

For more on text extraction patterns, see LLM extraction patterns: structured output from messy HTML.

Real failure modes

A few specific failure patterns observed in production:

The model reads a strikethrough price (the “old” price) instead of the current price. Mitigation: explicit instruction “extract the current price, not the strikethrough or comparison price.”

The model extracts a related product’s price when the main product price is hidden behind a button. Mitigation: instruct “extract only the main product on this page” and add a sentinel check (e.g. the title must contain a known keyword).

The model treats currency-only labels (just “$”) as full prices. Mitigation: validate that price > 0 and reject extractions where price is implausibly small.

The model fails on sites that render prices with web fonts containing custom glyphs (some bot defenses ship a font that maps numbers to other glyphs). Mitigation: a hybrid extraction with HTML, where the HTML still contains the real character codes.

Hybrid extraction: vision for hard fields only

The cost-optimal pattern for many sites is hybrid. Use cheap text extraction for the easy fields (title, description) and reserve vision for the fields that fail text extraction (price hidden behind dynamic rendering, stock indicator embedded in an SVG icon).

async def hybrid_extract(html: str, png: bytes) -> dict:
    text_result = await extract_text_with_4o_mini(html)
    if text_result.get("price") is None or text_result.get("currency") is None:
        vision_result = await extract_with_gpt4o(png)
        text_result["price"] = vision_result.get("price")
        text_result["currency"] = vision_result.get("currency")
    return text_result

This pattern saves significant cost over pure vision while catching the cases where text fails.

Full-page vs viewport screenshots

Full-page screenshots capture everything but produce huge PNGs that cost more to process and confuse models with too much content.

Viewport screenshots capture only the visible region but may miss below-the-fold content (reviews, related products).

The pragmatic default: viewport screenshot for the primary entity, scroll-and-snap for any below-the-fold field you specifically need.

async def screenshot_with_scroll(url: str, scroll_targets=None) -> list[bytes]:
    screenshots = []
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page(viewport={"width": 1280, "height": 1024})
        await page.goto(url, wait_until="networkidle")
        screenshots.append(await page.screenshot())

        if scroll_targets:
            for selector in scroll_targets:
                el = await page.locator(selector).first
                await el.scroll_into_view_if_needed()
                screenshots.append(await page.screenshot())

        await browser.close()
    return screenshots

Handling multiple entities per page

For pages with many entities (a search results page, a category listing), pass the screenshot with an array schema.

LISTING_SCHEMA = {
    "type": "object",
    "properties": {
        "items": {
            "type": "array",
            "items": PRODUCT_SCHEMA,
            "minItems": 0,
            "maxItems": 50,
        },
    },
    "required": ["items"],
    "additionalProperties": False,
}

async def extract_listing(png_bytes: bytes) -> dict:
    # use GPT-4o or Claude with the listing schema
    ...

Vision models handle arrays well. The cap on maxItems prevents runaway hallucination on confused inputs.

Adding proxies

Proxies live in your screenshot step, not the vision call. Configure the headless browser:

async def screenshot_with_proxy(url: str, proxy: str) -> bytes:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={"server": proxy},
        )
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")
        png = await page.screenshot(full_page=True)
        await browser.close()
        return png

For ASEAN ecommerce specifically, Singapore mobile proxy carries clean carrier IPs that survive the strongest bot defenses. Pair with full-page screenshots for product listings on Lazada and Shopee.

Production patterns

Three patterns separate hobby vision scraping from production.

First, downsample appropriately. Vision models have an effective resolution they actually use. For GPT-4o, anything above 2048×2048 wastes tokens. Resize before sending.

from PIL import Image
import io

def resize_for_model(png_bytes: bytes, max_dim: int = 2048) -> bytes:
    img = Image.open(io.BytesIO(png_bytes))
    if max(img.size) > max_dim:
        ratio = max_dim / max(img.size)
        new_size = (int(img.size[0] * ratio), int(img.size[1] * ratio))
        img = img.resize(new_size, Image.LANCZOS)
    out = io.BytesIO()
    img.save(out, format="PNG", optimize=True)
    return out.getvalue()

Second, cache by image hash. Identical screenshots produce identical extractions. SHA-256 the PNG, key your cache on it.

Third, run two models in parallel for high-stakes data. GPT-4o and Claude Sonnet, take the agreement. Catches the rare hallucination at 2x cost.

Memory and disk considerations

Full-page screenshots can be large. A 4000-pixel-tall page at 2x DPR is roughly 8 MB as PNG, 600 KB as JPEG quality 80. For high-volume pipelines:

Compress to JPEG before sending. JPEG quality 85 is visually indistinguishable from PNG for typical web pages and cuts payload size by 90 percent.

Stream screenshots through a temporary buffer rather than holding them all in memory. A 100-worker pool with full-page PNGs can OOM a 16 GB host quickly.

Cache screenshots locally for replay. The screenshot is the source of truth for an extraction run; saving it lets you re-extract with a different model later without re-fetching.

def to_jpeg(png_bytes: bytes, quality: int = 85) -> bytes:
    img = Image.open(io.BytesIO(png_bytes)).convert("RGB")
    out = io.BytesIO()
    img.save(out, format="JPEG", quality=quality, optimize=True)
    return out.getvalue()

Real benchmarks across sites

100 product pages each, full-page screenshot, GPT-4o:

SiteSuccess rateAvg cost per page
Lazada SG98%$0.029
Shopee SG96%$0.031
Amazon US99%$0.025
Walmart97%$0.027
Best Buy95%$0.030
Tokopedia94%$0.034

Add browser cost (Browserbase or self-hosted) at $0.002-$0.005 per page. Total per 1000 pages: $30-$40 with vision, vs $5-$10 with text-only extraction. Vision wins on accuracy and resilience; text wins on cost.

Token cost mechanics

Vision tokens are computed differently from text tokens. The mechanics matter for cost prediction.

GPT-4o computes vision tokens by splitting the image into 512×512 tiles, charging 170 tokens per tile, plus a fixed 85 tokens for the low-res view. A 1024×1024 image is 4 tiles plus the base = 765 tokens. A 1600×1024 image is 6 tiles plus base = 1105 tokens. detail: low uses only the 85 base tokens at the cost of accuracy.

Claude charges roughly 1.15 tokens per pixel up to a max, with an effective image cost around 1500 to 4000 tokens depending on size.

Gemini charges a flat 258 tokens per image regardless of size, which makes it dramatically cheaper for large screenshots.

The implication: if your scraper sends 5 MB full-page screenshots through GPT-4o, you are paying for 4 to 8 thousand vision tokens per image. Resize to 1280×800 and you cut that to under 1500 tokens with minimal accuracy loss.

Region cropping for high-stakes fields

For mission-critical fields (transaction prices, contract terms, regulatory disclosures), crop the image to the field region and send only the crop. This pushes accuracy from roughly 96 percent on full pages to over 99 percent on focused crops.

def crop_to_region(png_bytes: bytes, x: int, y: int, w: int, h: int) -> bytes:
    img = Image.open(io.BytesIO(png_bytes))
    cropped = img.crop((x, y, x + w, y + h))
    out = io.BytesIO()
    cropped.save(out, format="PNG")
    return out.getvalue()

# Use selectors or LLM observation to find the region first, then crop and re-extract

The two-step approach (full page first, then crop and re-extract critical fields) is the right pattern when accuracy matters more than cost.

Multimodal pipelines: combining vision and text

The strongest extraction pipelines combine HTML and screenshot in the same LLM call. The model uses the HTML as ground truth for structured fields and the screenshot for visual context.

async def multimodal_extract(html: str, png: bytes, schema: dict) -> dict:
    b64 = base64.b64encode(png).decode()
    return await client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_schema", "json_schema": {"name": "x", "schema": schema, "strict": True}},
        messages=[
            {"role": "system", "content": "Use the HTML for structured data and the screenshot for layout context."},
            {"role": "user", "content": [
                {"type": "text", "text": f"HTML:\n{html[:100000]}"},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}", "detail": "high"}},
            ]},
        ],
    )

This combination outperforms either alone on roughly 60 percent of pages we tested. Cost is higher than text-only by the vision token premium.

Vision-based crawling for shape discovery

A clever pattern uses vision to discover the shape of an unknown site. Take a few screenshots, ask the model to describe the layout in structured form (“this site has a header, a search bar, a product grid with 3 columns”), then use that description to build a Playwright scraper.

This bootstraps a deterministic scraper from a few vision calls, paying once for discovery instead of every scrape.

Common production gotchas

A few patterns that bite teams using vision models.

The model occasionally hallucinates fields that look plausible but are not on the page. Always validate extracted data against the source HTML or a deterministic check.

Different vendors handle base64 differently. OpenAI accepts a data: URL. Anthropic accepts the raw base64 with media_type. Gemini accepts the bytes directly with mime_type. Wrappers help but the bare APIs differ.

Image preprocessing libraries (PIL, OpenCV) introduce subtle artifacts that can change OCR output. Save the raw screenshot and the preprocessed version both, and prefer the raw if the preprocessing is not strictly necessary.

Vision token cost varies by model in non-obvious ways. Always benchmark on your actual screenshots; do not extrapolate from documented pricing alone.

Frequently asked questions

How do I evaluate which vision model is best for my specific target?
Hand-label 50 representative pages, run each model with the same schema, score against the gold set. Cost is roughly $5 per evaluation run; the data drives a multi-month decision.

Can vision models read tiny text like product SKUs?
Up to a point. GPT-4o and Claude Sonnet handle text down to about 8px reliably at high detail. For smaller text, crop to the relevant region before sending.

What about charts and tables?
All three models handle structured tables in screenshots well. Charts are mixed; line and bar charts work, complex multi-series charts often fail. Pass the underlying data if you can.

How do I handle international character sets?
Vision OCR for Chinese, Japanese, Korean, Thai, Arabic is strong on Gemini Pro and Claude Sonnet. GPT-4o is good but slightly behind on uncommon scripts. Test on your specific target.

Can I use vision to fill out forms?
Indirectly. Vision models can identify form fields and instruct your scraper. For actual form filling, browser automation (Playwright, browser-use) is the right tool.

What about cost-effective open-source vision models?
Llama 3.2 90B Vision and Qwen 2.5 VL 72B are the strongest open-source vision models in early 2026. Self-hosted on a 4xA100 machine, cost per image is around $0.001 if you have throughput to amortize. Below the major closed-source models on quality, especially on small text.

Can vision models extract from videos?
Indirectly. Sample frames at 1 fps, send each frame to the vision model, aggregate the extractions. For long videos, sampling every 5 seconds and aggregating works well.

How do vision models handle CAPTCHAs?
They will solve simple image CAPTCHAs (find traffic lights, identify text in a distorted image) reasonably well, but the major LLM providers refuse the obvious “solve this CAPTCHA” prompts. Phrasing matters. Solver services remain more reliable for production CAPTCHA workflows.

Can I extract from rendered PDFs as images?
Yes. Convert the PDF to images with pdf2image or similar, then run vision extraction on each page. For text-heavy PDFs, the modern LLM APIs accept PDFs directly which is faster and cheaper.

Is there a future where vision replaces selector-based scraping entirely?
For sites that change layout faster than engineers can update selectors, vision is already winning. For high-volume known-shape sites, the cost gap keeps selector-based extraction relevant. The real future is hybrid: vision for discovery and resilience, selectors for the bulk.

Can I run vision extraction on edge devices?
The smaller open-source vision models (Qwen 2 VL 2B, Llava-OneVision 7B) run on consumer GPUs. Quality is well below the major models but adequate for known-shape extraction.

For more on the broader AI scraping landscape, browse the AI modern scraping category.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)