Anthropic Prompt Caching for LLM Scraping Pipelines (2026)

Anthropic Prompt Caching for LLM Scraping Pipelines (2026)

If you’re running LLM calls against scraped HTML at any real volume, Anthropic prompt caching is probably the single fastest way to cut your bill in half — without touching your extraction logic.

Prompt caching lets Claude reuse a previously computed KV cache for the “static” portion of your prompt: your system prompt, schema definitions, few-shot examples. you pay full price once, then a fraction (10% of the input token rate) on every cache hit. for scraping pipelines where the schema and instructions don’t change between pages, the hit rate is almost always high enough to matter.

Below is the real mechanics: what’s cacheable and what isn’t, what savings you can actually expect, and how to wire it into a production scraper.

How the cache works (and what breaks it)

Anthropic’s cache operates on a prefix model. you mark a block with cache_control: {"type": "ephemeral"}, and the API caches everything up to that breakpoint for 5 minutes (extendable with repeated use). if your next request matches that exact prefix, you get a cache hit.

This means token order matters absolutely. if you’re constructing prompts dynamically and shuffle the field order or inject any variable content before the cache breakpoint, you’ll get a miss. the single most common mistake is putting a request-level variable (like the page URL or a timestamp) early in the prompt before the static schema block.

A valid caching setup for a scraper looks like this:

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You extract structured product data from HTML.
Return JSON matching this schema exactly:
{"title": str, "price": float, "sku": str, "availability": str}
Rules:
- price must be a float, strip currency symbols
- availability: one of "in_stock", "out_of_stock", "unknown"
- sku: null if not found
"""

def extract(html: str) -> dict:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"}  # cache this prefix
            }
        ],
        messages=[
            {"role": "user", "content": f"Extract from this HTML:\n\n{html}"}
        ]
    )
    return response.content[0].text

The schema and rules are cached. the HTML is not. every call after the first one (within the 5-minute TTL) hits the cache on the system prefix, paying only 10% of the input cost for those tokens.

Real-world savings numbers

Cache hit savings depend on two things: how large your static prefix is relative to the total input, and what your hit rate actually is.

Here’s what the math looks like across a few realistic configurations:

SetupSystem tokensHTML tokensCache hit savings per call
Simple schema, short pages400600~40% input cost
Detailed schema + few-shots1,200800~60% input cost
Complex schema + long HTML1,2004,000~23% input cost
Batch job, same schema all day8001,500~35% input cost

For a pipeline scraping 50,000 pages/day with a 1,200-token system prompt and moderate HTML, you’re looking at saving roughly $3 to $5 per day on claude-haiku alone. not life-changing per run. but across a month of continuous scraping it adds up fast. and if you’re on claude-sonnet-4-6 for richer extraction, the raw savings are 5x higher becuase the token rates are higher.

This is separate from response-level caching (caching the full output for identical inputs), which is covered in detail in the Caching LLM Responses for Scrapers: Hit-Rate Patterns That Save 70% (2026) post. prompt caching operates at the API layer; response caching operates at your application layer. both are worth doing.

Where prompt caching fits in the cost stack

Prompt caching is not a replacement for model routing. if you’re choosing between haiku and sonnet on a per-page basis, you should be doing that first — it has a bigger per-call impact. Building an LLM Model Router for Scraping: Cheap vs Smart Trade-Offs (2026) has a good breakdown of when to route to a smaller model vs. a capable one.

Prompt caching layers on top of whatever model you’re already using. it’s not a routing decision — it’s a construction decision. you’re just making sure the static part of your prompt gets reused efficiently.

Stack the levers in order of impact:

  1. Model selection first. haiku vs. sonnet is a 5x cost difference before any caching.
  2. Prompt caching second. cuts input cost for the static prefix by 90%.
  3. Batching third — relevant if latency isn’t a constraint. OpenAI Batch API vs Real-Time for Scraping: Cost + Latency 2026 covers how Anthropic’s async approach compares if you’re doing cross-provider cost analysis.
  4. Response caching fourth. eliminates cost entirely for repeated identical pages (re-crawls, mostly).

If you want the full per-token picture across providers, the Scraping Cost Per Token 2026: Comparing 9 LLMs for Web Extraction breakdown is worth a read before you commit to a model choice.

What doesn’t cache well

There are some patterns that cause consistent cache misses in scraping pipelines — worth being explicit:

  • Variable injection before the breakpoint. anything that changes per-request (URL, timestamp, user-agent, session ID) should always come after the cached prefix, in the user message.
  • Schema templating with f-strings. if you’re rendering field names or types dynamically per-domain, you’re generating different system prompts every time. consider a fixed superset schema with optional fields instead.
  • Multi-turn extraction sessions. the cache breakpoint has to be at a stable conversation prefix. adding prior turns before your schema block invalidates it.
  • Too-short static prefixes. the minimum cacheable block is 1,024 tokens. if your system prompt is shorter than that, you won’t get any caching benefit at all. consider expanding your few-shot examples until you hit the threshold.

The 5-minute TTL is also worth understanding. if your scraper is batching jobs with gaps longer than 5 minutes between bursts, you’ll pay full price on the first call of each burst. keeping a steady drip of calls warm is better than occasional large batches with idle gaps.

Integrating with a downstream analytics pipeline

The savings compound when you’re pushing directly into storage. scraping to something like ClickHouse for real-time analytics (the architecture described in Scraping to ClickHouse: Real-Time Analytics Pipeline for Web Data (2026)), the scraper throughput directly determines your ingestion rate. cutting LLM latency by hitting the cache means you can push more rows per second without adding workers.

This also changes your cost model. at scale, the cost-per-row metric starts to matter more than cost-per-call. with prompt caching, a pipeline scraping 1M pages/day on haiku can get input costs down to roughly $0.25-0.40 per 1,000 pages depending on HTML size. That’s a meaningful number to track in your unit economics.

The quick checklist before calling your pipeline “cache-optimized”:

  • static schema and instructions are before the cache_control breakpoint
  • no variable content in the system prompt block
  • system prompt is at least 1,024 tokens (pad with few-shots if needed)
  • call frequency is high enough to stay inside the 5-min TTL
  • you’re logging cache_creation_input_tokens vs cache_read_input_tokens in your metrics to verify hit rate

Bottom line

Prompt caching is a low-effort, high-return optimization for any scraper that’s running the same schema against variable content — which is most of them. get your model routing right first, then layer caching on top, and you’ll generally land at 40-60% off your input costs with maybe an afternoon of refactoring. DRT covers these cost levers in depth across the LLM scraping series; the numbers above are representative of real production pipelines, not synthetic benchmarks.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)