Caching LLM Responses for Scrapers: Hit-Rate Patterns That Save 70% (2026)

LLM-powered scrapers are expensive by default. if you’re calling GPT-4o or Claude Sonnet on every page you fetch, you’re paying full token cost even when 80% of those pages are structurally identical. caching LLM responses is the single highest-leverage cost cut available in 2026, and teams that instrument it properly report 60-70% reduction in inference spend within the first month.

Why scraping workloads are unusually cacheable

most LLM use cases have low cache hit potential because every prompt is unique. scraping is the opposite. you’re repeatedly running the same extraction logic against pages that share a template: product listings, SERP results, job boards, company profiles. the HTML structure changes slowly. the extraction prompt almost never changes. this makes scraper workloads closer to a database read pattern than a conversational one.

the key insight is separating what varies (the raw HTML) from what doesn’t (the system prompt, the schema definition, the extraction instructions). if you’re using Pydantic AI for structured extraction, your schema and field-level instructions are fixed per scraper — exactly the kind of prefix that caches well.

Hit-rate patterns by page type

not all pages cache equally. here’s what we observe across production scraping pipelines:

page typetypical cache hit ratecache TTL sweet spot
product listings (e-commerce)65-80%6-24h
SERP result pages40-55%1-4h
company profile pages70-85%48-72h
news/article pages10-25%no cache
job postings60-75%12-24h

news and dynamic editorial content rarely repeat, so caching adds complexity without payback. everything else — especially structured directories and marketplaces — hits hard.

the variables that drive hit rate:

  • normalization before hashing. strip session tokens, ad parameters, timestamps, and CSRF tokens from the HTML before computing the cache key. a single UTM parameter difference will cause a miss on an otherwise identical page.
  • domain-level TTL tuning. sites that update product prices hourly (Amazon, Lazada) need short TTLs. B2B directories update quarterly. don’t apply a uniform TTL.
  • schema version in the cache key. when your Pydantic model changes, stale cache entries return wrong structure. append a schema hash or version integer to the key.

Implementation: semantic vs. exact-match caching

there are two approaches and they serve different goals.

exact-match caching hashes the full prompt (normalized HTML + system prompt + model ID) and stores the response. this is cheap, fast, and reliable. use Redis or DynamoDB with a short TTL. implementation takes an afternoon.

semantic caching embeds the prompt, stores the vector, and returns a cached response when cosine similarity exceeds a threshold (typically 0.95+). this catches near-duplicate pages that aren’t byte-identical. GPTCache and LangChain’s InMemoryCache support this pattern, but the embedding call itself costs tokens, so the math only works when your hit rate is above 50% and the saved LLM call is significantly more expensive than the embedding.

for most scraping pipelines, exact-match caching with good normalization gets you 70% of the way there. semantic caching adds complexity and an embedding latency penalty that matters if you’re running real-time extraction.

import hashlib, json, redis

r = redis.Redis(host="localhost", port=6379, db=0)

def cache_key(system_prompt: str, html: str, model: str, schema_version: int) -> str:
    normalized = html.split("csrfToken")[0]  # crude but effective
    payload = f"{model}:{schema_version}:{system_prompt}:{normalized}"
    return hashlib.sha256(payload.encode()).hexdigest()

def cached_extract(system_prompt, html, model, schema_version, ttl=3600):
    key = cache_key(system_prompt, html, model, schema_version)
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    result = call_llm(system_prompt, html, model)
    r.setex(key, ttl, json.dumps(result))
    return result

if you’re routing between cheap and expensive models based on page complexity, the cache layer should sit upstream of the router. a cached response skips the router entirely. see Building an LLM Model Router for Scraping for the routing logic that feeds into this.

Provider-level caching: Anthropic and OpenAI

both Anthropic and OpenAI now offer native prompt caching that reduces cost on repeated system prompt prefixes, which complements (not replaces) your application-level cache.

Anthropic’s prefix caching charges 10% of the base input token cost on cache hits for Claude Sonnet and Opus. if your system prompt plus schema runs 2,000 tokens and you’re hitting it 10,000 times a day, the savings are material. the real-world numbers for Anthropic prompt caching in scraping workflows show 40-60% input cost reduction on structured extraction pipelines specifically.

OpenAI’s approach is automatic prefix caching on context windows over 1,024 tokens, billed at 50% of base input cost. no configuration required, but you lose control over what gets cached. for batch workloads where latency doesn’t matter, the OpenAI Batch API cuts costs further still — 50% off list price — and pairs well with an application cache that deduplicates before the batch even runs.

the efficient stack in 2026: application-level exact-match cache first, then provider-level prefix caching on the system prompt, then model-tier routing. each layer stacks on the previous one.

Measuring and tuning cache performance

you can’t optimize what you don’t measure. instrument these four metrics from day one:

  1. hit rate by domain — identifies which scrapers cache well and which need TTL adjustment
  2. cost per extracted record — compare cached vs. uncached to quantify savings; benchmark against per-token cost data across nine models
  3. stale hit rate — how often cached results differ from a fresh extraction (sample 1-2% of hits to catch TTL drift)
  4. normalization miss rate — how often two logically identical pages miss cache due to normalization gaps (log cache keys and inspect collisions)

a weekly review of these four numbers will surface the most expensive gaps. the most common fix is always the same: tighten normalization. teams that add proper HTML normalization routinely jump from 45% to 70%+ hit rate without changing anything else.

Bottom line

caching LLM responses is not optional at scale — it’s the difference between a scraping pipeline that costs $0.003 per record and one that costs $0.012. start with exact-match Redis caching, normalize your HTML aggressively before hashing, version your schema in the cache key, and layer provider-level prefix caching on top. DRT covers the full cost stack for LLM-powered extraction, from caching patterns through to model selection and batch scheduling.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)