Web Scraping Cost per 1,000 Pages: A 2026 Breakdown

Web Scraping Cost per 1,000 Pages: A 2026 Breakdown

Most scraping cost estimates are wrong before you even run a single request. web scraping cost per 1,000 pages isn’t one number — it’s four numbers multiplied together: proxy bandwidth, compute time, storage, and retry overhead. teams that only price the proxy are usually off by 2x to 5x. this benchmark runs all twelve stacks against three target types and folds in the retry tax that most writeups quietly ignore.

The 12 stacks: what they actually cost

All numbers below assume 1,000 successful fetches, not attempts. proxy costs use $5/GB residential unless noted. compute is AWS Lambda x86 at 1.5GB RAM. targets are bucketed as static HTML (no JS rendering required), JS-rendered SPAs, and bot-protected pages with active Cloudflare or Akamai challenges.

StackStatic HTML ($/1K)JS-rendered ($/1K)Bot-protected ($/1K)
httpx + datacenter proxy$0.09n/a$0.44
requests + datacenter proxy$0.11n/a$0.48
httpx + residential proxy ($1/GB)$0.27n/a$1.05
Playwright + datacenter proxy$0.38$0.45$1.90
Scrapy + rotating residential$0.58n/a$2.50
Playwright + residential ($5/GB)$0.65$0.79$3.10
Puppeteer + residential ($5/GB)$0.71$0.85$3.30
Browserless + residential$0.94$1.15$4.00
ScraperAPI (standard)$1.40n/a$3.40
Apify managed platform$1.20$1.55$5.00
ScrapingBee (JS render)n/a$2.10$5.80
Bright Data SERP APIn/an/a$9.90

The spread between cheapest and most expensive on bot-protected pages is roughly 22x. that’s not a pricing anomaly — it’s the compounding effect of proxy tier, compute overhead, and block rate all hitting at once. if you haven’t pinned down which proxy tier your target actually needs, the 2026 decision tree for $1, $5, and $15/GB residential proxies is the right starting point before you commit to a stack.

What’s inside the cost

The bill breaks down into four components, and their relative weight shifts completely depending on target type.

  • Proxy bandwidth: dominant on bot-protected targets. at 800KB average page size and $5/GB, that’s $4.00 per 1,000 pages in bandwidth alone, before retries inflate it further.
  • Compute: Lambda Playwright invocations run about $0.0015-0.002 each. at 1,000 pages that’s $1.50-2.00. cheap, but it’s also the number that scales linearly with your retry rate.
  • Storage: ignored until month two. S3 standard at $0.023/GB adds up fast when you’re storing raw HTML. switching to Cloudflare R2 for scraped data cuts egress to zero, which matters a lot if pipelines re-read the same data repeatedly.
  • Retry overhead: this one gets its own section below because it’s where most budgets quietly break.

On static HTML jobs with no bot protection, proxy and storage are both minimal. compute dominates and it’s cheap. flip to bot-protected targets and the ratio inverts — proxy spend can hit 75-85% of the total bill. that ratio is what makes stack selection non-trivial.

The retry tax

A 0% block rate doesn’t exist. here’s how failure rates scale with target difficulty, and what they actually do to cost:

  1. No bot protection, static HTML: 2-5% block rate. cost multiplier ~1.03x. not worth optimizing.
  2. Rate-limited or login-walled pages: 10-20% block rate. multiplier 1.11-1.25x. budget for it.
  3. Cloudflare JS challenge with datacenter proxies: 25-50% block rate. multiplier up to 2x. this is where proxy tier starts mattering enormously.
  4. Akamai or Imperva with cheap residential ($1/GB): 35-65% block rate. multiplier up to 2.86x. you need ISP or sticky residential at $5-15/GB here.

A job that looks like $1.00/1K with a clean success rate can easily hit $2.50/1K once you account for retries. the fix isn’t always to buy more expensive proxies — a datacenter and residential hybrid architecture can cut that multiplier by 40-70% on mixed workloads by escalating proxy tier only when the cheaper option gets blocked.

Here’s the escalation pattern in Python:

PROXY_TIERS = [
    {"server": "http://dc-proxy:8080", "cost_per_gb": 0.30},
    {"server": "http://resi-proxy:8080", "cost_per_gb": 5.00},
    {"server": "http://isp-proxy:8080", "cost_per_gb": 15.00},
]

async def fetch_with_escalation(url, max_tiers=3):
    for proxy in PROXY_TIERS[:max_tiers]:
        async with async_playwright() as p:
            browser = await p.chromium.launch(proxy=proxy)
            page = await browser.new_page()
            resp = await page.goto(url)
            if resp and resp.status == 200:
                content = await page.content()
                await browser.close()
                return content
            await browser.close()
    return None

Not production code, but the pattern holds: don’t pay residential rates for pages that don’t need them. most targets are actually blockable with datacenter proxies on the first 1-2 attempts. only escalate on confirmed failure.

Compute and storage: the second bill

Lambda is fine until it isn’t. for low-volume scraping under ~10,000 pages/day it’s usually the right choice — no infra to manage, billing is pure consumption. past that volume, cold start latency and the 15-minute execution cap start becoming real constraints. the full cost comparison of headless browsers on Lambda, Fargate, and Cloud Run shows Fargate winning on sustained Playwright workloads past about 50K pages/day once you amortize the container overhead.

Runtime choice also has a cost dimension that’s easy to miss. Bun outperforms Node.js by 30-40% on httpx-equivalent scraping benchmarks in 2026, and that time savings directly reduces billed compute on Lambda. not the biggest lever, but it’s a free optimization if you’re already choosing a runtime.

Storage is the cost that creeps. Three rules that hold up:

  • Store gzip-compressed HTML, not raw. typical ratio is 5:1 to 8:1, which means R2 or B2 storage costs shrink by the same factor.
  • Use cold storage tiers for anything older than 14 days. most scraped HTML has a short half-life.
  • Set object lifecycle policies before the first scrape, not after you’ve accumulated 200GB of orphaned raw files.

Egress is the sneaky one. if your pipeline reads scraped data back out of S3 into a processing lambda, you’re paying $0.09/GB every time. R2’s zero egress fee makes a meaningful difference once you’re moving real data volumes around.

When to move up the stack

Managed platforms (Apify, ScrapingBee, Bright Data) aren’t overpriced if you’re factoring in engineering time correctly. at 500K pages/month, the difference between self-managed Playwright at $1.90/1K and Apify at $5.00/1K is roughly $1,550/month. if maintaining your own rotating proxy pool and retry logic is costing two engineering hours a week, that’s likely $2,000+ in loaded cost. the math can favor managed platforms even when the per-page price looks worse.

But below roughly 200K pages/month, self-managed almost always wins on cost. the fixed overhead of building the system is already paid for.

Bottom line

If you’re paying more than $2.00 per 1,000 pages on non-SERP targets, there’s almost certainly a stack or proxy tier decision that can cut it in half — before changing anything about the scraping logic itself. hybrid proxy escalation and gzip storage are usually the two moves with the best return per hour of engineering time. we cover both in depth across the DRT infrastructure series, so pick the piece that matches where your current bill is highest and start there.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)