Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed

Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed

Residential proxy providers charge $5-15 per GB. A naive scraper downloading full pages with images, CSS, and JavaScript can burn through gigabytes in hours. By optimizing bandwidth, you can reduce costs by 70-90% while also improving speed — less data means faster responses.

This guide covers every technique to minimize proxy bandwidth consumption.

The Bandwidth Problem

A typical web page in 2026:

Average page weight: 2.5 MB
├── HTML:        100 KB (4%)
├── JavaScript:  900 KB (36%)
├── CSS:         200 KB (8%)
├── Images:      1,100 KB (44%)
├── Fonts:       150 KB (6%)
└── Other:       50 KB (2%)

If you only need text data, you're wasting 96% of bandwidth.

Scraping 100,000 pages at full weight: 250 GB = $1,250-$3,750 in residential proxy costs.

Scraping 100,000 pages (HTML only): 10 GB = $50-$150. A 25x cost reduction.

Technique 1: Block Unnecessary Resources

With Playwright

from playwright.async_api import async_playwright

async def scrape_text_only(url, proxy):
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            proxy={"server": proxy}
        )
        page = await browser.new_page()

        # Block images, CSS, fonts, media
        await page.route("**/*.{png,jpg,jpeg,gif,svg,webp,ico}",
                        lambda route: route.abort())
        await page.route("**/*.{css,woff,woff2,ttf,eot}",
                        lambda route: route.abort())
        await page.route("**/*.{mp4,mp3,avi,flv}",
                        lambda route: route.abort())

        # Block known tracking/analytics domains
        block_domains = [
            "google-analytics.com", "googletagmanager.com",
            "facebook.net", "doubleclick.net", "hotjar.com",
        ]
        for domain in block_domains:
            await page.route(f"**/*{domain}*", lambda route: route.abort())

        await page.goto(url, wait_until="domcontentloaded")
        content = await page.content()
        await browser.close()
        return content

With Puppeteer (Node.js)

const puppeteer = require('puppeteer');

async function scrapeEfficient(url, proxyServer) {
    const browser = await puppeteer.launch({
        args: [`--proxy-server=${proxyServer}`]
    });
    const page = await browser.newPage();

    // Enable request interception
    await page.setRequestInterception(true);
    page.on('request', (req) => {
        const blockedTypes = ['image', 'stylesheet', 'font', 'media'];
        if (blockedTypes.includes(req.resourceType())) {
            req.abort();
        } else {
            req.continue();
        }
    });

    await page.goto(url, { waitUntil: 'domcontentloaded' });
    const content = await page.content();
    await browser.close();
    return content;
}

Technique 2: Request Compression

Ask servers to compress responses:

import httpx
import gzip
import brotli

async def compressed_request(url, proxy):
    async with httpx.AsyncClient(
        proxy=proxy,
        headers={
            "Accept-Encoding": "br, gzip, deflate",  # Prefer Brotli
        }
    ) as client:
        response = await client.get(url)

        # httpx handles decompression automatically
        # But check what encoding was used
        encoding = response.headers.get("content-encoding", "none")
        raw_size = int(response.headers.get("content-length", 0))
        decoded_size = len(response.content)

        compression_ratio = (1 - raw_size / decoded_size) * 100 if decoded_size > 0 else 0
        print(f"Encoding: {encoding}")
        print(f"Transfer size: {raw_size:,} bytes")
        print(f"Decoded size: {decoded_size:,} bytes")
        print(f"Savings: {compression_ratio:.1f}%")

        return response

Compression savings:

HTML:  60-80% smaller with gzip, 65-85% with Brotli
JSON:  70-90% smaller with gzip
CSS:   60-75% smaller
JS:    50-70% smaller
Images: 0% (already compressed)

Technique 3: API-First Scraping

Many websites load data via API calls. Intercepting these APIs gives you structured data without page overhead:

import httpx
import json

class APIInterceptScraper:
    """Find and use internal APIs instead of parsing HTML."""

    async def discover_apis(self, url, proxy):
        """Use browser to discover API endpoints."""
        from playwright.async_api import async_playwright

        apis_found = []

        async with async_playwright() as p:
            browser = await p.chromium.launch(proxy={"server": proxy})
            page = await browser.new_page()

            # Capture XHR/fetch requests
            page.on("request", lambda req: apis_found.append({
                "url": req.url,
                "method": req.method,
                "type": req.resource_type,
            }) if req.resource_type in ["xhr", "fetch"] else None)

            await page.goto(url)
            await page.wait_for_timeout(5000)
            await browser.close()

        return [
            api for api in apis_found
            if "api" in api["url"].lower() or "json" in api["url"].lower()
        ]

    async def scrape_via_api(self, api_url, proxy):
        """Directly call the API — much less bandwidth."""
        async with httpx.AsyncClient(proxy=proxy) as client:
            response = await client.get(
                api_url,
                headers={
                    "Accept": "application/json",
                    "X-Requested-With": "XMLHttpRequest",
                }
            )
            return response.json()

# Bandwidth comparison:
# Full page load: ~2.5 MB
# API call only:  ~5-50 KB (50-500x less bandwidth)

Technique 4: Conditional Requests

Avoid re-downloading unchanged content:

class ConditionalScraper:
    """Only download content that has changed."""

    def __init__(self, proxy):
        self.proxy = proxy
        self.etag_cache = {}
        self.modified_cache = {}

    async def get_if_modified(self, url):
        headers = {}

        # Send ETag if we have one
        if url in self.etag_cache:
            headers["If-None-Match"] = self.etag_cache[url]

        # Send Last-Modified if we have it
        if url in self.modified_cache:
            headers["If-Modified-Since"] = self.modified_cache[url]

        async with httpx.AsyncClient(proxy=self.proxy) as client:
            response = await client.get(url, headers=headers)

            if response.status_code == 304:
                # Not modified — zero bandwidth for content
                print(f"  {url}: Not modified (0 bytes)")
                return None

            # Cache headers for next request
            if "etag" in response.headers:
                self.etag_cache[url] = response.headers["etag"]
            if "last-modified" in response.headers:
                self.modified_cache[url] = response.headers["last-modified"]

            print(f"  {url}: Downloaded ({len(response.content)} bytes)")
            return response.content

Technique 5: Partial Content (Range Requests)

Download only what you need from large files:

async def download_partial(url, proxy, start_byte=0, end_byte=1024):
    """Download specific byte range."""
    async with httpx.AsyncClient(proxy=proxy) as client:
        response = await client.get(
            url,
            headers={"Range": f"bytes={start_byte}-{end_byte}"}
        )
        if response.status_code == 206:
            print(f"Partial content: {len(response.content)} bytes")
            return response.content
        else:
            print("Server doesn't support range requests")
            return response.content

# Useful for: checking headers of large files,
# downloading just the first N KB of a page,
# resuming interrupted downloads

Technique 6: Response Streaming

Process data as it arrives without buffering the full response:

async def stream_and_extract(url, proxy, max_bytes=50_000):
    """Stream response and stop after finding what we need."""
    async with httpx.AsyncClient(proxy=proxy) as client:
        total_bytes = 0
        chunks = []

        async with client.stream("GET", url) as response:
            async for chunk in response.aiter_bytes(chunk_size=8192):
                chunks.append(chunk)
                total_bytes += len(chunk)

                # Check if we have what we need
                content_so_far = b"".join(chunks).decode("utf-8", errors="ignore")
                if "<title>" in content_so_far and "</title>" in content_so_far:
                    # Found the title — stop downloading
                    break

                if total_bytes > max_bytes:
                    break

        content = b"".join(chunks).decode("utf-8", errors="ignore")
        print(f"Downloaded {total_bytes:,} bytes instead of full page")
        return content

Technique 7: Local Caching

import hashlib
import os
import time
import json

class BandwidthCache:
    """Cache responses locally to avoid repeated proxy requests."""

    def __init__(self, cache_dir="./cache", ttl=3600):
        self.cache_dir = cache_dir
        self.ttl = ttl
        os.makedirs(cache_dir, exist_ok=True)

    def _cache_key(self, url):
        return hashlib.sha256(url.encode()).hexdigest()

    def get(self, url):
        key = self._cache_key(url)
        path = os.path.join(self.cache_dir, key)

        if not os.path.exists(path):
            return None

        with open(path, "r") as f:
            cached = json.load(f)

        if time.time() - cached["timestamp"] > self.ttl:
            os.remove(path)
            return None

        return cached["content"]

    def set(self, url, content):
        key = self._cache_key(url)
        path = os.path.join(self.cache_dir, key)

        with open(path, "w") as f:
            json.dump({
                "url": url,
                "content": content,
                "timestamp": time.time(),
            }, f)

    async def get_or_fetch(self, url, proxy):
        cached = self.get(url)
        if cached:
            return cached

        async with httpx.AsyncClient(proxy=proxy) as client:
            response = await client.get(url)
            content = response.text
            self.set(url, content)
            return content

Bandwidth Savings Summary

TechniqueSavingsEffort
Block images/CSS/fonts60-80%Low
Gzip/Brotli compression60-85%None (just set header)
API-first scraping90-99%Medium
Conditional requests100% on unchangedLow
Streaming with early stop50-95%Medium
Local caching100% on repeatsLow
Combined approach90-99%Medium

Internal Links

FAQ

How much can I save on proxy costs with bandwidth optimization?

Combining resource blocking (images/CSS), compression, and API-first scraping typically reduces bandwidth by 90-95%. On residential proxies at $10/GB, scraping 100K pages could cost $250 unoptimized vs $12-25 optimized.

Does blocking resources affect the data I can scrape?

Blocking images and CSS does not affect text data extraction. However, some JavaScript-heavy sites need certain scripts to render content. Test with and without blocking to ensure your target data still loads.

Should I use headless browser or HTTP requests for bandwidth efficiency?

HTTP requests (httpx, requests) use dramatically less bandwidth than headless browsers. A headless browser downloads and executes all page resources. Use HTTP requests when possible, and switch to headless browsers only for JavaScript-rendered content — with resource blocking enabled.

Does compression use more CPU on my machine?

Decompression is fast — gzip adds less than 1ms per response, Brotli 1-3ms. The bandwidth savings (60-85% less data) far outweigh the minimal CPU cost. Always enable compression.

How do I know which optimization gives the biggest improvement?

Measure your baseline first. Profile a sample of requests to see the average page size breakdown (HTML, JS, CSS, images). Block the largest category first. Typically, blocking images gives the biggest single improvement (40-50% savings), followed by CSS and fonts.


Related Reading

Scroll to Top