Distributed Web Scraping: Architecture Guide

Distributed Web Scraping: Architecture Guide

Scraping millions of pages from a single machine is a recipe for failure. Between IP bans, memory limits, and bandwidth bottlenecks, single-node scrapers hit a wall fast. Distributed web scraping spreads the workload across multiple machines, enabling you to crawl at scale while maintaining reliability and speed.

This guide covers the architecture patterns, message queues, worker coordination, and proxy integration strategies you need to build a production-grade distributed scraping system.

Why Distributed Scraping?

Single-machine scrapers face hard limits:

  • IP blocking: One IP making thousands of requests gets flagged instantly
  • Memory constraints: Storing millions of URLs in memory crashes your process
  • Bandwidth limits: A single connection can only download so fast
  • No fault tolerance: If the machine goes down, you lose everything
  • CPU bottleneck: Parsing HTML is CPU-intensive at scale

Distributed scraping solves all of these by splitting work across multiple nodes, each with its own IP, memory, and processing power.

Core Architecture Components

A distributed scraping system has five key components:

┌─────────────┐     ┌──────────────┐     ┌─────────────┐

│ Scheduler │────▶│ Message Queue │────▶│ Workers │

│ (URL Mgmt) │ │ (Redis/RMQ) │ │ (Scrapers) │

└─────────────┘ └──────────────┘ └──────┬──────┘

┌──────────────┐ │

│ Storage │◀─────────────┘

│ (DB/S3/etc) │

└──────────────┘

┌──────────────┐

│ Proxy Pool │

│ (Rotating) │

└──────────────┘

1. Scheduler (URL Manager)

The scheduler manages which URLs need to be scraped and distributes them to workers. It handles:

  • URL deduplication (avoiding re-scraping the same page)
  • Priority queuing (important pages first)
  • Retry logic for failed requests
  • Rate limiting per domain
  • Politeness policies (respecting robots.txt)

2. Message Queue

The message queue decouples the scheduler from workers. Popular choices:

QueueBest ForThroughput
RedisSimple setups, fast~100K msg/sec
RabbitMQComplex routing, reliability~50K msg/sec
Apache KafkaMassive scale, event streaming~1M msg/sec
Amazon SQSAWS-native, managed~3K msg/sec

3. Workers (Scrapers)

Workers pull URLs from the queue, fetch pages, parse data, and store results. Each worker:

  • Runs independently
  • Uses its own proxy connection
  • Reports success/failure back to the scheduler
  • Can be scaled horizontally

4. Storage Layer

Scraped data needs persistent storage:

  • PostgreSQL/MySQL: Structured data with relationships
  • MongoDB: Semi-structured or varying schemas
  • Amazon S3: Raw HTML pages, screenshots
  • Elasticsearch: Full-text search across scraped content

5. Proxy Pool

Each worker routes requests through rotating proxies to avoid detection. A proxy manager assigns proxies to workers and rotates them based on usage.

Implementation with Python and Redis

Here’s a production-ready distributed scraping system using Python, Redis, and Celery:

Project Structure

distributed-scraper/

├── config.py

├── scheduler.py

├── worker.py

├── tasks.py

├── proxy_manager.py

├── storage.py

├── requirements.txt

└── docker-compose.yml

Configuration

# config.py

import os

REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379/0")

MONGO_URL = os.getenv("MONGO_URL", "mongodb://localhost:27017")

DATABASE_NAME = "scraper_db"

Scraping settings

MAX_RETRIES = 3

REQUEST_TIMEOUT = 30

CONCURRENT_REQUESTS = 50

RATE_LIMIT_PER_DOMAIN = 2 # requests per second

Proxy settings

PROXY_API_URL = os.getenv("PROXY_API_URL", "https://proxy-provider.com/api")

PROXY_API_KEY = os.getenv("PROXY_API_KEY", "")

PROXY_ROTATION_INTERVAL = 10 # requests before rotating

Scheduler

# scheduler.py

import redis

import json

import hashlib

from urllib.parse import urlparse

from config import REDIS_URL, RATE_LIMIT_PER_DOMAIN

class Scheduler:

def __init__(self):

self.redis = redis.from_url(REDIS_URL)

self.seen_key = "scraper:seen_urls"

self.queue_key = "scraper:url_queue"

self.failed_key = "scraper:failed_urls"

self.stats_key = "scraper:stats"

def add_url(self, url, priority=0, metadata=None):

"""Add URL to scraping queue if not already seen."""

url_hash = hashlib.md5(url.encode()).hexdigest()

# Check if already scraped

if self.redis.sismember(self.seen_key, url_hash):

return False

# Add to seen set

self.redis.sadd(self.seen_key, url_hash)

# Add to priority queue

task = {

"url": url,

"priority": priority,

"metadata": metadata or {},

"retries": 0

}

self.redis.zadd(self.queue_key, {json.dumps(task): priority})

return True

def add_urls_bulk(self, urls, priority=0):

"""Add multiple URLs efficiently."""

pipe = self.redis.pipeline()

added = 0

for url in urls:

url_hash = hashlib.md5(url.encode()).hexdigest()

if not self.redis.sismember(self.seen_key, url_hash):

pipe.sadd(self.seen_key, url_hash)

task = json.dumps({

"url": url, "priority": priority,

"metadata": {}, "retries": 0

})

pipe.zadd(self.queue_key, {task: priority})

added += 1

pipe.execute()

return added

def get_next_url(self):

"""Get highest-priority URL from queue."""

result = self.redis.zpopmax(self.queue_key)

if result:

task_json, score = result[0]

return json.loads(task_json)

return None

def retry_url(self, task, max_retries=3):

"""Re-queue a failed URL with incremented retry count."""

task["retries"] += 1

if task["retries"] < max_retries:

# Lower priority for retries

new_priority = task["priority"] - task["retries"]

self.redis.zadd(

self.queue_key,

{json.dumps(task): new_priority}

)

else:

self.redis.lpush(self.failed_key, json.dumps(task))

def get_stats(self):

"""Get scraping statistics."""

return {

"queued": self.redis.zcard(self.queue_key),

"seen": self.redis.scard(self.seen_key),

"failed": self.redis.llen(self.failed_key)

}

Celery Worker Tasks

# tasks.py

from celery import Celery

from proxy_manager import ProxyManager

from storage import DataStore

from config import REDIS_URL, MAX_RETRIES, REQUEST_TIMEOUT

import requests

from bs4 import BeautifulSoup

import logging

app = Celery("scraper", broker=REDIS_URL, backend=REDIS_URL)

app.conf.update(

worker_prefetch_multiplier=1,

task_acks_late=True,

task_reject_on_worker_lost=True,

worker_max_tasks_per_child=100,

)

proxy_manager = ProxyManager()

data_store = DataStore()

logger = logging.getLogger(__name__)

@app.task(bind=True, max_retries=MAX_RETRIES, default_retry_delay=60)

def scrape_url(self, url, metadata=None):

"""Scrape a single URL with proxy rotation and retry logic."""

proxy = proxy_manager.get_proxy()

try:

response = requests.get(

url,

proxies={"http": proxy, "https": proxy},

timeout=REQUEST_TIMEOUT,

headers={

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "

"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"

}

)

response.raise_for_status()

# Parse the page

soup = BeautifulSoup(response.text, "lxml")

data = extract_data(soup, url, metadata)

# Store results

data_store.save(data)

proxy_manager.report_success(proxy)

return {"status": "success", "url": url}

except requests.exceptions.ProxyError:

proxy_manager.report_failure(proxy)

raise self.retry(exc=Exception(f"Proxy failed: {proxy}"))

except requests.exceptions.HTTPError as e:

if response.status_code == 429:

raise self.retry(exc=e, countdown=120)

elif response.status_code == 403:

proxy_manager.report_blocked(proxy)

raise self.retry(exc=e)

raise

except Exception as e:

logger.error(f"Failed to scrape {url}: {e}")

raise self.retry(exc=e)

def extract_data(soup, url, metadata):

"""Extract structured data from parsed HTML."""

return {

"url": url,

"title": soup.title.string if soup.title else None,

"text": soup.get_text(separator=" ", strip=True)[:5000],

"links": [a.get("href") for a in soup.find_all("a", href=True)],

"metadata": metadata or {}

}

Proxy Manager

# proxy_manager.py

import redis

import requests

import random

from config import REDIS_URL, PROXY_API_URL, PROXY_API_KEY

class ProxyManager:

def __init__(self):

self.redis = redis.from_url(REDIS_URL)

self.proxy_key = "scraper:proxies"

self.blocked_key = "scraper:blocked_proxies"

def refresh_proxies(self):

"""Fetch fresh proxy list from provider API."""

response = requests.get(

PROXY_API_URL,

headers={"Authorization": f"Bearer {PROXY_API_KEY}"}

)

proxies = response.json().get("proxies", [])

pipe = self.redis.pipeline()

pipe.delete(self.proxy_key)

for proxy in proxies:

pipe.sadd(self.proxy_key, proxy)

pipe.execute()

return len(proxies)

def get_proxy(self):

"""Get a random proxy from the pool."""

proxy = self.redis.srandmember(self.proxy_key)

if proxy:

return proxy.decode()

# If pool is empty, refresh

self.refresh_proxies()

proxy = self.redis.srandmember(self.proxy_key)

return proxy.decode() if proxy else None

def report_success(self, proxy):

"""Mark a proxy as working."""

self.redis.hincrby("scraper:proxy_stats", f"{proxy}:success", 1)

def report_failure(self, proxy):

"""Record proxy failure, remove if too many failures."""

failures = self.redis.hincrby(

"scraper:proxy_stats", f"{proxy}:failures", 1

)

if failures > 5:

self.redis.srem(self.proxy_key, proxy)

self.redis.sadd(self.blocked_key, proxy)

def report_blocked(self, proxy):

"""Mark proxy as blocked and remove from pool."""

self.redis.srem(self.proxy_key, proxy)

self.redis.sadd(self.blocked_key, proxy)

Docker Compose Setup

# docker-compose.yml

version: "3.8"

services:

redis:

image: redis:7-alpine

ports:

  • "6379:6379"

mongodb:

image: mongo:7

ports:

  • "27017:27017"

volumes:

  • mongo_data:/data/db

scheduler:

build: .

command: python scheduler_runner.py

environment:

  • REDIS_URL=redis://redis:6379/0
  • MONGO_URL=mongodb://mongodb:27017

depends_on:

  • redis
  • mongodb

worker:

build: .

command: celery -A tasks worker --concurrency=10 --loglevel=info

environment:

  • REDIS_URL=redis://redis:6379/0
  • MONGO_URL=mongodb://mongodb:27017

depends_on:

  • redis
  • mongodb

deploy:

replicas: 4

monitor:

build: .

command: celery -A tasks flower --port=5555

ports:

  • "5555:5555"

depends_on:

  • redis

volumes:

mongo_data:

Scaling Strategies

Horizontal Scaling

Add more worker containers:

docker-compose up --scale worker=10

Each worker handles N concurrent tasks, so 10 workers with 10 concurrency = 100 parallel scraping tasks.

Domain-Based Sharding

Assign specific domains to specific workers to maintain rate limits:

# Domain-aware task routing

@app.task(bind=True)

def scrape_url(self, url, metadata=None):

domain = urlparse(url).netloc

# Route to domain-specific queue

queue_name = f"scraper.{domain.replace('.', '_')}"

# ... scraping logic

Geographic Distribution

Deploy workers in different regions to reduce latency and appear more natural:

  • US workers scrape US sites
  • EU workers scrape EU sites
  • APAC workers scrape Asian sites

Use cloud providers with multi-region support (AWS, GCP, Azure) and assign location-appropriate residential proxies to each region.

Deduplication at Scale

URL deduplication becomes critical when crawling millions of pages. Redis sets work up to ~100M URLs. Beyond that, use probabilistic data structures:

# Bloom filter for memory-efficient deduplication

from pybloom_live import ScalableBloomFilter

class BloomDeduplicator:

def __init__(self):

self.bloom = ScalableBloomFilter(

initial_capacity=1000000,

error_rate=0.001

)

def is_seen(self, url):

if url in self.bloom:

return True

self.bloom.add(url)

return False

A Bloom filter with 0.1% false positive rate uses ~1.2 MB per million URLs versus ~64 MB for a Redis set.

Handling Failures Gracefully

Circuit Breaker Pattern

Stop hammering a site that’s returning errors:

import time

class CircuitBreaker:

def __init__(self, failure_threshold=5, reset_timeout=300):

self.failure_count = {}

self.failure_threshold = failure_threshold

self.reset_timeout = reset_timeout

self.open_since = {}

def can_request(self, domain):

if domain not in self.failure_count:

return True

if self.failure_count[domain] >= self.failure_threshold:

if time.time() - self.open_since[domain] > self.reset_timeout:

self.failure_count[domain] = 0

return True

return False

return True

def record_failure(self, domain):

self.failure_count[domain] = self.failure_count.get(domain, 0) + 1

if self.failure_count[domain] >= self.failure_threshold:

self.open_since[domain] = time.time()

def record_success(self, domain):

self.failure_count[domain] = 0

Dead Letter Queue

URLs that fail after all retries go to a dead letter queue for manual inspection:

def handle_permanent_failure(task):

"""Store permanently failed tasks for investigation."""

data_store.save_failed({

"url": task["url"],

"retries": task["retries"],

"last_error": task.get("last_error"),

"timestamp": time.time()

})

Performance Optimization Tips

  1. Use connection pooling: Reuse HTTP connections with requests.Session()
  2. Parse selectively: Don’t parse the entire DOM if you only need specific elements
  3. Compress storage: Store raw HTML with gzip compression
  4. Batch database writes: Buffer results and write in batches of 100+
  5. Use lxml over html.parser: 10x faster HTML parsing
  6. Pre-filter URLs: Skip non-HTML resources (images, PDFs) early

Monitoring Your Distributed Scraper

Track these key metrics:

  • Pages/minute: Overall throughput
  • Error rate: Percentage of failed requests
  • Queue depth: How many URLs are waiting
  • Proxy health: Success rate per proxy
  • Latency: Average response time

Use Celery Flower (included in the Docker setup) for real-time worker monitoring, and export metrics to Prometheus/Grafana for long-term tracking.

FAQ

How many workers do I need?

Start with 4-5 workers, each handling 10-20 concurrent requests. Scale based on your throughput needs and target site tolerance. For most projects, 10-20 workers handle millions of pages per day.

Should I use Celery or build my own worker system?

Use Celery for most projects — it handles task routing, retries, monitoring, and scaling out of the box. Build custom only if you need sub-millisecond latency or have unusual routing requirements.

How do I avoid getting blocked at scale?

Combine rotating residential proxies, realistic request headers, randomized delays, and domain-based rate limiting. See our anti-bot detection guide for detailed strategies.

What’s the best message queue for distributed scraping?

Redis for simplicity and speed (most projects). RabbitMQ if you need complex routing rules. Kafka if you’re processing 10M+ URLs and need replay capability.

How do I handle JavaScript-rendered pages in distributed scraping?

Use dedicated browser worker pools running headless Chrome/Playwright. These are resource-intensive, so separate them from your lightweight HTTP workers. See our JavaScript scraping guide for details.

Conclusion

Distributed web scraping transforms scraping from a fragile single-machine process into a resilient, scalable data collection system. The architecture — scheduler, message queue, workers, storage, and proxy pool — provides the foundation for scraping at any scale.

Start with the Redis + Celery setup shown here, add proxy rotation for IP management, and scale workers horizontally as your needs grow. The investment in distributed architecture pays off the moment your scraping targets exceed what a single machine can handle.

Internal Links

Scroll to Top