Distributed Web Scraping: Architecture Guide

Scraping millions of pages from a single machine is a recipe for failure. Between IP bans, memory limits, and bandwidth bottlenecks, single-node scrapers hit a wall fast. Distributed web scraping spreads the workload across multiple machines, enabling you to crawl at scale while maintaining reliability and speed.

This guide covers the architecture patterns, message queues, worker coordination, and proxy integration strategies you need to build a production-grade distributed scraping system.

Why Distributed Scraping?

Single-machine scrapers face hard limits:

IP blocking: One IP making thousands of requests gets flagged instantly
Memory constraints: Storing millions of URLs in memory crashes your process
Bandwidth limits: A single connection can only download so fast
No fault tolerance: If the machine goes down, you lose everything
CPU bottleneck: Parsing HTML is CPU-intensive at scale

Distributed scraping solves all of these by splitting work across multiple nodes, each with its own IP, memory, and processing power.

Core Architecture Components

A distributed scraping system has five key components:

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Scheduler  │────▶│ Message Queue │────▶│   Workers   │
│  (URL Mgmt)  │     │  (Redis/RMQ) │     │ (Scrapers)  │
└─────────────┘     └──────────────┘     └──────┬──────┘
│
┌──────────────┐              │
│    Storage   │◀─────────────┘
│  (DB/S3/etc) │
└──────────────┘
│
┌──────────────┐
│  Proxy Pool  │
│  (Rotating)  │
└──────────────┘

1. Scheduler (URL Manager)

The scheduler manages which URLs need to be scraped and distributes them to workers. It handles:

URL deduplication (avoiding re-scraping the same page)
Priority queuing (important pages first)
Retry logic for failed requests
Rate limiting per domain
Politeness policies (respecting robots.txt)

2. Message Queue

The message queue decouples the scheduler from workers. Popular choices:

Queue	Best For	Throughput
Redis	Simple setups, fast	~100K msg/sec
RabbitMQ	Complex routing, reliability	~50K msg/sec
Apache Kafka	Massive scale, event streaming	~1M msg/sec
Amazon SQS	AWS-native, managed	~3K msg/sec

3. Workers (Scrapers)

Workers pull URLs from the queue, fetch pages, parse data, and store results. Each worker:

Runs independently
Uses its own proxy connection
Reports success/failure back to the scheduler
Can be scaled horizontally

4. Storage Layer

Scraped data needs persistent storage:

PostgreSQL/MySQL: Structured data with relationships
MongoDB: Semi-structured or varying schemas
Amazon S3: Raw HTML pages, screenshots
Elasticsearch: Full-text search across scraped content

5. Proxy Pool

Each worker routes requests through rotating proxies to avoid detection. A proxy manager assigns proxies to workers and rotates them based on usage.

Implementation with Python and Redis

Here’s a production-ready distributed scraping system using Python, Redis, and Celery:

Project Structure

distributed-scraper/
├── config.py
├── scheduler.py
├── worker.py
├── tasks.py
├── proxy_manager.py
├── storage.py
├── requirements.txt
└── docker-compose.yml

Configuration

# config.py
import os

REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379/0")
MONGO_URL = os.getenv("MONGO_URL", "mongodb://localhost:27017")
DATABASE_NAME = "scraper_db"

Scraping settings
MAX_RETRIES = 3
REQUEST_TIMEOUT = 30
CONCURRENT_REQUESTS = 50
RATE_LIMIT_PER_DOMAIN = 2  # requests per second

Proxy settings
PROXY_API_URL = os.getenv("PROXY_API_URL", "https://proxy-provider.com/api")
PROXY_API_KEY = os.getenv("PROXY_API_KEY", "")
PROXY_ROTATION_INTERVAL = 10  # requests before rotating

Scheduler

# scheduler.py
import redis
import json
import hashlib
from urllib.parse import urlparse
from config import REDIS_URL, RATE_LIMIT_PER_DOMAIN

class Scheduler:
def __init__(self):
self.redis = redis.from_url(REDIS_URL)
self.seen_key = "scraper:seen_urls"
self.queue_key = "scraper:url_queue"
self.failed_key = "scraper:failed_urls"
self.stats_key = "scraper:stats"

def add_url(self, url, priority=0, metadata=None):
"""Add URL to scraping queue if not already seen."""
url_hash = hashlib.md5(url.encode()).hexdigest()

# Check if already scraped
if self.redis.sismember(self.seen_key, url_hash):
return False

# Add to seen set
self.redis.sadd(self.seen_key, url_hash)

# Add to priority queue
task = {
"url": url,
"priority": priority,
"metadata": metadata or {},
"retries": 0
}
self.redis.zadd(self.queue_key, {json.dumps(task): priority})
return True

def add_urls_bulk(self, urls, priority=0):
"""Add multiple URLs efficiently."""
pipe = self.redis.pipeline()
added = 0
for url in urls:
url_hash = hashlib.md5(url.encode()).hexdigest()
if not self.redis.sismember(self.seen_key, url_hash):
pipe.sadd(self.seen_key, url_hash)
task = json.dumps({
"url": url, "priority": priority,
"metadata": {}, "retries": 0
})
pipe.zadd(self.queue_key, {task: priority})
added += 1
pipe.execute()
return added

def get_next_url(self):
"""Get highest-priority URL from queue."""
result = self.redis.zpopmax(self.queue_key)
if result:
task_json, score = result[0]
return json.loads(task_json)
return None

def retry_url(self, task, max_retries=3):
"""Re-queue a failed URL with incremented retry count."""
task["retries"] += 1
if task["retries"] < max_retries:
# Lower priority for retries
new_priority = task["priority"] - task["retries"]
self.redis.zadd(
self.queue_key,
{json.dumps(task): new_priority}
)
else:
self.redis.lpush(self.failed_key, json.dumps(task))

def get_stats(self):
"""Get scraping statistics."""
return {
"queued": self.redis.zcard(self.queue_key),
"seen": self.redis.scard(self.seen_key),
"failed": self.redis.llen(self.failed_key)
}

Celery Worker Tasks

# tasks.py
from celery import Celery
from proxy_manager import ProxyManager
from storage import DataStore
from config import REDIS_URL, MAX_RETRIES, REQUEST_TIMEOUT
import requests
from bs4 import BeautifulSoup
import logging

app = Celery("scraper", broker=REDIS_URL, backend=REDIS_URL)
app.conf.update(
worker_prefetch_multiplier=1,
task_acks_late=True,
task_reject_on_worker_lost=True,
worker_max_tasks_per_child=100,
)

proxy_manager = ProxyManager()
data_store = DataStore()
logger = logging.getLogger(__name__)

@app.task(bind=True, max_retries=MAX_RETRIES, default_retry_delay=60)
def scrape_url(self, url, metadata=None):
"""Scrape a single URL with proxy rotation and retry logic."""
proxy = proxy_manager.get_proxy()

try:
response = requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=REQUEST_TIMEOUT,
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
}
)
response.raise_for_status()

# Parse the page
soup = BeautifulSoup(response.text, "lxml")
data = extract_data(soup, url, metadata)

# Store results
data_store.save(data)
proxy_manager.report_success(proxy)

return {"status": "success", "url": url}

except requests.exceptions.ProxyError:
proxy_manager.report_failure(proxy)
raise self.retry(exc=Exception(f"Proxy failed: {proxy}"))

except requests.exceptions.HTTPError as e:
if response.status_code == 429:
raise self.retry(exc=e, countdown=120)
elif response.status_code == 403:
proxy_manager.report_blocked(proxy)
raise self.retry(exc=e)
raise

except Exception as e:
logger.error(f"Failed to scrape {url}: {e}")
raise self.retry(exc=e)

def extract_data(soup, url, metadata):
"""Extract structured data from parsed HTML."""
return {
"url": url,
"title": soup.title.string if soup.title else None,
"text": soup.get_text(separator=" ", strip=True)[:5000],
"links": [a.get("href") for a in soup.find_all("a", href=True)],
"metadata": metadata or {}
}

Proxy Manager

# proxy_manager.py
import redis
import requests
import random
from config import REDIS_URL, PROXY_API_URL, PROXY_API_KEY

class ProxyManager:
def __init__(self):
self.redis = redis.from_url(REDIS_URL)
self.proxy_key = "scraper:proxies"
self.blocked_key = "scraper:blocked_proxies"

def refresh_proxies(self):
"""Fetch fresh proxy list from provider API."""
response = requests.get(
PROXY_API_URL,
headers={"Authorization": f"Bearer {PROXY_API_KEY}"}
)
proxies = response.json().get("proxies", [])
pipe = self.redis.pipeline()
pipe.delete(self.proxy_key)
for proxy in proxies:
pipe.sadd(self.proxy_key, proxy)
pipe.execute()
return len(proxies)

def get_proxy(self):
"""Get a random proxy from the pool."""
proxy = self.redis.srandmember(self.proxy_key)
if proxy:
return proxy.decode()
# If pool is empty, refresh
self.refresh_proxies()
proxy = self.redis.srandmember(self.proxy_key)
return proxy.decode() if proxy else None

def report_success(self, proxy):
"""Mark a proxy as working."""
self.redis.hincrby("scraper:proxy_stats", f"{proxy}:success", 1)

def report_failure(self, proxy):
"""Record proxy failure, remove if too many failures."""
failures = self.redis.hincrby(
"scraper:proxy_stats", f"{proxy}:failures", 1
)
if failures > 5:
self.redis.srem(self.proxy_key, proxy)
self.redis.sadd(self.blocked_key, proxy)

def report_blocked(self, proxy):
"""Mark proxy as blocked and remove from pool."""
self.redis.srem(self.proxy_key, proxy)
self.redis.sadd(self.blocked_key, proxy)

Docker Compose Setup

# docker-compose.yml
version: "3.8"
services:
redis:
image: redis:7-alpine
ports:

"6379:6379"


mongodb:
image: mongo:7
ports:

"27017:27017"

volumes:

mongo_data:/data/db


scheduler:
build: .
command: python scheduler_runner.py
environment:

REDIS_URL=redis://redis:6379/0
MONGO_URL=mongodb://mongodb:27017

depends_on:

redis
mongodb


worker:
build: .
command: celery -A tasks worker --concurrency=10 --loglevel=info
environment:

REDIS_URL=redis://redis:6379/0
MONGO_URL=mongodb://mongodb:27017

depends_on:

redis
mongodb

deploy:
replicas: 4

monitor:
build: .
command: celery -A tasks flower --port=5555
ports:

"5555:5555"

depends_on:

redis


volumes:
mongo_data:

Scaling Strategies

Horizontal Scaling

Add more worker containers:

docker-compose up --scale worker=10

Each worker handles N concurrent tasks, so 10 workers with 10 concurrency = 100 parallel scraping tasks.

Domain-Based Sharding

Assign specific domains to specific workers to maintain rate limits:

# Domain-aware task routing
@app.task(bind=True)
def scrape_url(self, url, metadata=None):
domain = urlparse(url).netloc
# Route to domain-specific queue
queue_name = f"scraper.{domain.replace('.', '_')}"
# ... scraping logic

Geographic Distribution

Deploy workers in different regions to reduce latency and appear more natural:

US workers scrape US sites
EU workers scrape EU sites
APAC workers scrape Asian sites

Use cloud providers with multi-region support (AWS, GCP, Azure) and assign location-appropriate residential proxies to each region.

Deduplication at Scale

URL deduplication becomes critical when crawling millions of pages. Redis sets work up to ~100M URLs. Beyond that, use probabilistic data structures:

# Bloom filter for memory-efficient deduplication
from pybloom_live import ScalableBloomFilter

class BloomDeduplicator:
def __init__(self):
self.bloom = ScalableBloomFilter(
initial_capacity=1000000,
error_rate=0.001
)

def is_seen(self, url):
if url in self.bloom:
return True
self.bloom.add(url)
return False

A Bloom filter with 0.1% false positive rate uses ~1.2 MB per million URLs versus ~64 MB for a Redis set.

Handling Failures Gracefully

Circuit Breaker Pattern

Stop hammering a site that’s returning errors:

import time

class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=300):
self.failure_count = {}
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.open_since = {}

def can_request(self, domain):
if domain not in self.failure_count:
return True
if self.failure_count[domain] >= self.failure_threshold:
if time.time() - self.open_since[domain] > self.reset_timeout:
self.failure_count[domain] = 0
return True
return False
return True

def record_failure(self, domain):
self.failure_count[domain] = self.failure_count.get(domain, 0) + 1
if self.failure_count[domain] >= self.failure_threshold:
self.open_since[domain] = time.time()

def record_success(self, domain):
self.failure_count[domain] = 0

Dead Letter Queue

URLs that fail after all retries go to a dead letter queue for manual inspection:

def handle_permanent_failure(task):
"""Store permanently failed tasks for investigation."""
data_store.save_failed({
"url": task["url"],
"retries": task["retries"],
"last_error": task.get("last_error"),
"timestamp": time.time()
})

Performance Optimization Tips

Use connection pooling: Reuse HTTP connections with requests.Session()
Parse selectively: Don’t parse the entire DOM if you only need specific elements
Compress storage: Store raw HTML with gzip compression
Batch database writes: Buffer results and write in batches of 100+
Use lxml over html.parser: 10x faster HTML parsing
Pre-filter URLs: Skip non-HTML resources (images, PDFs) early

Monitoring Your Distributed Scraper

Track these key metrics:

Pages/minute: Overall throughput
Error rate: Percentage of failed requests
Queue depth: How many URLs are waiting
Proxy health: Success rate per proxy
Latency: Average response time

Use Celery Flower (included in the Docker setup) for real-time worker monitoring, and export metrics to Prometheus/Grafana for long-term tracking.

FAQ

How many workers do I need?

Start with 4-5 workers, each handling 10-20 concurrent requests. Scale based on your throughput needs and target site tolerance. For most projects, 10-20 workers handle millions of pages per day.

Should I use Celery or build my own worker system?

Use Celery for most projects — it handles task routing, retries, monitoring, and scaling out of the box. Build custom only if you need sub-millisecond latency or have unusual routing requirements.

How do I avoid getting blocked at scale?

Combine rotating residential proxies, realistic request headers, randomized delays, and domain-based rate limiting. See our anti-bot detection guide for detailed strategies.

What’s the best message queue for distributed scraping?

Redis for simplicity and speed (most projects). RabbitMQ if you need complex routing rules. Kafka if you’re processing 10M+ URLs and need replay capability.

How do I handle JavaScript-rendered pages in distributed scraping?

Use dedicated browser worker pools running headless Chrome/Playwright. These are resource-intensive, so separate them from your lightweight HTTP workers. See our JavaScript scraping guide for details.

Conclusion

Distributed web scraping transforms scraping from a fragile single-machine process into a resilient, scalable data collection system. The architecture — scheduler, message queue, workers, storage, and proxy pool — provides the foundation for scraping at any scale.

Start with the Redis + Celery setup shown here, add proxy rotation for IP management, and scale workers horizontally as your needs grow. The investment in distributed architecture pays off the moment your scraping targets exceed what a single machine can handle.