Distributed Web Scraping: Architecture Guide
Scraping millions of pages from a single machine is a recipe for failure. Between IP bans, memory limits, and bandwidth bottlenecks, single-node scrapers hit a wall fast. Distributed web scraping spreads the workload across multiple machines, enabling you to crawl at scale while maintaining reliability and speed.
This guide covers the architecture patterns, message queues, worker coordination, and proxy integration strategies you need to build a production-grade distributed scraping system.
Why Distributed Scraping?
Single-machine scrapers face hard limits:
- IP blocking: One IP making thousands of requests gets flagged instantly
- Memory constraints: Storing millions of URLs in memory crashes your process
- Bandwidth limits: A single connection can only download so fast
- No fault tolerance: If the machine goes down, you lose everything
- CPU bottleneck: Parsing HTML is CPU-intensive at scale
Distributed scraping solves all of these by splitting work across multiple nodes, each with its own IP, memory, and processing power.
Core Architecture Components
A distributed scraping system has five key components:
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Scheduler │────▶│ Message Queue │────▶│ Workers │
│ (URL Mgmt) │ │ (Redis/RMQ) │ │ (Scrapers) │
└─────────────┘ └──────────────┘ └──────┬──────┘
│
┌──────────────┐ │
│ Storage │◀─────────────┘
│ (DB/S3/etc) │
└──────────────┘
│
┌──────────────┐
│ Proxy Pool │
│ (Rotating) │
└──────────────┘
1. Scheduler (URL Manager)
The scheduler manages which URLs need to be scraped and distributes them to workers. It handles:
- URL deduplication (avoiding re-scraping the same page)
- Priority queuing (important pages first)
- Retry logic for failed requests
- Rate limiting per domain
- Politeness policies (respecting robots.txt)
2. Message Queue
The message queue decouples the scheduler from workers. Popular choices:
| Queue | Best For | Throughput |
|---|---|---|
| Redis | Simple setups, fast | ~100K msg/sec |
| RabbitMQ | Complex routing, reliability | ~50K msg/sec |
| Apache Kafka | Massive scale, event streaming | ~1M msg/sec |
| Amazon SQS | AWS-native, managed | ~3K msg/sec |
3. Workers (Scrapers)
Workers pull URLs from the queue, fetch pages, parse data, and store results. Each worker:
- Runs independently
- Uses its own proxy connection
- Reports success/failure back to the scheduler
- Can be scaled horizontally
4. Storage Layer
Scraped data needs persistent storage:
- PostgreSQL/MySQL: Structured data with relationships
- MongoDB: Semi-structured or varying schemas
- Amazon S3: Raw HTML pages, screenshots
- Elasticsearch: Full-text search across scraped content
5. Proxy Pool
Each worker routes requests through rotating proxies to avoid detection. A proxy manager assigns proxies to workers and rotates them based on usage.
Implementation with Python and Redis
Here’s a production-ready distributed scraping system using Python, Redis, and Celery:
Project Structure
distributed-scraper/
├── config.py
├── scheduler.py
├── worker.py
├── tasks.py
├── proxy_manager.py
├── storage.py
├── requirements.txt
└── docker-compose.yml
Configuration
# config.py
import os
REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379/0")
MONGO_URL = os.getenv("MONGO_URL", "mongodb://localhost:27017")
DATABASE_NAME = "scraper_db"
Scraping settings
MAX_RETRIES = 3
REQUEST_TIMEOUT = 30
CONCURRENT_REQUESTS = 50
RATE_LIMIT_PER_DOMAIN = 2 # requests per second
Proxy settings
PROXY_API_URL = os.getenv("PROXY_API_URL", "https://proxy-provider.com/api")
PROXY_API_KEY = os.getenv("PROXY_API_KEY", "")
PROXY_ROTATION_INTERVAL = 10 # requests before rotating
Scheduler
# scheduler.py
import redis
import json
import hashlib
from urllib.parse import urlparse
from config import REDIS_URL, RATE_LIMIT_PER_DOMAIN
class Scheduler:
def __init__(self):
self.redis = redis.from_url(REDIS_URL)
self.seen_key = "scraper:seen_urls"
self.queue_key = "scraper:url_queue"
self.failed_key = "scraper:failed_urls"
self.stats_key = "scraper:stats"
def add_url(self, url, priority=0, metadata=None):
"""Add URL to scraping queue if not already seen."""
url_hash = hashlib.md5(url.encode()).hexdigest()
# Check if already scraped
if self.redis.sismember(self.seen_key, url_hash):
return False
# Add to seen set
self.redis.sadd(self.seen_key, url_hash)
# Add to priority queue
task = {
"url": url,
"priority": priority,
"metadata": metadata or {},
"retries": 0
}
self.redis.zadd(self.queue_key, {json.dumps(task): priority})
return True
def add_urls_bulk(self, urls, priority=0):
"""Add multiple URLs efficiently."""
pipe = self.redis.pipeline()
added = 0
for url in urls:
url_hash = hashlib.md5(url.encode()).hexdigest()
if not self.redis.sismember(self.seen_key, url_hash):
pipe.sadd(self.seen_key, url_hash)
task = json.dumps({
"url": url, "priority": priority,
"metadata": {}, "retries": 0
})
pipe.zadd(self.queue_key, {task: priority})
added += 1
pipe.execute()
return added
def get_next_url(self):
"""Get highest-priority URL from queue."""
result = self.redis.zpopmax(self.queue_key)
if result:
task_json, score = result[0]
return json.loads(task_json)
return None
def retry_url(self, task, max_retries=3):
"""Re-queue a failed URL with incremented retry count."""
task["retries"] += 1
if task["retries"] < max_retries:
# Lower priority for retries
new_priority = task["priority"] - task["retries"]
self.redis.zadd(
self.queue_key,
{json.dumps(task): new_priority}
)
else:
self.redis.lpush(self.failed_key, json.dumps(task))
def get_stats(self):
"""Get scraping statistics."""
return {
"queued": self.redis.zcard(self.queue_key),
"seen": self.redis.scard(self.seen_key),
"failed": self.redis.llen(self.failed_key)
}
Celery Worker Tasks
# tasks.py
from celery import Celery
from proxy_manager import ProxyManager
from storage import DataStore
from config import REDIS_URL, MAX_RETRIES, REQUEST_TIMEOUT
import requests
from bs4 import BeautifulSoup
import logging
app = Celery("scraper", broker=REDIS_URL, backend=REDIS_URL)
app.conf.update(
worker_prefetch_multiplier=1,
task_acks_late=True,
task_reject_on_worker_lost=True,
worker_max_tasks_per_child=100,
)
proxy_manager = ProxyManager()
data_store = DataStore()
logger = logging.getLogger(__name__)
@app.task(bind=True, max_retries=MAX_RETRIES, default_retry_delay=60)
def scrape_url(self, url, metadata=None):
"""Scrape a single URL with proxy rotation and retry logic."""
proxy = proxy_manager.get_proxy()
try:
response = requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=REQUEST_TIMEOUT,
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
}
)
response.raise_for_status()
# Parse the page
soup = BeautifulSoup(response.text, "lxml")
data = extract_data(soup, url, metadata)
# Store results
data_store.save(data)
proxy_manager.report_success(proxy)
return {"status": "success", "url": url}
except requests.exceptions.ProxyError:
proxy_manager.report_failure(proxy)
raise self.retry(exc=Exception(f"Proxy failed: {proxy}"))
except requests.exceptions.HTTPError as e:
if response.status_code == 429:
raise self.retry(exc=e, countdown=120)
elif response.status_code == 403:
proxy_manager.report_blocked(proxy)
raise self.retry(exc=e)
raise
except Exception as e:
logger.error(f"Failed to scrape {url}: {e}")
raise self.retry(exc=e)
def extract_data(soup, url, metadata):
"""Extract structured data from parsed HTML."""
return {
"url": url,
"title": soup.title.string if soup.title else None,
"text": soup.get_text(separator=" ", strip=True)[:5000],
"links": [a.get("href") for a in soup.find_all("a", href=True)],
"metadata": metadata or {}
}
Proxy Manager
# proxy_manager.py
import redis
import requests
import random
from config import REDIS_URL, PROXY_API_URL, PROXY_API_KEY
class ProxyManager:
def __init__(self):
self.redis = redis.from_url(REDIS_URL)
self.proxy_key = "scraper:proxies"
self.blocked_key = "scraper:blocked_proxies"
def refresh_proxies(self):
"""Fetch fresh proxy list from provider API."""
response = requests.get(
PROXY_API_URL,
headers={"Authorization": f"Bearer {PROXY_API_KEY}"}
)
proxies = response.json().get("proxies", [])
pipe = self.redis.pipeline()
pipe.delete(self.proxy_key)
for proxy in proxies:
pipe.sadd(self.proxy_key, proxy)
pipe.execute()
return len(proxies)
def get_proxy(self):
"""Get a random proxy from the pool."""
proxy = self.redis.srandmember(self.proxy_key)
if proxy:
return proxy.decode()
# If pool is empty, refresh
self.refresh_proxies()
proxy = self.redis.srandmember(self.proxy_key)
return proxy.decode() if proxy else None
def report_success(self, proxy):
"""Mark a proxy as working."""
self.redis.hincrby("scraper:proxy_stats", f"{proxy}:success", 1)
def report_failure(self, proxy):
"""Record proxy failure, remove if too many failures."""
failures = self.redis.hincrby(
"scraper:proxy_stats", f"{proxy}:failures", 1
)
if failures > 5:
self.redis.srem(self.proxy_key, proxy)
self.redis.sadd(self.blocked_key, proxy)
def report_blocked(self, proxy):
"""Mark proxy as blocked and remove from pool."""
self.redis.srem(self.proxy_key, proxy)
self.redis.sadd(self.blocked_key, proxy)
Docker Compose Setup
# docker-compose.yml
version: "3.8"
services:
redis:
image: redis:7-alpine
ports:
- "6379:6379"
mongodb:
image: mongo:7
ports:
- "27017:27017"
volumes:
- mongo_data:/data/db
scheduler:
build: .
command: python scheduler_runner.py
environment:
- REDIS_URL=redis://redis:6379/0
- MONGO_URL=mongodb://mongodb:27017
depends_on:
- redis
- mongodb
worker:
build: .
command: celery -A tasks worker --concurrency=10 --loglevel=info
environment:
- REDIS_URL=redis://redis:6379/0
- MONGO_URL=mongodb://mongodb:27017
depends_on:
- redis
- mongodb
deploy:
replicas: 4
monitor:
build: .
command: celery -A tasks flower --port=5555
ports:
- "5555:5555"
depends_on:
- redis
volumes:
mongo_data:
Scaling Strategies
Horizontal Scaling
Add more worker containers:
docker-compose up --scale worker=10
Each worker handles N concurrent tasks, so 10 workers with 10 concurrency = 100 parallel scraping tasks.
Domain-Based Sharding
Assign specific domains to specific workers to maintain rate limits:
# Domain-aware task routing
@app.task(bind=True)
def scrape_url(self, url, metadata=None):
domain = urlparse(url).netloc
# Route to domain-specific queue
queue_name = f"scraper.{domain.replace('.', '_')}"
# ... scraping logic
Geographic Distribution
Deploy workers in different regions to reduce latency and appear more natural:
- US workers scrape US sites
- EU workers scrape EU sites
- APAC workers scrape Asian sites
Use cloud providers with multi-region support (AWS, GCP, Azure) and assign location-appropriate residential proxies to each region.
Deduplication at Scale
URL deduplication becomes critical when crawling millions of pages. Redis sets work up to ~100M URLs. Beyond that, use probabilistic data structures:
# Bloom filter for memory-efficient deduplication
from pybloom_live import ScalableBloomFilter
class BloomDeduplicator:
def __init__(self):
self.bloom = ScalableBloomFilter(
initial_capacity=1000000,
error_rate=0.001
)
def is_seen(self, url):
if url in self.bloom:
return True
self.bloom.add(url)
return False
A Bloom filter with 0.1% false positive rate uses ~1.2 MB per million URLs versus ~64 MB for a Redis set.
Handling Failures Gracefully
Circuit Breaker Pattern
Stop hammering a site that’s returning errors:
import time
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=300):
self.failure_count = {}
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.open_since = {}
def can_request(self, domain):
if domain not in self.failure_count:
return True
if self.failure_count[domain] >= self.failure_threshold:
if time.time() - self.open_since[domain] > self.reset_timeout:
self.failure_count[domain] = 0
return True
return False
return True
def record_failure(self, domain):
self.failure_count[domain] = self.failure_count.get(domain, 0) + 1
if self.failure_count[domain] >= self.failure_threshold:
self.open_since[domain] = time.time()
def record_success(self, domain):
self.failure_count[domain] = 0
Dead Letter Queue
URLs that fail after all retries go to a dead letter queue for manual inspection:
def handle_permanent_failure(task):
"""Store permanently failed tasks for investigation."""
data_store.save_failed({
"url": task["url"],
"retries": task["retries"],
"last_error": task.get("last_error"),
"timestamp": time.time()
})
Performance Optimization Tips
- Use connection pooling: Reuse HTTP connections with
requests.Session() - Parse selectively: Don’t parse the entire DOM if you only need specific elements
- Compress storage: Store raw HTML with gzip compression
- Batch database writes: Buffer results and write in batches of 100+
- Use lxml over html.parser: 10x faster HTML parsing
- Pre-filter URLs: Skip non-HTML resources (images, PDFs) early
Monitoring Your Distributed Scraper
Track these key metrics:
- Pages/minute: Overall throughput
- Error rate: Percentage of failed requests
- Queue depth: How many URLs are waiting
- Proxy health: Success rate per proxy
- Latency: Average response time
Use Celery Flower (included in the Docker setup) for real-time worker monitoring, and export metrics to Prometheus/Grafana for long-term tracking.
FAQ
How many workers do I need?
Start with 4-5 workers, each handling 10-20 concurrent requests. Scale based on your throughput needs and target site tolerance. For most projects, 10-20 workers handle millions of pages per day.
Should I use Celery or build my own worker system?
Use Celery for most projects — it handles task routing, retries, monitoring, and scaling out of the box. Build custom only if you need sub-millisecond latency or have unusual routing requirements.
How do I avoid getting blocked at scale?
Combine rotating residential proxies, realistic request headers, randomized delays, and domain-based rate limiting. See our anti-bot detection guide for detailed strategies.
What’s the best message queue for distributed scraping?
Redis for simplicity and speed (most projects). RabbitMQ if you need complex routing rules. Kafka if you’re processing 10M+ URLs and need replay capability.
How do I handle JavaScript-rendered pages in distributed scraping?
Use dedicated browser worker pools running headless Chrome/Playwright. These are resource-intensive, so separate them from your lightweight HTTP workers. See our JavaScript scraping guide for details.
Conclusion
Distributed web scraping transforms scraping from a fragile single-machine process into a resilient, scalable data collection system. The architecture — scheduler, message queue, workers, storage, and proxy pool — provides the foundation for scraping at any scale.
Start with the Redis + Celery setup shown here, add proxy rotation for IP management, and scale workers horizontally as your needs grow. The investment in distributed architecture pays off the moment your scraping targets exceed what a single machine can handle.