Web Scraping Architecture: Design Patterns & Best Practices
A simple requests.get() loop works for scraping 100 pages. Scraping 10 million pages requires proper architecture — job queues, worker pools, proxy management, data pipelines, error handling, and monitoring. This guide covers proven architecture patterns for production scraping systems.
Architecture Patterns
Pattern 1: Simple Pipeline
URL List → Fetcher → Parser → StorageGood for: Small jobs (<10K pages), single machine, one-time scraping.
import asyncio
import httpx
from bs4 import BeautifulSoup
import json
async def simple_pipeline(urls, proxy, output_file):
results = []
async with httpx.AsyncClient(proxy=proxy, timeout=30) as client:
semaphore = asyncio.Semaphore(10)
async def fetch_and_parse(url):
async with semaphore:
response = await client.get(url)
soup = BeautifulSoup(response.text, 'lxml')
return {
'url': url,
'title': soup.title.string if soup.title else '',
'content': soup.get_text()[:500],
}
tasks = [fetch_and_parse(url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
with open(output_file, 'w') as f:
json.dump([r for r in results if not isinstance(r, Exception)], f)Pattern 2: Queue-Based Architecture
┌──────────┐
URL Source ──────→ │ Queue │
│ (Redis) │
└─────┬────┘
┌─────┴─────┐
┌─────┤ Workers ├─────┐
│ └───────────┘ │
┌────┴───┐ ┌───┴────┐
│Worker 1│ │Worker N│
└────┬───┘ └───┬────┘
│ ┌───────────┐ │
└─────┤ Results ├────┘
│ Queue │
└─────┬────┘
┌─────┴────┐
│ Storage │
│ Pipeline │
└──────────┘Good for: Medium-large jobs (10K-10M pages), distributed workers.
import redis
import json
class QueueBasedScraper:
def __init__(self, redis_url='redis://localhost:6379'):
self.redis = redis.from_url(redis_url)
self.url_queue = 'scraper:urls'
self.result_queue = 'scraper:results'
self.seen_set = 'scraper:seen'
def enqueue_urls(self, urls):
pipe = self.redis.pipeline()
for url in urls:
if not self.redis.sismember(self.seen_set, url):
pipe.rpush(self.url_queue, url)
pipe.sadd(self.seen_set, url)
pipe.execute()
async def worker(self, worker_id, proxy):
async with httpx.AsyncClient(proxy=proxy, timeout=30) as client:
while True:
url = self.redis.blpop(self.url_queue, timeout=5)
if not url:
continue
url = url[1].decode()
try:
response = await client.get(url)
result = self.parse(response)
self.redis.rpush(self.result_queue, json.dumps(result))
except Exception as e:
self.redis.rpush('scraper:errors', json.dumps({
'url': url, 'error': str(e)
}))
def parse(self, response):
soup = BeautifulSoup(response.text, 'lxml')
return {
'url': str(response.url),
'title': soup.title.string if soup.title else '',
'status': response.status_code,
}Pattern 3: Microservice Architecture
┌────────────┐ ┌──────────┐ ┌────────────┐ ┌──────────┐
│ Scheduler │──→│ Fetcher │──→│ Parser │──→│ Storage │
│ Service │ │ Service │ │ Service │ │ Service │
└────────────┘ └──────────┘ └────────────┘ └──────────┘
│ │ │ │
└───────────────┼───────────────┼───────────────┘
│ │
┌─────┴────┐ ┌─────┴────┐
│ Proxy │ │ Monitor │
│ Manager │ │ Service │
└──────────┘ └──────────┘Good for: Enterprise scale (10M+ pages), team operations, multi-tenant.
Data Pipeline Design
class ScrapingPipeline:
"""Pluggable data processing pipeline."""
def __init__(self):
self.stages = []
def add_stage(self, stage_func):
self.stages.append(stage_func)
return self
async def process(self, item):
for stage in self.stages:
item = await stage(item) if asyncio.iscoroutinefunction(stage) else stage(item)
if item is None:
return None # Item filtered out
return item
# Define pipeline stages
def clean_text(item):
import re
for key in ['title', 'description']:
if key in item:
item[key] = re.sub(r'\s+', ' ', item[key]).strip()
return item
def validate_price(item):
if 'price' in item:
try:
item['price'] = float(re.sub(r'[^\d.]', '', item['price']))
except ValueError:
item['price'] = None
return item
def filter_incomplete(item):
required = ['title', 'price', 'url']
if all(item.get(f) for f in required):
return item
return None # Skip incomplete items
# Build pipeline
pipeline = ScrapingPipeline()
pipeline.add_stage(clean_text)
pipeline.add_stage(validate_price)
pipeline.add_stage(filter_incomplete)Error Handling & Retries
class ResilientScraper:
def __init__(self, max_retries=3):
self.max_retries = max_retries
self.error_counts = {}
async def fetch(self, url, client):
for attempt in range(self.max_retries):
try:
response = await client.get(url, timeout=30)
if response.status_code == 429:
wait = int(response.headers.get('retry-after', 30))
await asyncio.sleep(wait)
continue
if response.status_code >= 500:
await asyncio.sleep(2 ** attempt)
continue
return response
except (httpx.TimeoutException, httpx.ConnectError):
await asyncio.sleep(2 ** attempt)
continue
return None # All retries exhaustedInternal Links
- Distributed Web Scraping — scale across multiple machines
- Proxy Load Balancing — manage proxy pools in your architecture
- Building a Web Scraping Queue with Redis — implement queue-based scraping
- Monitoring Web Scrapers — add observability
- Docker for Web Scraping — containerize your architecture
FAQ
When should I move from simple scripts to proper architecture?
When you consistently scrape more than 10,000 pages, need to run on a schedule, require reliability guarantees, or work in a team. The investment in architecture pays off quickly through reduced debugging time and better data quality.
Which message queue should I use?
Redis is simplest for small-medium scale. RabbitMQ offers more features (routing, acknowledgments). Apache Kafka handles massive throughput but adds complexity. Start with Redis and upgrade only when you outgrow it.
How many concurrent workers should I run?
Start with 10-20 concurrent workers and monitor error rates and response times. Increase until you see degradation. Residential proxies typically support 10-50 concurrent connections per IP. Datacenter proxies can handle 100+.
Should I use Scrapy or build custom?
Scrapy is excellent for structured spider-based scraping with built-in middleware, pipelines, and concurrency. Build custom when you need specific queue integration, microservice architecture, or non-standard scraping patterns. Many teams use Scrapy for crawling and custom code for processing.
How do I handle schema changes on target websites?
Use CSS/XPath selectors with fallback chains. Monitor extraction accuracy with assertions. Set up alerts when expected fields are empty. Version your parsers and test against saved HTML samples before deploying changes.
- AJAX Request Interception: Scraping API Calls Directly
- Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a Proxy Rotator in Python: Complete Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
Related Reading
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)