Web Scraping Architecture: Design Patterns & Best Practices

Web Scraping Architecture: Design Patterns & Best Practices

A simple requests.get() loop works for scraping 100 pages. Scraping 10 million pages requires proper architecture — job queues, worker pools, proxy management, data pipelines, error handling, and monitoring. This guide covers proven architecture patterns for production scraping systems.

Architecture Patterns

Pattern 1: Simple Pipeline

URL List → Fetcher → Parser → Storage

Good for: Small jobs (<10K pages), single machine, one-time scraping.

import asyncio
import httpx
from bs4 import BeautifulSoup
import json

async def simple_pipeline(urls, proxy, output_file):
    results = []
    
    async with httpx.AsyncClient(proxy=proxy, timeout=30) as client:
        semaphore = asyncio.Semaphore(10)
        
        async def fetch_and_parse(url):
            async with semaphore:
                response = await client.get(url)
                soup = BeautifulSoup(response.text, 'lxml')
                return {
                    'url': url,
                    'title': soup.title.string if soup.title else '',
                    'content': soup.get_text()[:500],
                }
        
        tasks = [fetch_and_parse(url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
    
    with open(output_file, 'w') as f:
        json.dump([r for r in results if not isinstance(r, Exception)], f)

Pattern 2: Queue-Based Architecture

                    ┌──────────┐
URL Source ──────→  │  Queue   │
                    │ (Redis)  │
                    └─────┬────┘
                    ┌─────┴─────┐
              ┌─────┤  Workers  ├─────┐
              │     └───────────┘     │
         ┌────┴───┐              ┌───┴────┐
         │Worker 1│              │Worker N│
         └────┬───┘              └───┬────┘
              │     ┌───────────┐    │
              └─────┤  Results  ├────┘
                    │  Queue    │
                    └─────┬────┘
                    ┌─────┴────┐
                    │ Storage  │
                    │ Pipeline │
                    └──────────┘

Good for: Medium-large jobs (10K-10M pages), distributed workers.

import redis
import json

class QueueBasedScraper:
    def __init__(self, redis_url='redis://localhost:6379'):
        self.redis = redis.from_url(redis_url)
        self.url_queue = 'scraper:urls'
        self.result_queue = 'scraper:results'
        self.seen_set = 'scraper:seen'
    
    def enqueue_urls(self, urls):
        pipe = self.redis.pipeline()
        for url in urls:
            if not self.redis.sismember(self.seen_set, url):
                pipe.rpush(self.url_queue, url)
                pipe.sadd(self.seen_set, url)
        pipe.execute()
    
    async def worker(self, worker_id, proxy):
        async with httpx.AsyncClient(proxy=proxy, timeout=30) as client:
            while True:
                url = self.redis.blpop(self.url_queue, timeout=5)
                if not url:
                    continue
                
                url = url[1].decode()
                try:
                    response = await client.get(url)
                    result = self.parse(response)
                    self.redis.rpush(self.result_queue, json.dumps(result))
                except Exception as e:
                    self.redis.rpush('scraper:errors', json.dumps({
                        'url': url, 'error': str(e)
                    }))
    
    def parse(self, response):
        soup = BeautifulSoup(response.text, 'lxml')
        return {
            'url': str(response.url),
            'title': soup.title.string if soup.title else '',
            'status': response.status_code,
        }

Pattern 3: Microservice Architecture

┌────────────┐   ┌──────────┐   ┌────────────┐   ┌──────────┐
│ Scheduler  │──→│ Fetcher  │──→│  Parser    │──→│ Storage  │
│ Service    │   │ Service  │   │  Service   │   │ Service  │
└────────────┘   └──────────┘   └────────────┘   └──────────┘
      │               │               │               │
      └───────────────┼───────────────┼───────────────┘
                      │               │
                ┌─────┴────┐   ┌─────┴────┐
                │  Proxy   │   │ Monitor  │
                │ Manager  │   │ Service  │
                └──────────┘   └──────────┘

Good for: Enterprise scale (10M+ pages), team operations, multi-tenant.

Data Pipeline Design

class ScrapingPipeline:
    """Pluggable data processing pipeline."""
    
    def __init__(self):
        self.stages = []
    
    def add_stage(self, stage_func):
        self.stages.append(stage_func)
        return self
    
    async def process(self, item):
        for stage in self.stages:
            item = await stage(item) if asyncio.iscoroutinefunction(stage) else stage(item)
            if item is None:
                return None  # Item filtered out
        return item

# Define pipeline stages
def clean_text(item):
    import re
    for key in ['title', 'description']:
        if key in item:
            item[key] = re.sub(r'\s+', ' ', item[key]).strip()
    return item

def validate_price(item):
    if 'price' in item:
        try:
            item['price'] = float(re.sub(r'[^\d.]', '', item['price']))
        except ValueError:
            item['price'] = None
    return item

def filter_incomplete(item):
    required = ['title', 'price', 'url']
    if all(item.get(f) for f in required):
        return item
    return None  # Skip incomplete items

# Build pipeline
pipeline = ScrapingPipeline()
pipeline.add_stage(clean_text)
pipeline.add_stage(validate_price)
pipeline.add_stage(filter_incomplete)

Error Handling & Retries

class ResilientScraper:
    def __init__(self, max_retries=3):
        self.max_retries = max_retries
        self.error_counts = {}
    
    async def fetch(self, url, client):
        for attempt in range(self.max_retries):
            try:
                response = await client.get(url, timeout=30)
                
                if response.status_code == 429:
                    wait = int(response.headers.get('retry-after', 30))
                    await asyncio.sleep(wait)
                    continue
                
                if response.status_code >= 500:
                    await asyncio.sleep(2 ** attempt)
                    continue
                
                return response
            
            except (httpx.TimeoutException, httpx.ConnectError):
                await asyncio.sleep(2 ** attempt)
                continue
        
        return None  # All retries exhausted

Internal Links

FAQ

When should I move from simple scripts to proper architecture?

When you consistently scrape more than 10,000 pages, need to run on a schedule, require reliability guarantees, or work in a team. The investment in architecture pays off quickly through reduced debugging time and better data quality.

Which message queue should I use?

Redis is simplest for small-medium scale. RabbitMQ offers more features (routing, acknowledgments). Apache Kafka handles massive throughput but adds complexity. Start with Redis and upgrade only when you outgrow it.

How many concurrent workers should I run?

Start with 10-20 concurrent workers and monitor error rates and response times. Increase until you see degradation. Residential proxies typically support 10-50 concurrent connections per IP. Datacenter proxies can handle 100+.

Should I use Scrapy or build custom?

Scrapy is excellent for structured spider-based scraping with built-in middleware, pipelines, and concurrency. Build custom when you need specific queue integration, microservice architecture, or non-standard scraping patterns. Many teams use Scrapy for crawling and custom code for processing.

How do I handle schema changes on target websites?

Use CSS/XPath selectors with fallback chains. Monitor extraction accuracy with assertions. Set up alerts when expected fields are empty. Version your parsers and test against saved HTML samples before deploying changes.


Related Reading

Scroll to Top