Docker Compose for Scraping Infrastructure

Docker Compose for Scraping Infrastructure

Building a web scraping infrastructure from scratch means coordinating multiple services — scrapers, proxy rotators, message queues, databases, and monitoring tools. Docker Compose lets you define this entire stack in a single YAML file, making it reproducible, portable, and easy to scale.

This guide builds a production-ready scraping stack with Docker Compose, including proxy management, task queuing, and data storage.

Architecture

docker-compose.yml
├── scraper (Python + Playwright)
├── proxy-rotator (custom proxy middleware)
├── redis (task queue + caching)
├── postgres (results storage)
├── grafana (monitoring dashboard)
└── prometheus (metrics collection)

Docker Compose Configuration

version: '3.8'

services:
  scraper:
    build:
      context: ./scraper
      dockerfile: Dockerfile
    environment:
      - PROXY_URL=http://proxy-rotator:8888
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=postgresql://scraper:password@postgres:5432/scraping
      - CONCURRENT_WORKERS=10
      - LOG_LEVEL=INFO
    depends_on:
      - redis
      - postgres
      - proxy-rotator
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 4G

  proxy-rotator:
    build: ./proxy-rotator
    ports:
      - "8888:8888"
    environment:
      - PROXY_LIST_FILE=/config/proxies.txt
      - ROTATION_STRATEGY=weighted
      - HEALTH_CHECK_INTERVAL=60
    volumes:
      - ./config:/config
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    command: redis-server --appendonly yes

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: scraping
      POSTGRES_USER: scraper
      POSTGRES_PASSWORD: password
    ports:
      - "5432:5432"
    volumes:
      - postgres-data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  redis-data:
  postgres-data:
  grafana-data:

Scraper Dockerfile

FROM python:3.12-slim

RUN apt-get update && apt-get install -y --no-install-recommends \
    chromium chromium-driver && rm -rf /var/lib/apt/lists/*

RUN pip install --no-cache-dir \
    requests beautifulsoup4 lxml \
    playwright redis psycopg2-binary \
    prometheus-client

RUN playwright install chromium

COPY . /app
WORKDIR /app

CMD ["python", "main.py"]

Scaling Scrapers

# Scale to 5 scraper instances
docker compose up -d --scale scraper=5

# Check status
docker compose ps

# View logs for all scrapers
docker compose logs -f scraper

Database Initialization

-- init.sql
CREATE TABLE IF NOT EXISTS scraped_pages (
    id SERIAL PRIMARY KEY,
    url TEXT NOT NULL,
    domain TEXT NOT NULL,
    status_code INTEGER,
    content_hash TEXT,
    data JSONB,
    scraped_at TIMESTAMP DEFAULT NOW(),
    proxy_used TEXT,
    latency_ms FLOAT
);

CREATE INDEX idx_scraped_domain ON scraped_pages(domain);
CREATE INDEX idx_scraped_date ON scraped_pages(scraped_at);
CREATE UNIQUE INDEX idx_scraped_url_date ON scraped_pages(url, scraped_at::date);

FAQ

How much RAM does this stack need?

The minimum is 4GB for a single scraper instance. Each Playwright browser uses 200-500MB. Redis and PostgreSQL add another 500MB-1GB. For 5 scrapers with headless browsers, plan for 8-16GB.

Can I use this with residential proxies?

Yes. Replace the proxy-rotator service with your provider’s gateway URL in the scraper environment variables. Commercial providers handle rotation through their backconnect gateways.

How do I deploy this to production?

For production, use Docker Swarm or Kubernetes. Add persistent volumes for databases, configure resource limits, set up log aggregation, and use secrets management for proxy credentials. See our Kubernetes scraping guide for details.

Network Configuration

Docker Compose networks isolate your scraping infrastructure from other containers:

networks:
  scraping-net:
    driver: bridge
    ipam:
      config:
        - subnet: 172.20.0.0/16

services:
  scraper:
    networks:
      - scraping-net
  redis:
    networks:
      - scraping-net
  postgres:
    networks:
      - scraping-net

Environment-Specific Configurations

Use Docker Compose override files for different environments:

# Development (with debugging tools)
docker compose -f docker-compose.yml -f docker-compose.dev.yml up

# Production (optimized, no debug ports)
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
# docker-compose.dev.yml
services:
  scraper:
    environment:
      - LOG_LEVEL=DEBUG
      - CONCURRENT_WORKERS=2
    ports:
      - "5678:5678"  # debugger port

  redis:
    ports:
      - "6379:6379"  # expose for local debugging
# docker-compose.prod.yml
services:
  scraper:
    environment:
      - LOG_LEVEL=WARNING
      - CONCURRENT_WORKERS=20
    deploy:
      replicas: 3
      resources:
        limits:
          memory: 4G
          cpus: "2.0"
    restart: always

Health Checks

Add health checks to ensure services are running properly:

services:
  scraper:
    healthcheck:
      test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8000/health')"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 10s

  redis:
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3

  postgres:
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U scraper"]
      interval: 10s
      timeout: 5s
      retries: 3

Logging and Log Aggregation

Centralize logs from all containers:

services:
  scraper:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

  # Optional: Add Loki for log aggregation
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml

Proxy Rotator Service

Build a lightweight proxy rotation service as part of your stack:

# proxy-rotator/main.py
from flask import Flask, request, Response
import requests
import random
import os

app = Flask(__name__)

# Load proxies from file
proxy_file = os.getenv("PROXY_LIST_FILE", "/config/proxies.txt")
with open(proxy_file) as f:
    PROXIES = [line.strip() for line in f if line.strip()]

current_index = 0

@app.route("/", defaults={"path": ""}, methods=["GET", "POST", "PUT", "DELETE"])
@app.route("/<path:path>", methods=["GET", "POST", "PUT", "DELETE"])
def proxy_request(path):
    global current_index
    target_url = request.headers.get("X-Target-URL")
    if not target_url:
        return {"error": "Missing X-Target-URL header"}, 400

    proxy = PROXIES[current_index % len(PROXIES)]
    current_index += 1

    try:
        resp = requests.request(
            method=request.method,
            url=target_url,
            headers={k: v for k, v in request.headers if k != "Host"},
            data=request.get_data(),
            proxies={"http": proxy, "https": proxy},
            timeout=30,
        )
        return Response(resp.content, status=resp.status_code, headers=dict(resp.headers))
    except Exception as e:
        return {"error": str(e)}, 502

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8888)

Volume Management and Data Persistence

Properly manage persistent data:

volumes:
  redis-data:
    driver: local
  postgres-data:
    driver: local
  grafana-data:
    driver: local
  scraper-output:
    driver: local
    driver_opts:
      type: none
      device: /data/scraping-output
      o: bind

Backup Strategy

#!/bin/bash
# backup-scraping-data.sh

# Backup PostgreSQL
docker compose exec -T postgres pg_dump -U scraper scraping > backup_$(date +%Y%m%d).sql

# Backup Redis
docker compose exec -T redis redis-cli BGSAVE
docker cp $(docker compose ps -q redis):/data/dump.rdb ./redis_backup_$(date +%Y%m%d).rdb

# Backup Grafana dashboards
docker cp $(docker compose ps -q grafana):/var/lib/grafana/grafana.db ./grafana_backup_$(date +%Y%m%d).db

Performance Tuning

ServiceParameterDevelopmentProduction
ScraperCONCURRENT_WORKERS220
ScraperMemory limit1G4G
Redismaxmemory256mb2gb
PostgreSQLshared_buffers128MB1GB
PostgreSQLwork_mem4MB64MB

Monitoring the Stack

# View resource usage
docker compose stats

# Check service health
docker compose ps

# Follow logs
docker compose logs -f --tail=100 scraper

# Scale services
docker compose up -d --scale scraper=5

For more on proxy configuration in containerized environments, see our Docker proxy setup guide. For scaling beyond Docker Compose, explore our Kubernetes scraping deployment guide.


Related Reading

Scroll to Top