Docker Compose for Scraping Infrastructure
Building a web scraping infrastructure from scratch means coordinating multiple services — scrapers, proxy rotators, message queues, databases, and monitoring tools. Docker Compose lets you define this entire stack in a single YAML file, making it reproducible, portable, and easy to scale.
This guide builds a production-ready scraping stack with Docker Compose, including proxy management, task queuing, and data storage.
Architecture
docker-compose.yml
├── scraper (Python + Playwright)
├── proxy-rotator (custom proxy middleware)
├── redis (task queue + caching)
├── postgres (results storage)
├── grafana (monitoring dashboard)
└── prometheus (metrics collection)Docker Compose Configuration
version: '3.8'
services:
scraper:
build:
context: ./scraper
dockerfile: Dockerfile
environment:
- PROXY_URL=http://proxy-rotator:8888
- REDIS_URL=redis://redis:6379
- DATABASE_URL=postgresql://scraper:password@postgres:5432/scraping
- CONCURRENT_WORKERS=10
- LOG_LEVEL=INFO
depends_on:
- redis
- postgres
- proxy-rotator
restart: unless-stopped
deploy:
resources:
limits:
memory: 4G
proxy-rotator:
build: ./proxy-rotator
ports:
- "8888:8888"
environment:
- PROXY_LIST_FILE=/config/proxies.txt
- ROTATION_STRATEGY=weighted
- HEALTH_CHECK_INTERVAL=60
volumes:
- ./config:/config
restart: unless-stopped
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis-data:/data
command: redis-server --appendonly yes
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: scraping
POSTGRES_USER: scraper
POSTGRES_PASSWORD: password
ports:
- "5432:5432"
volumes:
- postgres-data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- grafana-data:/var/lib/grafana
volumes:
redis-data:
postgres-data:
grafana-data:Scraper Dockerfile
FROM python:3.12-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
chromium chromium-driver && rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir \
requests beautifulsoup4 lxml \
playwright redis psycopg2-binary \
prometheus-client
RUN playwright install chromium
COPY . /app
WORKDIR /app
CMD ["python", "main.py"]Scaling Scrapers
# Scale to 5 scraper instances
docker compose up -d --scale scraper=5
# Check status
docker compose ps
# View logs for all scrapers
docker compose logs -f scraperDatabase Initialization
-- init.sql
CREATE TABLE IF NOT EXISTS scraped_pages (
id SERIAL PRIMARY KEY,
url TEXT NOT NULL,
domain TEXT NOT NULL,
status_code INTEGER,
content_hash TEXT,
data JSONB,
scraped_at TIMESTAMP DEFAULT NOW(),
proxy_used TEXT,
latency_ms FLOAT
);
CREATE INDEX idx_scraped_domain ON scraped_pages(domain);
CREATE INDEX idx_scraped_date ON scraped_pages(scraped_at);
CREATE UNIQUE INDEX idx_scraped_url_date ON scraped_pages(url, scraped_at::date);FAQ
How much RAM does this stack need?
The minimum is 4GB for a single scraper instance. Each Playwright browser uses 200-500MB. Redis and PostgreSQL add another 500MB-1GB. For 5 scrapers with headless browsers, plan for 8-16GB.
Can I use this with residential proxies?
Yes. Replace the proxy-rotator service with your provider’s gateway URL in the scraper environment variables. Commercial providers handle rotation through their backconnect gateways.
How do I deploy this to production?
For production, use Docker Swarm or Kubernetes. Add persistent volumes for databases, configure resource limits, set up log aggregation, and use secrets management for proxy credentials. See our Kubernetes scraping guide for details.
Network Configuration
Docker Compose networks isolate your scraping infrastructure from other containers:
networks:
scraping-net:
driver: bridge
ipam:
config:
- subnet: 172.20.0.0/16
services:
scraper:
networks:
- scraping-net
redis:
networks:
- scraping-net
postgres:
networks:
- scraping-netEnvironment-Specific Configurations
Use Docker Compose override files for different environments:
# Development (with debugging tools)
docker compose -f docker-compose.yml -f docker-compose.dev.yml up
# Production (optimized, no debug ports)
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d# docker-compose.dev.yml
services:
scraper:
environment:
- LOG_LEVEL=DEBUG
- CONCURRENT_WORKERS=2
ports:
- "5678:5678" # debugger port
redis:
ports:
- "6379:6379" # expose for local debugging# docker-compose.prod.yml
services:
scraper:
environment:
- LOG_LEVEL=WARNING
- CONCURRENT_WORKERS=20
deploy:
replicas: 3
resources:
limits:
memory: 4G
cpus: "2.0"
restart: alwaysHealth Checks
Add health checks to ensure services are running properly:
services:
scraper:
healthcheck:
test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8000/health')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
redis:
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 3
postgres:
healthcheck:
test: ["CMD-SHELL", "pg_isready -U scraper"]
interval: 10s
timeout: 5s
retries: 3Logging and Log Aggregation
Centralize logs from all containers:
services:
scraper:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
# Optional: Add Loki for log aggregation
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yamlProxy Rotator Service
Build a lightweight proxy rotation service as part of your stack:
# proxy-rotator/main.py
from flask import Flask, request, Response
import requests
import random
import os
app = Flask(__name__)
# Load proxies from file
proxy_file = os.getenv("PROXY_LIST_FILE", "/config/proxies.txt")
with open(proxy_file) as f:
PROXIES = [line.strip() for line in f if line.strip()]
current_index = 0
@app.route("/", defaults={"path": ""}, methods=["GET", "POST", "PUT", "DELETE"])
@app.route("/<path:path>", methods=["GET", "POST", "PUT", "DELETE"])
def proxy_request(path):
global current_index
target_url = request.headers.get("X-Target-URL")
if not target_url:
return {"error": "Missing X-Target-URL header"}, 400
proxy = PROXIES[current_index % len(PROXIES)]
current_index += 1
try:
resp = requests.request(
method=request.method,
url=target_url,
headers={k: v for k, v in request.headers if k != "Host"},
data=request.get_data(),
proxies={"http": proxy, "https": proxy},
timeout=30,
)
return Response(resp.content, status=resp.status_code, headers=dict(resp.headers))
except Exception as e:
return {"error": str(e)}, 502
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8888)Volume Management and Data Persistence
Properly manage persistent data:
volumes:
redis-data:
driver: local
postgres-data:
driver: local
grafana-data:
driver: local
scraper-output:
driver: local
driver_opts:
type: none
device: /data/scraping-output
o: bindBackup Strategy
#!/bin/bash
# backup-scraping-data.sh
# Backup PostgreSQL
docker compose exec -T postgres pg_dump -U scraper scraping > backup_$(date +%Y%m%d).sql
# Backup Redis
docker compose exec -T redis redis-cli BGSAVE
docker cp $(docker compose ps -q redis):/data/dump.rdb ./redis_backup_$(date +%Y%m%d).rdb
# Backup Grafana dashboards
docker cp $(docker compose ps -q grafana):/var/lib/grafana/grafana.db ./grafana_backup_$(date +%Y%m%d).dbPerformance Tuning
| Service | Parameter | Development | Production |
|---|---|---|---|
| Scraper | CONCURRENT_WORKERS | 2 | 20 |
| Scraper | Memory limit | 1G | 4G |
| Redis | maxmemory | 256mb | 2gb |
| PostgreSQL | shared_buffers | 128MB | 1GB |
| PostgreSQL | work_mem | 4MB | 64MB |
Monitoring the Stack
# View resource usage
docker compose stats
# Check service health
docker compose ps
# Follow logs
docker compose logs -f --tail=100 scraper
# Scale services
docker compose up -d --scale scraper=5For more on proxy configuration in containerized environments, see our Docker proxy setup guide. For scaling beyond Docker Compose, explore our Kubernetes scraping deployment guide.
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a Proxy Rotator in Python: Complete Tutorial
- AJAX Request Interception: Scraping API Calls Directly
- Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a Proxy Rotator in Python: Complete Tutorial
- AJAX Request Interception: Scraping API Calls Directly
- Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
Related Reading
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)