Web Scraping with Docker: Containerized Crawlers
Running web scrapers directly on your machine creates dependency conflicts, environment inconsistencies, and deployment headaches. Docker solves all of this by packaging your scraper, its dependencies, and runtime environment into portable containers that run identically everywhere.
This guide covers building Docker images for scrapers, managing headless browsers in containers, orchestrating multi-container scraping systems, and deploying to production.
Why Docker for Web Scraping?
Docker provides several advantages for scraping projects:
- Reproducible environments: Same Python version, same libraries, everywhere
- Easy deployment: Push an image, pull it on any server, run
- Isolation: Scrapers can’t interfere with each other or the host system
- Scaling: Spin up 50 scraper containers in seconds
- Resource limits: Cap CPU and memory per container
- Version control: Tag images for rollback capability
Basic Scraper Dockerfile
Simple HTTP Scraper
# Dockerfile
FROM python:3.12-slim
WORKDIR /app
Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
&& rm -rf /var/lib/apt/lists/*
Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
Copy scraper code
COPY . .
Run the scraper
CMD ["python", "scraper.py"]
# requirements.txt
requests==2.31.0
beautifulsoup4==4.12.3
lxml==5.1.0
pymongo==4.6.1
# scraper.py
import requests
from bs4 import BeautifulSoup
import json
import os
PROXY_URL = os.getenv("PROXY_URL", "")
TARGET_URL = os.getenv("TARGET_URL", "https://example.com")
def scrape(url):
proxies = {"http": PROXY_URL, "https": PROXY_URL} if PROXY_URL else None
response = requests.get(
url,
proxies=proxies,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"},
timeout=30
)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
return {
"url": url,
"title": soup.title.string if soup.title else None,
"links": [a["href"] for a in soup.find_all("a", href=True)]
}
if __name__ == "__main__":
result = scrape(TARGET_URL)
print(json.dumps(result, indent=2))
Build and run:
docker build -t my-scraper .
docker run -e TARGET_URL="https://example.com" -e PROXY_URL="http://user:pass@proxy:8080" my-scraper
Headless Browser Scraper with Docker
For JavaScript-rendered pages, you need a headless browser inside the container:
Playwright in Docker
# Dockerfile.playwright
FROM mcr.microsoft.com/playwright/python:v1.41.0-jammy
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "browser_scraper.py"]
# browser_scraper.py
from playwright.sync_api import sync_playwright
import json
import os
TARGET_URL = os.getenv("TARGET_URL", "https://example.com")
PROXY_SERVER = os.getenv("PROXY_SERVER", "")
def scrape_with_browser(url):
with sync_playwright() as p:
browser_args = {}
if PROXY_SERVER:
browser_args["proxy"] = {"server": PROXY_SERVER}
browser = p.chromium.launch(
headless=True,
args=["--no-sandbox", "--disable-dev-shm-usage"],
**browser_args
)
page = browser.new_page()
page.goto(url, wait_until="networkidle", timeout=60000)
# Wait for dynamic content
page.wait_for_timeout(2000)
data = {
"url": url,
"title": page.title(),
"content": page.inner_text("body")[:5000],
"screenshot": None
}
# Take screenshot for debugging
page.screenshot(path="/app/output/screenshot.png")
browser.close()
return data
if __name__ == "__main__":
result = scrape_with_browser(TARGET_URL)
print(json.dumps(result, indent=2))
Run with volume mount for screenshots:
docker run -v $(pwd)/output:/app/output \
-e TARGET_URL="https://example.com" \
my-browser-scraper
Chrome + Selenium in Docker
# Dockerfile.selenium
FROM python:3.12-slim
Install Chrome
RUN apt-get update && apt-get install -y --no-install-recommends \
wget gnupg2 \
&& wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | apt-key add - \
&& echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list \
&& apt-get update \
&& apt-get install -y google-chrome-stable \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "selenium_scraper.py"]
Shared Memory for Chrome
Chrome in Docker needs extra shared memory. Without it, you’ll get crashes:
# Option 1: Increase shm size
docker run --shm-size=2g my-browser-scraper
Option 2: Use /tmp instead of /dev/shm
docker run --shm-size=256m -e CHROME_FLAGS="--disable-dev-shm-usage" my-browser-scraper
Multi-Container Scraping with Docker Compose
For production setups, use Docker Compose to orchestrate scrapers with supporting services:
# docker-compose.yml
version: "3.8"
services:
scraper:
build:
context: .
dockerfile: Dockerfile
environment:
- REDIS_URL=redis://redis:6379/0
- MONGO_URL=mongodb://mongo:27017/scraper_db
- PROXY_URL=${PROXY_URL}
- CONCURRENCY=20
depends_on:
- redis
- mongo
deploy:
replicas: 4
resources:
limits:
cpus: "1.0"
memory: 512M
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
mongo:
image: mongo:7
ports:
- "27017:27017"
volumes:
- mongo_data:/data/db
environment:
- MONGO_INITDB_DATABASE=scraper_db
scheduler:
build:
context: .
dockerfile: Dockerfile
command: python scheduler.py
environment:
- REDIS_URL=redis://redis:6379/0
depends_on:
- redis
dashboard:
build:
context: .
dockerfile: Dockerfile.dashboard
ports:
- "8080:8080"
environment:
- REDIS_URL=redis://redis:6379/0
- MONGO_URL=mongodb://mongo:27017/scraper_db
depends_on:
- redis
- mongo
volumes:
redis_data:
mongo_data:
Start the entire stack:
# Start all services
docker-compose up -d
Scale scrapers
docker-compose up -d --scale scraper=8
View logs
docker-compose logs -f scraper
Stop everything
docker-compose down
Optimizing Docker Images for Scrapers
Multi-Stage Builds
Reduce image size by separating build and runtime:
# Build stage
FROM python:3.12 AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --target=/app/deps -r requirements.txt
Runtime stage
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /app/deps /usr/local/lib/python3.12/site-packages/
COPY . .
CMD ["python", "scraper.py"]
This reduces image size from ~900MB to ~200MB.
Layer Caching
Order Dockerfile commands from least to most frequently changed:
FROM python:3.12-slim
System deps change rarely - cached
RUN apt-get update && apt-get install -y gcc && rm -rf /var/lib/apt/lists/*
Python deps change occasionally - cached after first build
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
Code changes frequently - only this layer rebuilds
COPY . .
CMD ["python", "scraper.py"]
Alpine-Based Images
Use Alpine for the smallest possible images:
FROM python:3.12-alpine
RUN apk add --no-cache gcc musl-dev libxml2-dev libxslt-dev
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "scraper.py"]
Image size drops from ~200MB to ~80MB.
Environment Configuration
Use environment variables for all configurable settings:
# config.py
import os
class Config:
PROXY_URL = os.getenv("PROXY_URL", "")
PROXY_USERNAME = os.getenv("PROXY_USERNAME", "")
PROXY_PASSWORD = os.getenv("PROXY_PASSWORD", "")
TARGET_URLS = os.getenv("TARGET_URLS", "").split(",")
CONCURRENCY = int(os.getenv("CONCURRENCY", "10"))
REQUEST_TIMEOUT = int(os.getenv("REQUEST_TIMEOUT", "30"))
OUTPUT_DIR = os.getenv("OUTPUT_DIR", "/app/output")
REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379/0")
LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")
Use .env files for local development:
# .env
PROXY_URL=http://user:pass@proxy.example.com:8080
CONCURRENCY=5
LOG_LEVEL=DEBUG
docker run --env-file .env my-scraper
Persistent Storage and Volumes
Mount volumes for scraped data persistence:
# Mount output directory
docker run -v $(pwd)/data:/app/output my-scraper
Named volume for database
docker volume create scraper_data
docker run -v scraper_data:/app/data my-scraper
For large-scale scraping, write directly to cloud storage:
import boto3
import os
s3 = boto3.client("s3")
def save_to_s3(data, key):
s3.put_object(
Bucket=os.getenv("S3_BUCKET"),
Key=key,
Body=json.dumps(data),
ContentType="application/json"
)
Docker Networking for Scrapers
Proxy Containers
Run your own proxy as a Docker container alongside scrapers:
services:
squid-proxy:
image: ubuntu/squid:latest
ports:
- "3128:3128"
volumes:
- ./squid.conf:/etc/squid/squid.conf
scraper:
build: .
environment:
- PROXY_URL=http://squid-proxy:3128
depends_on:
- squid-proxy
Network Isolation
Create separate networks for internal and external traffic:
services:
scraper:
networks:
- internal
- external
redis:
networks:
- internal
mongo:
networks:
- internal
networks:
internal:
internal: true # No internet access
external:
driver: bridge # Internet access for scrapers
Health Checks and Restart Policies
services:
scraper:
build: .
restart: unless-stopped
healthcheck:
test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8081/health')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
Add a health endpoint to your scraper:
from threading import Thread
from http.server import HTTPServer, BaseHTTPRequestHandler
class HealthHandler(BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.end_headers()
self.wfile.write(b"OK")
def start_health_server():
server = HTTPServer(("0.0.0.0", 8081), HealthHandler)
server.serve_forever()
Start health check server in background
Thread(target=start_health_server, daemon=True).start()
Production Deployment Tips
- Use
.dockerignoreto exclude unnecessary files:
.git
__pycache__
*.pyc
.env
output/
- Pin dependency versions in requirements.txt for reproducibility
- Run as non-root user:
RUN useradd -m scraper
USER scraper
- Set resource limits to prevent runaway containers from killing the host
- Use logging drivers to centralize logs:
logging:
driver: json-file
options:
max-size: "10m"
max-file: "3"
FAQ
How much memory does a Docker scraper container need?
For HTTP-only scrapers: 128-256MB. For headless browser scrapers: 512MB-2GB depending on page complexity. Always set memory limits to prevent runaway containers.
Should I use Docker or Docker Compose?
Use Docker for single-container scrapers. Use Docker Compose when you need multiple services (Redis, database, multiple scrapers). For production at scale, consider Kubernetes.
How do I debug a scraper inside Docker?
Run interactively: docker run -it my-scraper /bin/bash. Or attach to a running container: docker exec -it container_id /bin/bash.
Can I run headless Chrome in Docker without –no-sandbox?
Not easily in standard Docker containers. The --no-sandbox flag is needed because Docker containers don’t support the Linux sandboxing Chrome uses. This is safe because the container itself provides isolation.
How do I update my scraper without downtime?
Use rolling updates: build the new image, then docker-compose up -d --no-deps scraper. Docker replaces containers one at a time.
Conclusion
Docker transforms web scraping from a fragile local process into a portable, scalable, and reproducible system. Start with a simple Dockerfile, add Docker Compose for multi-service setups, and scale to Kubernetes when your needs outgrow a single host.
The containerized approach pairs perfectly with proxy rotation — each container can use its own proxy endpoint, making your scraping operation both scalable and stealthy.