Web Scraping with Docker: Containerized Crawlers

Running web scrapers directly on your machine creates dependency conflicts, environment inconsistencies, and deployment headaches. Docker solves all of this by packaging your scraper, its dependencies, and runtime environment into portable containers that run identically everywhere.

This guide covers building Docker images for scrapers, managing headless browsers in containers, orchestrating multi-container scraping systems, and deploying to production.

Why Docker for Web Scraping?

Docker provides several advantages for scraping projects:

Reproducible environments: Same Python version, same libraries, everywhere
Easy deployment: Push an image, pull it on any server, run
Isolation: Scrapers can’t interfere with each other or the host system
Scaling: Spin up 50 scraper containers in seconds
Resource limits: Cap CPU and memory per container
Version control: Tag images for rollback capability

Basic Scraper Dockerfile

Simple HTTP Scraper

# Dockerfile
FROM python:3.12-slim

WORKDIR /app

Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
&& rm -rf /var/lib/apt/lists/*

Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

Copy scraper code
COPY . .

Run the scraper
CMD ["python", "scraper.py"]

# requirements.txt
requests==2.31.0
beautifulsoup4==4.12.3
lxml==5.1.0
pymongo==4.6.1

# scraper.py
import requests
from bs4 import BeautifulSoup
import json
import os

PROXY_URL = os.getenv("PROXY_URL", "")
TARGET_URL = os.getenv("TARGET_URL", "https://example.com")

def scrape(url):
proxies = {"http": PROXY_URL, "https": PROXY_URL} if PROXY_URL else None

response = requests.get(
url,
proxies=proxies,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"},
timeout=30
)
response.raise_for_status()

soup = BeautifulSoup(response.text, "lxml")
return {
"url": url,
"title": soup.title.string if soup.title else None,
"links": [a["href"] for a in soup.find_all("a", href=True)]
}

if __name__ == "__main__":
result = scrape(TARGET_URL)
print(json.dumps(result, indent=2))

Build and run:

docker build -t my-scraper .
docker run -e TARGET_URL="https://example.com" -e PROXY_URL="http://user:pass@proxy:8080" my-scraper

Headless Browser Scraper with Docker

For JavaScript-rendered pages, you need a headless browser inside the container:

Playwright in Docker

# Dockerfile.playwright
FROM mcr.microsoft.com/playwright/python:v1.41.0-jammy

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "browser_scraper.py"]

# browser_scraper.py
from playwright.sync_api import sync_playwright
import json
import os

TARGET_URL = os.getenv("TARGET_URL", "https://example.com")
PROXY_SERVER = os.getenv("PROXY_SERVER", "")

def scrape_with_browser(url):
with sync_playwright() as p:
browser_args = {}
if PROXY_SERVER:
browser_args["proxy"] = {"server": PROXY_SERVER}

browser = p.chromium.launch(
headless=True,
args=["--no-sandbox", "--disable-dev-shm-usage"],
**browser_args
)

page = browser.new_page()
page.goto(url, wait_until="networkidle", timeout=60000)

# Wait for dynamic content
page.wait_for_timeout(2000)

data = {
"url": url,
"title": page.title(),
"content": page.inner_text("body")[:5000],
"screenshot": None
}

# Take screenshot for debugging
page.screenshot(path="/app/output/screenshot.png")

browser.close()
return data

if __name__ == "__main__":
result = scrape_with_browser(TARGET_URL)
print(json.dumps(result, indent=2))

Run with volume mount for screenshots:

docker run -v $(pwd)/output:/app/output \
-e TARGET_URL="https://example.com" \
my-browser-scraper

Chrome + Selenium in Docker

# Dockerfile.selenium
FROM python:3.12-slim

Install Chrome
RUN apt-get update && apt-get install -y --no-install-recommends \
wget gnupg2 \
&& wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | apt-key add - \
&& echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list \
&& apt-get update \
&& apt-get install -y google-chrome-stable \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
CMD ["python", "selenium_scraper.py"]

Shared Memory for Chrome

Chrome in Docker needs extra shared memory. Without it, you’ll get crashes:

# Option 1: Increase shm size
docker run --shm-size=2g my-browser-scraper

Option 2: Use /tmp instead of /dev/shm
docker run --shm-size=256m -e CHROME_FLAGS="--disable-dev-shm-usage" my-browser-scraper

Multi-Container Scraping with Docker Compose

For production setups, use Docker Compose to orchestrate scrapers with supporting services:

# docker-compose.yml
version: "3.8"

services:
scraper:
build:
context: .
dockerfile: Dockerfile
environment:

REDIS_URL=redis://redis:6379/0
MONGO_URL=mongodb://mongo:27017/scraper_db
PROXY_URL=${PROXY_URL}
CONCURRENCY=20

depends_on:

redis
mongo

deploy:
replicas: 4
resources:
limits:
cpus: "1.0"
memory: 512M

redis:
image: redis:7-alpine
ports:

"6379:6379"

volumes:

redis_data:/data


mongo:
image: mongo:7
ports:

"27017:27017"

volumes:

mongo_data:/data/db

environment:

MONGO_INITDB_DATABASE=scraper_db


scheduler:
build:
context: .
dockerfile: Dockerfile
command: python scheduler.py
environment:

REDIS_URL=redis://redis:6379/0

depends_on:

redis


dashboard:
build:
context: .
dockerfile: Dockerfile.dashboard
ports:

"8080:8080"

environment:

REDIS_URL=redis://redis:6379/0
MONGO_URL=mongodb://mongo:27017/scraper_db

depends_on:

redis
mongo


volumes:
redis_data:
mongo_data:

Start the entire stack:

# Start all services
docker-compose up -d

Scale scrapers
docker-compose up -d --scale scraper=8

View logs
docker-compose logs -f scraper

Stop everything
docker-compose down

Optimizing Docker Images for Scrapers

Multi-Stage Builds

Reduce image size by separating build and runtime:

# Build stage
FROM python:3.12 AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --target=/app/deps -r requirements.txt

Runtime stage
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /app/deps /usr/local/lib/python3.12/site-packages/
COPY . .
CMD ["python", "scraper.py"]

This reduces image size from ~900MB to ~200MB.

Layer Caching

Order Dockerfile commands from least to most frequently changed:

FROM python:3.12-slim

System deps change rarely - cached
RUN apt-get update && apt-get install -y gcc && rm -rf /var/lib/apt/lists/*

Python deps change occasionally - cached after first build
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

Code changes frequently - only this layer rebuilds
COPY . .
CMD ["python", "scraper.py"]

Alpine-Based Images

Use Alpine for the smallest possible images:

FROM python:3.12-alpine
RUN apk add --no-cache gcc musl-dev libxml2-dev libxslt-dev
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "scraper.py"]

Image size drops from ~200MB to ~80MB.

Environment Configuration

Use environment variables for all configurable settings:

# config.py
import os

class Config:
PROXY_URL = os.getenv("PROXY_URL", "")
PROXY_USERNAME = os.getenv("PROXY_USERNAME", "")
PROXY_PASSWORD = os.getenv("PROXY_PASSWORD", "")

TARGET_URLS = os.getenv("TARGET_URLS", "").split(",")
CONCURRENCY = int(os.getenv("CONCURRENCY", "10"))
REQUEST_TIMEOUT = int(os.getenv("REQUEST_TIMEOUT", "30"))

OUTPUT_DIR = os.getenv("OUTPUT_DIR", "/app/output")
REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379/0")

LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")

Use .env files for local development:

# .env
PROXY_URL=http://user:pass@proxy.example.com:8080
CONCURRENCY=5
LOG_LEVEL=DEBUG

docker run --env-file .env my-scraper

Persistent Storage and Volumes

Mount volumes for scraped data persistence:

# Mount output directory
docker run -v $(pwd)/data:/app/output my-scraper

Named volume for database
docker volume create scraper_data
docker run -v scraper_data:/app/data my-scraper

For large-scale scraping, write directly to cloud storage:

import boto3
import os

s3 = boto3.client("s3")

def save_to_s3(data, key):
s3.put_object(
Bucket=os.getenv("S3_BUCKET"),
Key=key,
Body=json.dumps(data),
ContentType="application/json"
)

Docker Networking for Scrapers

Proxy Containers

Run your own proxy as a Docker container alongside scrapers:

services:
squid-proxy:
image: ubuntu/squid:latest
ports:

"3128:3128"

volumes:

./squid.conf:/etc/squid/squid.conf


scraper:
build: .
environment:

PROXY_URL=http://squid-proxy:3128

depends_on:

squid-proxy

Network Isolation

Create separate networks for internal and external traffic:

services:
scraper:
networks:

internal
external


redis:
networks:

internal


mongo:
networks:

internal


networks:
internal:
internal: true  # No internet access
external:
driver: bridge  # Internet access for scrapers

Health Checks and Restart Policies

services:
scraper:
build: .
restart: unless-stopped
healthcheck:
test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8081/health')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s

Add a health endpoint to your scraper:

from threading import Thread
from http.server import HTTPServer, BaseHTTPRequestHandler

class HealthHandler(BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.end_headers()
self.wfile.write(b"OK")

def start_health_server():
server = HTTPServer(("0.0.0.0", 8081), HealthHandler)
server.serve_forever()

Start health check server in background
Thread(target=start_health_server, daemon=True).start()

Production Deployment Tips

Use .dockerignore to exclude unnecessary files:

   .git
__pycache__
*.pyc
.env
output/

Pin dependency versions in requirements.txt for reproducibility

Run as non-root user:

   RUN useradd -m scraper
USER scraper

Set resource limits to prevent runaway containers from killing the host

Use logging drivers to centralize logs:

   logging:
driver: json-file
options:
max-size: "10m"
max-file: "3"

FAQ

How much memory does a Docker scraper container need?

For HTTP-only scrapers: 128-256MB. For headless browser scrapers: 512MB-2GB depending on page complexity. Always set memory limits to prevent runaway containers.

Should I use Docker or Docker Compose?

Use Docker for single-container scrapers. Use Docker Compose when you need multiple services (Redis, database, multiple scrapers). For production at scale, consider Kubernetes.

How do I debug a scraper inside Docker?

Run interactively: docker run -it my-scraper /bin/bash. Or attach to a running container: docker exec -it container_id /bin/bash.

Can I run headless Chrome in Docker without –no-sandbox?

Not easily in standard Docker containers. The --no-sandbox flag is needed because Docker containers don’t support the Linux sandboxing Chrome uses. This is safe because the container itself provides isolation.

How do I update my scraper without downtime?

Use rolling updates: build the new image, then docker-compose up -d --no-deps scraper. Docker replaces containers one at a time.

Conclusion

Docker transforms web scraping from a fragile local process into a portable, scalable, and reproducible system. Start with a simple Dockerfile, add Docker Compose for multi-service setups, and scale to Kubernetes when your needs outgrow a single host.

The containerized approach pairs perfectly with proxy rotation — each container can use its own proxy endpoint, making your scraping operation both scalable and stealthy.

Web Scraping with Docker: Containerized Crawlers

Why Docker for Web Scraping?

Basic Scraper Dockerfile

Simple HTTP Scraper

Install system dependencies

Install Python dependencies

Copy scraper code

Run the scraper

Headless Browser Scraper with Docker

Playwright in Docker

Chrome + Selenium in Docker

Install Chrome

Shared Memory for Chrome

Option 2: Use /tmp instead of /dev/shm

Multi-Container Scraping with Docker Compose

Scale scrapers

View logs

Stop everything