Web Scraping with Docker: Containerized Crawlers

Web Scraping with Docker: Containerized Crawlers

Running web scrapers directly on your machine creates dependency conflicts, environment inconsistencies, and deployment headaches. Docker solves all of this by packaging your scraper, its dependencies, and runtime environment into portable containers that run identically everywhere.

This guide covers building Docker images for scrapers, managing headless browsers in containers, orchestrating multi-container scraping systems, and deploying to production.

Why Docker for Web Scraping?

Docker provides several advantages for scraping projects:

  • Reproducible environments: Same Python version, same libraries, everywhere
  • Easy deployment: Push an image, pull it on any server, run
  • Isolation: Scrapers can’t interfere with each other or the host system
  • Scaling: Spin up 50 scraper containers in seconds
  • Resource limits: Cap CPU and memory per container
  • Version control: Tag images for rollback capability

Basic Scraper Dockerfile

Simple HTTP Scraper

# Dockerfile

FROM python:3.12-slim

WORKDIR /app

Install system dependencies

RUN apt-get update && apt-get install -y --no-install-recommends \

gcc \

&& rm -rf /var/lib/apt/lists/*

Install Python dependencies

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

Copy scraper code

COPY . .

Run the scraper

CMD ["python", "scraper.py"]

# requirements.txt

requests==2.31.0

beautifulsoup4==4.12.3

lxml==5.1.0

pymongo==4.6.1

# scraper.py

import requests

from bs4 import BeautifulSoup

import json

import os

PROXY_URL = os.getenv("PROXY_URL", "")

TARGET_URL = os.getenv("TARGET_URL", "https://example.com")

def scrape(url):

proxies = {"http": PROXY_URL, "https": PROXY_URL} if PROXY_URL else None

response = requests.get(

url,

proxies=proxies,

headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"},

timeout=30

)

response.raise_for_status()

soup = BeautifulSoup(response.text, "lxml")

return {

"url": url,

"title": soup.title.string if soup.title else None,

"links": [a["href"] for a in soup.find_all("a", href=True)]

}

if __name__ == "__main__":

result = scrape(TARGET_URL)

print(json.dumps(result, indent=2))

Build and run:

docker build -t my-scraper .

docker run -e TARGET_URL="https://example.com" -e PROXY_URL="http://user:pass@proxy:8080" my-scraper

Headless Browser Scraper with Docker

For JavaScript-rendered pages, you need a headless browser inside the container:

Playwright in Docker

# Dockerfile.playwright

FROM mcr.microsoft.com/playwright/python:v1.41.0-jammy

WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "browser_scraper.py"]

# browser_scraper.py

from playwright.sync_api import sync_playwright

import json

import os

TARGET_URL = os.getenv("TARGET_URL", "https://example.com")

PROXY_SERVER = os.getenv("PROXY_SERVER", "")

def scrape_with_browser(url):

with sync_playwright() as p:

browser_args = {}

if PROXY_SERVER:

browser_args["proxy"] = {"server": PROXY_SERVER}

browser = p.chromium.launch(

headless=True,

args=["--no-sandbox", "--disable-dev-shm-usage"],

**browser_args

)

page = browser.new_page()

page.goto(url, wait_until="networkidle", timeout=60000)

# Wait for dynamic content

page.wait_for_timeout(2000)

data = {

"url": url,

"title": page.title(),

"content": page.inner_text("body")[:5000],

"screenshot": None

}

# Take screenshot for debugging

page.screenshot(path="/app/output/screenshot.png")

browser.close()

return data

if __name__ == "__main__":

result = scrape_with_browser(TARGET_URL)

print(json.dumps(result, indent=2))

Run with volume mount for screenshots:

docker run -v $(pwd)/output:/app/output \

-e TARGET_URL="https://example.com" \

my-browser-scraper

Chrome + Selenium in Docker

# Dockerfile.selenium

FROM python:3.12-slim

Install Chrome

RUN apt-get update && apt-get install -y --no-install-recommends \

wget gnupg2 \

&& wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | apt-key add - \

&& echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list \

&& apt-get update \

&& apt-get install -y google-chrome-stable \

&& rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "selenium_scraper.py"]

Shared Memory for Chrome

Chrome in Docker needs extra shared memory. Without it, you’ll get crashes:

# Option 1: Increase shm size

docker run --shm-size=2g my-browser-scraper

Option 2: Use /tmp instead of /dev/shm

docker run --shm-size=256m -e CHROME_FLAGS="--disable-dev-shm-usage" my-browser-scraper

Multi-Container Scraping with Docker Compose

For production setups, use Docker Compose to orchestrate scrapers with supporting services:

# docker-compose.yml

version: "3.8"

services:

scraper:

build:

context: .

dockerfile: Dockerfile

environment:

  • REDIS_URL=redis://redis:6379/0
  • MONGO_URL=mongodb://mongo:27017/scraper_db
  • PROXY_URL=${PROXY_URL}
  • CONCURRENCY=20

depends_on:

  • redis
  • mongo

deploy:

replicas: 4

resources:

limits:

cpus: "1.0"

memory: 512M

redis:

image: redis:7-alpine

ports:

  • "6379:6379"

volumes:

  • redis_data:/data

mongo:

image: mongo:7

ports:

  • "27017:27017"

volumes:

  • mongo_data:/data/db

environment:

  • MONGO_INITDB_DATABASE=scraper_db

scheduler:

build:

context: .

dockerfile: Dockerfile

command: python scheduler.py

environment:

  • REDIS_URL=redis://redis:6379/0

depends_on:

  • redis

dashboard:

build:

context: .

dockerfile: Dockerfile.dashboard

ports:

  • "8080:8080"

environment:

  • REDIS_URL=redis://redis:6379/0
  • MONGO_URL=mongodb://mongo:27017/scraper_db

depends_on:

  • redis
  • mongo

volumes:

redis_data:

mongo_data:

Start the entire stack:

# Start all services

docker-compose up -d

Scale scrapers

docker-compose up -d --scale scraper=8

View logs

docker-compose logs -f scraper

Stop everything

docker-compose down

Optimizing Docker Images for Scrapers

Multi-Stage Builds

Reduce image size by separating build and runtime:

# Build stage

FROM python:3.12 AS builder

WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir --target=/app/deps -r requirements.txt

Runtime stage

FROM python:3.12-slim

WORKDIR /app

COPY --from=builder /app/deps /usr/local/lib/python3.12/site-packages/

COPY . .

CMD ["python", "scraper.py"]

This reduces image size from ~900MB to ~200MB.

Layer Caching

Order Dockerfile commands from least to most frequently changed:

FROM python:3.12-slim

System deps change rarely - cached

RUN apt-get update && apt-get install -y gcc && rm -rf /var/lib/apt/lists/*

Python deps change occasionally - cached after first build

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

Code changes frequently - only this layer rebuilds

COPY . .

CMD ["python", "scraper.py"]

Alpine-Based Images

Use Alpine for the smallest possible images:

FROM python:3.12-alpine

RUN apk add --no-cache gcc musl-dev libxml2-dev libxslt-dev

WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "scraper.py"]

Image size drops from ~200MB to ~80MB.

Environment Configuration

Use environment variables for all configurable settings:

# config.py

import os

class Config:

PROXY_URL = os.getenv("PROXY_URL", "")

PROXY_USERNAME = os.getenv("PROXY_USERNAME", "")

PROXY_PASSWORD = os.getenv("PROXY_PASSWORD", "")

TARGET_URLS = os.getenv("TARGET_URLS", "").split(",")

CONCURRENCY = int(os.getenv("CONCURRENCY", "10"))

REQUEST_TIMEOUT = int(os.getenv("REQUEST_TIMEOUT", "30"))

OUTPUT_DIR = os.getenv("OUTPUT_DIR", "/app/output")

REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379/0")

LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")

Use .env files for local development:

# .env

PROXY_URL=http://user:pass@proxy.example.com:8080

CONCURRENCY=5

LOG_LEVEL=DEBUG

docker run --env-file .env my-scraper

Persistent Storage and Volumes

Mount volumes for scraped data persistence:

# Mount output directory

docker run -v $(pwd)/data:/app/output my-scraper

Named volume for database

docker volume create scraper_data

docker run -v scraper_data:/app/data my-scraper

For large-scale scraping, write directly to cloud storage:

import boto3

import os

s3 = boto3.client("s3")

def save_to_s3(data, key):

s3.put_object(

Bucket=os.getenv("S3_BUCKET"),

Key=key,

Body=json.dumps(data),

ContentType="application/json"

)

Docker Networking for Scrapers

Proxy Containers

Run your own proxy as a Docker container alongside scrapers:

services:

squid-proxy:

image: ubuntu/squid:latest

ports:

  • "3128:3128"

volumes:

  • ./squid.conf:/etc/squid/squid.conf

scraper:

build: .

environment:

  • PROXY_URL=http://squid-proxy:3128

depends_on:

  • squid-proxy

Network Isolation

Create separate networks for internal and external traffic:

services:

scraper:

networks:

  • internal
  • external

redis:

networks:

  • internal

mongo:

networks:

  • internal

networks:

internal:

internal: true # No internet access

external:

driver: bridge # Internet access for scrapers

Health Checks and Restart Policies

services:

scraper:

build: .

restart: unless-stopped

healthcheck:

test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8081/health')"]

interval: 30s

timeout: 10s

retries: 3

start_period: 10s

Add a health endpoint to your scraper:

from threading import Thread

from http.server import HTTPServer, BaseHTTPRequestHandler

class HealthHandler(BaseHTTPRequestHandler):

def do_GET(self):

self.send_response(200)

self.end_headers()

self.wfile.write(b"OK")

def start_health_server():

server = HTTPServer(("0.0.0.0", 8081), HealthHandler)

server.serve_forever()

Start health check server in background

Thread(target=start_health_server, daemon=True).start()

Production Deployment Tips

  1. Use .dockerignore to exclude unnecessary files:
   .git

__pycache__

*.pyc

.env

output/

  1. Pin dependency versions in requirements.txt for reproducibility
  1. Run as non-root user:
   RUN useradd -m scraper

USER scraper

  1. Set resource limits to prevent runaway containers from killing the host
  1. Use logging drivers to centralize logs:
   logging:

driver: json-file

options:

max-size: "10m"

max-file: "3"

FAQ

How much memory does a Docker scraper container need?

For HTTP-only scrapers: 128-256MB. For headless browser scrapers: 512MB-2GB depending on page complexity. Always set memory limits to prevent runaway containers.

Should I use Docker or Docker Compose?

Use Docker for single-container scrapers. Use Docker Compose when you need multiple services (Redis, database, multiple scrapers). For production at scale, consider Kubernetes.

How do I debug a scraper inside Docker?

Run interactively: docker run -it my-scraper /bin/bash. Or attach to a running container: docker exec -it container_id /bin/bash.

Can I run headless Chrome in Docker without –no-sandbox?

Not easily in standard Docker containers. The --no-sandbox flag is needed because Docker containers don’t support the Linux sandboxing Chrome uses. This is safe because the container itself provides isolation.

How do I update my scraper without downtime?

Use rolling updates: build the new image, then docker-compose up -d --no-deps scraper. Docker replaces containers one at a time.

Conclusion

Docker transforms web scraping from a fragile local process into a portable, scalable, and reproducible system. Start with a simple Dockerfile, add Docker Compose for multi-service setups, and scale to Kubernetes when your needs outgrow a single host.

The containerized approach pairs perfectly with proxy rotation — each container can use its own proxy endpoint, making your scraping operation both scalable and stealthy.

Internal Links

Scroll to Top