Kubernetes Web Scraping Deployment: Scalable Crawlers

Kubernetes Web Scraping Deployment: Scalable Crawlers

Kubernetes provides the scalability and resilience needed for enterprise-grade web scraping. When you need to process millions of pages daily through rotating proxies, Kubernetes auto-scales your crawler fleet based on queue depth, handles failures gracefully, and distributes traffic across geographic regions.

Architecture

Kubernetes Cluster
├── Namespace: scraping
│   ├── Deployment: crawler (auto-scaled 1-50 pods)
│   ├── Deployment: proxy-sidecar (injected per pod)
│   ├── CronJob: scheduler (triggers scraping runs)
│   ├── Service: redis (task queue)
│   ├── StatefulSet: postgres (results storage)
│   └── HPA: auto-scaler (scales on queue depth)

Crawler Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-crawler
  namespace: scraping
spec:
  replicas: 3
  selector:
    matchLabels:
      app: crawler
  template:
    metadata:
      labels:
        app: crawler
    spec:
      containers:
        - name: crawler
          image: your-registry/crawler:latest
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "2Gi"
              cpu: "1000m"
          env:
            - name: REDIS_URL
              value: "redis://redis-svc:6379"
            - name: PROXY_URL
              valueFrom:
                secretKeyRef:
                  name: proxy-credentials
                  key: proxy_url
            - name: CONCURRENT_TASKS
              value: "5"

        - name: proxy-sidecar
          image: your-registry/proxy-rotator:latest
          ports:
            - containerPort: 8888
          env:
            - name: UPSTREAM_PROXY
              valueFrom:
                secretKeyRef:
                  name: proxy-credentials
                  key: upstream_url

Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: crawler-hpa
  namespace: scraping
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-crawler
  minReplicas: 1
  maxReplicas: 50
  metrics:
    - type: External
      external:
        metric:
          name: redis_queue_length
          selector:
            matchLabels:
              queue: scraping-tasks
        target:
          type: AverageValue
          averageValue: "100"

Secrets Management

apiVersion: v1
kind: Secret
metadata:
  name: proxy-credentials
  namespace: scraping
type: Opaque
stringData:
  proxy_url: "http://user:pass@gate.provider.com:7777"
  upstream_url: "http://user:pass@residential.provider.com:8080"
  db_password: "secure_password_here"

CronJob for Scheduled Scraping

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-scrape
  namespace: scraping
spec:
  schedule: "0 6 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: scheduler
              image: your-registry/scheduler:latest
              command: ["python", "enqueue_urls.py"]
              env:
                - name: REDIS_URL
                  value: "redis://redis-svc:6379"
          restartPolicy: OnFailure

FAQ

How many pods do I need for large-scale scraping?

Each pod typically handles 5-20 concurrent requests. For 1,000 requests per minute, start with 10-20 pods. Kubernetes HPA will auto-scale based on your queue depth. Monitor resource usage and adjust.

How do I handle proxy credentials securely?

Use Kubernetes Secrets and mount them as environment variables. Never hardcode proxy credentials in your container images. Consider using external secret managers like HashiCorp Vault for larger deployments.

Can I use different proxies for different scraping targets?

Yes. Create separate deployments or use configuration maps to assign different proxy types to different crawling tasks. Social media targets might use mobile proxies while e-commerce uses residential.

Resource Management

Properly sizing resources prevents OOM kills and ensures efficient cluster usage:

# Resource quotas for the scraping namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: scraping-quota
  namespace: scraping
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    pods: "100"

Persistent Volume for Results

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: scraping-results
  namespace: scraping
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
  storageClassName: standard

Monitoring with Prometheus

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: crawler-monitor
  namespace: scraping
spec:
  selector:
    matchLabels:
      app: crawler
  endpoints:
    - port: metrics
      interval: 15s

Network Policies

Restrict traffic flow for security:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: scraper-network-policy
  namespace: scraping
spec:
  podSelector:
    matchLabels:
      app: crawler
  policyTypes:
    - Egress
    - Ingress
  egress:
    - to: []  # Allow all egress (scrapers need internet access)
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: redis
    - from:
        - podSelector:
            matchLabels:
              app: monitoring

Configuration Maps

Store scraping configurations separately from code:

apiVersion: v1
kind: ConfigMap
metadata:
  name: scraping-config
  namespace: scraping
data:
  targets.json: |
    {
      "targets": [
        {"domain": "example.com", "rate_limit": "10/min", "proxy_type": "residential"},
        {"domain": "store.example.com", "rate_limit": "5/min", "proxy_type": "datacenter"}
      ]
    }
  proxy-config.yaml: |
    rotation_strategy: weighted
    health_check_interval: 60
    max_retries: 3

Rolling Updates

Deploy new scraper versions without downtime:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
# Deploy new version
kubectl set image deployment/web-crawler crawler=your-registry/crawler:v2.0 -n scraping

# Watch rollout
kubectl rollout status deployment/web-crawler -n scraping

# Rollback if needed
kubectl rollout undo deployment/web-crawler -n scraping

Troubleshooting

IssueDiagnosisFix
Pods crashingkubectl logs pod-name -n scrapingCheck memory limits, proxy config
Slow scalingkubectl describe hpa crawler-hpaAdjust HPA thresholds
DNS resolutionkubectl exec pod -- nslookup target.comCheck CoreDNS config
Proxy connectivitykubectl exec pod -- curl -x proxy:port https://httpbin.org/ipVerify secrets, network policies
Resource exhaustionkubectl top pods -n scrapingIncrease limits or scale down

Cost Optimization

StrategySavingsTrade-off
Spot/Preemptible instances60-80%Pods may be evicted
Cluster autoscalerVariableCold start latency
Right-sizing resources20-40%Requires monitoring
Off-peak scheduling30-50%Limited scraping windows
# Use spot instances for crawlers
spec:
  nodeSelector:
    cloud.google.com/gke-spot: "true"
  tolerations:
    - key: cloud.google.com/gke-spot
      operator: Equal
      value: "true"
      effect: NoSchedule

For container-level proxy configuration, see our Docker proxy setup guide. For simpler deployments, start with our Docker Compose scraping guide.


Related Reading

Scroll to Top