Kubernetes Web Scraping Deployment: Scalable Crawlers
Kubernetes provides the scalability and resilience needed for enterprise-grade web scraping. When you need to process millions of pages daily through rotating proxies, Kubernetes auto-scales your crawler fleet based on queue depth, handles failures gracefully, and distributes traffic across geographic regions.
Architecture
Kubernetes Cluster
├── Namespace: scraping
│ ├── Deployment: crawler (auto-scaled 1-50 pods)
│ ├── Deployment: proxy-sidecar (injected per pod)
│ ├── CronJob: scheduler (triggers scraping runs)
│ ├── Service: redis (task queue)
│ ├── StatefulSet: postgres (results storage)
│ └── HPA: auto-scaler (scales on queue depth)Crawler Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-crawler
namespace: scraping
spec:
replicas: 3
selector:
matchLabels:
app: crawler
template:
metadata:
labels:
app: crawler
spec:
containers:
- name: crawler
image: your-registry/crawler:latest
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
env:
- name: REDIS_URL
value: "redis://redis-svc:6379"
- name: PROXY_URL
valueFrom:
secretKeyRef:
name: proxy-credentials
key: proxy_url
- name: CONCURRENT_TASKS
value: "5"
- name: proxy-sidecar
image: your-registry/proxy-rotator:latest
ports:
- containerPort: 8888
env:
- name: UPSTREAM_PROXY
valueFrom:
secretKeyRef:
name: proxy-credentials
key: upstream_urlHorizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: crawler-hpa
namespace: scraping
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-crawler
minReplicas: 1
maxReplicas: 50
metrics:
- type: External
external:
metric:
name: redis_queue_length
selector:
matchLabels:
queue: scraping-tasks
target:
type: AverageValue
averageValue: "100"Secrets Management
apiVersion: v1
kind: Secret
metadata:
name: proxy-credentials
namespace: scraping
type: Opaque
stringData:
proxy_url: "http://user:pass@gate.provider.com:7777"
upstream_url: "http://user:pass@residential.provider.com:8080"
db_password: "secure_password_here"CronJob for Scheduled Scraping
apiVersion: batch/v1
kind: CronJob
metadata:
name: daily-scrape
namespace: scraping
spec:
schedule: "0 6 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: scheduler
image: your-registry/scheduler:latest
command: ["python", "enqueue_urls.py"]
env:
- name: REDIS_URL
value: "redis://redis-svc:6379"
restartPolicy: OnFailureFAQ
How many pods do I need for large-scale scraping?
Each pod typically handles 5-20 concurrent requests. For 1,000 requests per minute, start with 10-20 pods. Kubernetes HPA will auto-scale based on your queue depth. Monitor resource usage and adjust.
How do I handle proxy credentials securely?
Use Kubernetes Secrets and mount them as environment variables. Never hardcode proxy credentials in your container images. Consider using external secret managers like HashiCorp Vault for larger deployments.
Can I use different proxies for different scraping targets?
Yes. Create separate deployments or use configuration maps to assign different proxy types to different crawling tasks. Social media targets might use mobile proxies while e-commerce uses residential.
Resource Management
Properly sizing resources prevents OOM kills and ensures efficient cluster usage:
# Resource quotas for the scraping namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: scraping-quota
namespace: scraping
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
pods: "100"Persistent Volume for Results
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: scraping-results
namespace: scraping
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
storageClassName: standardMonitoring with Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: crawler-monitor
namespace: scraping
spec:
selector:
matchLabels:
app: crawler
endpoints:
- port: metrics
interval: 15sNetwork Policies
Restrict traffic flow for security:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: scraper-network-policy
namespace: scraping
spec:
podSelector:
matchLabels:
app: crawler
policyTypes:
- Egress
- Ingress
egress:
- to: [] # Allow all egress (scrapers need internet access)
ingress:
- from:
- podSelector:
matchLabels:
app: redis
- from:
- podSelector:
matchLabels:
app: monitoringConfiguration Maps
Store scraping configurations separately from code:
apiVersion: v1
kind: ConfigMap
metadata:
name: scraping-config
namespace: scraping
data:
targets.json: |
{
"targets": [
{"domain": "example.com", "rate_limit": "10/min", "proxy_type": "residential"},
{"domain": "store.example.com", "rate_limit": "5/min", "proxy_type": "datacenter"}
]
}
proxy-config.yaml: |
rotation_strategy: weighted
health_check_interval: 60
max_retries: 3Rolling Updates
Deploy new scraper versions without downtime:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%# Deploy new version
kubectl set image deployment/web-crawler crawler=your-registry/crawler:v2.0 -n scraping
# Watch rollout
kubectl rollout status deployment/web-crawler -n scraping
# Rollback if needed
kubectl rollout undo deployment/web-crawler -n scrapingTroubleshooting
| Issue | Diagnosis | Fix |
|---|---|---|
| Pods crashing | kubectl logs pod-name -n scraping | Check memory limits, proxy config |
| Slow scaling | kubectl describe hpa crawler-hpa | Adjust HPA thresholds |
| DNS resolution | kubectl exec pod -- nslookup target.com | Check CoreDNS config |
| Proxy connectivity | kubectl exec pod -- curl -x proxy:port https://httpbin.org/ip | Verify secrets, network policies |
| Resource exhaustion | kubectl top pods -n scraping | Increase limits or scale down |
Cost Optimization
| Strategy | Savings | Trade-off |
|---|---|---|
| Spot/Preemptible instances | 60-80% | Pods may be evicted |
| Cluster autoscaler | Variable | Cold start latency |
| Right-sizing resources | 20-40% | Requires monitoring |
| Off-peak scheduling | 30-50% | Limited scraping windows |
# Use spot instances for crawlers
spec:
nodeSelector:
cloud.google.com/gke-spot: "true"
tolerations:
- key: cloud.google.com/gke-spot
operator: Equal
value: "true"
effect: NoScheduleFor container-level proxy configuration, see our Docker proxy setup guide. For simpler deployments, start with our Docker Compose scraping guide.
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a Proxy Rotator in Python: Complete Tutorial
- AJAX Request Interception: Scraping API Calls Directly
- Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a Proxy Rotator in Python: Complete Tutorial
- AJAX Request Interception: Scraping API Calls Directly
- Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
Related Reading
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)