Monitoring Web Scrapers: Alerting & Dashboards
A scraping pipeline without monitoring is a ticking time bomb. Target websites change structure, proxies go down, rate limits tighten, and CAPTCHAs appear — all without warning. Proper monitoring catches issues in minutes instead of days, saving data quality and proxy costs.
What to Monitor
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| Success rate | Detect blocking or site changes | < 90% |
| Response time | Detect slow proxies or throttling | > 5 seconds avg |
| Error rate by type | Diagnose specific issues | > 5% for any type |
| Proxy health | Detect dead proxies | < 80% healthy |
| Data quality | Detect parser breakage | Empty fields > 10% |
| Bandwidth usage | Control costs | > daily budget |
| Queue depth | Detect processing bottlenecks | Growing > 1 hour |
Prometheus Metrics Export
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Define metrics
REQUESTS_TOTAL = Counter(
'scraper_requests_total',
'Total scraping requests',
['target_domain', 'status', 'proxy_provider']
)
REQUEST_DURATION = Histogram(
'scraper_request_duration_seconds',
'Request duration in seconds',
['target_domain'],
buckets=[0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
)
PROXY_HEALTH = Gauge(
'scraper_proxy_healthy',
'Number of healthy proxies',
['provider']
)
ITEMS_SCRAPED = Counter(
'scraper_items_scraped_total',
'Total items successfully scraped',
['target_domain']
)
DATA_QUALITY = Gauge(
'scraper_data_quality_score',
'Data quality score (0-1)',
['target_domain', 'field']
)
QUEUE_SIZE = Gauge(
'scraper_queue_size',
'Current queue size',
['queue_name']
)
class MonitoredScraper:
def __init__(self, metrics_port=9090):
start_http_server(metrics_port)
print(f"Metrics available at http://localhost:{metrics_port}/metrics")
async def scrape(self, url, proxy, domain=None):
domain = domain or url.split('/')[2]
start = time.time()
try:
async with httpx.AsyncClient(proxy=proxy, timeout=30) as client:
response = await client.get(url)
duration = time.time() - start
REQUEST_DURATION.labels(target_domain=domain).observe(duration)
if response.status_code == 200:
REQUESTS_TOTAL.labels(
target_domain=domain,
status='success',
proxy_provider='default'
).inc()
return response
else:
REQUESTS_TOTAL.labels(
target_domain=domain,
status=f'http_{response.status_code}',
proxy_provider='default'
).inc()
return response
except Exception as e:
REQUESTS_TOTAL.labels(
target_domain=domain,
status='error',
proxy_provider='default'
).inc()
raiseGrafana Dashboard
{
"dashboard": {
"title": "Web Scraper Monitoring",
"panels": [
{
"title": "Request Success Rate",
"type": "gauge",
"targets": [{
"expr": "sum(rate(scraper_requests_total{status='success'}[5m])) / sum(rate(scraper_requests_total[5m])) * 100"
}]
},
{
"title": "Requests per Second",
"type": "graph",
"targets": [{
"expr": "sum(rate(scraper_requests_total[1m])) by (target_domain)"
}]
},
{
"title": "Response Time (p95)",
"type": "graph",
"targets": [{
"expr": "histogram_quantile(0.95, rate(scraper_request_duration_seconds_bucket[5m]))"
}]
},
{
"title": "Proxy Health",
"type": "gauge",
"targets": [{
"expr": "scraper_proxy_healthy"
}]
}
]
}
}Data Quality Monitoring
class DataQualityMonitor:
def __init__(self, required_fields):
self.required_fields = required_fields
self.total_items = 0
self.field_counts = {f: 0 for f in required_fields}
def check_item(self, item, domain):
self.total_items += 1
for field in self.required_fields:
value = item.get(field)
if value and str(value).strip():
self.field_counts[field] += 1
# Update Prometheus metrics
for field, count in self.field_counts.items():
quality = count / self.total_items if self.total_items > 0 else 0
DATA_QUALITY.labels(
target_domain=domain,
field=field
).set(quality)
def report(self):
print("\nData Quality Report:")
for field, count in self.field_counts.items():
pct = count / self.total_items * 100 if self.total_items > 0 else 0
status = "OK" if pct > 90 else "WARNING" if pct > 70 else "CRITICAL"
print(f" {field}: {pct:.1f}% populated [{status}]")
# Usage
monitor = DataQualityMonitor(['title', 'price', 'description', 'image_url'])
for item in scraped_items:
monitor.check_item(item, 'example.com')
monitor.report()Alert Configuration
# prometheus_alerts.yml
groups:
- name: scraper_alerts
rules:
- alert: LowSuccessRate
expr: sum(rate(scraper_requests_total{status="success"}[5m])) / sum(rate(scraper_requests_total[5m])) < 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Scraper success rate below 90%"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(scraper_request_duration_seconds_bucket[5m])) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "P95 latency above 5 seconds"
- alert: ProxyPoolDegraded
expr: scraper_proxy_healthy < 5
for: 2m
labels:
severity: critical
annotations:
summary: "Less than 5 healthy proxies"Internal Links
- Web Scraping Architecture — build monitorable architectures
- Proxy Failover Strategies — automated failure response
- Proxy Performance Benchmarks — establish performance baselines
- Building a Web Scraping Dashboard — custom monitoring UI
- Data Validation for Scraped Data — automated quality checks
FAQ
Do I really need monitoring for web scraping?
If you scrape more than once (scheduled jobs, ongoing data collection), yes. Without monitoring, you discover problems only when stakeholders report missing or stale data — hours or days later. Monitoring catches issues within minutes.
What is the simplest monitoring setup?
Start with logging success/failure counts to a file and a cron job that checks the log. Graduate to Prometheus + Grafana when you need dashboards and alerts. For small projects, even a Slack webhook on error count > threshold works well.
How do I detect when a website changes its structure?
Monitor data quality — when your parser stops extracting certain fields, the quality score drops. Set alerts on per-field extraction rates. Also run periodic checks comparing sample scrapes against expected patterns.
What metrics matter most?
Success rate (are requests working?), data quality (is the parser working?), and queue depth (is processing keeping up?). These three metrics catch 90% of scraping issues. Add latency and proxy health for production systems.
How much does monitoring infrastructure cost?
Prometheus and Grafana are free and open source. A small VPS ($5/month) runs both comfortably. Cloud-hosted alternatives (Grafana Cloud, Datadog) have free tiers for small workloads. The cost of not monitoring (missed data, wasted proxy budget) far exceeds monitoring infrastructure costs.
- AJAX Request Interception: Scraping API Calls Directly
- Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a Proxy Rotator in Python: Complete Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
- AJAX Request Interception: Scraping API Calls Directly
- Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a Proxy Rotator in Python: Complete Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
Related Reading
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)