Monitoring Web Scrapers: Alerting & Dashboards

Monitoring Web Scrapers: Alerting & Dashboards

A scraping pipeline without monitoring is a ticking time bomb. Target websites change structure, proxies go down, rate limits tighten, and CAPTCHAs appear — all without warning. Proper monitoring catches issues in minutes instead of days, saving data quality and proxy costs.

What to Monitor

MetricWhy It MattersAlert Threshold
Success rateDetect blocking or site changes< 90%
Response timeDetect slow proxies or throttling> 5 seconds avg
Error rate by typeDiagnose specific issues> 5% for any type
Proxy healthDetect dead proxies< 80% healthy
Data qualityDetect parser breakageEmpty fields > 10%
Bandwidth usageControl costs> daily budget
Queue depthDetect processing bottlenecksGrowing > 1 hour

Prometheus Metrics Export

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
REQUESTS_TOTAL = Counter(
    'scraper_requests_total',
    'Total scraping requests',
    ['target_domain', 'status', 'proxy_provider']
)

REQUEST_DURATION = Histogram(
    'scraper_request_duration_seconds',
    'Request duration in seconds',
    ['target_domain'],
    buckets=[0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
)

PROXY_HEALTH = Gauge(
    'scraper_proxy_healthy',
    'Number of healthy proxies',
    ['provider']
)

ITEMS_SCRAPED = Counter(
    'scraper_items_scraped_total',
    'Total items successfully scraped',
    ['target_domain']
)

DATA_QUALITY = Gauge(
    'scraper_data_quality_score',
    'Data quality score (0-1)',
    ['target_domain', 'field']
)

QUEUE_SIZE = Gauge(
    'scraper_queue_size',
    'Current queue size',
    ['queue_name']
)

class MonitoredScraper:
    def __init__(self, metrics_port=9090):
        start_http_server(metrics_port)
        print(f"Metrics available at http://localhost:{metrics_port}/metrics")
    
    async def scrape(self, url, proxy, domain=None):
        domain = domain or url.split('/')[2]
        
        start = time.time()
        try:
            async with httpx.AsyncClient(proxy=proxy, timeout=30) as client:
                response = await client.get(url)
            
            duration = time.time() - start
            REQUEST_DURATION.labels(target_domain=domain).observe(duration)
            
            if response.status_code == 200:
                REQUESTS_TOTAL.labels(
                    target_domain=domain,
                    status='success',
                    proxy_provider='default'
                ).inc()
                return response
            else:
                REQUESTS_TOTAL.labels(
                    target_domain=domain,
                    status=f'http_{response.status_code}',
                    proxy_provider='default'
                ).inc()
                return response
        
        except Exception as e:
            REQUESTS_TOTAL.labels(
                target_domain=domain,
                status='error',
                proxy_provider='default'
            ).inc()
            raise

Grafana Dashboard

{
  "dashboard": {
    "title": "Web Scraper Monitoring",
    "panels": [
      {
        "title": "Request Success Rate",
        "type": "gauge",
        "targets": [{
          "expr": "sum(rate(scraper_requests_total{status='success'}[5m])) / sum(rate(scraper_requests_total[5m])) * 100"
        }]
      },
      {
        "title": "Requests per Second",
        "type": "graph",
        "targets": [{
          "expr": "sum(rate(scraper_requests_total[1m])) by (target_domain)"
        }]
      },
      {
        "title": "Response Time (p95)",
        "type": "graph",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(scraper_request_duration_seconds_bucket[5m]))"
        }]
      },
      {
        "title": "Proxy Health",
        "type": "gauge",
        "targets": [{
          "expr": "scraper_proxy_healthy"
        }]
      }
    ]
  }
}

Data Quality Monitoring

class DataQualityMonitor:
    def __init__(self, required_fields):
        self.required_fields = required_fields
        self.total_items = 0
        self.field_counts = {f: 0 for f in required_fields}
    
    def check_item(self, item, domain):
        self.total_items += 1
        
        for field in self.required_fields:
            value = item.get(field)
            if value and str(value).strip():
                self.field_counts[field] += 1
        
        # Update Prometheus metrics
        for field, count in self.field_counts.items():
            quality = count / self.total_items if self.total_items > 0 else 0
            DATA_QUALITY.labels(
                target_domain=domain,
                field=field
            ).set(quality)
    
    def report(self):
        print("\nData Quality Report:")
        for field, count in self.field_counts.items():
            pct = count / self.total_items * 100 if self.total_items > 0 else 0
            status = "OK" if pct > 90 else "WARNING" if pct > 70 else "CRITICAL"
            print(f"  {field}: {pct:.1f}% populated [{status}]")

# Usage
monitor = DataQualityMonitor(['title', 'price', 'description', 'image_url'])
for item in scraped_items:
    monitor.check_item(item, 'example.com')
monitor.report()

Alert Configuration

# prometheus_alerts.yml
groups:
  - name: scraper_alerts
    rules:
      - alert: LowSuccessRate
        expr: sum(rate(scraper_requests_total{status="success"}[5m])) / sum(rate(scraper_requests_total[5m])) < 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Scraper success rate below 90%"
      
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(scraper_request_duration_seconds_bucket[5m])) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency above 5 seconds"
      
      - alert: ProxyPoolDegraded
        expr: scraper_proxy_healthy < 5
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Less than 5 healthy proxies"

Internal Links

FAQ

Do I really need monitoring for web scraping?

If you scrape more than once (scheduled jobs, ongoing data collection), yes. Without monitoring, you discover problems only when stakeholders report missing or stale data — hours or days later. Monitoring catches issues within minutes.

What is the simplest monitoring setup?

Start with logging success/failure counts to a file and a cron job that checks the log. Graduate to Prometheus + Grafana when you need dashboards and alerts. For small projects, even a Slack webhook on error count > threshold works well.

How do I detect when a website changes its structure?

Monitor data quality — when your parser stops extracting certain fields, the quality score drops. Set alerts on per-field extraction rates. Also run periodic checks comparing sample scrapes against expected patterns.

What metrics matter most?

Success rate (are requests working?), data quality (is the parser working?), and queue depth (is processing keeping up?). These three metrics catch 90% of scraping issues. Add latency and proxy health for production systems.

How much does monitoring infrastructure cost?

Prometheus and Grafana are free and open source. A small VPS ($5/month) runs both comfortably. Cloud-hosted alternatives (Grafana Cloud, Datadog) have free tiers for small workloads. The cost of not monitoring (missed data, wasted proxy budget) far exceeds monitoring infrastructure costs.


Related Reading

Scroll to Top