Creating a Web Scraping Dashboard with Grafana

Creating a Web Scraping Dashboard with Grafana

Running web scrapers without monitoring is like driving blind. You do not know when proxies fail, when target sites change their structure, or when your error rate spikes. A Grafana dashboard gives you real-time visibility into every aspect of your scraping operation.

This guide covers setting up Prometheus metrics in your Python scraper and building Grafana dashboards to visualize them.

What We Will Monitor

A complete scraping dashboard tracks these metrics:

  • Request rate — requests per second, broken down by target domain
  • Success/failure rate — HTTP status codes, timeouts, connection errors
  • Proxy health — alive proxies, response times per proxy, rotation stats
  • Scraping throughput — pages scraped, items extracted per minute
  • Queue depth — pending URLs, backlog size
  • Resource usage — CPU, memory, bandwidth

Architecture

Scraper (Python) → Prometheus → Grafana
    ↓                              ↑
  Metrics endpoint (/metrics)  Dashboards

The scraper exposes metrics on an HTTP endpoint. Prometheus scrapes that endpoint every 15 seconds. Grafana queries Prometheus to render dashboards.

Setting Up Prometheus Metrics in Python

Install the Prometheus client library:

pip install prometheus-client httpx

Create a metrics module for your scraper:

from prometheus_client import (
    Counter, Histogram, Gauge, Summary,
    start_http_server, CollectorRegistry
)

REGISTRY = CollectorRegistry()

# Request metrics
REQUESTS_TOTAL = Counter(
    'scraper_requests_total',
    'Total HTTP requests made',
    ['domain', 'method', 'status'],
    registry=REGISTRY
)

REQUEST_DURATION = Histogram(
    'scraper_request_duration_seconds',
    'Request duration in seconds',
    ['domain'],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
    registry=REGISTRY
)

# Proxy metrics
PROXY_POOL_SIZE = Gauge(
    'scraper_proxy_pool_size',
    'Number of proxies in pool',
    ['status'],  # alive, dead, cooldown
    registry=REGISTRY
)

PROXY_LATENCY = Histogram(
    'scraper_proxy_latency_seconds',
    'Proxy response latency',
    ['proxy_id'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0],
    registry=REGISTRY
)

# Scraping metrics
ITEMS_SCRAPED = Counter(
    'scraper_items_scraped_total',
    'Total items extracted',
    ['item_type'],
    registry=REGISTRY
)

PAGES_SCRAPED = Counter(
    'scraper_pages_scraped_total',
    'Total pages scraped',
    ['domain'],
    registry=REGISTRY
)

QUEUE_SIZE = Gauge(
    'scraper_queue_size',
    'Number of URLs in queue',
    registry=REGISTRY
)

ERRORS = Counter(
    'scraper_errors_total',
    'Total errors by type',
    ['error_type'],  # timeout, connection, blocked, parse_error
    registry=REGISTRY
)

ACTIVE_SESSIONS = Gauge(
    'scraper_active_sessions',
    'Number of active scraping sessions',
    registry=REGISTRY
)

def start_metrics_server(port=8000):
    start_http_server(port, registry=REGISTRY)
    print(f"Metrics server running on http://localhost:{port}/metrics")

Instrumenting Your Scraper

Wrap your scraping logic with metric collection:

import httpx
import time
import asyncio
from urllib.parse import urlparse
from metrics import (
    REQUESTS_TOTAL, REQUEST_DURATION, ITEMS_SCRAPED,
    PAGES_SCRAPED, ERRORS, QUEUE_SIZE, PROXY_POOL_SIZE,
    PROXY_LATENCY, ACTIVE_SESSIONS, start_metrics_server
)

class InstrumentedScraper:
    def __init__(self, proxies: list):
        self.proxies = proxies
        self.queue = asyncio.Queue()

    async def fetch(self, url: str, proxy: str = None) -> httpx.Response:
        domain = urlparse(url).netloc
        start = time.monotonic()

        try:
            async with httpx.AsyncClient(
                proxy=proxy, timeout=30
            ) as client:
                response = await client.get(url)

            duration = time.monotonic() - start

            REQUESTS_TOTAL.labels(
                domain=domain,
                method='GET',
                status=str(response.status_code)
            ).inc()
            REQUEST_DURATION.labels(domain=domain).observe(duration)

            if proxy:
                proxy_id = proxy.split('@')[-1] if '@' in proxy else proxy
                PROXY_LATENCY.labels(proxy_id=proxy_id).observe(duration)

            if response.status_code == 200:
                PAGES_SCRAPED.labels(domain=domain).inc()
            elif response.status_code == 403:
                ERRORS.labels(error_type='blocked').inc()
            elif response.status_code == 429:
                ERRORS.labels(error_type='rate_limited').inc()

            return response

        except httpx.TimeoutException:
            ERRORS.labels(error_type='timeout').inc()
            REQUESTS_TOTAL.labels(
                domain=domain, method='GET', status='timeout'
            ).inc()
            raise
        except httpx.ConnectError:
            ERRORS.labels(error_type='connection').inc()
            REQUESTS_TOTAL.labels(
                domain=domain, method='GET', status='connection_error'
            ).inc()
            raise

    async def scrape_page(self, url: str, proxy: str):
        response = await self.fetch(url, proxy)
        items = self.parse_items(response.text)
        ITEMS_SCRAPED.labels(item_type='product').inc(len(items))
        return items

    def parse_items(self, html: str) -> list:
        # Your parsing logic here
        return []

    async def run(self, urls: list):
        start_metrics_server(port=8000)

        for url in urls:
            await self.queue.put(url)
        QUEUE_SIZE.set(self.queue.qsize())

        ACTIVE_SESSIONS.set(1)
        while not self.queue.empty():
            url = await self.queue.get()
            QUEUE_SIZE.set(self.queue.qsize())
            proxy = self.proxies[0]  # Use your rotation logic
            try:
                await self.scrape_page(url, proxy)
            except Exception as e:
                ERRORS.labels(error_type='unknown').inc()
        ACTIVE_SESSIONS.set(0)

Docker Compose Setup

Run Prometheus and Grafana alongside your scraper:

# docker-compose.yml
version: '3.8'

services:
  scraper:
    build: .
    ports:
      - "8000:8000"  # Metrics endpoint

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  grafana-data:

Prometheus configuration:

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'scraper'
    static_configs:
      - targets: ['scraper:8000']
    scrape_interval: 5s

Building Grafana Dashboards

After starting the services with docker-compose up, open Grafana at http://localhost:3000 and add Prometheus as a data source.

Request Rate Panel

Create a time series panel with this PromQL query:

rate(scraper_requests_total[5m])

Group by status to see success vs. failure trends:

sum by (status) (rate(scraper_requests_total[5m]))

Error Rate Panel

Show error percentage as a gauge:

sum(rate(scraper_errors_total[5m]))
/
sum(rate(scraper_requests_total[5m]))
* 100

Set thresholds: green below 5%, yellow below 15%, red above 15%.

Proxy Health Panel

Display alive vs. dead proxies as a stat panel:

scraper_proxy_pool_size{status="alive"}

Latency Distribution

Use a heatmap panel:

rate(scraper_request_duration_seconds_bucket[5m])

Throughput Panel

Show items scraped per minute:

rate(scraper_items_scraped_total[1m]) * 60

Queue Depth Panel

Monitor backlog with a time series:

scraper_queue_size

Alerting Rules

Configure Grafana alerts for critical conditions:

# Alert when error rate exceeds 20%
- alert: HighErrorRate
  expr: >
    sum(rate(scraper_errors_total[5m]))
    / sum(rate(scraper_requests_total[5m]))
    > 0.2
  for: 5m
  annotations:
    summary: "Scraper error rate above 20%"

# Alert when all proxies are dead
- alert: NoAliveProxies
  expr: scraper_proxy_pool_size{status="alive"} == 0
  for: 1m
  annotations:
    summary: "No alive proxies in pool"

# Alert when queue is growing (scraper falling behind)
- alert: QueueBacklog
  expr: scraper_queue_size > 10000
  for: 10m
  annotations:
    summary: "Scraping queue backlog exceeds 10,000 URLs"

Dashboard JSON Export

Export your dashboard as JSON for version control. In Grafana, go to Dashboard Settings > JSON Model. Save it alongside your scraper code so teammates can import the same dashboard.

# Save dashboard
curl -H "Authorization: Bearer $GRAFANA_API_KEY" \
  http://localhost:3000/api/dashboards/uid/scraper-dashboard \
  > grafana-dashboard.json

# Import dashboard
curl -X POST -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -H "Content-Type: application/json" \
  -d @grafana-dashboard.json \
  http://localhost:3000/api/dashboards/db

Internal Links

FAQ

Do I need Prometheus or can I use a simpler backend?

Prometheus is the standard for time-series metrics, but you can use InfluxDB or even SQLite for simpler setups. Grafana supports many data sources. For quick prototyping, push metrics directly to Grafana Cloud using their free tier.

How much overhead does metrics collection add?

Prometheus client library adds negligible overhead — under 1ms per metric operation. The /metrics endpoint scrape takes 5-10ms. For scrapers making hundreds of requests per second, metrics collection is invisible in performance profiles.

What is the most important metric to monitor?

Error rate by type. A sudden spike in 403 or 429 errors means your proxies are getting blocked. A spike in timeouts means proxy infrastructure issues. Monitor error rate first, then optimize latency and throughput.

How long should I retain scraping metrics?

Keep high-resolution data (15-second intervals) for 7 days. Downsample to 1-minute intervals for 30 days. For long-term trend analysis, keep hourly aggregates for 1 year. Configure Prometheus retention with --storage.tsdb.retention.time=7d.

Can I monitor multiple scrapers on one dashboard?

Yes. Add a job or instance label to your metrics. In Grafana, use template variables to filter by scraper instance. This lets you view all scrapers on one dashboard or drill into a specific one.


Related Reading

Scroll to Top