Creating a Web Scraping Dashboard with Grafana
Running web scrapers without monitoring is like driving blind. You do not know when proxies fail, when target sites change their structure, or when your error rate spikes. A Grafana dashboard gives you real-time visibility into every aspect of your scraping operation.
This guide covers setting up Prometheus metrics in your Python scraper and building Grafana dashboards to visualize them.
What We Will Monitor
A complete scraping dashboard tracks these metrics:
- Request rate — requests per second, broken down by target domain
- Success/failure rate — HTTP status codes, timeouts, connection errors
- Proxy health — alive proxies, response times per proxy, rotation stats
- Scraping throughput — pages scraped, items extracted per minute
- Queue depth — pending URLs, backlog size
- Resource usage — CPU, memory, bandwidth
Architecture
Scraper (Python) → Prometheus → Grafana
↓ ↑
Metrics endpoint (/metrics) DashboardsThe scraper exposes metrics on an HTTP endpoint. Prometheus scrapes that endpoint every 15 seconds. Grafana queries Prometheus to render dashboards.
Setting Up Prometheus Metrics in Python
Install the Prometheus client library:
pip install prometheus-client httpxCreate a metrics module for your scraper:
from prometheus_client import (
Counter, Histogram, Gauge, Summary,
start_http_server, CollectorRegistry
)
REGISTRY = CollectorRegistry()
# Request metrics
REQUESTS_TOTAL = Counter(
'scraper_requests_total',
'Total HTTP requests made',
['domain', 'method', 'status'],
registry=REGISTRY
)
REQUEST_DURATION = Histogram(
'scraper_request_duration_seconds',
'Request duration in seconds',
['domain'],
buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
registry=REGISTRY
)
# Proxy metrics
PROXY_POOL_SIZE = Gauge(
'scraper_proxy_pool_size',
'Number of proxies in pool',
['status'], # alive, dead, cooldown
registry=REGISTRY
)
PROXY_LATENCY = Histogram(
'scraper_proxy_latency_seconds',
'Proxy response latency',
['proxy_id'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0],
registry=REGISTRY
)
# Scraping metrics
ITEMS_SCRAPED = Counter(
'scraper_items_scraped_total',
'Total items extracted',
['item_type'],
registry=REGISTRY
)
PAGES_SCRAPED = Counter(
'scraper_pages_scraped_total',
'Total pages scraped',
['domain'],
registry=REGISTRY
)
QUEUE_SIZE = Gauge(
'scraper_queue_size',
'Number of URLs in queue',
registry=REGISTRY
)
ERRORS = Counter(
'scraper_errors_total',
'Total errors by type',
['error_type'], # timeout, connection, blocked, parse_error
registry=REGISTRY
)
ACTIVE_SESSIONS = Gauge(
'scraper_active_sessions',
'Number of active scraping sessions',
registry=REGISTRY
)
def start_metrics_server(port=8000):
start_http_server(port, registry=REGISTRY)
print(f"Metrics server running on http://localhost:{port}/metrics")Instrumenting Your Scraper
Wrap your scraping logic with metric collection:
import httpx
import time
import asyncio
from urllib.parse import urlparse
from metrics import (
REQUESTS_TOTAL, REQUEST_DURATION, ITEMS_SCRAPED,
PAGES_SCRAPED, ERRORS, QUEUE_SIZE, PROXY_POOL_SIZE,
PROXY_LATENCY, ACTIVE_SESSIONS, start_metrics_server
)
class InstrumentedScraper:
def __init__(self, proxies: list):
self.proxies = proxies
self.queue = asyncio.Queue()
async def fetch(self, url: str, proxy: str = None) -> httpx.Response:
domain = urlparse(url).netloc
start = time.monotonic()
try:
async with httpx.AsyncClient(
proxy=proxy, timeout=30
) as client:
response = await client.get(url)
duration = time.monotonic() - start
REQUESTS_TOTAL.labels(
domain=domain,
method='GET',
status=str(response.status_code)
).inc()
REQUEST_DURATION.labels(domain=domain).observe(duration)
if proxy:
proxy_id = proxy.split('@')[-1] if '@' in proxy else proxy
PROXY_LATENCY.labels(proxy_id=proxy_id).observe(duration)
if response.status_code == 200:
PAGES_SCRAPED.labels(domain=domain).inc()
elif response.status_code == 403:
ERRORS.labels(error_type='blocked').inc()
elif response.status_code == 429:
ERRORS.labels(error_type='rate_limited').inc()
return response
except httpx.TimeoutException:
ERRORS.labels(error_type='timeout').inc()
REQUESTS_TOTAL.labels(
domain=domain, method='GET', status='timeout'
).inc()
raise
except httpx.ConnectError:
ERRORS.labels(error_type='connection').inc()
REQUESTS_TOTAL.labels(
domain=domain, method='GET', status='connection_error'
).inc()
raise
async def scrape_page(self, url: str, proxy: str):
response = await self.fetch(url, proxy)
items = self.parse_items(response.text)
ITEMS_SCRAPED.labels(item_type='product').inc(len(items))
return items
def parse_items(self, html: str) -> list:
# Your parsing logic here
return []
async def run(self, urls: list):
start_metrics_server(port=8000)
for url in urls:
await self.queue.put(url)
QUEUE_SIZE.set(self.queue.qsize())
ACTIVE_SESSIONS.set(1)
while not self.queue.empty():
url = await self.queue.get()
QUEUE_SIZE.set(self.queue.qsize())
proxy = self.proxies[0] # Use your rotation logic
try:
await self.scrape_page(url, proxy)
except Exception as e:
ERRORS.labels(error_type='unknown').inc()
ACTIVE_SESSIONS.set(0)Docker Compose Setup
Run Prometheus and Grafana alongside your scraper:
# docker-compose.yml
version: '3.8'
services:
scraper:
build: .
ports:
- "8000:8000" # Metrics endpoint
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
volumes:
grafana-data:Prometheus configuration:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'scraper'
static_configs:
- targets: ['scraper:8000']
scrape_interval: 5sBuilding Grafana Dashboards
After starting the services with docker-compose up, open Grafana at http://localhost:3000 and add Prometheus as a data source.
Request Rate Panel
Create a time series panel with this PromQL query:
rate(scraper_requests_total[5m])Group by status to see success vs. failure trends:
sum by (status) (rate(scraper_requests_total[5m]))Error Rate Panel
Show error percentage as a gauge:
sum(rate(scraper_errors_total[5m]))
/
sum(rate(scraper_requests_total[5m]))
* 100Set thresholds: green below 5%, yellow below 15%, red above 15%.
Proxy Health Panel
Display alive vs. dead proxies as a stat panel:
scraper_proxy_pool_size{status="alive"}Latency Distribution
Use a heatmap panel:
rate(scraper_request_duration_seconds_bucket[5m])Throughput Panel
Show items scraped per minute:
rate(scraper_items_scraped_total[1m]) * 60Queue Depth Panel
Monitor backlog with a time series:
scraper_queue_sizeAlerting Rules
Configure Grafana alerts for critical conditions:
# Alert when error rate exceeds 20%
- alert: HighErrorRate
expr: >
sum(rate(scraper_errors_total[5m]))
/ sum(rate(scraper_requests_total[5m]))
> 0.2
for: 5m
annotations:
summary: "Scraper error rate above 20%"
# Alert when all proxies are dead
- alert: NoAliveProxies
expr: scraper_proxy_pool_size{status="alive"} == 0
for: 1m
annotations:
summary: "No alive proxies in pool"
# Alert when queue is growing (scraper falling behind)
- alert: QueueBacklog
expr: scraper_queue_size > 10000
for: 10m
annotations:
summary: "Scraping queue backlog exceeds 10,000 URLs"Dashboard JSON Export
Export your dashboard as JSON for version control. In Grafana, go to Dashboard Settings > JSON Model. Save it alongside your scraper code so teammates can import the same dashboard.
# Save dashboard
curl -H "Authorization: Bearer $GRAFANA_API_KEY" \
http://localhost:3000/api/dashboards/uid/scraper-dashboard \
> grafana-dashboard.json
# Import dashboard
curl -X POST -H "Authorization: Bearer $GRAFANA_API_KEY" \
-H "Content-Type: application/json" \
-d @grafana-dashboard.json \
http://localhost:3000/api/dashboards/dbInternal Links
- Building a Proxy Checker Tool — check proxy health programmatically
- Proxy Health Monitor with Node.js — alternative monitoring approach
- Web Scraping ETL Pipeline with Airflow — orchestrate scraping workflows
- Proxy Log Analyzer — analyze historical proxy performance
- Proxy Performance Benchmarks — benchmark your setup
FAQ
Do I need Prometheus or can I use a simpler backend?
Prometheus is the standard for time-series metrics, but you can use InfluxDB or even SQLite for simpler setups. Grafana supports many data sources. For quick prototyping, push metrics directly to Grafana Cloud using their free tier.
How much overhead does metrics collection add?
Prometheus client library adds negligible overhead — under 1ms per metric operation. The /metrics endpoint scrape takes 5-10ms. For scrapers making hundreds of requests per second, metrics collection is invisible in performance profiles.
What is the most important metric to monitor?
Error rate by type. A sudden spike in 403 or 429 errors means your proxies are getting blocked. A spike in timeouts means proxy infrastructure issues. Monitor error rate first, then optimize latency and throughput.
How long should I retain scraping metrics?
Keep high-resolution data (15-second intervals) for 7 days. Downsample to 1-minute intervals for 30 days. For long-term trend analysis, keep hourly aggregates for 1 year. Configure Prometheus retention with --storage.tsdb.retention.time=7d.
Can I monitor multiple scrapers on one dashboard?
Yes. Add a job or instance label to your metrics. In Grafana, use template variables to filter by scraper instance. This lets you view all scrapers on one dashboard or drill into a specific one.
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a Proxy Rotator in Python: Complete Tutorial
- AJAX Request Interception: Scraping API Calls Directly
- Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
Related Reading
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)