Distributed scraping with Apache Kafka in 2026
Distributed scraping with Apache Kafka is what you reach for when a single Scrapy process or single-machine Playwright cluster cannot keep up with your URL volume. Kafka excels at moving high-volume work between producers (URL discovery, sitemap crawlers) and consumers (scraper workers) with durable messaging, partitioning for parallelism, and built-in retry semantics. By 2026 Kafka 3.x is mature enough that running it in production for scraping pipelines is standard practice, especially for teams already operating Kafka for other data flows.
This guide covers Kafka topic design for scraping pipelines, partition strategies that keep scrapers fed, dead letter queue patterns for failed URLs, exactly-once semantics for deduplicated processing, and a complete working Python implementation using confluent-kafka. The benchmarks reflect production deployments handling 10-50 million URLs per day. By the end you will know how to architect a Kafka-backed scraping pipeline that scales horizontally and survives operational chaos.
When Kafka fits scraping
Kafka shines when:
- You have multiple URL sources (sitemaps, APIs, customer uploads, periodic crawls) feeding scrapers
- Scraper workers run on different machines and you want to balance load
- You need to retry failed URLs without losing them
- You want to fan out the same URLs to multiple downstream processors (HTML store, ML extraction, analytics)
- You operate other Kafka pipelines and want consistency
- You need exactly-once processing semantics
Kafka does not fit when:
- You have low volume (under 1M URLs/day, simpler queues like Redis or SQS suffice)
- You operate one scraping job at a time (no need for the multi-producer/multi-consumer architecture)
- Your team has no Kafka operational experience
- You need request/response semantics (Kafka is one-way; for scraper status, use a separate sync API)
For most mid-volume scraping, simpler queues work. Kafka pays off above several million URLs per day or when fanout to multiple consumers is required.
For the official Kafka documentation, see kafka.apache.org/documentation/.
Architecture overview
A typical Kafka-backed scraping pipeline:
URL sources Scraper workers Storage / downstream
+----------+ +-------------+ +-------------+
| sitemap | --+ | | --+ | postgres |
| crawler | | | worker_1 | | +-------------+
+----------+ | | | |
| +-------+ +-------------+ | +-------------+
+----------+ +->| urls |---->| | +------>| s3 / r2 |
| API feed | +-->| topic | | worker_2 | | +-------------+
+----------+ | +-------+ | | |
| +-------------+ | +-------------+
+----------+ | | | +------>| ML pipeline |
| customer |--+ | worker_N | +-------------+
| upload | +-------------+
+----------+
retries: dlq topic
Each producer pushes URLs to the urls topic. Multiple scraper workers consume from partitions of that topic in parallel. Successful results go to a pages topic that downstream consumers pick up. Failed URLs go to a dlq topic for retry or manual review.
Topic design for scraping
The standard set of topics:
| topic | purpose | partitions | retention |
|---|---|---|---|
| urls | URLs to scrape | 50 | 7 days |
| pages | scraped page content | 20 | 14 days |
| extractions | parsed structured data | 20 | 30 days |
| errors | scraper errors | 5 | 7 days |
| dlq | failed URLs for retry | 5 | 30 days |
| metrics | per-URL timing/status | 10 | 1 day |
Partitioning matters because Kafka guarantees ordered processing only within a partition. For scraping, you usually want:
- By host:
partition = hash(url.host) % num_partitions. Keeps all URLs for one host on one partition, enables per-host rate limiting. - By customer:
partition = hash(customer_id) % num_partitions. Tenant isolation, billing per customer. - Random: any partition. Maximum parallelism, no host-level coordination.
For scraping with anti-bot vendors, by-host partitioning is the right default. It lets each consumer apply per-host throttling without coordinating with other consumers.
Setup: Kafka 3.x with KRaft
Kafka 3.5+ removed the ZooKeeper dependency. KRaft (Kafka Raft) is now the recommended setup. For local dev:
# docker-compose.yml
version: "3.8"
services:
kafka:
image: bitnami/kafka:3.7
ports:
- "9092:9092"
environment:
KAFKA_CFG_NODE_ID: 1
KAFKA_CFG_PROCESS_ROLES: controller,broker
KAFKA_CFG_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
KAFKA_CFG_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
KAFKA_CFG_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_CFG_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
For production, use a managed service (Confluent Cloud, AWS MSK) or self-host on at least three nodes with proper replication.
Create the topics:
kafka-topics.sh --create --bootstrap-server localhost:9092 \
--topic urls --partitions 50 --replication-factor 1 \
--config retention.ms=604800000
kafka-topics.sh --create --bootstrap-server localhost:9092 \
--topic pages --partitions 20 --replication-factor 1 \
--config retention.ms=1209600000
kafka-topics.sh --create --bootstrap-server localhost:9092 \
--topic dlq --partitions 5 --replication-factor 1 \
--config retention.ms=2592000000
In production, use replication-factor 3 for durability.
Producer: feeding URLs
A simple Python producer using confluent-kafka:
# producer.py
from confluent_kafka import Producer
import json
import hashlib
producer = Producer({
"bootstrap.servers": "kafka:9092",
"client.id": "url-discovery",
"acks": "all",
"compression.type": "lz4",
"retries": 5,
"linger.ms": 10,
})
def url_key(url: str) -> bytes:
"""Partition by host so per-host throttling works in consumers."""
from urllib.parse import urlparse
host = urlparse(url).hostname or ""
return host.encode()
def produce_url(url: str, customer_id: str = "default", priority: int = 0):
payload = {
"url": url,
"customer_id": customer_id,
"priority": priority,
"discovered_at": int(time.time()),
}
producer.produce(
topic="urls",
key=url_key(url),
value=json.dumps(payload).encode(),
callback=lambda err, msg: (
print(f"Failed: {err}") if err else None
),
)
if __name__ == "__main__":
import sys, time
with open(sys.argv[1]) as f:
for line in f:
url = line.strip()
if url:
produce_url(url)
producer.flush(timeout=30)
Key choices:
acks="all": wait for all in-sync replicas to acknowledge. Durable but slower.compression.type="lz4": roughly 60-80% size reduction with low CPU overhead.linger.ms=10: batch up to 10ms before sending. Reduces request count.key=hostname.encode(): partitions by host.
For very high throughput, use acks=1 (only leader) and accept slightly higher loss risk in exchange for 2-3x speed.
Consumer: the scraper worker
A consumer that fetches URLs and produces results:
# scraper_worker.py
from confluent_kafka import Consumer, Producer, KafkaError
import json
import requests
import time
import logging
from collections import defaultdict
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
consumer = Consumer({
"bootstrap.servers": "kafka:9092",
"group.id": "scrapers",
"auto.offset.reset": "earliest",
"enable.auto.commit": False,
"max.poll.interval.ms": 600000, # 10 min
"session.timeout.ms": 30000,
})
consumer.subscribe(["urls"])
producer = Producer({
"bootstrap.servers": "kafka:9092",
"compression.type": "lz4",
"acks": "all",
})
# Per-host last-fetch timestamps for throttling
last_fetch = defaultdict(float)
HOST_DELAY = 1.5 # seconds between same-host requests
def scrape(url: str) -> dict:
from urllib.parse import urlparse
host = urlparse(url).hostname
# Per-host throttle
elapsed = time.time() - last_fetch[host]
if elapsed < HOST_DELAY:
time.sleep(HOST_DELAY - elapsed)
last_fetch[host] = time.time()
resp = requests.get(url, timeout=30, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...",
})
return {
"url": url,
"status": resp.status_code,
"html": resp.text if resp.status_code == 200 else None,
"fetched_at": int(time.time()),
}
def send_to_dlq(message_value: bytes, error: str):
producer.produce(
topic="dlq",
key=message_value,
value=json.dumps({
"original": json.loads(message_value),
"error": error,
"failed_at": int(time.time()),
}).encode(),
)
while True:
msg = consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
if msg.error().code() == KafkaError._PARTITION_EOF:
continue
logger.error(f"Consumer error: {msg.error()}")
continue
payload = json.loads(msg.value())
url = payload["url"]
try:
result = scrape(url)
if result["status"] == 200:
producer.produce(
topic="pages",
key=url.encode(),
value=json.dumps(result).encode(),
)
elif result["status"] >= 500:
send_to_dlq(msg.value(), f"HTTP {result['status']}")
# else: 4xx errors are skipped (404, 403, etc.)
consumer.commit(msg) # commit offset only on success or known-skip
except Exception as e:
logger.error(f"Failed to scrape {url}: {e}")
send_to_dlq(msg.value(), str(e))
consumer.commit(msg) # still commit, DLQ has the message
Key choices:
enable.auto.commit=False: manual commit so we only advance after processing- Per-host throttle via
last_fetchdict - DLQ for transient errors (5xx, exceptions)
- Skip 4xx errors (404, 403 are usually permanent for that URL)
Per-host rate limiting at scale
The simple per-process throttle above breaks when you have multiple consumers per partition. By partitioning on host, each host’s URLs all land on one partition, which is consumed by one consumer at a time within a consumer group. So the per-process throttle effectively becomes a per-host throttle across the cluster.
For finer control (multiple consumers, different rate limits per customer), use a distributed rate limiter (Redis-backed):
import redis
r = redis.Redis(host="redis", port=6379)
def can_fetch_host(host: str, max_per_sec: float = 1.0) -> bool:
"""Token bucket via Redis."""
key = f"rate:{host}"
now = time.time()
pipe = r.pipeline()
pipe.zremrangebyscore(key, 0, now - 1) # remove old entries
pipe.zcard(key)
pipe.zadd(key, {str(now): now})
pipe.expire(key, 10)
_, count, _, _ = pipe.execute()
return count < max_per_sec
# In scraper:
while not can_fetch_host(host):
time.sleep(0.1)
This gives global rate limiting independent of partition distribution.
Dead letter queue handling
DLQ patterns:
# dlq_consumer.py
from confluent_kafka import Consumer, Producer
import json
import time
consumer = Consumer({
"bootstrap.servers": "kafka:9092",
"group.id": "dlq-retry",
"auto.offset.reset": "earliest",
})
consumer.subscribe(["dlq"])
producer = Producer({"bootstrap.servers": "kafka:9092"})
while True:
msg = consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
continue
record = json.loads(msg.value())
failed_at = record["failed_at"]
age = time.time() - failed_at
# Retry after 1 hour cooldown
if age < 3600:
continue
# Re-emit to urls topic
producer.produce(
topic="urls",
key=record["original"]["url"].encode(),
value=json.dumps(record["original"]).encode(),
)
consumer.commit(msg)
Or for permanent failures: a dashboard query of the DLQ topic for ops review.
Exactly-once semantics
Kafka 3.x supports exactly-once semantics (EOS) via transactional producers and read_committed consumers. For scraping, this matters when you want each URL processed exactly once across worker restarts and failures.
producer = Producer({
"bootstrap.servers": "kafka:9092",
"transactional.id": f"scraper-{worker_id}",
"enable.idempotence": True,
})
producer.init_transactions()
# In scraper loop:
producer.begin_transaction()
producer.produce(topic="pages", value=json.dumps(result).encode())
producer.send_offsets_to_transaction(
[TopicPartition("urls", partition, offset)],
consumer_group_metadata
)
producer.commit_transaction()
This guarantees that if the scraper crashes mid-process, the offset commit and the result production happen atomically: either both, or neither. The next consumer instance picks up from the same offset and processes the URL again.
Comparison: Kafka vs alternatives for scraping queues
| platform | pros | cons | scale |
|---|---|---|---|
| Kafka | high throughput, durable, multi-consumer, EOS | complex ops, requires JVM | up to 10M+ msg/sec |
| Redis Streams | simple, fast, low overhead | single-node, less durable | up to 100k msg/sec |
| RabbitMQ | rich routing, mature | lower throughput than Kafka | up to 100k msg/sec |
| SQS | managed, cheap, infinite scale | per-message cost, no replay | up to millions/sec |
| NATS JetStream | simple, fast, durable | smaller community | up to 1M msg/sec |
| Pulsar | Kafka-alternative, geo-replication | smaller adoption | comparable to Kafka |
For scraping, Kafka and Pulsar both work; Kafka has the larger community. SQS is often the simpler choice if your AWS bill is comfortable with per-message charges. For under 1M URLs/day, Redis Streams is hard to beat for simplicity.
Monitoring
Critical Kafka metrics for scraping:
| metric | what it tells you |
|---|---|
| consumer lag (urls topic) | how far behind workers are |
| produce rate (urls) | URL discovery throughput |
| consume rate (urls) | scrape throughput |
| DLQ size | failure rate |
| rebalance count | worker churn |
| under-replicated partitions | broker issues |
Tools:
- Kafka Exporter for Prometheus: scrapes Kafka metrics
- Conduktor or Confluent Control Center: GUI for cluster management
- Burrow: consumer lag monitoring with alerting
- Strimzi: Kubernetes operator with built-in monitoring
Set alerts on consumer lag > 30 min, DLQ size growth, broker downtime.
Production deployment
For 10M URLs/day:
- 3-broker Kafka cluster on Kafka 3.7 with KRaft
- 50 partitions on
urlstopic - 20 scraper workers, each consuming from 2-3 partitions
- Per-host throttle via partition assignment + Redis backstop
- Postgres for results (loaded from
pagestopic by separate consumers) - DLQ retry consumer running every hour
- Prometheus + Grafana for monitoring
- ~$1500/month in compute + Kafka
For 100M URLs/day:
- 5+ broker cluster with replication-factor 3
- 200+ partitions
- 100+ scraper workers
- Multiple Kafka clusters segmented by region
- ~$15k/month compute + Kafka
Operational checklist
For Kafka-backed scraping in 2026:
- Kafka 3.5+ with KRaft (no ZooKeeper)
- Topic design: urls, pages, dlq, errors, metrics
- Partition by host for per-host coordination
- LZ4 compression on all topics
- acks=all for producers (durability)
- Manual commit on consumers (process before commit)
- Per-host throttling (partition-level + Redis backstop)
- DLQ retry consumer
- Prometheus + Grafana monitoring
- Alerts on lag, DLQ growth, broker health
- Schema registry if pipeline complexity grows
For broader pipeline patterns, see building scraping pipelines with Prefect 3 and building scraping pipelines with Dagster.
Common pitfalls
- Partition imbalance: bad partition keys cause hot partitions. Audit URL distribution by partition.
- Slow consumers and consumer group rebalances: long-running scrapes blow past
max.poll.interval.ms. Bump it or split work. - Offset commit before processing: leads to data loss. Always commit after successful processing.
- DLQ overflow: if 30% of URLs fail, DLQ fills faster than retries clear it. Tune retry policy or fix underlying issues.
- Memory pressure on consumers: large pages (>1 MB HTML) accumulate in producer buffer. Cap message size or stream pages directly to S3.
- Kafka cluster upgrade missteps: KRaft migration requires careful planning. Read the Kafka KRaft docs.
FAQ
Q: do I need Kafka for under 1M URLs/day?
Probably not. Redis Streams or SQS will be simpler and sufficient. Kafka pays off above 1-10M URLs/day where the durability, throughput, and multi-consumer fanout matter.
Q: can I use Kafka for both URL queue and result storage?
Kafka is a queue, not a database. Use it for streaming and short-term retention. Store results in a database (Postgres, ClickHouse) or object storage (S3, R2) via downstream consumers.
Q: how do I retry failed URLs without losing the original message?
DLQ topic. Failed processing produces to DLQ, original offset commits. A retry consumer reads DLQ on a schedule and re-emits to the urls topic.
Q: what about Kafka Streams for in-flight processing?
Useful for derived data (aggregations, deduplication). For raw scraping, KafkaConsumer is enough. Add Streams when you need stateful enrichment.
Q: does this work with Confluent Cloud?
Yes. The Python client is identical, only the bootstrap.servers and SASL credentials change. Confluent Cloud removes the ops burden at the cost of usage-based pricing.
Real-world example: 80M URL crawl with KRaft and per-host fairness
A scraping team migrated from a single-broker Kafka cluster to a 5-broker KRaft cluster handling 80 million URLs per day across 12,000 distinct hosts. The bottleneck before migration was not throughput but fairness: a few hosts (Amazon, eBay, Walmart) made up 40 percent of all URLs and starved smaller hosts because their partitions had longer queue depth. The fix involved a custom partitioner that hashed by host AND by a “fairness band” derived from the host’s URL queue depth:
import hashlib
from confluent_kafka import Producer
class FairHostPartitioner:
def __init__(self, num_partitions: int, queue_depth_provider):
self.num_partitions = num_partitions
self.queue_depth = queue_depth_provider # callable(host) -> int
def partition(self, key: bytes, all_partitions: list) -> int:
host = key.decode().split("/")[2]
depth = self.queue_depth(host)
# High-depth hosts get spread across multiple partitions
if depth > 100_000:
band = hashlib.sha256(key).hexdigest()[:8]
band_int = int(band, 16) % 4 # spread across 4 partitions
base = int(hashlib.sha256(host.encode()).hexdigest()[:8], 16)
return (base + band_int) % self.num_partitions
# Low-depth hosts get one partition (per-host throttling preserved)
return int(hashlib.sha256(host.encode()).hexdigest()[:8], 16) % self.num_partitions
producer = Producer({
"bootstrap.servers": "kafka-1:9092,kafka-2:9092,kafka-3:9092",
"compression.type": "lz4",
"linger.ms": 10,
"batch.size": 65536,
})
After deployment, p99 URL-to-fetch latency for low-volume hosts dropped from 4.2 hours to 18 minutes. High-volume hosts saw their throughput increase 3.5x because the work was now spread across 4 partitions instead of bottlenecking on one. The 5-broker cluster ran at 65 percent CPU during steady state, leaving headroom for daily peak loads.
Common pitfalls in production Kafka scraping
The first failure mode is consumer group rebalance storms during scraper deploys. When you redeploy 20 scraper workers via rolling restart, each worker’s exit triggers a rebalance, which pauses all consumers for 5-30 seconds while partitions reassign. With 20 workers redeploying sequentially, you accumulate 100-600 seconds of pause time during the deploy window. The fix is cooperative rebalancing (Kafka 2.4+) which only reassigns the partitions of the leaving worker rather than all partitions:
consumer = Consumer({
"bootstrap.servers": "kafka:9092",
"group.id": "scrapers",
"partition.assignment.strategy": "cooperative-sticky",
"session.timeout.ms": 30000,
"max.poll.interval.ms": 600000, # 10 min for slow scrapes
})
After enabling cooperative-sticky, deploy-induced pause time drops by roughly 90 percent because each worker exit only pauses its own 2-3 partitions instead of all 50.
The second pitfall is the __consumer_offsets topic bloat. Kafka stores consumer group offsets in an internal compacted topic. Scrapers that commit offsets aggressively (every message instead of every batch) generate millions of offset commits per day, which can outgrow the broker’s compaction capacity and cause the topic to balloon to tens of GB. The fix is to commit in batches of 100-1000 messages rather than per-message:
batch = []
for msg in consumer:
process(msg)
batch.append((msg.topic, msg.partition, msg.offset + 1))
if len(batch) >= 100:
consumer.commit(offsets=batch)
batch = []
The third pitfall is the producer buffer exhaustion under back-pressure. When a broker becomes slow or unreachable, the producer’s buffer.memory (default 32MB) fills up with un-acked messages, and producer.send() calls block indefinitely once full. Scrapers that produce results while consuming URLs can deadlock: the producer blocks on a slow broker, the consumer can’t make progress because it can’t produce results, the consumer gets kicked from the group for missed heartbeats, and the entire pipeline halts. Configure delivery.timeout.ms and request.timeout.ms aggressively, and use producer.flush(timeout=10) with a timeout rather than indefinite blocking:
producer = Producer({
"bootstrap.servers": "kafka:9092",
"delivery.timeout.ms": 30000,
"request.timeout.ms": 15000,
"buffer.memory": 67108864, # 64MB
"max.block.ms": 5000, # don't block produce() longer than 5s
})
If max.block.ms is exceeded, producer.produce() raises BufferError, which you handle by dropping the message to a local fallback file rather than crashing the consumer.
Wrapping up
Distributed scraping with Kafka pays off when volume justifies the operational complexity. For 10M+ URLs/day, the durability, partitioning, and multi-consumer architecture are hard to beat. For under 1M, simpler queues are usually the right answer. Pair this with our building scraping pipelines with Prefect 3 and Scrapy Cloud vs Crawlee Cloud writeups for the full pipeline picture, and browse the dev-tools-projects category on DRT for related infrastructure deep-dives.