Distributed scraping with Apache Kafka in 2026

Distributed scraping with Apache Kafka in 2026

Distributed scraping with Apache Kafka is what you reach for when a single Scrapy process or single-machine Playwright cluster cannot keep up with your URL volume. Kafka excels at moving high-volume work between producers (URL discovery, sitemap crawlers) and consumers (scraper workers) with durable messaging, partitioning for parallelism, and built-in retry semantics. By 2026 Kafka 3.x is mature enough that running it in production for scraping pipelines is standard practice, especially for teams already operating Kafka for other data flows.

This guide covers Kafka topic design for scraping pipelines, partition strategies that keep scrapers fed, dead letter queue patterns for failed URLs, exactly-once semantics for deduplicated processing, and a complete working Python implementation using confluent-kafka. The benchmarks reflect production deployments handling 10-50 million URLs per day. By the end you will know how to architect a Kafka-backed scraping pipeline that scales horizontally and survives operational chaos.

When Kafka fits scraping

Kafka shines when:

  • You have multiple URL sources (sitemaps, APIs, customer uploads, periodic crawls) feeding scrapers
  • Scraper workers run on different machines and you want to balance load
  • You need to retry failed URLs without losing them
  • You want to fan out the same URLs to multiple downstream processors (HTML store, ML extraction, analytics)
  • You operate other Kafka pipelines and want consistency
  • You need exactly-once processing semantics

Kafka does not fit when:

  • You have low volume (under 1M URLs/day, simpler queues like Redis or SQS suffice)
  • You operate one scraping job at a time (no need for the multi-producer/multi-consumer architecture)
  • Your team has no Kafka operational experience
  • You need request/response semantics (Kafka is one-way; for scraper status, use a separate sync API)

For most mid-volume scraping, simpler queues work. Kafka pays off above several million URLs per day or when fanout to multiple consumers is required.

For the official Kafka documentation, see kafka.apache.org/documentation/.

Architecture overview

A typical Kafka-backed scraping pipeline:

   URL sources                     Scraper workers           Storage / downstream
   +----------+                    +-------------+           +-------------+
   | sitemap  | --+                |             | --+       | postgres    |
   | crawler  |   |                | worker_1    |   |       +-------------+
   +----------+   |                |             |   |
                  |  +-------+     +-------------+   |       +-------------+
   +----------+   +->| urls  |---->|             |   +------>| s3 / r2     |
   | API feed |  +-->| topic |     | worker_2    |   |       +-------------+
   +----------+   |  +-------+     |             |   |
                  |                +-------------+   |       +-------------+
   +----------+   |                |             |   +------>| ML pipeline |
   | customer |--+                 | worker_N    |           +-------------+
   | upload   |                    +-------------+
   +----------+
                                                      retries: dlq topic

Each producer pushes URLs to the urls topic. Multiple scraper workers consume from partitions of that topic in parallel. Successful results go to a pages topic that downstream consumers pick up. Failed URLs go to a dlq topic for retry or manual review.

Topic design for scraping

The standard set of topics:

topicpurposepartitionsretention
urlsURLs to scrape507 days
pagesscraped page content2014 days
extractionsparsed structured data2030 days
errorsscraper errors57 days
dlqfailed URLs for retry530 days
metricsper-URL timing/status101 day

Partitioning matters because Kafka guarantees ordered processing only within a partition. For scraping, you usually want:

  • By host: partition = hash(url.host) % num_partitions. Keeps all URLs for one host on one partition, enables per-host rate limiting.
  • By customer: partition = hash(customer_id) % num_partitions. Tenant isolation, billing per customer.
  • Random: any partition. Maximum parallelism, no host-level coordination.

For scraping with anti-bot vendors, by-host partitioning is the right default. It lets each consumer apply per-host throttling without coordinating with other consumers.

Setup: Kafka 3.x with KRaft

Kafka 3.5+ removed the ZooKeeper dependency. KRaft (Kafka Raft) is now the recommended setup. For local dev:

# docker-compose.yml
version: "3.8"
services:
  kafka:
    image: bitnami/kafka:3.7
    ports:
      - "9092:9092"
    environment:
      KAFKA_CFG_NODE_ID: 1
      KAFKA_CFG_PROCESS_ROLES: controller,broker
      KAFKA_CFG_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_CFG_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
      KAFKA_CFG_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_CFG_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT

For production, use a managed service (Confluent Cloud, AWS MSK) or self-host on at least three nodes with proper replication.

Create the topics:

kafka-topics.sh --create --bootstrap-server localhost:9092 \
    --topic urls --partitions 50 --replication-factor 1 \
    --config retention.ms=604800000

kafka-topics.sh --create --bootstrap-server localhost:9092 \
    --topic pages --partitions 20 --replication-factor 1 \
    --config retention.ms=1209600000

kafka-topics.sh --create --bootstrap-server localhost:9092 \
    --topic dlq --partitions 5 --replication-factor 1 \
    --config retention.ms=2592000000

In production, use replication-factor 3 for durability.

Producer: feeding URLs

A simple Python producer using confluent-kafka:

# producer.py
from confluent_kafka import Producer
import json
import hashlib

producer = Producer({
    "bootstrap.servers": "kafka:9092",
    "client.id": "url-discovery",
    "acks": "all",
    "compression.type": "lz4",
    "retries": 5,
    "linger.ms": 10,
})

def url_key(url: str) -> bytes:
    """Partition by host so per-host throttling works in consumers."""
    from urllib.parse import urlparse
    host = urlparse(url).hostname or ""
    return host.encode()

def produce_url(url: str, customer_id: str = "default", priority: int = 0):
    payload = {
        "url": url,
        "customer_id": customer_id,
        "priority": priority,
        "discovered_at": int(time.time()),
    }
    producer.produce(
        topic="urls",
        key=url_key(url),
        value=json.dumps(payload).encode(),
        callback=lambda err, msg: (
            print(f"Failed: {err}") if err else None
        ),
    )

if __name__ == "__main__":
    import sys, time
    with open(sys.argv[1]) as f:
        for line in f:
            url = line.strip()
            if url:
                produce_url(url)
    producer.flush(timeout=30)

Key choices:

  • acks="all": wait for all in-sync replicas to acknowledge. Durable but slower.
  • compression.type="lz4": roughly 60-80% size reduction with low CPU overhead.
  • linger.ms=10: batch up to 10ms before sending. Reduces request count.
  • key=hostname.encode(): partitions by host.

For very high throughput, use acks=1 (only leader) and accept slightly higher loss risk in exchange for 2-3x speed.

Consumer: the scraper worker

A consumer that fetches URLs and produces results:

# scraper_worker.py
from confluent_kafka import Consumer, Producer, KafkaError
import json
import requests
import time
import logging
from collections import defaultdict

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

consumer = Consumer({
    "bootstrap.servers": "kafka:9092",
    "group.id": "scrapers",
    "auto.offset.reset": "earliest",
    "enable.auto.commit": False,
    "max.poll.interval.ms": 600000,  # 10 min
    "session.timeout.ms": 30000,
})
consumer.subscribe(["urls"])

producer = Producer({
    "bootstrap.servers": "kafka:9092",
    "compression.type": "lz4",
    "acks": "all",
})

# Per-host last-fetch timestamps for throttling
last_fetch = defaultdict(float)
HOST_DELAY = 1.5  # seconds between same-host requests

def scrape(url: str) -> dict:
    from urllib.parse import urlparse
    host = urlparse(url).hostname

    # Per-host throttle
    elapsed = time.time() - last_fetch[host]
    if elapsed < HOST_DELAY:
        time.sleep(HOST_DELAY - elapsed)
    last_fetch[host] = time.time()

    resp = requests.get(url, timeout=30, headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...",
    })
    return {
        "url": url,
        "status": resp.status_code,
        "html": resp.text if resp.status_code == 200 else None,
        "fetched_at": int(time.time()),
    }

def send_to_dlq(message_value: bytes, error: str):
    producer.produce(
        topic="dlq",
        key=message_value,
        value=json.dumps({
            "original": json.loads(message_value),
            "error": error,
            "failed_at": int(time.time()),
        }).encode(),
    )

while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None:
        continue
    if msg.error():
        if msg.error().code() == KafkaError._PARTITION_EOF:
            continue
        logger.error(f"Consumer error: {msg.error()}")
        continue

    payload = json.loads(msg.value())
    url = payload["url"]

    try:
        result = scrape(url)
        if result["status"] == 200:
            producer.produce(
                topic="pages",
                key=url.encode(),
                value=json.dumps(result).encode(),
            )
        elif result["status"] >= 500:
            send_to_dlq(msg.value(), f"HTTP {result['status']}")
        # else: 4xx errors are skipped (404, 403, etc.)

        consumer.commit(msg)  # commit offset only on success or known-skip
    except Exception as e:
        logger.error(f"Failed to scrape {url}: {e}")
        send_to_dlq(msg.value(), str(e))
        consumer.commit(msg)  # still commit, DLQ has the message

Key choices:

  • enable.auto.commit=False: manual commit so we only advance after processing
  • Per-host throttle via last_fetch dict
  • DLQ for transient errors (5xx, exceptions)
  • Skip 4xx errors (404, 403 are usually permanent for that URL)

Per-host rate limiting at scale

The simple per-process throttle above breaks when you have multiple consumers per partition. By partitioning on host, each host’s URLs all land on one partition, which is consumed by one consumer at a time within a consumer group. So the per-process throttle effectively becomes a per-host throttle across the cluster.

For finer control (multiple consumers, different rate limits per customer), use a distributed rate limiter (Redis-backed):

import redis

r = redis.Redis(host="redis", port=6379)

def can_fetch_host(host: str, max_per_sec: float = 1.0) -> bool:
    """Token bucket via Redis."""
    key = f"rate:{host}"
    now = time.time()

    pipe = r.pipeline()
    pipe.zremrangebyscore(key, 0, now - 1)  # remove old entries
    pipe.zcard(key)
    pipe.zadd(key, {str(now): now})
    pipe.expire(key, 10)
    _, count, _, _ = pipe.execute()

    return count < max_per_sec

# In scraper:
while not can_fetch_host(host):
    time.sleep(0.1)

This gives global rate limiting independent of partition distribution.

Dead letter queue handling

DLQ patterns:

# dlq_consumer.py
from confluent_kafka import Consumer, Producer
import json
import time

consumer = Consumer({
    "bootstrap.servers": "kafka:9092",
    "group.id": "dlq-retry",
    "auto.offset.reset": "earliest",
})
consumer.subscribe(["dlq"])

producer = Producer({"bootstrap.servers": "kafka:9092"})

while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None:
        continue
    if msg.error():
        continue

    record = json.loads(msg.value())
    failed_at = record["failed_at"]
    age = time.time() - failed_at

    # Retry after 1 hour cooldown
    if age < 3600:
        continue

    # Re-emit to urls topic
    producer.produce(
        topic="urls",
        key=record["original"]["url"].encode(),
        value=json.dumps(record["original"]).encode(),
    )

    consumer.commit(msg)

Or for permanent failures: a dashboard query of the DLQ topic for ops review.

Exactly-once semantics

Kafka 3.x supports exactly-once semantics (EOS) via transactional producers and read_committed consumers. For scraping, this matters when you want each URL processed exactly once across worker restarts and failures.

producer = Producer({
    "bootstrap.servers": "kafka:9092",
    "transactional.id": f"scraper-{worker_id}",
    "enable.idempotence": True,
})
producer.init_transactions()

# In scraper loop:
producer.begin_transaction()
producer.produce(topic="pages", value=json.dumps(result).encode())
producer.send_offsets_to_transaction(
    [TopicPartition("urls", partition, offset)],
    consumer_group_metadata
)
producer.commit_transaction()

This guarantees that if the scraper crashes mid-process, the offset commit and the result production happen atomically: either both, or neither. The next consumer instance picks up from the same offset and processes the URL again.

Comparison: Kafka vs alternatives for scraping queues

platformprosconsscale
Kafkahigh throughput, durable, multi-consumer, EOScomplex ops, requires JVMup to 10M+ msg/sec
Redis Streamssimple, fast, low overheadsingle-node, less durableup to 100k msg/sec
RabbitMQrich routing, maturelower throughput than Kafkaup to 100k msg/sec
SQSmanaged, cheap, infinite scaleper-message cost, no replayup to millions/sec
NATS JetStreamsimple, fast, durablesmaller communityup to 1M msg/sec
PulsarKafka-alternative, geo-replicationsmaller adoptioncomparable to Kafka

For scraping, Kafka and Pulsar both work; Kafka has the larger community. SQS is often the simpler choice if your AWS bill is comfortable with per-message charges. For under 1M URLs/day, Redis Streams is hard to beat for simplicity.

Monitoring

Critical Kafka metrics for scraping:

metricwhat it tells you
consumer lag (urls topic)how far behind workers are
produce rate (urls)URL discovery throughput
consume rate (urls)scrape throughput
DLQ sizefailure rate
rebalance countworker churn
under-replicated partitionsbroker issues

Tools:

  • Kafka Exporter for Prometheus: scrapes Kafka metrics
  • Conduktor or Confluent Control Center: GUI for cluster management
  • Burrow: consumer lag monitoring with alerting
  • Strimzi: Kubernetes operator with built-in monitoring

Set alerts on consumer lag > 30 min, DLQ size growth, broker downtime.

Production deployment

For 10M URLs/day:

  • 3-broker Kafka cluster on Kafka 3.7 with KRaft
  • 50 partitions on urls topic
  • 20 scraper workers, each consuming from 2-3 partitions
  • Per-host throttle via partition assignment + Redis backstop
  • Postgres for results (loaded from pages topic by separate consumers)
  • DLQ retry consumer running every hour
  • Prometheus + Grafana for monitoring
  • ~$1500/month in compute + Kafka

For 100M URLs/day:

  • 5+ broker cluster with replication-factor 3
  • 200+ partitions
  • 100+ scraper workers
  • Multiple Kafka clusters segmented by region
  • ~$15k/month compute + Kafka

Operational checklist

For Kafka-backed scraping in 2026:

  • Kafka 3.5+ with KRaft (no ZooKeeper)
  • Topic design: urls, pages, dlq, errors, metrics
  • Partition by host for per-host coordination
  • LZ4 compression on all topics
  • acks=all for producers (durability)
  • Manual commit on consumers (process before commit)
  • Per-host throttling (partition-level + Redis backstop)
  • DLQ retry consumer
  • Prometheus + Grafana monitoring
  • Alerts on lag, DLQ growth, broker health
  • Schema registry if pipeline complexity grows

For broader pipeline patterns, see building scraping pipelines with Prefect 3 and building scraping pipelines with Dagster.

Common pitfalls

  • Partition imbalance: bad partition keys cause hot partitions. Audit URL distribution by partition.
  • Slow consumers and consumer group rebalances: long-running scrapes blow past max.poll.interval.ms. Bump it or split work.
  • Offset commit before processing: leads to data loss. Always commit after successful processing.
  • DLQ overflow: if 30% of URLs fail, DLQ fills faster than retries clear it. Tune retry policy or fix underlying issues.
  • Memory pressure on consumers: large pages (>1 MB HTML) accumulate in producer buffer. Cap message size or stream pages directly to S3.
  • Kafka cluster upgrade missteps: KRaft migration requires careful planning. Read the Kafka KRaft docs.

FAQ

Q: do I need Kafka for under 1M URLs/day?
Probably not. Redis Streams or SQS will be simpler and sufficient. Kafka pays off above 1-10M URLs/day where the durability, throughput, and multi-consumer fanout matter.

Q: can I use Kafka for both URL queue and result storage?
Kafka is a queue, not a database. Use it for streaming and short-term retention. Store results in a database (Postgres, ClickHouse) or object storage (S3, R2) via downstream consumers.

Q: how do I retry failed URLs without losing the original message?
DLQ topic. Failed processing produces to DLQ, original offset commits. A retry consumer reads DLQ on a schedule and re-emits to the urls topic.

Q: what about Kafka Streams for in-flight processing?
Useful for derived data (aggregations, deduplication). For raw scraping, KafkaConsumer is enough. Add Streams when you need stateful enrichment.

Q: does this work with Confluent Cloud?
Yes. The Python client is identical, only the bootstrap.servers and SASL credentials change. Confluent Cloud removes the ops burden at the cost of usage-based pricing.

Real-world example: 80M URL crawl with KRaft and per-host fairness

A scraping team migrated from a single-broker Kafka cluster to a 5-broker KRaft cluster handling 80 million URLs per day across 12,000 distinct hosts. The bottleneck before migration was not throughput but fairness: a few hosts (Amazon, eBay, Walmart) made up 40 percent of all URLs and starved smaller hosts because their partitions had longer queue depth. The fix involved a custom partitioner that hashed by host AND by a “fairness band” derived from the host’s URL queue depth:

import hashlib
from confluent_kafka import Producer

class FairHostPartitioner:
    def __init__(self, num_partitions: int, queue_depth_provider):
        self.num_partitions = num_partitions
        self.queue_depth = queue_depth_provider  # callable(host) -> int

    def partition(self, key: bytes, all_partitions: list) -> int:
        host = key.decode().split("/")[2]
        depth = self.queue_depth(host)
        # High-depth hosts get spread across multiple partitions
        if depth > 100_000:
            band = hashlib.sha256(key).hexdigest()[:8]
            band_int = int(band, 16) % 4  # spread across 4 partitions
            base = int(hashlib.sha256(host.encode()).hexdigest()[:8], 16)
            return (base + band_int) % self.num_partitions
        # Low-depth hosts get one partition (per-host throttling preserved)
        return int(hashlib.sha256(host.encode()).hexdigest()[:8], 16) % self.num_partitions

producer = Producer({
    "bootstrap.servers": "kafka-1:9092,kafka-2:9092,kafka-3:9092",
    "compression.type": "lz4",
    "linger.ms": 10,
    "batch.size": 65536,
})

After deployment, p99 URL-to-fetch latency for low-volume hosts dropped from 4.2 hours to 18 minutes. High-volume hosts saw their throughput increase 3.5x because the work was now spread across 4 partitions instead of bottlenecking on one. The 5-broker cluster ran at 65 percent CPU during steady state, leaving headroom for daily peak loads.

Common pitfalls in production Kafka scraping

The first failure mode is consumer group rebalance storms during scraper deploys. When you redeploy 20 scraper workers via rolling restart, each worker’s exit triggers a rebalance, which pauses all consumers for 5-30 seconds while partitions reassign. With 20 workers redeploying sequentially, you accumulate 100-600 seconds of pause time during the deploy window. The fix is cooperative rebalancing (Kafka 2.4+) which only reassigns the partitions of the leaving worker rather than all partitions:

consumer = Consumer({
    "bootstrap.servers": "kafka:9092",
    "group.id": "scrapers",
    "partition.assignment.strategy": "cooperative-sticky",
    "session.timeout.ms": 30000,
    "max.poll.interval.ms": 600000,  # 10 min for slow scrapes
})

After enabling cooperative-sticky, deploy-induced pause time drops by roughly 90 percent because each worker exit only pauses its own 2-3 partitions instead of all 50.

The second pitfall is the __consumer_offsets topic bloat. Kafka stores consumer group offsets in an internal compacted topic. Scrapers that commit offsets aggressively (every message instead of every batch) generate millions of offset commits per day, which can outgrow the broker’s compaction capacity and cause the topic to balloon to tens of GB. The fix is to commit in batches of 100-1000 messages rather than per-message:

batch = []
for msg in consumer:
    process(msg)
    batch.append((msg.topic, msg.partition, msg.offset + 1))
    if len(batch) >= 100:
        consumer.commit(offsets=batch)
        batch = []

The third pitfall is the producer buffer exhaustion under back-pressure. When a broker becomes slow or unreachable, the producer’s buffer.memory (default 32MB) fills up with un-acked messages, and producer.send() calls block indefinitely once full. Scrapers that produce results while consuming URLs can deadlock: the producer blocks on a slow broker, the consumer can’t make progress because it can’t produce results, the consumer gets kicked from the group for missed heartbeats, and the entire pipeline halts. Configure delivery.timeout.ms and request.timeout.ms aggressively, and use producer.flush(timeout=10) with a timeout rather than indefinite blocking:

producer = Producer({
    "bootstrap.servers": "kafka:9092",
    "delivery.timeout.ms": 30000,
    "request.timeout.ms": 15000,
    "buffer.memory": 67108864,  # 64MB
    "max.block.ms": 5000,  # don't block produce() longer than 5s
})

If max.block.ms is exceeded, producer.produce() raises BufferError, which you handle by dropping the message to a local fallback file rather than crashing the consumer.

Wrapping up

Distributed scraping with Kafka pays off when volume justifies the operational complexity. For 10M+ URLs/day, the durability, partitioning, and multi-consumer architecture are hard to beat. For under 1M, simpler queues are usually the right answer. Pair this with our building scraping pipelines with Prefect 3 and Scrapy Cloud vs Crawlee Cloud writeups for the full pipeline picture, and browse the dev-tools-projects category on DRT for related infrastructure deep-dives.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)