Distributed Scraper Architecture 2026: Master-Worker vs Pub-Sub Patterns

Choosing the wrong distributed scraper architecture costs you weeks of debugging, not hours. In 2026, the two dominant patterns for distributed scraper architecture are master-worker and pub-sub, and they are not interchangeable. Pick the wrong one and you end up with a scheduling bottleneck at 50k URLs/day or a message backlog that takes down your Redis cluster at 500k. This piece breaks down both patterns, where each one breaks, and how to decide fast.

Master-Worker: Simple Until It Isn’t

Master-worker is the most common starting point. One coordinator process owns the URL queue and work state, and dispatches jobs to N workers. Workers pull tasks, execute, and report back.

It works well at small to medium scale because the control flow is obvious. The master has global visibility: retry logic, deduplication, and rate limiting all live in one place. Frameworks like Scrapy with a Redis scheduler, or a custom FastAPI coordinator feeding Celery workers, follow this model.

The architecture collapses in two scenarios:

Master becomes a single point of failure. If it crashes mid-crawl, in-flight jobs vanish unless you have explicit lease tracking.
Scheduling throughput caps out. At high worker counts (100+), the master’s dispatch loop becomes the bottleneck. You see workers sitting idle waiting for assignment.

For most scraping projects under 200k URLs/day with moderate worker counts (under 30), master-worker is fine. For anything larger, you need the pub-sub path.

Pub-Sub: Scale at the Cost of Complexity

Pub-sub inverts the control model. Producers push URLs (or scraping tasks) onto a topic or queue, and workers consume independently. There is no central coordinator — just a broker in the middle.

The canonical 2026 stack uses Kafka for high-volume pipelines or Redis Streams for mid-tier setups. Scraper Queue Patterns 2026: SQS vs Redis vs RabbitMQ vs Kafka covers the broker tradeoffs in depth, but the short version: Kafka if you need replay and audit, Redis Streams if you want low latency and simpler ops, SQS if you are already on AWS and want zero broker maintenance.

# Kafka topic config for a scraper workload
scraper-tasks:
  partitions: 24          # ~1 partition per 4 workers at 100 worker ceiling
  replication-factor: 3
  retention.ms: 86400000  # 24h replay window for dead-letter recovery
  min.insync.replicas: 2

Pub-sub shines when you have multiple consumer types reading from the same stream. Your HTML fetcher, your parser, and your structured-data validator can all consume independently, making the pipeline composable. Adding a new consumer does not touch existing code.

The operational overhead is real. Message ordering, deduplication, and exactly-once semantics are all your problem now. For multi-step workflows where step B depends on step A’s output, you will want to look at Saga Pattern for Multi-Step Scraping Workflows in 2026 before you commit to a pub-sub design.

Pattern Comparison at a Glance

Dimension	Master-Worker	Pub-Sub
Coordination	Centralized	Decentralized via broker
Failure surface	Master crash = full stop	Broker crash = queue pause
Scaling ceiling	~50 workers before bottleneck	500+ workers with partitioning
Retry logic	Simple (master owns state)	Complex (consumer must handle)
Operational overhead	Low	Medium-high
Best fit	Focused crawls, internal tools	Multi-stage pipelines, high volume
2026 tooling	Scrapy+Redis, Celery, APScheduler	Kafka, Redis Streams, SQS+Lambda

State Management: Where Both Patterns Get Complicated

Neither pattern is clean without solving seen-URL deduplication and job status tracking. This is where most production scraper failures actually originate.

In master-worker setups, state lives on the master: a Postgres table tracking URL status, timestamps, and retry counts is the standard. The risk is that at high insert rates, Postgres starts to lag, and you end up querying stale state.

Pub-sub setups tend to push state into Redis (for hot deduplication with a Bloom filter or a simple SET) and Postgres for durable history. The Scraper State Management: Redis, DynamoDB, or Postgres in 2026 breakdown is worth reading before you pick a store, especially if you are running at a scale where DynamoDB’s cost model starts to make sense.

Regardless of pattern, idempotency is non-negotiable. Retries without idempotency keys produce silent duplicates in your output dataset. If you have not audited your retry paths, Scraper Idempotency: Why Your Retries Are Creating Duplicates (2026) will save you a painful data-cleaning exercise later.

Hybrid Patterns Worth Knowing

Pure master-worker and pure pub-sub are the theoretical poles. Most production scrapers in 2026 run a hybrid.

The most common variant: a lightweight coordinator (master) handles domain-level scheduling and rate limits, but delegates URL-level dispatch to a Redis Stream or SQS queue. Workers pull directly from the queue without going through the coordinator for every job.

This preserves the observability of master-worker (you have a single place to see crawl progress and impose per-domain politeness rules) while removing the dispatch bottleneck.

Three scenarios where this hybrid wins:

You are scraping 50+ domains simultaneously with different rate limits per domain. The coordinator enforces per-domain token buckets; the queue handles volume.
Your workers are heterogeneous (some fetch, some render JavaScript via headless Chrome, some handle authenticated sessions). Different consumer groups read from different queue partitions based on task type.
You need burst capacity. Workers auto-scale based on queue depth (SQS + Lambda or Kafka consumer group lag), while the coordinator stays at fixed size.

The full design space for these patterns is covered in Web Scraping Architecture: Design Patterns and Best Practices, which is worth bookmarking as a reference as your system grows.

Bottom Line

If you are under 200k URLs/day with a single scraping stage, start with master-worker and invest the saved complexity budget in state management and idempotency. If you are building a multi-stage pipeline or need horizontal scale past 50 concurrent workers, go pub-sub with Kafka or Redis Streams from the start — retrofitting it later is painful. DRT covers these architecture decisions regularly because the wrong call at design time costs far more than the time spent reading first.