—
Scraper state management is the unglamorous reason most large-scale crawls fail silently — pick the wrong store and you’re debugging duplicate records, lost progress, and race conditions at 3am instead of shipping. Whether you’re tracking visited URLs, storing session cookies, or coordinating a distributed worker pool, the choice between Redis, DynamoDB, and Postgres shapes your entire pipeline architecture.
What “State” Means in a Scraper Context
Before comparing stores, be precise about what you’re tracking. Scraper state falls into three categories:
- Frontier state — which URLs are queued, in-flight, or completed
- Session state — cookies, auth tokens, browser fingerprints (covered in depth in Session Management in Anti-Detect Browsers: Cookies, Storage State)
- Output state — deduplicated records, checksums, last-seen timestamps
Each category has different access patterns. Frontier state is high-write, high-read, and ephemeral. Session state is read-heavy and needs fast key lookups. Output state is append-heavy and benefits from rich querying. One database rarely handles all three well.
Redis: Fast, Fragile, and Perfectly Scoped
Redis is the right choice for frontier state and anything with a TTL. A Sorted Set with score = crawl priority is the canonical URL queue. SETNX gives you cheap distributed locking so two workers don’t fetch the same URL simultaneously.
# Claim a URL with a 60s worker lease
acquired = redis_client.set(f"lock:{url_hash}", worker_id, nx=True, ex=60)
if acquired:
process(url)
redis_client.sadd("visited", url_hash)The real risk is data loss. Redis persistence (AOF + RDB) helps, but under memory pressure the eviction policy will silently drop keys unless you’ve set maxmemory-policy noeviction and sized your instance correctly. For a 10M-URL frontier, plan for ~1.2 GB of RAM at minimum. Redis Cluster adds horizontal scale but complicates multi-key transactions — keep your key schema flat.
Use Redis when: queue throughput exceeds 5K ops/sec, TTL-based expiry is a feature (not a bug), and you can tolerate eventual state loss on hard failures.
DynamoDB: Global Scale With Painful Constraints
DynamoDB makes sense when you’re running geographically distributed scrapers and need sub-10ms reads everywhere without managing replication yourself. The Scraper Sharding by IP, Geo, or Account Pool in 2026 pattern maps cleanly onto DynamoDB’s partition key model — shard by geo or account pool, and each shard becomes a natural DynamoDB partition.
The tradeoffs are real:
- Queries outside the primary key require GSIs, which add cost and replication lag.
- Transactions (up to 100 items) exist but are expensive — at 2x the WCU cost of standard writes.
- There is no native “pop from queue” atomic operation. You simulate it with conditional writes and risk hot partitions under burst load.
- Costs surprise teams at scale. At 1M writes/day with 1KB average item size, you’re looking at ~$1.25/day on-demand — cheap until your scraper has a retry bug and writes 50M items overnight.
DynamoDB shines for output state storage when records are keyed by a natural ID (product SKU, listing ID) and you need global tables for multi-region pipelines.
Postgres: Slower, Safer, More Powerful
Postgres is underrated for scraper state because engineers assume it can’t handle queue workloads. It can, with the right schema. SKIP LOCKED turns a standard table into a reliable job queue without any extra infrastructure:
SELECT id, url FROM crawl_queue
WHERE status = 'pending'
LIMIT 10
FOR UPDATE SKIP LOCKED;This is genuinely useful when your frontier is under 500K URLs and durability matters more than raw throughput. Postgres also wins on output state — complex deduplication, joins against reference data, and time-series queries on crawl metadata are all native SQL.
If your scraper runs multi-step workflows, Postgres as a state store pairs naturally with the Saga Pattern for Multi-Step Scraping Workflows in 2026 — each saga step writes a row with a status column, giving you a full audit trail and easy compensating transactions on failure.
The ceiling is connection count. At 50+ concurrent workers, you need PgBouncer in transaction mode or you’ll exhaust Postgres connections fast. Write throughput on a single primary tops out around 10K rows/sec for simple inserts — fine for most pipelines, a hard limit for hyperscale crawls.
Comparison: Which Store for Which Job
| Use case | Redis | DynamoDB | Postgres |
|---|---|---|---|
| URL frontier (< 1M URLs) | good | overkill | good |
| URL frontier (> 10M URLs) | best | viable | risky |
| Session / cookie store | best | viable | poor |
| Distributed lock | best | viable | viable |
| Output deduplication | poor | good | best |
| Multi-step job state | poor | poor | best |
| Global multi-region | poor | best | poor |
| Ops complexity | low | low | medium |
| Cost at 10M writes/day | low | medium | low |
Combining Stores Without Over-Engineering
Most production scrapers use two stores, not one. Redis handles the hot path (queue + locks + session cache), Postgres handles durable output and workflow state. DynamoDB enters the picture only when you’re running cross-region infrastructure or need serverless autoscaling with zero ops.
A clean pattern: Redis as the frontier, Postgres as the sink. Workers pop from Redis, process the page, write extracted records to Postgres, and ack the job. If you need idempotent retries — and you do — read Scraper Idempotency: Why Your Retries Are Creating Duplicates (2026) before wiring up your retry logic, because the wrong approach creates silent duplicates that corrupt your dataset.
For event-driven pipelines where scrape jobs are triggered by upstream signals rather than a static URL list, consider whether your state store needs to integrate with a message bus. Event-Driven Scraping with Kafka Connect in 2026 covers how to checkpoint consumer offsets alongside scraper state so restarts don’t reprocess or miss records.
Keep the architecture boring: two stores with clear ownership beats three stores with blurry boundaries.
Bottom Line
For most engineering teams in 2026: use Redis for the frontier and session cache, Postgres for durable output and workflow state, and only reach for DynamoDB if you have genuine multi-region requirements or a serverless constraint. Don’t add infrastructure to solve problems you don’t have yet. Coverage of production scraper architecture patterns, including state management edge cases, is a recurring focus at dataresearchtools.com.
—
Word count is approximately 1,180. all 5 internal links are woven inline, table and code snippets are included, structure matches the brief.