Scraper Idempotency: Why Your Retries Are Creating Duplicates (2026)

Retries without idempotency guarantees are one of the most common causes of duplicate data in production scrapers. here is the article:

Most engineers bolt on retry logic and call it resilience. what they actually ship is a duplicate factory. scraper idempotency is the property that running the same job twice — whether by retry, crash recovery, or queue redelivery — produces exactly one clean record in your data store. without it, every transient network blip or pod restart compounds into data quality debt that takes weeks to untangle.

Why Scrapers Duplicate in the First Place

Duplication almost always originates at one of three seams: job dispatch, HTTP execution, or persistence. a queue consumer crashes after a successful fetch but before acknowledging the message, so the broker redelivers. a requests.post to your database times out on the client side but commits on the server side. a scraper queue with at-least-once delivery (SQS default, Kafka consumer groups without idempotent producers) replays the same URL after a visibility timeout expires.

the failure pattern is always the same: success happened, but your code did not observe it. so it tries again.

Idempotency Keys: The Core Primitive

an idempotency key is a deterministic, collision-resistant identifier you derive from the job’s inputs before any side effects occur. for a scraper, the key is typically a hash of the target URL plus a crawl-window timestamp (rounded to your deduplication window).

import hashlib, time

def idempotency_key(url: str, window_minutes: int = 60) -> str:
    bucket = int(time.time() // (window_minutes * 60))
    raw = f"{url}|{bucket}"
    return hashlib.sha256(raw.encode()).hexdigest()

you write this key to your state store before fetching, using an upsert with a conflict-do-nothing guard. if the key already exists, the job is a duplicate and you skip it. if the write succeeds, you own the slot and proceed. this “claim first, fetch second” pattern is the only safe ordering. fetching first and writing after opens the window for double-writes.

for state store selection, the tradeoffs are real:

storededup windowthroughputconsistency
Redis SET NXseconds to hoursvery highsingle-node; loses state on crash without AOF
Postgres upsertunlimitedmoderateACID; safe for billing-grade dedup
DynamoDB conditional writeunlimitedhigheventually consistent reads need care
Bloom filter (Redis)approximateextremely highfalse positives, no deletion

for high-throughput crawls, a two-layer approach works well: Redis for fast in-flight dedup, Postgres for durable record. the scraper state management comparison covers the persistence tradeoffs in detail.

At-Least-Once Queues and How to Tame Them

SQS, RabbitMQ, and Kafka with consumer groups all guarantee at-least-once delivery. this is correct behavior, not a bug. your consumer must be idempotent, not your queue.

the practical checklist:

  • set visibility timeout (SQS) or ack timeout (RabbitMQ) to 2x your p99 job duration, not 30 seconds
  • delete or ack the message only after your persistence write completes and is confirmed
  • on crash recovery, re-check the idempotency key before re-executing any side effects
  • never ack before you write; never write before you claim

multi-step scraping workflows (fetch, parse, enrich, store) compound the risk because each step is a potential duplicate point. the saga pattern gives you a compensating-transaction model that handles partial failures without replaying completed steps.

Deduplication at the Persistence Layer

even with upstream idempotency keys, your database should enforce uniqueness as a backstop. a composite unique index on (url_hash, crawl_date) means a race condition between two workers results in a constraint violation, not a silent duplicate.

CREATE UNIQUE INDEX idx_pages_dedup
ON scraped_pages (url_hash, date_trunc('hour', fetched_at));

handle the violation explicitly in your application code:

from psycopg2 import errors

try:
    cur.execute(INSERT_SQL, row)
    conn.commit()
except errors.UniqueViolation:
    conn.rollback()
    # legitimate duplicate, log and skip

swallowing UniqueViolation silently is correct here. raising it is not. the distinction matters for alerting: you want to track duplicate rate, not treat each one as an incident.

for distributed crawlers sharded across proxy pools or geographies, idempotency keys need to be globally unique, not just unique within one worker’s namespace. a shared Redis cluster or a central Postgres sequence solves this. scraper sharding by IP and geo pool introduces the coordination overhead that makes cross-shard dedup non-trivial.

Testing Idempotency Before It Fails in Production

idempotency is not testable by reading code. you have to inject failures.

numbered steps for a minimal idempotency test harness:

  1. enqueue the same job URL three times with identical parameters
  2. kill the worker process mid-execution on the second run (SIGKILL, not SIGTERM)
  3. let the third run complete naturally
  4. assert exactly one record exists in your output store
  5. assert the idempotency key table has exactly one row for that URL/window

run this in CI against a real database and a real queue (localstack for SQS, a local RabbitMQ container). mocking the queue in unit tests tells you nothing about message redelivery behavior.

also test clock skew: if two workers hash the same URL with a one-second difference that straddles a window boundary, they will generate different keys and both proceed. floor your timestamp to a coarse bucket (hourly is usually right) to absorb clock drift across pods.

Bottom Line

build idempotency keys into job dispatch before you write any retry logic — retrofitting it later means auditing every side effect in your pipeline. use postgres unique constraints as your final backstop, redis SET NX for speed, and test with actual process kills, not mocked exceptions. DRT covers this class of scraper infrastructure problem regularly; the state management and queue pattern guides in this series give you the implementation depth to go from principle to production.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)