Draft Rewrite
Scraper sharding is the single architectural call that separates scrapers hitting rate limits on day one from pipelines pulling billions of records a month without a ban. Partition your workload across multiple identity axes — IP pools, geographic exit nodes, or authenticated account pools — so no single identity sees enough traffic to trip a detector. In 2026, Cloudflare Bot Management, DataDome v5, and PerimeterX px3 are table stakes on any mid-size e-commerce or social platform. Sharding’s not optional anymore. It’s the floor.
AI Audit
What still reads as AI-generated:
- “table stakes” is overused in tech writing
- Rhythm is too even, punchy sentences feel deliberately punchy rather than natural
- Could use more specificity and a personal angle
Final Version
—
Now applying to the full article. Here’s the complete humanized markdown body:
—
Scraper sharding is the single architectural call that separates a scraper that’s blocked by Tuesday from a pipeline collecting billions of records a month without a hitch. Partition your workload across multiple identity axes — IP pools, geographic exit nodes, or authenticated account pools — so no single identity ever sees enough traffic to trip a detector. In 2026, Cloudflare Bot Management, DataDome v5, and PerimeterX px3 are basically standard on any serious e-commerce or social platform. If you’re running production scraping without sharding, you’re not dealing with a performance problem. You’re dealing with a survivability problem.
The three sharding axes
Not all sharding works the same way. Pick the axis that actually matches how the target detects abuse.
IP sharding splits requests across a proxy pool. It’s the most common approach and the cheapest to implement — works fine against rate limits that are purely IP-keyed, which is still the majority of simpler setups. The failure mode is behavioral fingerprinting: if 200 IPs all visit /product/ in the same crawl order with matching user agents, the platform clusters them even if no single IP trips a limit.
Geo sharding takes IP sharding further by routing through IPs that actually match the target’s expected audience geography. A US retail scraper coming from Singapore datacenter IPs is flagged instantly — geo scoring is cheap to compute and free to enforce. Residential and mobile proxies in the target country are the fix. The tradeoffs between proxy types look like this:
| Proxy type | Avg cost / GB | Geo accuracy | Block rate (retail sites) | Best for |
|---|---|---|---|---|
| Datacenter | $0.50–$1.50 | City-level | High (60–80%) | APIs, low-risk targets |
| Residential | $4–$8 | City/ISP-level | Medium (15–30%) | General retail, social |
| Mobile (4G/5G) | $10–$25 | Cell tower-level | Low (2–8%) | High-security targets |
| ISP (static residential) | $2–$5 | ISP-level | Low-medium (10–20%) | Account-based workflows |
Account pool sharding is required when the target is session-aware. LinkedIn, Amazon seller data, Google Maps, any platform gating content behind login — you need to shard across authenticated accounts, rotating cookies and sessions independently of IP rotation. Account pools are expensive to build, but platforms that correlate device fingerprint + session + IP together can’t be cracked any other way. The proxy selection principles covered in Mobile Proxies for Roblox: Account Management and Geo-Access — rotating mobile IPs paired with account-level session isolation — apply directly here, not just for gaming platforms.
Shard assignment strategies
How you assign URLs to shards matters as much as pool size.
Round-robin is the naive default and creates detectable patterns. A better approach: hash-based assignment, where you hash a canonicalized URL to consistently route the same target to the same shard. This prevents the same URL from being hit by multiple identities in a short window, which looks coordinated.
import hashlib
def assign_shard(url: str, pool: list[str]) -> str:
# same URL always lands on same proxy -- avoids multi-identity fingerprinting
key = hashlib.md5(url.encode()).hexdigest()
idx = int(key, 16) % len(pool)
return pool[idx]For targets you need to hit repeatedly (price monitoring, inventory checks), use time-bucketed rotation: (url_hash + hour_bucket) % pool_size. Each identity gets a rest window while temporal consistency holds within a session.
Don’t let shard assignment logic live in each worker. Centralize it. If you’re using a job queue — and if you’re at any real scale you should be, as covered in Distributed Scraper Architecture 2026: Master-Worker vs Pub-Sub Patterns — the dispatcher is the right place to resolve shard assignment before a task is enqueued.
State and coordination
Sharding creates shared mutable state: proxy health scores, account cooldown timers, per-shard rate limit counters. This state has to be centralized and fast.
Redis is the standard for proxy and account pool lease management. A sorted set keyed by proxy ID with score = last_used_timestamp gives you O(log N) lease assignment. A TTL-based cooldown key per proxy enforces rest windows after a 429 or soft block. For persistent shard assignment and block history, you want a relational store. Scraper State Management: Redis, DynamoDB, or Postgres in 2026 lays out the tradeoffs in detail — short answer is Redis for hot lease state, Postgres for durable mapping and audit trail.
Per-shard state records worth tracking:
- last assigned timestamp
- cumulative requests in rolling 60-minute window
- block count in last 24 hours
- cooldown status (active / cooling / available)
- geographic assignment for geo shards
Retry logic and deduplication
Sharded scrapers fail in a specific way: a task fails on shard A, gets retried, and if retry assignment isn’t shard-aware it lands on shard A again. Or worse, it runs twice on diferent shards and you get duplicate records. Both are bad.
Retry logic in a sharded system must do two things: reassign the failed task to a different shard, and be idempotent at the data layer. Scraper Idempotency: Why Your Retries Are Creating Duplicates (2026) covers the deduplication mechanics — the architectural implication for sharding is that your task schema needs a shard_id field and a retry_count field, and retry routing must explicitly exclude the shard that produced the failure.
A workable retry policy for a sharded scraper:
- Attempt 1 — assigned shard via hash-based routing
- Attempt 2 — rotate to next available shard, excluding attempt 1’s shard
- Attempt 3 — escalate proxy tier (datacenter → residential → mobile)
- Attempt 4 — flag for manual review or dead-letter queue
Scaling pools dynamically
Static pools work at small scale. Above 50 concurrent workers and multi-million URL queues, you need dynamic pool management: adding proxies when block rates climb, retiring degraded accounts, rebalancing geo coverage without restarting workers.
Event-driven architectures fit this well. When a worker emits a proxy.blocked event, a pool manager consumes it and marks the proxy unavailable — without the worker needing to own any pool logic. Event-Driven Scraping with Kafka Connect in 2026 covers the Kafka-side implementation for exactly this pattern.
A few operational rules that actually hold up at scale:
- Alert when block rate on any shard exceeds 5% in a 15-minute window
- Auto-rotate any proxy that hits 3 consecutive 429s
- Keep 20–30% of pool capacity idle as burst reserve
- For account pools, enforce a minimum 45-minute cooldown after any soft-block signal
Bottom line
Pick your sharding axis based on what the target detects: IP for rate-limited APIs, geo-accurate residential or mobile for retail and social, account pools for session-gated platforms. Centralize shard state in Redis for hot data and Postgres for durable records, and make retry logic shard-aware from the start or you’ll be fighting duplicates and mystery blocks at scale. DRT covers these architecture patterns continuously — if you’re building a production scraping system in 2026, the infra decisions you lock in early will define your ceiling for the next two years.
Related guides on dataresearchtools.com
- Scraper Idempotency: Why Your Retries Are Creating Duplicates (2026)
- Scraper State Management: Redis, DynamoDB, or Postgres in 2026
- Event-Driven Scraping with Kafka Connect in 2026
- Distributed Scraper Architecture 2026: Master-Worker vs Pub-Sub Patterns
- Pillar: Mobile Proxies for Roblox: Account Management and Geo-Access