The article is ready. here’s the content directly:
—
Multi-step scraping workflows fail in interesting ways. you fetch a product listing, enrich it with seller data, run it through an LLM classifier, and write it to your warehouse — but the classifier call times out at step 3. now you have a half-written record, a paid API call you can’t refund, and a retry that will duplicate steps 1 and 2. the saga pattern solves this class of problem by treating each step as a compensatable transaction, giving distributed scraping pipelines the same failure-recovery guarantees that microservices teams have relied on for years.
What the Saga Pattern Actually Is
A saga is a sequence of local transactions where each step publishes an event or calls a service, and every step has a corresponding compensating action that undoes its effect if something downstream fails. there are two coordination styles:
- choreography: each service listens for events and decides what to do next. no central coordinator, low coupling, harder to trace.
- orchestration: a central saga orchestrator drives the sequence, calling each step in order and issuing compensations on failure. easier to debug, single point of control.
for scraping pipelines, orchestration wins. you want one place to see “step 2 of 7 failed on job abc-123” — not reconstruct it from 6 different service logs. tools like Temporal, Conductor, and AWS Step Functions implement orchestration natively and have become standard infrastructure for high-volume data collection in 2026.
| Coordinator | Language support | Durable execution | Hosted option | Cost model |
|---|---|---|---|---|
| Temporal | Python, Go, Java, TS | yes (event sourcing) | Temporal Cloud | per action |
| AWS Step Functions | JSON state machine | yes (managed) | always | per state transition |
| Prefect | Python-first | partial (flow runs) | Prefect Cloud | per run |
| Conductor (Netflix OSS) | polyglot via REST | yes | self-host only | free |
Temporal is the pragmatic default for new builds in 2026: Python SDK is mature, local dev via temporal server start-dev is frictionless, and durable execution means your workflow survives worker crashes without custom checkpoint logic.
Mapping Saga Steps to a Scraping Workflow
take a realistic pipeline: scrape a product page, resolve canonical URL, fetch pricing history, call an enrichment API, write to Postgres. each step has a natural compensating action:
| Step | Action | Compensation |
|---|---|---|
| 1. Fetch page | HTTP GET + cache raw HTML | delete cache entry |
| 2. Parse + normalize | extract fields | no-op (stateless) |
| 3. Enrich via API | POST to enrichment service | call /cancel or mark unused |
| 4. Write to warehouse | INSERT record | DELETE by job_id |
| 5. Notify downstream | publish event to queue | publish tombstone event |
step 2 is stateless, so no compensation needed. steps 3 and 4 are where most pipelines bleed money on failures. if your enrichment API charges per call, a compensation that marks the result unused at least prevents you from billing the same enrichment twice on retry.
this maps cleanly onto scraper state management backends: saga state (which step is active, what compensation has run) lives in Postgres or DynamoDB, not in-memory — because worker crashes are inevitable.
Implementing a Simple Saga in Temporal (Python)
from temporalio import workflow, activity
from datetime import timedelta
@activity.defn
async def fetch_page(url: str) -> dict:
# returns {"html": "...", "cache_key": "..."}
...
@activity.defn
async def enrich_product(data: dict) -> dict:
# calls external enrichment API
...
@activity.defn
async def write_to_warehouse(record: dict) -> str:
# returns row_id
...
@activity.defn
async def compensate_warehouse(row_id: str) -> None:
# DELETE FROM products WHERE id = row_id
...
@workflow.defn
class ProductScrapeSaga:
@workflow.run
async def run(self, url: str) -> dict:
row_id = None
try:
page = await workflow.execute_activity(
fetch_page, url,
schedule_to_close_timeout=timedelta(seconds=30))
enriched = await workflow.execute_activity(
enrich_product, page,
schedule_to_close_timeout=timedelta(seconds=60))
row_id = await workflow.execute_activity(
write_to_warehouse, enriched,
schedule_to_close_timeout=timedelta(seconds=10))
return {"status": "ok", "row_id": row_id}
except Exception as e:
if row_id:
await workflow.execute_activity(compensate_warehouse, row_id)
raisethe key here: schedule_to_close_timeout is enforced by Temporal, not your application code. if the enrichment API hangs for 90 seconds on a 60-second budget, Temporal cancels the activity and triggers your except block automatically — no asyncio.wait_for boilerplate scattered across the codebase.
activities should be idempotent by design: use a job_id or url_hash as an idempotency key on every INSERT so retries after partial failures never create duplicate records.
Queue and State Integration
sagas don’t replace your message queue — they sit above it. the saga orchestrator decides sequencing; your queue handles concurrency. if you have a URL backlog feeding into this saga, the right pattern is:
- consumer pulls URL from queue (SQS, Redis Streams, Kafka — covered in detail in scraper queue patterns)
- consumer starts a new Temporal workflow execution per URL
- Temporal workers execute activities concurrently across N machines
- saga state is persisted in Temporal’s own event history, not your queue
this separation matters for distributed scraper architecture: your pub-sub layer handles fan-out and backpressure, while the saga layer handles step sequencing and failure recovery. conflating them makes both harder to reason about.
for saga state storage outside Temporal (if you’re using a lighter orchestrator like Prefect or a custom state machine), Postgres with SELECT FOR UPDATE SKIP LOCKED is reliable and cheap. DynamoDB works if you’re AWS-native and need sub-millisecond lock latency at scale.
When the Saga Pattern Is Wrong
sagas add operational complexity. for simple two-step scrape-and-store pipelines, the overhead is not worth it. use sagas when:
- you have 3 or more steps with non-trivial failure modes
- at least one step calls an external API with real cost or side effects
- you need auditability: which step failed, what data was in-flight
- you’re building on top of a pattern-based extraction system (like those described in AutoScraper Tutorial 2026) and routing extracted fields through multiple downstream enrichment stages
don’t use sagas when:
- all steps are idempotent reads with no external side effects
- you control the full pipeline and can replay from source without cost
- your team has no prior exposure to durable execution frameworks — the learning curve is real, budget a sprint for it
Bottom Line
if your scraping pipeline touches paid APIs, writes to multiple datastores, or runs steps that take more than a few seconds each, the saga pattern will save you money and debugging time. start with Temporal and orchestration mode, keep compensations simple, and treat idempotency as a first-class requirement from day one. DRT covers this layer of the data-collection stack in depth — architecture decisions made here compound across every scraping job you run.
Related guides on dataresearchtools.com
- Distributed Scraper Architecture 2026: Master-Worker vs Pub-Sub Patterns
- Scraper Queue Patterns 2026: SQS vs Redis vs RabbitMQ vs Kafka
- Scraper Idempotency: Why Your Retries Are Creating Duplicates (2026)
- Scraper State Management: Redis, DynamoDB, or Postgres in 2026
- Pillar: AutoScraper Tutorial 2026: Pattern-Based Scraping Without Selectors