Saga Pattern for Multi-Step Scraping Workflows in 2026

The article is ready. here’s the content directly:

—

Multi-step scraping workflows fail in interesting ways. you fetch a product listing, enrich it with seller data, run it through an LLM classifier, and write it to your warehouse — but the classifier call times out at step 3. now you have a half-written record, a paid API call you can’t refund, and a retry that will duplicate steps 1 and 2. the saga pattern solves this class of problem by treating each step as a compensatable transaction, giving distributed scraping pipelines the same failure-recovery guarantees that microservices teams have relied on for years.

What the Saga Pattern Actually Is

A saga is a sequence of local transactions where each step publishes an event or calls a service, and every step has a corresponding compensating action that undoes its effect if something downstream fails. there are two coordination styles:

choreography: each service listens for events and decides what to do next. no central coordinator, low coupling, harder to trace.
orchestration: a central saga orchestrator drives the sequence, calling each step in order and issuing compensations on failure. easier to debug, single point of control.

for scraping pipelines, orchestration wins. you want one place to see “step 2 of 7 failed on job abc-123” — not reconstruct it from 6 different service logs. tools like Temporal, Conductor, and AWS Step Functions implement orchestration natively and have become standard infrastructure for high-volume data collection in 2026.

Coordinator	Language support	Durable execution	Hosted option	Cost model
Temporal	Python, Go, Java, TS	yes (event sourcing)	Temporal Cloud	per action
AWS Step Functions	JSON state machine	yes (managed)	always	per state transition
Prefect	Python-first	partial (flow runs)	Prefect Cloud	per run
Conductor (Netflix OSS)	polyglot via REST	yes	self-host only	free

Temporal is the pragmatic default for new builds in 2026: Python SDK is mature, local dev via temporal server start-dev is frictionless, and durable execution means your workflow survives worker crashes without custom checkpoint logic.

Mapping Saga Steps to a Scraping Workflow

take a realistic pipeline: scrape a product page, resolve canonical URL, fetch pricing history, call an enrichment API, write to Postgres. each step has a natural compensating action:

Step	Action	Compensation
1. Fetch page	HTTP GET + cache raw HTML	delete cache entry
2. Parse + normalize	extract fields	no-op (stateless)
3. Enrich via API	POST to enrichment service	call /cancel or mark unused
4. Write to warehouse	INSERT record	DELETE by job_id
5. Notify downstream	publish event to queue	publish tombstone event

step 2 is stateless, so no compensation needed. steps 3 and 4 are where most pipelines bleed money on failures. if your enrichment API charges per call, a compensation that marks the result unused at least prevents you from billing the same enrichment twice on retry.

this maps cleanly onto scraper state management backends: saga state (which step is active, what compensation has run) lives in Postgres or DynamoDB, not in-memory — because worker crashes are inevitable.

Implementing a Simple Saga in Temporal (Python)

from temporalio import workflow, activity
from datetime import timedelta

@activity.defn
async def fetch_page(url: str) -> dict:
    # returns {"html": "...", "cache_key": "..."}
    ...

@activity.defn
async def enrich_product(data: dict) -> dict:
    # calls external enrichment API
    ...

@activity.defn
async def write_to_warehouse(record: dict) -> str:
    # returns row_id
    ...

@activity.defn
async def compensate_warehouse(row_id: str) -> None:
    # DELETE FROM products WHERE id = row_id
    ...

@workflow.defn
class ProductScrapeSaga:
    @workflow.run
    async def run(self, url: str) -> dict:
        row_id = None
        try:
            page = await workflow.execute_activity(
                fetch_page, url,
                schedule_to_close_timeout=timedelta(seconds=30))
            enriched = await workflow.execute_activity(
                enrich_product, page,
                schedule_to_close_timeout=timedelta(seconds=60))
            row_id = await workflow.execute_activity(
                write_to_warehouse, enriched,
                schedule_to_close_timeout=timedelta(seconds=10))
            return {"status": "ok", "row_id": row_id}
        except Exception as e:
            if row_id:
                await workflow.execute_activity(compensate_warehouse, row_id)
            raise

the key here: schedule_to_close_timeout is enforced by Temporal, not your application code. if the enrichment API hangs for 90 seconds on a 60-second budget, Temporal cancels the activity and triggers your except block automatically — no asyncio.wait_for boilerplate scattered across the codebase.

activities should be idempotent by design: use a job_id or url_hash as an idempotency key on every INSERT so retries after partial failures never create duplicate records.

Queue and State Integration

sagas don’t replace your message queue — they sit above it. the saga orchestrator decides sequencing; your queue handles concurrency. if you have a URL backlog feeding into this saga, the right pattern is:

consumer pulls URL from queue (SQS, Redis Streams, Kafka — covered in detail in scraper queue patterns)
consumer starts a new Temporal workflow execution per URL
Temporal workers execute activities concurrently across N machines
saga state is persisted in Temporal’s own event history, not your queue

this separation matters for distributed scraper architecture: your pub-sub layer handles fan-out and backpressure, while the saga layer handles step sequencing and failure recovery. conflating them makes both harder to reason about.

for saga state storage outside Temporal (if you’re using a lighter orchestrator like Prefect or a custom state machine), Postgres with SELECT FOR UPDATE SKIP LOCKED is reliable and cheap. DynamoDB works if you’re AWS-native and need sub-millisecond lock latency at scale.

When the Saga Pattern Is Wrong

sagas add operational complexity. for simple two-step scrape-and-store pipelines, the overhead is not worth it. use sagas when:

you have 3 or more steps with non-trivial failure modes
at least one step calls an external API with real cost or side effects
you need auditability: which step failed, what data was in-flight
you’re building on top of a pattern-based extraction system (like those described in AutoScraper Tutorial 2026) and routing extracted fields through multiple downstream enrichment stages

don’t use sagas when:

all steps are idempotent reads with no external side effects
you control the full pipeline and can replay from source without cost
your team has no prior exposure to durable execution frameworks — the learning curve is real, budget a sprint for it

Bottom Line

if your scraping pipeline touches paid APIs, writes to multiple datastores, or runs steps that take more than a few seconds each, the saga pattern will save you money and debugging time. start with Temporal and orchestration mode, keep compensations simple, and treat idempotency as a first-class requirement from day one. DRT covers this layer of the data-collection stack in depth — architecture decisions made here compound across every scraping job you run.