Scraping with LangGraph agents in 2026

Scraping with LangGraph agents in 2026

LangGraph scraping agents have become the standard pattern for any non-trivial LLM-driven scraping pipeline that needs branching, retries, and checkpointed state. By early 2026, LangGraph has reached version 0.4 with stable APIs, the StateGraph primitive is rock solid, and the Postgres checkpointer makes it easy to resume long-running scrapes after a crash. If your scraping job is more than fetch-extract-store, LangGraph is the right framework.

This guide builds a real production scraping agent step by step. We define the state, wire the tool nodes, add retry edges, plug in a checkpointer, and benchmark cost against alternatives. By the end you will have a working LangGraph scraper that handles a flaky target site, recovers from failures, and emits clean structured data.

Why LangGraph beats LangChain agents for scraping

LangGraph scraping agents express the workflow as an explicit graph instead of an implicit ReAct loop. That difference matters in three places.

First, branching. A scraping agent often needs to take different paths depending on what the page returns. Did the site return a captcha? Branch to the solver. Is the price hidden behind a login? Branch to authentication. LangChain’s ReAct agent makes branching implicit through prompt engineering. LangGraph makes it explicit through edges.

Second, observability. When a scraper fails at 3 AM, you want to know exactly which node failed and what state was in scope. LangGraph’s state object plus LangSmith integration gives you that. The classic LangChain agent gives you a chain of thought that you have to read.

Third, persistence. LangGraph ships a checkpointer system. After every node, the state is persisted to SQLite or Postgres. If the worker dies mid-scrape, you resume from the last checkpoint with one line of code.

Where LangGraph fits next to LangChain

LangChain remains the right framework for prompt templates, LLM clients, retrievers, and document loaders. LangGraph builds on top, adding the runtime that orchestrates them. The mental model is: LangChain is the parts bin, LangGraph is the assembly line. Almost every production LangGraph scraping agent imports LangChain primitives for the LLM call and the prompt template, and reserves LangGraph for the routing and state machine.

Installing the stack

pip install langgraph==0.4.0 langchain==0.3.20 langchain-openai==0.2.10 \
            langgraph-checkpoint-postgres==2.0.10 \
            playwright==1.49.0 pydantic==2.9.2 httpx==0.27.2
playwright install chromium

For LangSmith tracing (free for personal projects):

export LANGSMITH_API_KEY="ls__..."
export LANGSMITH_TRACING="true"
export LANGSMITH_PROJECT="lazada-scraper"

Defining the agent state

The state is a Pydantic-style TypedDict that flows through every node. For a scraping agent it typically holds the input URL, the fetched HTML, the extracted data, and an error log.

from typing import TypedDict, Optional, List
from langgraph.graph import StateGraph, END

class ScrapeState(TypedDict):
    url: str
    html: Optional[str]
    captcha_detected: bool
    extracted: Optional[dict]
    errors: List[str]
    attempt: int

LangGraph reduces state by merging dicts on every node return. Mutations to the state inside a node are not seen by other nodes; only the returned dict is.

Building the nodes

Each node is a function that takes the state and returns a partial state update.

import asyncio
from playwright.async_api import async_playwright
from openai import AsyncOpenAI
import json

client = AsyncOpenAI()

async def fetch_node(state: ScrapeState) -> dict:
    """Fetch the URL with Playwright."""
    try:
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page()
            await page.goto(state["url"], wait_until="networkidle", timeout=30000)
            html = await page.content()
            await browser.close()
        return {"html": html, "attempt": state["attempt"] + 1}
    except Exception as e:
        return {
            "errors": state["errors"] + [f"fetch failed: {e}"],
            "attempt": state["attempt"] + 1,
        }

async def captcha_check_node(state: ScrapeState) -> dict:
    """Quick heuristic for captcha detection."""
    html = state.get("html", "") or ""
    flags = ["cf-challenge", "captcha", "px-captcha", "datadome", "recaptcha"]
    detected = any(f in html.lower() for f in flags)
    return {"captcha_detected": detected}

async def extract_node(state: ScrapeState) -> dict:
    """LLM-driven extraction with strict JSON Schema."""
    schema = {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "price": {"type": "number"},
            "currency": {"type": "string"},
            "in_stock": {"type": "boolean"},
        },
        "required": ["title", "price", "currency", "in_stock"],
        "additionalProperties": False,
    }

    resp = await client.chat.completions.create(
        model="gpt-4o-mini",
        response_format={
            "type": "json_schema",
            "json_schema": {"name": "product", "schema": schema, "strict": True},
        },
        messages=[
            {"role": "system", "content": "Extract product data from HTML."},
            {"role": "user", "content": (state["html"] or "")[:200000]},
        ],
    )
    return {"extracted": json.loads(resp.choices[0].message.content)}

async def captcha_solver_node(state: ScrapeState) -> dict:
    """Stub — wire to 2Captcha, CapSolver, or similar."""
    return {
        "errors": state["errors"] + ["captcha solver not implemented"],
        "captcha_detected": False,
    }

Wiring the graph

This is the part that beats every alternative framework on clarity. You list nodes, list edges, and the runtime is built.

def route_after_captcha_check(state: ScrapeState):
    if state["captcha_detected"]:
        return "captcha_solver"
    return "extract"

def route_after_fetch(state: ScrapeState):
    if state.get("html") is None:
        if state["attempt"] >= 3:
            return END
        return "fetch"
    return "captcha_check"

graph = StateGraph(ScrapeState)
graph.add_node("fetch", fetch_node)
graph.add_node("captcha_check", captcha_check_node)
graph.add_node("captcha_solver", captcha_solver_node)
graph.add_node("extract", extract_node)

graph.set_entry_point("fetch")
graph.add_conditional_edges("fetch", route_after_fetch)
graph.add_conditional_edges("captcha_check", route_after_captcha_check)
graph.add_edge("captcha_solver", "fetch")  # retry after solving
graph.add_edge("extract", END)

app = graph.compile()

That graph handles the basic happy path, captcha branch, and a 3-attempt retry on fetch failures. LangGraph compiles it into an executable that you invoke with the initial state.

async def main():
    final = await app.ainvoke({
        "url": "https://www.lazada.sg/products/xyz",
        "html": None,
        "captcha_detected": False,
        "extracted": None,
        "errors": [],
        "attempt": 0,
    })
    print(json.dumps(final["extracted"], indent=2))

asyncio.run(main())

Adding a checkpointer

For long-running scraping jobs (think: scrape ten thousand products with intermediate state at each one), the checkpointer is mandatory. SQLite for development, Postgres for production.

from langgraph.checkpoint.postgres import PostgresSaver

DB_URI = "postgresql://scraper:secret@localhost:5432/scrapes"

async with PostgresSaver.from_conn_string(DB_URI) as checkpointer:
    await checkpointer.setup()
    app = graph.compile(checkpointer=checkpointer)

    config = {"configurable": {"thread_id": "lazada-product-12345"}}

    async for event in app.astream(initial_state, config):
        print(event)

If the worker dies mid-graph, you restart with the same thread_id and LangGraph resumes from the last successful node. Critical for any scrape that takes more than a minute.

Adding tool nodes for proxy rotation

For real production work, every fetch should go through a rotating proxy pool. Wire it into the fetch node:

import random
import os

PROXIES = os.environ.get("PROXY_POOL", "").split(",")

async def fetch_node(state: ScrapeState) -> dict:
    proxy = random.choice(PROXIES) if PROXIES and PROXIES != [""] else None
    proxy_config = None
    if proxy:
        u, _, rest = proxy.partition("://")
        if "@" in rest:
            creds, host_port = rest.split("@", 1)
            user, password = creds.split(":")
            proxy_config = {"server": f"{u}://{host_port}", "username": user, "password": password}
        else:
            proxy_config = {"server": proxy}

    try:
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True, proxy=proxy_config)
            page = await browser.new_page()
            await page.goto(state["url"], wait_until="networkidle", timeout=30000)
            html = await page.content()
            await browser.close()
        return {"html": html, "attempt": state["attempt"] + 1}
    except Exception as e:
        return {
            "errors": state["errors"] + [f"fetch failed: {e}"],
            "attempt": state["attempt"] + 1,
        }

For ASEAN scraping where mobile IPs work better than residential, Singapore mobile proxy integrates as the proxy pool source.

Sticky proxies via thread state

For multi-page flows where the same session must hold the same exit IP, store the proxy in the state and pin it across nodes:

class ScrapeState(TypedDict):
    url: str
    html: Optional[str]
    captcha_detected: bool
    extracted: Optional[dict]
    errors: List[str]
    attempt: int
    sticky_proxy: Optional[str]  # bound on first fetch, reused across retries

async def fetch_node(state: ScrapeState) -> dict:
    proxy = state.get("sticky_proxy") or random.choice(PROXIES)
    # ... fetch with proxy
    return {"html": html, "sticky_proxy": proxy, "attempt": state["attempt"] + 1}

This is essential for cart and checkout flows on retailers that fingerprint the session-to-IP binding.

Parallel fan-out with Send

For the common pattern of “scrape 50 URLs and aggregate the results,” LangGraph supports a Send primitive that fans out across N parallel branches and rejoins.

from langgraph.graph import Send

def fanout(state):
    return [Send("scrape_one", {"url": u, "html": None, "errors": [], "attempt": 0})
            for u in state["urls"]]

graph = StateGraph(BatchState)
graph.add_node("scrape_one", scrape_one_node)
graph.add_node("aggregate", aggregate_node)
graph.set_entry_point("dispatcher")
graph.add_conditional_edges("dispatcher", fanout, ["scrape_one"])
graph.add_edge("scrape_one", "aggregate")
graph.add_edge("aggregate", END)

This gets you 50-way parallel scraping with proper backpressure (limit concurrency in the runtime config) and a single aggregated result. The pattern is faster than spawning 50 separate graph invocations because the aggregate state lives in one place.

Retry strategies that actually work

The naive retry pattern is “if fetch fails, retry up to N times.” In production this is rarely sufficient because failures cluster: a bad IP fails 5 times in a row before you decide to rotate. A better pattern uses exponential backoff and IP rotation between attempts.

async def fetch_node(state: ScrapeState) -> dict:
    import asyncio
    backoff = min(2 ** state["attempt"], 30)
    if state["attempt"] > 0:
        await asyncio.sleep(backoff)
    proxy = random.choice(PROXIES)  # NEW proxy on every retry
    # ... fetch with proxy

Combined with a circuit breaker on the proxy pool (evict any IP that fails 3 times in 10 minutes), the success rate on real-world targets jumps from roughly 88 percent to over 97 percent.

Comparing LangGraph to alternatives

FrameworkState modelBranchingPersistenceObservabilityBest fit
LangGraphExplicit TypedDictFirst-classBuilt-in checkpointerLangSmith integrationComplex multi-step pipelines
LangChain ReActImplicit, chat historyPrompt-drivenManualLangSmith integrationSimple one-shot tasks
CrewAIPer-crew shared memoryRole-basedManualLangSmith or selfMulti-agent role play
AutoGenGroup chat stateFree-formManualOpenTelemetryConversational agents
Custom asyncioWhatever you buildWhatever you writeWhatever you writeWhatever you wireMaximum flexibility

LangGraph wins for scraping specifically because scraping flows are state machines with clear branches: fetch then maybe captcha then extract then maybe retry. That maps to LangGraph’s primitives one to one. ReAct loops can do the same job but every behavior change requires prompt rewriting.

For more on CrewAI as an alternative, see CrewAI for scraping pipelines. For AutoGen, see Multi-agent scraping with AutoGen in 2026.

Cost benchmarks

Single product page extraction, end to end, on the graph above:

LLM modelAvg LLM tokens per pageLLM cost per pageCompute per pageTotal per 1k pages
GPT-4o-mini14,000$0.0028$0.001$3.80
GPT-4o14,000$0.05$0.001$51
Claude 3.5 Sonnet13,500$0.052$0.001$53
Claude 3.5 Haiku13,500$0.011$0.001$12

For high-volume scraping where you control the prompt and the schema is simple, GPT-4o-mini or Claude Haiku is the right pick. For tricky sites where extraction quality matters more than cost, Sonnet or 4o is worth it.

Production deployment

Run LangGraph workers under a process supervisor (systemd, PM2, or Kubernetes) with a health-check endpoint. Use Redis or Postgres for the checkpointer in production.

A minimal worker loop:

import asyncio
from redis.asyncio import Redis

redis = Redis.from_url("redis://localhost:6379")

async def worker():
    while True:
        url = await redis.brpop("scrape:queue", timeout=10)
        if url is None:
            continue
        url = url[1].decode()
        state = make_initial_state(url)
        config = {"configurable": {"thread_id": f"scrape-{url}"}}
        try:
            final = await app.ainvoke(state, config)
            await store_result(url, final)
        except Exception as e:
            await redis.lpush("scrape:dead", url)

asyncio.run(worker())

For LangGraph deployment on serverless, the LangGraph Platform docs cover the managed option in detail.

A complete production graph with all the layers

Putting the patterns together gives a graph that handles the long tail. The nodes:

  1. validate_url rejects malformed input early, saving downstream work.
  2. fetch with sticky proxy and exponential backoff.
  3. captcha_check flags Cloudflare, DataDome, PerimeterX, and Akamai.
  4. captcha_solver calls 2Captcha for Turnstile and CapSolver for the rest.
  5. extract pulls structured fields with strict schema.
  6. validate_extracted checks Pydantic-level invariants (price > 0, in_stock is boolean).
  7. enrich adds derived fields (USD-converted price, normalized SKU).
  8. persist writes to Postgres with an upsert.
  9. notify posts to Slack on price changes greater than 10 percent.

The graph branches at captcha_check (solver vs extract), at validate_extracted (re-fetch vs persist on validation failure), and at notify (skip if no significant change). Total node count: 9. Total edges: 14. The graph compiles in milliseconds and runs each scrape in roughly 4 to 8 seconds depending on captcha presence.

Real teams underestimate how much of their scraper is the surrounding plumbing (validate, enrich, persist, notify) versus the core fetch and extract. Having those as named nodes in a graph instead of buried inside the fetch function makes the system far easier to reason about when something breaks at 2 AM.

Cost and latency under realistic load

Numbers from a March 2026 production deployment running 50,000 product scrapes per day on a 4 vCPU 8 GB worker pool:

MetricValue
Median graph latency end-to-end4.8 s
p99 graph latency22 s
Throughput per worker12 scrapes/min
Workers needed for 50k/day3
LLM cost per scrape (GPT-4o-mini)$0.0028
Proxy cost per scrape (residential)$0.0006
Compute cost per scrape (Fargate)$0.0008
Total per-scrape cost$0.0042
Daily total$210

Compared to a non-LangGraph baseline (raw asyncio plus prompt) that ran at $190/day for the same throughput, LangGraph adds about 10 percent overhead in exchange for resumability, observability, and retry correctness. Most teams find the tradeoff lopsidedly worth it.

Observability with LangSmith

LangSmith remains the easiest way to see what a LangGraph agent is doing in production. Every node execution is traced with inputs, outputs, latency, and any LLM calls inside the node. The trace tree mirrors the graph structure, so you can spot a slow extract node or a node that errored without grepping logs.

Three patterns worth adopting from the start:

Tag every run with the URL being scraped and the source queue name. This makes filtering on the LangSmith UI trivial. langsmith_extra={"tags": [url_domain, queue_name]} in the invoke call.

Add custom metadata for the scrape ID and the proxy used. When something fails on a specific IP, you can filter by proxy to see if the failure is sticky to one bad IP versus a target-side ban.

Use LangSmith’s evaluation suite to regress against a frozen set of known-good HTML pages. Whenever you change the extract prompt, run the eval and confirm structured output quality has not regressed. This catches subtle prompt drift that production logs would not surface for days.

For teams that prefer self-hosted observability, OpenTelemetry instrumentation is supported via the langsmith SDK with the OTel exporter. Span attributes match the LangGraph node names, so Jaeger or Tempo show the same trace tree.

Frequently asked questions

Can LangGraph handle parallel scraping of multiple URLs?
Yes. Spawn one app invocation per URL, each with its own thread_id, and let asyncio handle concurrency. LangGraph also supports parallel branches inside a single graph using Send for fan-out patterns.

How do I version a LangGraph workflow safely?
Treat the graph definition as code, version it in git, and include a graph version in your thread_id. Old checkpoints stay tied to the old graph version, new ones to the new.

Does LangGraph work with local LLMs?
Yes. Anything that exposes an OpenAI-compatible API (Ollama, vLLM, LM Studio) plugs in as the LLM. Quality depends on the model. Llama 3.3 70B and Qwen 2.5 72B are the strongest open-source picks for extraction tasks in 2026.

Can I use LangGraph from JavaScript?
Yes. LangGraph.js is feature-equivalent to the Python version as of late 2025 and is the right pick for Node.js scraping pipelines.

What about cycles? Can a graph loop forever?
Cycles are allowed and useful for retry patterns, but every graph has a recursion_limit config (default 25) that prevents infinite loops. Set it explicitly for your use case.

Can I share state between two unrelated graph invocations?
Yes, by writing to an external store (Redis, Postgres) from inside a node. LangGraph state is per-thread by design; cross-thread sharing is intentional code, not implicit behavior.

How do I add human-in-the-loop approval to a scraping graph?
Use the interrupt_before config to pause the graph before a sensitive node (say, the one that publishes to your warehouse). LangGraph blocks until you call update_state and resume. This is how teams add manual review gates without restructuring the graph.

What about streaming partial results to a client?
The astream_events API emits a stream of node-level events. For long-running scrapes that feed a UI, stream the events over a WebSocket and the UI shows progress as each node completes.

Does LangGraph have built-in support for batching multiple URLs into one LLM call?
No, but you can implement it as a node that buffers up to N URLs and emits a single multi-extraction request. The trade-off is latency for cost: batching cuts LLM tokens by 30 to 50 percent on small extractions but adds a buffering delay. Most teams skip it and accept the per-page cost.

Production gotchas

  • The Postgres checkpointer needs an explicit setup() call on first use. Skipping it produces a confusing “relation does not exist” error.
  • Conditional edges that return a list of node names cause parallel execution. Returning a single string is sequential routing. Mixing the two is the most common bug we see in code review.
  • LangGraph state is merged with a shallow update. If a node returns {"errors": [new_error]}, it overwrites the prior errors list. Use a reducer to append: errors: Annotated[list, operator.add].
  • recursion_limit counts node executions, not loop iterations. A complex graph with many parallel branches can hit the default 25 limit unexpectedly.
  • The Sqlite checkpointer is fine for development but locks aggressively under concurrent writes. Switch to Postgres before going production.
  • Returning a partial state from a node with no fields removed but without including unchanged fields is correct, and merging is automatic. New developers often re-emit the full state thinking they have to, which works but obscures intent.
  • The compile step is cheap; recompile on every code change in dev. In prod, compile once at startup and reuse the compiled app across requests.

If you are building a scraping team in 2026, the AI agentic proxies category covers the proxy and infrastructure side that pairs with LangGraph for full production deployments.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)