Scraping with LangGraph agents in 2026
LangGraph scraping agents have become the standard pattern for any non-trivial LLM-driven scraping pipeline that needs branching, retries, and checkpointed state. By early 2026, LangGraph has reached version 0.4 with stable APIs, the StateGraph primitive is rock solid, and the Postgres checkpointer makes it easy to resume long-running scrapes after a crash. If your scraping job is more than fetch-extract-store, LangGraph is the right framework.
This guide builds a real production scraping agent step by step. We define the state, wire the tool nodes, add retry edges, plug in a checkpointer, and benchmark cost against alternatives. By the end you will have a working LangGraph scraper that handles a flaky target site, recovers from failures, and emits clean structured data.
Why LangGraph beats LangChain agents for scraping
LangGraph scraping agents express the workflow as an explicit graph instead of an implicit ReAct loop. That difference matters in three places.
First, branching. A scraping agent often needs to take different paths depending on what the page returns. Did the site return a captcha? Branch to the solver. Is the price hidden behind a login? Branch to authentication. LangChain’s ReAct agent makes branching implicit through prompt engineering. LangGraph makes it explicit through edges.
Second, observability. When a scraper fails at 3 AM, you want to know exactly which node failed and what state was in scope. LangGraph’s state object plus LangSmith integration gives you that. The classic LangChain agent gives you a chain of thought that you have to read.
Third, persistence. LangGraph ships a checkpointer system. After every node, the state is persisted to SQLite or Postgres. If the worker dies mid-scrape, you resume from the last checkpoint with one line of code.
Where LangGraph fits next to LangChain
LangChain remains the right framework for prompt templates, LLM clients, retrievers, and document loaders. LangGraph builds on top, adding the runtime that orchestrates them. The mental model is: LangChain is the parts bin, LangGraph is the assembly line. Almost every production LangGraph scraping agent imports LangChain primitives for the LLM call and the prompt template, and reserves LangGraph for the routing and state machine.
Installing the stack
pip install langgraph==0.4.0 langchain==0.3.20 langchain-openai==0.2.10 \
langgraph-checkpoint-postgres==2.0.10 \
playwright==1.49.0 pydantic==2.9.2 httpx==0.27.2
playwright install chromium
For LangSmith tracing (free for personal projects):
export LANGSMITH_API_KEY="ls__..."
export LANGSMITH_TRACING="true"
export LANGSMITH_PROJECT="lazada-scraper"
Defining the agent state
The state is a Pydantic-style TypedDict that flows through every node. For a scraping agent it typically holds the input URL, the fetched HTML, the extracted data, and an error log.
from typing import TypedDict, Optional, List
from langgraph.graph import StateGraph, END
class ScrapeState(TypedDict):
url: str
html: Optional[str]
captcha_detected: bool
extracted: Optional[dict]
errors: List[str]
attempt: int
LangGraph reduces state by merging dicts on every node return. Mutations to the state inside a node are not seen by other nodes; only the returned dict is.
Building the nodes
Each node is a function that takes the state and returns a partial state update.
import asyncio
from playwright.async_api import async_playwright
from openai import AsyncOpenAI
import json
client = AsyncOpenAI()
async def fetch_node(state: ScrapeState) -> dict:
"""Fetch the URL with Playwright."""
try:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(state["url"], wait_until="networkidle", timeout=30000)
html = await page.content()
await browser.close()
return {"html": html, "attempt": state["attempt"] + 1}
except Exception as e:
return {
"errors": state["errors"] + [f"fetch failed: {e}"],
"attempt": state["attempt"] + 1,
}
async def captcha_check_node(state: ScrapeState) -> dict:
"""Quick heuristic for captcha detection."""
html = state.get("html", "") or ""
flags = ["cf-challenge", "captcha", "px-captcha", "datadome", "recaptcha"]
detected = any(f in html.lower() for f in flags)
return {"captcha_detected": detected}
async def extract_node(state: ScrapeState) -> dict:
"""LLM-driven extraction with strict JSON Schema."""
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"in_stock": {"type": "boolean"},
},
"required": ["title", "price", "currency", "in_stock"],
"additionalProperties": False,
}
resp = await client.chat.completions.create(
model="gpt-4o-mini",
response_format={
"type": "json_schema",
"json_schema": {"name": "product", "schema": schema, "strict": True},
},
messages=[
{"role": "system", "content": "Extract product data from HTML."},
{"role": "user", "content": (state["html"] or "")[:200000]},
],
)
return {"extracted": json.loads(resp.choices[0].message.content)}
async def captcha_solver_node(state: ScrapeState) -> dict:
"""Stub — wire to 2Captcha, CapSolver, or similar."""
return {
"errors": state["errors"] + ["captcha solver not implemented"],
"captcha_detected": False,
}
Wiring the graph
This is the part that beats every alternative framework on clarity. You list nodes, list edges, and the runtime is built.
def route_after_captcha_check(state: ScrapeState):
if state["captcha_detected"]:
return "captcha_solver"
return "extract"
def route_after_fetch(state: ScrapeState):
if state.get("html") is None:
if state["attempt"] >= 3:
return END
return "fetch"
return "captcha_check"
graph = StateGraph(ScrapeState)
graph.add_node("fetch", fetch_node)
graph.add_node("captcha_check", captcha_check_node)
graph.add_node("captcha_solver", captcha_solver_node)
graph.add_node("extract", extract_node)
graph.set_entry_point("fetch")
graph.add_conditional_edges("fetch", route_after_fetch)
graph.add_conditional_edges("captcha_check", route_after_captcha_check)
graph.add_edge("captcha_solver", "fetch") # retry after solving
graph.add_edge("extract", END)
app = graph.compile()
That graph handles the basic happy path, captcha branch, and a 3-attempt retry on fetch failures. LangGraph compiles it into an executable that you invoke with the initial state.
async def main():
final = await app.ainvoke({
"url": "https://www.lazada.sg/products/xyz",
"html": None,
"captcha_detected": False,
"extracted": None,
"errors": [],
"attempt": 0,
})
print(json.dumps(final["extracted"], indent=2))
asyncio.run(main())
Adding a checkpointer
For long-running scraping jobs (think: scrape ten thousand products with intermediate state at each one), the checkpointer is mandatory. SQLite for development, Postgres for production.
from langgraph.checkpoint.postgres import PostgresSaver
DB_URI = "postgresql://scraper:secret@localhost:5432/scrapes"
async with PostgresSaver.from_conn_string(DB_URI) as checkpointer:
await checkpointer.setup()
app = graph.compile(checkpointer=checkpointer)
config = {"configurable": {"thread_id": "lazada-product-12345"}}
async for event in app.astream(initial_state, config):
print(event)
If the worker dies mid-graph, you restart with the same thread_id and LangGraph resumes from the last successful node. Critical for any scrape that takes more than a minute.
Adding tool nodes for proxy rotation
For real production work, every fetch should go through a rotating proxy pool. Wire it into the fetch node:
import random
import os
PROXIES = os.environ.get("PROXY_POOL", "").split(",")
async def fetch_node(state: ScrapeState) -> dict:
proxy = random.choice(PROXIES) if PROXIES and PROXIES != [""] else None
proxy_config = None
if proxy:
u, _, rest = proxy.partition("://")
if "@" in rest:
creds, host_port = rest.split("@", 1)
user, password = creds.split(":")
proxy_config = {"server": f"{u}://{host_port}", "username": user, "password": password}
else:
proxy_config = {"server": proxy}
try:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, proxy=proxy_config)
page = await browser.new_page()
await page.goto(state["url"], wait_until="networkidle", timeout=30000)
html = await page.content()
await browser.close()
return {"html": html, "attempt": state["attempt"] + 1}
except Exception as e:
return {
"errors": state["errors"] + [f"fetch failed: {e}"],
"attempt": state["attempt"] + 1,
}
For ASEAN scraping where mobile IPs work better than residential, Singapore mobile proxy integrates as the proxy pool source.
Sticky proxies via thread state
For multi-page flows where the same session must hold the same exit IP, store the proxy in the state and pin it across nodes:
class ScrapeState(TypedDict):
url: str
html: Optional[str]
captcha_detected: bool
extracted: Optional[dict]
errors: List[str]
attempt: int
sticky_proxy: Optional[str] # bound on first fetch, reused across retries
async def fetch_node(state: ScrapeState) -> dict:
proxy = state.get("sticky_proxy") or random.choice(PROXIES)
# ... fetch with proxy
return {"html": html, "sticky_proxy": proxy, "attempt": state["attempt"] + 1}
This is essential for cart and checkout flows on retailers that fingerprint the session-to-IP binding.
Parallel fan-out with Send
For the common pattern of “scrape 50 URLs and aggregate the results,” LangGraph supports a Send primitive that fans out across N parallel branches and rejoins.
from langgraph.graph import Send
def fanout(state):
return [Send("scrape_one", {"url": u, "html": None, "errors": [], "attempt": 0})
for u in state["urls"]]
graph = StateGraph(BatchState)
graph.add_node("scrape_one", scrape_one_node)
graph.add_node("aggregate", aggregate_node)
graph.set_entry_point("dispatcher")
graph.add_conditional_edges("dispatcher", fanout, ["scrape_one"])
graph.add_edge("scrape_one", "aggregate")
graph.add_edge("aggregate", END)
This gets you 50-way parallel scraping with proper backpressure (limit concurrency in the runtime config) and a single aggregated result. The pattern is faster than spawning 50 separate graph invocations because the aggregate state lives in one place.
Retry strategies that actually work
The naive retry pattern is “if fetch fails, retry up to N times.” In production this is rarely sufficient because failures cluster: a bad IP fails 5 times in a row before you decide to rotate. A better pattern uses exponential backoff and IP rotation between attempts.
async def fetch_node(state: ScrapeState) -> dict:
import asyncio
backoff = min(2 ** state["attempt"], 30)
if state["attempt"] > 0:
await asyncio.sleep(backoff)
proxy = random.choice(PROXIES) # NEW proxy on every retry
# ... fetch with proxy
Combined with a circuit breaker on the proxy pool (evict any IP that fails 3 times in 10 minutes), the success rate on real-world targets jumps from roughly 88 percent to over 97 percent.
Comparing LangGraph to alternatives
| Framework | State model | Branching | Persistence | Observability | Best fit |
|---|---|---|---|---|---|
| LangGraph | Explicit TypedDict | First-class | Built-in checkpointer | LangSmith integration | Complex multi-step pipelines |
| LangChain ReAct | Implicit, chat history | Prompt-driven | Manual | LangSmith integration | Simple one-shot tasks |
| CrewAI | Per-crew shared memory | Role-based | Manual | LangSmith or self | Multi-agent role play |
| AutoGen | Group chat state | Free-form | Manual | OpenTelemetry | Conversational agents |
| Custom asyncio | Whatever you build | Whatever you write | Whatever you write | Whatever you wire | Maximum flexibility |
LangGraph wins for scraping specifically because scraping flows are state machines with clear branches: fetch then maybe captcha then extract then maybe retry. That maps to LangGraph’s primitives one to one. ReAct loops can do the same job but every behavior change requires prompt rewriting.
For more on CrewAI as an alternative, see CrewAI for scraping pipelines. For AutoGen, see Multi-agent scraping with AutoGen in 2026.
Cost benchmarks
Single product page extraction, end to end, on the graph above:
| LLM model | Avg LLM tokens per page | LLM cost per page | Compute per page | Total per 1k pages |
|---|---|---|---|---|
| GPT-4o-mini | 14,000 | $0.0028 | $0.001 | $3.80 |
| GPT-4o | 14,000 | $0.05 | $0.001 | $51 |
| Claude 3.5 Sonnet | 13,500 | $0.052 | $0.001 | $53 |
| Claude 3.5 Haiku | 13,500 | $0.011 | $0.001 | $12 |
For high-volume scraping where you control the prompt and the schema is simple, GPT-4o-mini or Claude Haiku is the right pick. For tricky sites where extraction quality matters more than cost, Sonnet or 4o is worth it.
Production deployment
Run LangGraph workers under a process supervisor (systemd, PM2, or Kubernetes) with a health-check endpoint. Use Redis or Postgres for the checkpointer in production.
A minimal worker loop:
import asyncio
from redis.asyncio import Redis
redis = Redis.from_url("redis://localhost:6379")
async def worker():
while True:
url = await redis.brpop("scrape:queue", timeout=10)
if url is None:
continue
url = url[1].decode()
state = make_initial_state(url)
config = {"configurable": {"thread_id": f"scrape-{url}"}}
try:
final = await app.ainvoke(state, config)
await store_result(url, final)
except Exception as e:
await redis.lpush("scrape:dead", url)
asyncio.run(worker())
For LangGraph deployment on serverless, the LangGraph Platform docs cover the managed option in detail.
A complete production graph with all the layers
Putting the patterns together gives a graph that handles the long tail. The nodes:
validate_urlrejects malformed input early, saving downstream work.fetchwith sticky proxy and exponential backoff.captcha_checkflags Cloudflare, DataDome, PerimeterX, and Akamai.captcha_solvercalls 2Captcha for Turnstile and CapSolver for the rest.extractpulls structured fields with strict schema.validate_extractedchecks Pydantic-level invariants (price > 0, in_stock is boolean).enrichadds derived fields (USD-converted price, normalized SKU).persistwrites to Postgres with an upsert.notifyposts to Slack on price changes greater than 10 percent.
The graph branches at captcha_check (solver vs extract), at validate_extracted (re-fetch vs persist on validation failure), and at notify (skip if no significant change). Total node count: 9. Total edges: 14. The graph compiles in milliseconds and runs each scrape in roughly 4 to 8 seconds depending on captcha presence.
Real teams underestimate how much of their scraper is the surrounding plumbing (validate, enrich, persist, notify) versus the core fetch and extract. Having those as named nodes in a graph instead of buried inside the fetch function makes the system far easier to reason about when something breaks at 2 AM.
Cost and latency under realistic load
Numbers from a March 2026 production deployment running 50,000 product scrapes per day on a 4 vCPU 8 GB worker pool:
| Metric | Value |
|---|---|
| Median graph latency end-to-end | 4.8 s |
| p99 graph latency | 22 s |
| Throughput per worker | 12 scrapes/min |
| Workers needed for 50k/day | 3 |
| LLM cost per scrape (GPT-4o-mini) | $0.0028 |
| Proxy cost per scrape (residential) | $0.0006 |
| Compute cost per scrape (Fargate) | $0.0008 |
| Total per-scrape cost | $0.0042 |
| Daily total | $210 |
Compared to a non-LangGraph baseline (raw asyncio plus prompt) that ran at $190/day for the same throughput, LangGraph adds about 10 percent overhead in exchange for resumability, observability, and retry correctness. Most teams find the tradeoff lopsidedly worth it.
Observability with LangSmith
LangSmith remains the easiest way to see what a LangGraph agent is doing in production. Every node execution is traced with inputs, outputs, latency, and any LLM calls inside the node. The trace tree mirrors the graph structure, so you can spot a slow extract node or a node that errored without grepping logs.
Three patterns worth adopting from the start:
Tag every run with the URL being scraped and the source queue name. This makes filtering on the LangSmith UI trivial. langsmith_extra={"tags": [url_domain, queue_name]} in the invoke call.
Add custom metadata for the scrape ID and the proxy used. When something fails on a specific IP, you can filter by proxy to see if the failure is sticky to one bad IP versus a target-side ban.
Use LangSmith’s evaluation suite to regress against a frozen set of known-good HTML pages. Whenever you change the extract prompt, run the eval and confirm structured output quality has not regressed. This catches subtle prompt drift that production logs would not surface for days.
For teams that prefer self-hosted observability, OpenTelemetry instrumentation is supported via the langsmith SDK with the OTel exporter. Span attributes match the LangGraph node names, so Jaeger or Tempo show the same trace tree.
Frequently asked questions
Can LangGraph handle parallel scraping of multiple URLs?
Yes. Spawn one app invocation per URL, each with its own thread_id, and let asyncio handle concurrency. LangGraph also supports parallel branches inside a single graph using Send for fan-out patterns.
How do I version a LangGraph workflow safely?
Treat the graph definition as code, version it in git, and include a graph version in your thread_id. Old checkpoints stay tied to the old graph version, new ones to the new.
Does LangGraph work with local LLMs?
Yes. Anything that exposes an OpenAI-compatible API (Ollama, vLLM, LM Studio) plugs in as the LLM. Quality depends on the model. Llama 3.3 70B and Qwen 2.5 72B are the strongest open-source picks for extraction tasks in 2026.
Can I use LangGraph from JavaScript?
Yes. LangGraph.js is feature-equivalent to the Python version as of late 2025 and is the right pick for Node.js scraping pipelines.
What about cycles? Can a graph loop forever?
Cycles are allowed and useful for retry patterns, but every graph has a recursion_limit config (default 25) that prevents infinite loops. Set it explicitly for your use case.
Can I share state between two unrelated graph invocations?
Yes, by writing to an external store (Redis, Postgres) from inside a node. LangGraph state is per-thread by design; cross-thread sharing is intentional code, not implicit behavior.
How do I add human-in-the-loop approval to a scraping graph?
Use the interrupt_before config to pause the graph before a sensitive node (say, the one that publishes to your warehouse). LangGraph blocks until you call update_state and resume. This is how teams add manual review gates without restructuring the graph.
What about streaming partial results to a client?
The astream_events API emits a stream of node-level events. For long-running scrapes that feed a UI, stream the events over a WebSocket and the UI shows progress as each node completes.
Does LangGraph have built-in support for batching multiple URLs into one LLM call?
No, but you can implement it as a node that buffers up to N URLs and emits a single multi-extraction request. The trade-off is latency for cost: batching cuts LLM tokens by 30 to 50 percent on small extractions but adds a buffering delay. Most teams skip it and accept the per-page cost.
Production gotchas
- The Postgres checkpointer needs an explicit
setup()call on first use. Skipping it produces a confusing “relation does not exist” error. - Conditional edges that return a list of node names cause parallel execution. Returning a single string is sequential routing. Mixing the two is the most common bug we see in code review.
- LangGraph state is merged with a shallow update. If a node returns
{"errors": [new_error]}, it overwrites the prior errors list. Use a reducer to append:errors: Annotated[list, operator.add]. recursion_limitcounts node executions, not loop iterations. A complex graph with many parallel branches can hit the default 25 limit unexpectedly.- The Sqlite checkpointer is fine for development but locks aggressively under concurrent writes. Switch to Postgres before going production.
- Returning a partial state from a node with no fields removed but without including unchanged fields is correct, and merging is automatic. New developers often re-emit the full state thinking they have to, which works but obscures intent.
- The compile step is cheap; recompile on every code change in dev. In prod, compile once at startup and reuse the compiled app across requests.
If you are building a scraping team in 2026, the AI agentic proxies category covers the proxy and infrastructure side that pairs with LangGraph for full production deployments.