Scraping JavaScript-heavy SPAs with AI agents in 2026

Scraping SPA AI agents pipelines have become the dominant pattern for any modern web target because the entire ecommerce, SaaS, and content stack has converged on React, Vue, and Next.js. Server-rendered HTML is the exception; client-rendered, hydrated, lazy-loaded SPAs are the rule. Traditional fetch-and-parse scrapers crash on these targets. AI agents that drive a real headless browser succeed.

This guide covers the patterns that work in 2026 for scraping SPAs with AI agents. We cover detection, rendering strategy, wait conditions, structured extraction, proxy integration, and the edge cases that bite if you miss them. Code in Python and TypeScript throughout.

Why SPAs break traditional scrapers

A SPA serves a near-empty HTML shell that loads JavaScript bundles, calls APIs, and renders the actual content client-side. If you fetch the URL with requests or httpx, you get the shell. The data you want is generated milliseconds to seconds later by JavaScript that never ran in your scraper.

The fix is to render the page in a real browser. The fix used to be Puppeteer or Selenium with brittle wait conditions and per-site selector maintenance. In 2026, the better fix is to drive a browser with an AI agent that watches the page render in real time and decides when to extract.

Detecting SPA targets

Before reaching for an AI agent, check whether the target actually needs one. Many sites that look like SPAs in DevTools are actually server-rendered or hybrid.

Quick detection script:

import httpx
from bs4 import BeautifulSoup

def is_spa(url: str) -> bool:
    r = httpx.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(r.text, "html.parser")
    text_chars = len(soup.get_text(strip=True))

    has_react_root = bool(soup.find(id="root") or soup.find(id="__next"))
    has_app_div = bool(soup.find("div", attrs={"id": "app"}))
    has_low_text = text_chars < 1000

    return (has_react_root or has_app_div) and has_low_text

print(is_spa("https://www.lazada.sg/"))  # True
print(is_spa("https://news.ycombinator.com/"))  # False

If is_spa returns False, use a normal HTTP scraper. If True, you need a real browser.

The agentic browser pattern

The pattern that wins in 2026 looks like this:

Launch a headless Chromium with stealth defaults
Navigate to the target URL
Wait for visual stability (network idle plus a small grace period)
Take a screenshot and dump the rendered HTML
Pass both to an LLM with a strict JSON Schema
Validate the result, retry if needed

In code, with browser-use as the agent driver:

import asyncio
from browser_use import Agent, Browser, BrowserConfig, Controller
from langchain_openai import ChatOpenAI
from pydantic import BaseModel

class Product(BaseModel):
    title: str
    price: float
    currency: str
    in_stock: bool

controller = Controller(output_model=Product)

async def scrape_spa_product(url: str) -> Product:
    browser = Browser(config=BrowserConfig(
        headless=True,
        extra_chromium_args=["--disable-blink-features=AutomationControlled"],
    ))

    agent = Agent(
        task=(
            f"Visit {url}, wait for the product details to fully load, "
            f"and return the title, price (number), currency code, and stock status."
        ),
        llm=ChatOpenAI(model="gpt-4o-mini"),
        browser=browser,
        controller=controller,
        max_failures=3,
    )

    history = await agent.run()
    return Product.model_validate_json(history.final_result())

product = asyncio.run(scrape_spa_product("https://www.lazada.sg/products/example-12345.html"))
print(product)

The agent watches the page render and decides when to extract. No selector maintenance. No wait condition tuning per site.

For more on browser-use specifically, see our browser-use scraping guide.

SPA framework cheat sheet

Different frameworks fingerprint differently. Quick recognition guide:

Framework	Tells
Next.js (App Router)	`__next` div, `__next-build-id` meta, `_next/static/...` script URLs
Next.js (Pages Router)	`__NEXT_DATA__` script tag, `_next/static/chunks/pages/...`
Remix	`__remixContext` script, `data-route` on root
SvelteKit	`__sveltekit_` global, `data-sveltekit-` attrs
Nuxt	`__NUXT__` script, `_nuxt/...` script URLs
Astro	`astro-island` custom elements, mixed SSR with client islands
React + Vite	`<div id="root">` + `/src/main.tsx` script ref
Vue + Vite	`<div id="app">` + Vue devtools meta
Angular	`<app-root>` element, `ng-version` attr

Knowing the framework tells you whether the data is already in the initial HTML payload. Next.js with App Router and Remix often serve fully rendered HTML; Astro typically does too. SPA frameworks with strict client rendering (vanilla React + Vite, vanilla Vue) almost always need a real browser.

Wait conditions that actually work

The trickiest part of SPA scraping is waiting long enough for content to hydrate without waiting forever. Three patterns:

Network idle plus grace period. Wait for networkidle (no requests for 500ms) then sleep an additional 1-2 seconds. Catches most React Suspense boundaries.

DOM-stability detection. Watch the DOM mutation count, wait until it stabilizes for 1-2 seconds.

Sentinel selector. Wait for a known element that only appears after content loads (price, product image, review count). Most reliable when you know the site.

For agentic scraping, the agent handles this implicitly. You just give it enough time budget per page.

Stagehand example with explicit wait:

import { Stagehand } from "@browserbasehq/stagehand";

const stagehand = new Stagehand({ env: "LOCAL", modelName: "gpt-4o-mini" });
await stagehand.init();
const page = stagehand.page;

await page.goto("https://www.lazada.sg/products/example.html", { waitUntil: "networkidle" });
await page.waitForTimeout(1500);  // grace period for React hydration

const data = await page.extract({
  instruction: "Extract product title, price, currency, in-stock status",
  schema: z.object({ title: z.string(), price: z.number(), currency: z.string(), inStock: z.boolean() }),
});

When networkidle lies

networkidle is a useful default but it lies on three common patterns. Long-poll connections never go idle, so a chat widget keeps the network busy forever. Analytics beacons that fire every 5 seconds prevent idle from triggering. Web sockets that retry every few seconds also prevent the idle state.

The mitigation is to combine networkidle with a hard timeout cap and a DOM-stability check. If networkidle has not fired within 10 seconds but the visible content has stopped changing, extract anyway.

async function waitForContent(page: Page, timeoutMs = 15000) {
  const start = Date.now();
  let lastDomSize = 0;
  let stableCount = 0;
  while (Date.now() - start < timeoutMs) {
    const size = await page.evaluate(() => document.body.innerText.length);
    if (size === lastDomSize && size > 500) {
      stableCount++;
      if (stableCount >= 3) return;  // 1.5s of stability
    } else {
      stableCount = 0;
    }
    lastDomSize = size;
    await page.waitForTimeout(500);
  }
}

This pattern beats raw networkidle on roughly 30 percent of SPAs we tested.

Handling infinite scroll and lazy loading

Many SPAs render content lazily as the user scrolls. Two patterns to handle this.

Scroll loop. Scroll to bottom in a loop until the page height stops growing.

async function scrollUntilStable(page: Page, maxScrolls = 20) {
  let lastHeight = 0;
  for (let i = 0; i < maxScrolls; i++) {
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
    await page.waitForTimeout(1000);
    const height = await page.evaluate(() => document.body.scrollHeight);
    if (height === lastHeight) return;
    lastHeight = height;
  }
}

Intercept the underlying API. Many SPAs lazy-load by calling a paginated JSON endpoint. Open DevTools, find the call, and hit it directly. Skip the browser entirely. This is the highest-throughput pattern when it applies.

import httpx

async def fetch_lazada_listings(category: str, page: int):
    url = f"https://www.lazada.sg/api/listing?category={category}&page={page}"
    headers = {"x-csrf-token": "...", "User-Agent": "Mozilla/5.0"}
    async with httpx.AsyncClient() as c:
        r = await c.get(url, headers=headers)
        return r.json()

When you can find and hit the underlying API, do that. AI agents are the fallback when the API is hidden, signed, or rate-limited too aggressively for direct access.

Comparison of SPA scraping approaches

Approach	Cost per page	Reliability on changing layouts	Maintenance	Best fit
Direct API interception	$0.001	High	Medium	When you can find the API
Playwright with custom selectors	$0.005	Low	High	Stable known-shape sites
Stagehand `extract`	$0.04	High	Low	Long-tail SPA targets
browser-use full agent	$0.04	High	Low	Multi-step SPA flows
Operator/Computer Use	$0.20	Highest	Lowest	Hardest targets

For 80 percent of SPA scraping work in 2026, Stagehand’s extract primitive plus a 1.5-second grace period is the sweet spot. It is cheap enough to scale, reliable enough to ignore most layout changes, and easy enough to write that a junior engineer can ship a new scraper in an hour.

Hydration race conditions

A common bug: you take a screenshot at the wrong moment and the agent extracts placeholder data (“Loading…”, skeleton boxes, default values).

Two defenses:

First, validate the output. Reject any extraction where price equals zero or title contains “loading”. Retry with a longer wait.

def validate_product(p: dict) -> bool:
    if not p.get("title") or "loading" in p["title"].lower():
        return False
    if p.get("price", 0) <= 0:
        return False
    return True

Second, take two screenshots 500ms apart and compare. If they differ significantly, the page is still rendering. Wait, retry, repeat.

Adding proxy rotation

SPAs are typically served by sites with strong bot defenses (Cloudflare, DataDome, Akamai). Mobile or residential proxies are mandatory.

In browser-use:

from browser_use import Browser, BrowserConfig

browser = Browser(config=BrowserConfig(
    headless=True,
    proxy={"server": "http://proxy.example.com:8000", "username": "u", "password": "p"},
))

In Stagehand:

const stagehand = new Stagehand({
  env: "LOCAL",
  localBrowserLaunchOptions: {
    proxy: { server: "http://proxy.example.com:8000", username: "u", password: "p" },
  },
});

For ASEAN ecommerce SPAs (Lazada, Shopee, Tokopedia), Singapore mobile proxy carries real Singtel and StarHub IPs that avoid the data-center blocks these sites apply.

Structured extraction at the end

The right pattern is to drive the agent only to reach the right page and dump the rendered HTML, then run structured extraction on a cheaper model.

# step 1: navigate with agent
agent = Agent(task=f"Reach the product page at {url} and extract the full HTML",
              llm=ChatOpenAI(model="gpt-4o-mini"))
result = await agent.run()
html = await agent.browser.context.pages[0].content()

# step 2: cheap structured extraction
import json
from openai import AsyncOpenAI
client = AsyncOpenAI()
extract = await client.chat.completions.create(
    model="gpt-4o-mini",
    response_format={"type": "json_schema", "json_schema": {
        "name": "product",
        "schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "price": {"type": "number"},
                "currency": {"type": "string"},
                "in_stock": {"type": "boolean"},
            },
            "required": ["title", "price", "currency", "in_stock"],
            "additionalProperties": False,
        },
        "strict": True,
    }},
    messages=[{"role": "user", "content": html[:200000]}],
)
product = json.loads(extract.choices[0].message.content)

This split typically cuts cost by half compared to one big agent loop. For more, see LLM extraction patterns.

Network interception for hidden data

When the underlying API is hidden but exists, intercept the network requests directly. Both Playwright and Stagehand expose a request/response listener that captures everything the browser fetches.

const apiResponses: Record<string, unknown> = {};

page.on("response", async (response) => {
  const url = response.url();
  if (url.includes("/api/product/") && response.headers()["content-type"]?.includes("json")) {
    try {
      apiResponses[url] = await response.json();
    } catch {}
  }
});

await page.goto(productUrl);
await waitForContent(page);

// apiResponses now contains the underlying JSON payloads

This pattern often gets you cleaner data than DOM extraction because the API payload is the source of truth that the SPA renders. Once you find the API, you can hit it directly and skip the browser entirely.

SPA scraping with vision-only extraction

For SPAs where the DOM is obfuscated (CSS-in-JS with random class names, shadow DOM components, canvas-rendered text), vision extraction can succeed where DOM extraction fails.

from openai import AsyncOpenAI
client = AsyncOpenAI()

async def extract_from_screenshot(png_b64: str, schema: dict) -> dict:
    resp = await client.chat.completions.create(
        model="gpt-4o-mini",
        response_format={"type": "json_schema", "json_schema": {"name": "x", "schema": schema, "strict": True}},
        messages=[{"role": "user", "content": [
            {"type": "text", "text": "Extract data from this screenshot per the schema"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{png_b64}"}},
        ]}],
    )
    return json.loads(resp.choices[0].message.content)

Vision extraction is roughly 3x more expensive per page than DOM extraction but works on a handful of sites where DOM extraction is essentially impossible. For more, see scraping with vision models 2026.

Production patterns

Three patterns separate hobby SPA scrapers from production ones.

First, cap per-page wall clock. Even agents can spin if the page is broken. Set timeout: 60000 and treat exceeded timeouts as failures.

Second, monitor cost per page. SPA scraping costs add up fast. Log tokens consumed per scrape and alert if a single page exceeds 30,000 tokens (likely a confused agent).

Third, keep a fallback. When the agent fails, fall through to a static Playwright scraper with cached selectors. Catches the cases where the agent is wrong and the deterministic code is right.

Cookie banners and modal interruptions

The single most common cause of stuck SPA scrapes is a cookie banner blocking the content. The agent sees the banner, the LLM does not understand to dismiss it, and the agent gives up.

Hardcode a banner-handling preamble in your task:

"If a cookie banner, age verification modal, region selector, or login prompt
is visible, dismiss it (click 'Accept all', 'I'm 18+', 'Close', or similar).
Then proceed with the main task."

This single instruction moves success rates on European retailer SPAs by 15 to 25 percentage points.

For Stagehand, you can also use act to dismiss known modals before extract:

await page.act("If a cookie banner is visible, click the 'Accept all' button");
const data = await page.extract({ instruction: "...", schema: ... });

Common SPA scraping pitfalls

A handful of failure modes worth memorizing.

The agent sees server-rendered placeholder content and extracts the placeholder. Mitigation: detect placeholders by content patterns (“loading”, skeleton boxes, default 0 values).

The page redirects to a login wall after a few page loads. Mitigation: rotate session cookies and IPs, or warm up the session with realistic browsing before scraping.

The site uses cursor-based pagination with opaque tokens. Mitigation: simulate the user click that triggers the next page rather than constructing the URL yourself.

The site delivers different markup based on user agent or viewport. Mitigation: use a realistic UA and a desktop viewport (1440×900), not the Playwright defaults.

The site lazy-loads images that block extraction because the LLM expects them. Mitigation: prefer text-only extractions, and if you need images, scroll first.

Real benchmarks on common SPAs

100 product pages each, GPT-4o-mini extraction:

Target	Approach	Success rate	Avg time per page
Lazada SG	browser-use + mobile proxy	98%	7.1 s
Shopee SG	Stagehand + mobile proxy	96%	6.4 s
Amazon US	Playwright + residential	94%	3.2 s
Best Buy	browser-use + residential	91%	8.5 s
Booking.com	Stagehand + residential	88%	12 s

For more on Lazada specifically, see our Lazada Thailand scraping guide. For Shopee, see Shopee Indonesia scraping.

Hydration timing across SPA frameworks

Different frameworks hydrate at different speeds. Average time from domcontentloaded to “fully interactive” on a typical product page:

Framework	Median hydration	p99 hydration
Next.js App Router (RSC)	250 ms	1.4 s
Next.js Pages Router	600 ms	2.8 s
Remix	350 ms	1.6 s
SvelteKit	200 ms	1.1 s
Nuxt 3	480 ms	2.4 s
React + Vite SPA	1.2 s	4.5 s
Vue + Vite SPA	1.0 s	4.2 s
Astro with islands	200 ms (mostly SSR)	1.0 s

For most production scrapers, a 2-second wait is sufficient. For React + Vite SPAs, bump to 4 seconds. The cost in latency is small compared to the cost in failed extractions.

Frequently asked questions

Can I scrape an SPA without a real browser?
Sometimes. If the SPA uses Next.js with getServerSideProps or React Server Components, the initial HTML may already contain the data you need. Check by curl-ing the URL. Otherwise, a real browser is required.

What about hydration mismatches?
Hydration mismatches happen when client-rendered HTML differs from server HTML. Wait until after hydration completes (typically 500-1500ms) before extracting.

How do I handle authentication on SPAs?
Most SPAs use cookie-based session auth. Log in once, save the storage state, replay on each scrape. Both Playwright and Stagehand expose storageState config for this.

Can I run an SPA scraper on AWS Lambda?
Cold starts kill latency. Lambda with the Chromium layer works for occasional jobs. For high throughput, run on Fargate or a dedicated VPS. Browserbase is the pay-per-minute alternative.

Why is my agent extracting “0” for prices?
Almost always a hydration timing issue. The agent extracted while the React app still showed the skeleton placeholder. Add a longer wait or a stability check before extraction.

How do I parallelize SPA scraping when each page takes 8 seconds?
Run multiple browser contexts in parallel within one Chromium process (cheaper than multiple browsers). Cap concurrency at roughly 1 per CPU core to avoid thrashing the browser’s rendering pipeline.

How do I handle SPAs that detect headless browsers?
Use headless=False with a virtual display (Xvfb) on Linux, or use Browserbase’s stealth mode. The single biggest tell is the navigator.webdriver flag, which Chromium sets to true in headless mode and most stealth plugins patch.

Can I use the Beautiful Soup HTML output from a Playwright render?
Yes. After page.content(), you have the rendered HTML as a string and can pass it to BeautifulSoup or lxml for traditional parsing. The combination of headless render plus traditional parser is a real production pattern.

What about WebSockets and Server-Sent Events?
Real-time data over WebSocket or SSE is harder to scrape because the data flows continuously. Use Playwright’s page.on('websocket') event listener to capture frames as they arrive, or intercept the underlying API connection.

How do I scrape an SPA behind a paywall I have access to?
Save the browser storage state (cookies, localStorage) after manual login, then load it on each scrape via storageState. Refresh the state when it expires.

Are there SPAs that just cannot be scraped reliably?
Yes. Sites with strong client-side encryption (some financial dashboards), sites that gate data behind interactive verification (real-time KYC), and sites with anti-replay tokens that bind to a specific browser fingerprint. For these, manual data export or partner APIs are the only realistic paths.

For broader patterns on the agentic browser stack, see our AI modern scraping category.