Scraping JavaScript-heavy SPAs with AI agents in 2026
Scraping SPA AI agents pipelines have become the dominant pattern for any modern web target because the entire ecommerce, SaaS, and content stack has converged on React, Vue, and Next.js. Server-rendered HTML is the exception; client-rendered, hydrated, lazy-loaded SPAs are the rule. Traditional fetch-and-parse scrapers crash on these targets. AI agents that drive a real headless browser succeed.
This guide covers the patterns that work in 2026 for scraping SPAs with AI agents. We cover detection, rendering strategy, wait conditions, structured extraction, proxy integration, and the edge cases that bite if you miss them. Code in Python and TypeScript throughout.
Why SPAs break traditional scrapers
A SPA serves a near-empty HTML shell that loads JavaScript bundles, calls APIs, and renders the actual content client-side. If you fetch the URL with requests or httpx, you get the shell. The data you want is generated milliseconds to seconds later by JavaScript that never ran in your scraper.
The fix is to render the page in a real browser. The fix used to be Puppeteer or Selenium with brittle wait conditions and per-site selector maintenance. In 2026, the better fix is to drive a browser with an AI agent that watches the page render in real time and decides when to extract.
Detecting SPA targets
Before reaching for an AI agent, check whether the target actually needs one. Many sites that look like SPAs in DevTools are actually server-rendered or hybrid.
Quick detection script:
import httpx
from bs4 import BeautifulSoup
def is_spa(url: str) -> bool:
r = httpx.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, "html.parser")
text_chars = len(soup.get_text(strip=True))
has_react_root = bool(soup.find(id="root") or soup.find(id="__next"))
has_app_div = bool(soup.find("div", attrs={"id": "app"}))
has_low_text = text_chars < 1000
return (has_react_root or has_app_div) and has_low_text
print(is_spa("https://www.lazada.sg/")) # True
print(is_spa("https://news.ycombinator.com/")) # False
If is_spa returns False, use a normal HTTP scraper. If True, you need a real browser.
The agentic browser pattern
The pattern that wins in 2026 looks like this:
- Launch a headless Chromium with stealth defaults
- Navigate to the target URL
- Wait for visual stability (network idle plus a small grace period)
- Take a screenshot and dump the rendered HTML
- Pass both to an LLM with a strict JSON Schema
- Validate the result, retry if needed
In code, with browser-use as the agent driver:
import asyncio
from browser_use import Agent, Browser, BrowserConfig, Controller
from langchain_openai import ChatOpenAI
from pydantic import BaseModel
class Product(BaseModel):
title: str
price: float
currency: str
in_stock: bool
controller = Controller(output_model=Product)
async def scrape_spa_product(url: str) -> Product:
browser = Browser(config=BrowserConfig(
headless=True,
extra_chromium_args=["--disable-blink-features=AutomationControlled"],
))
agent = Agent(
task=(
f"Visit {url}, wait for the product details to fully load, "
f"and return the title, price (number), currency code, and stock status."
),
llm=ChatOpenAI(model="gpt-4o-mini"),
browser=browser,
controller=controller,
max_failures=3,
)
history = await agent.run()
return Product.model_validate_json(history.final_result())
product = asyncio.run(scrape_spa_product("https://www.lazada.sg/products/example-12345.html"))
print(product)
The agent watches the page render and decides when to extract. No selector maintenance. No wait condition tuning per site.
For more on browser-use specifically, see our browser-use scraping guide.
SPA framework cheat sheet
Different frameworks fingerprint differently. Quick recognition guide:
| Framework | Tells |
|---|---|
| Next.js (App Router) | __next div, __next-build-id meta, _next/static/... script URLs |
| Next.js (Pages Router) | __NEXT_DATA__ script tag, _next/static/chunks/pages/... |
| Remix | __remixContext script, data-route on root |
| SvelteKit | __sveltekit_* global, data-sveltekit-* attrs |
| Nuxt | __NUXT__ script, _nuxt/... script URLs |
| Astro | astro-island custom elements, mixed SSR with client islands |
| React + Vite | <div id="root"> + /src/main.tsx script ref |
| Vue + Vite | <div id="app"> + Vue devtools meta |
| Angular | <app-root> element, ng-version attr |
Knowing the framework tells you whether the data is already in the initial HTML payload. Next.js with App Router and Remix often serve fully rendered HTML; Astro typically does too. SPA frameworks with strict client rendering (vanilla React + Vite, vanilla Vue) almost always need a real browser.
Wait conditions that actually work
The trickiest part of SPA scraping is waiting long enough for content to hydrate without waiting forever. Three patterns:
Network idle plus grace period. Wait for networkidle (no requests for 500ms) then sleep an additional 1-2 seconds. Catches most React Suspense boundaries.
DOM-stability detection. Watch the DOM mutation count, wait until it stabilizes for 1-2 seconds.
Sentinel selector. Wait for a known element that only appears after content loads (price, product image, review count). Most reliable when you know the site.
For agentic scraping, the agent handles this implicitly. You just give it enough time budget per page.
Stagehand example with explicit wait:
import { Stagehand } from "@browserbasehq/stagehand";
const stagehand = new Stagehand({ env: "LOCAL", modelName: "gpt-4o-mini" });
await stagehand.init();
const page = stagehand.page;
await page.goto("https://www.lazada.sg/products/example.html", { waitUntil: "networkidle" });
await page.waitForTimeout(1500); // grace period for React hydration
const data = await page.extract({
instruction: "Extract product title, price, currency, in-stock status",
schema: z.object({ title: z.string(), price: z.number(), currency: z.string(), inStock: z.boolean() }),
});
When networkidle lies
networkidle is a useful default but it lies on three common patterns. Long-poll connections never go idle, so a chat widget keeps the network busy forever. Analytics beacons that fire every 5 seconds prevent idle from triggering. Web sockets that retry every few seconds also prevent the idle state.
The mitigation is to combine networkidle with a hard timeout cap and a DOM-stability check. If networkidle has not fired within 10 seconds but the visible content has stopped changing, extract anyway.
async function waitForContent(page: Page, timeoutMs = 15000) {
const start = Date.now();
let lastDomSize = 0;
let stableCount = 0;
while (Date.now() - start < timeoutMs) {
const size = await page.evaluate(() => document.body.innerText.length);
if (size === lastDomSize && size > 500) {
stableCount++;
if (stableCount >= 3) return; // 1.5s of stability
} else {
stableCount = 0;
}
lastDomSize = size;
await page.waitForTimeout(500);
}
}
This pattern beats raw networkidle on roughly 30 percent of SPAs we tested.
Handling infinite scroll and lazy loading
Many SPAs render content lazily as the user scrolls. Two patterns to handle this.
Scroll loop. Scroll to bottom in a loop until the page height stops growing.
async function scrollUntilStable(page: Page, maxScrolls = 20) {
let lastHeight = 0;
for (let i = 0; i < maxScrolls; i++) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1000);
const height = await page.evaluate(() => document.body.scrollHeight);
if (height === lastHeight) return;
lastHeight = height;
}
}
Intercept the underlying API. Many SPAs lazy-load by calling a paginated JSON endpoint. Open DevTools, find the call, and hit it directly. Skip the browser entirely. This is the highest-throughput pattern when it applies.
import httpx
async def fetch_lazada_listings(category: str, page: int):
url = f"https://www.lazada.sg/api/listing?category={category}&page={page}"
headers = {"x-csrf-token": "...", "User-Agent": "Mozilla/5.0"}
async with httpx.AsyncClient() as c:
r = await c.get(url, headers=headers)
return r.json()
When you can find and hit the underlying API, do that. AI agents are the fallback when the API is hidden, signed, or rate-limited too aggressively for direct access.
Comparison of SPA scraping approaches
| Approach | Cost per page | Reliability on changing layouts | Maintenance | Best fit |
|---|---|---|---|---|
| Direct API interception | $0.001 | High | Medium | When you can find the API |
| Playwright with custom selectors | $0.005 | Low | High | Stable known-shape sites |
Stagehand extract | $0.04 | High | Low | Long-tail SPA targets |
| browser-use full agent | $0.04 | High | Low | Multi-step SPA flows |
| Operator/Computer Use | $0.20 | Highest | Lowest | Hardest targets |
For 80 percent of SPA scraping work in 2026, Stagehand’s extract primitive plus a 1.5-second grace period is the sweet spot. It is cheap enough to scale, reliable enough to ignore most layout changes, and easy enough to write that a junior engineer can ship a new scraper in an hour.
Hydration race conditions
A common bug: you take a screenshot at the wrong moment and the agent extracts placeholder data (“Loading…”, skeleton boxes, default values).
Two defenses:
First, validate the output. Reject any extraction where price equals zero or title contains “loading”. Retry with a longer wait.
def validate_product(p: dict) -> bool:
if not p.get("title") or "loading" in p["title"].lower():
return False
if p.get("price", 0) <= 0:
return False
return True
Second, take two screenshots 500ms apart and compare. If they differ significantly, the page is still rendering. Wait, retry, repeat.
Adding proxy rotation
SPAs are typically served by sites with strong bot defenses (Cloudflare, DataDome, Akamai). Mobile or residential proxies are mandatory.
In browser-use:
from browser_use import Browser, BrowserConfig
browser = Browser(config=BrowserConfig(
headless=True,
proxy={"server": "http://proxy.example.com:8000", "username": "u", "password": "p"},
))
In Stagehand:
const stagehand = new Stagehand({
env: "LOCAL",
localBrowserLaunchOptions: {
proxy: { server: "http://proxy.example.com:8000", username: "u", password: "p" },
},
});
For ASEAN ecommerce SPAs (Lazada, Shopee, Tokopedia), Singapore mobile proxy carries real Singtel and StarHub IPs that avoid the data-center blocks these sites apply.
Structured extraction at the end
The right pattern is to drive the agent only to reach the right page and dump the rendered HTML, then run structured extraction on a cheaper model.
# step 1: navigate with agent
agent = Agent(task=f"Reach the product page at {url} and extract the full HTML",
llm=ChatOpenAI(model="gpt-4o-mini"))
result = await agent.run()
html = await agent.browser.context.pages[0].content()
# step 2: cheap structured extraction
import json
from openai import AsyncOpenAI
client = AsyncOpenAI()
extract = await client.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_schema", "json_schema": {
"name": "product",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"in_stock": {"type": "boolean"},
},
"required": ["title", "price", "currency", "in_stock"],
"additionalProperties": False,
},
"strict": True,
}},
messages=[{"role": "user", "content": html[:200000]}],
)
product = json.loads(extract.choices[0].message.content)
This split typically cuts cost by half compared to one big agent loop. For more, see LLM extraction patterns.
Network interception for hidden data
When the underlying API is hidden but exists, intercept the network requests directly. Both Playwright and Stagehand expose a request/response listener that captures everything the browser fetches.
const apiResponses: Record<string, unknown> = {};
page.on("response", async (response) => {
const url = response.url();
if (url.includes("/api/product/") && response.headers()["content-type"]?.includes("json")) {
try {
apiResponses[url] = await response.json();
} catch {}
}
});
await page.goto(productUrl);
await waitForContent(page);
// apiResponses now contains the underlying JSON payloads
This pattern often gets you cleaner data than DOM extraction because the API payload is the source of truth that the SPA renders. Once you find the API, you can hit it directly and skip the browser entirely.
SPA scraping with vision-only extraction
For SPAs where the DOM is obfuscated (CSS-in-JS with random class names, shadow DOM components, canvas-rendered text), vision extraction can succeed where DOM extraction fails.
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def extract_from_screenshot(png_b64: str, schema: dict) -> dict:
resp = await client.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_schema", "json_schema": {"name": "x", "schema": schema, "strict": True}},
messages=[{"role": "user", "content": [
{"type": "text", "text": "Extract data from this screenshot per the schema"},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{png_b64}"}},
]}],
)
return json.loads(resp.choices[0].message.content)
Vision extraction is roughly 3x more expensive per page than DOM extraction but works on a handful of sites where DOM extraction is essentially impossible. For more, see scraping with vision models 2026.
Production patterns
Three patterns separate hobby SPA scrapers from production ones.
First, cap per-page wall clock. Even agents can spin if the page is broken. Set timeout: 60000 and treat exceeded timeouts as failures.
Second, monitor cost per page. SPA scraping costs add up fast. Log tokens consumed per scrape and alert if a single page exceeds 30,000 tokens (likely a confused agent).
Third, keep a fallback. When the agent fails, fall through to a static Playwright scraper with cached selectors. Catches the cases where the agent is wrong and the deterministic code is right.
Cookie banners and modal interruptions
The single most common cause of stuck SPA scrapes is a cookie banner blocking the content. The agent sees the banner, the LLM does not understand to dismiss it, and the agent gives up.
Hardcode a banner-handling preamble in your task:
"If a cookie banner, age verification modal, region selector, or login prompt
is visible, dismiss it (click 'Accept all', 'I'm 18+', 'Close', or similar).
Then proceed with the main task."
This single instruction moves success rates on European retailer SPAs by 15 to 25 percentage points.
For Stagehand, you can also use act to dismiss known modals before extract:
await page.act("If a cookie banner is visible, click the 'Accept all' button");
const data = await page.extract({ instruction: "...", schema: ... });
Common SPA scraping pitfalls
A handful of failure modes worth memorizing.
The agent sees server-rendered placeholder content and extracts the placeholder. Mitigation: detect placeholders by content patterns (“loading”, skeleton boxes, default 0 values).
The page redirects to a login wall after a few page loads. Mitigation: rotate session cookies and IPs, or warm up the session with realistic browsing before scraping.
The site uses cursor-based pagination with opaque tokens. Mitigation: simulate the user click that triggers the next page rather than constructing the URL yourself.
The site delivers different markup based on user agent or viewport. Mitigation: use a realistic UA and a desktop viewport (1440×900), not the Playwright defaults.
The site lazy-loads images that block extraction because the LLM expects them. Mitigation: prefer text-only extractions, and if you need images, scroll first.
Real benchmarks on common SPAs
100 product pages each, GPT-4o-mini extraction:
| Target | Approach | Success rate | Avg time per page |
|---|---|---|---|
| Lazada SG | browser-use + mobile proxy | 98% | 7.1 s |
| Shopee SG | Stagehand + mobile proxy | 96% | 6.4 s |
| Amazon US | Playwright + residential | 94% | 3.2 s |
| Best Buy | browser-use + residential | 91% | 8.5 s |
| Booking.com | Stagehand + residential | 88% | 12 s |
For more on Lazada specifically, see our Lazada Thailand scraping guide. For Shopee, see Shopee Indonesia scraping.
Hydration timing across SPA frameworks
Different frameworks hydrate at different speeds. Average time from domcontentloaded to “fully interactive” on a typical product page:
| Framework | Median hydration | p99 hydration |
|---|---|---|
| Next.js App Router (RSC) | 250 ms | 1.4 s |
| Next.js Pages Router | 600 ms | 2.8 s |
| Remix | 350 ms | 1.6 s |
| SvelteKit | 200 ms | 1.1 s |
| Nuxt 3 | 480 ms | 2.4 s |
| React + Vite SPA | 1.2 s | 4.5 s |
| Vue + Vite SPA | 1.0 s | 4.2 s |
| Astro with islands | 200 ms (mostly SSR) | 1.0 s |
For most production scrapers, a 2-second wait is sufficient. For React + Vite SPAs, bump to 4 seconds. The cost in latency is small compared to the cost in failed extractions.
Frequently asked questions
Can I scrape an SPA without a real browser?
Sometimes. If the SPA uses Next.js with getServerSideProps or React Server Components, the initial HTML may already contain the data you need. Check by curl-ing the URL. Otherwise, a real browser is required.
What about hydration mismatches?
Hydration mismatches happen when client-rendered HTML differs from server HTML. Wait until after hydration completes (typically 500-1500ms) before extracting.
How do I handle authentication on SPAs?
Most SPAs use cookie-based session auth. Log in once, save the storage state, replay on each scrape. Both Playwright and Stagehand expose storageState config for this.
Can I run an SPA scraper on AWS Lambda?
Cold starts kill latency. Lambda with the Chromium layer works for occasional jobs. For high throughput, run on Fargate or a dedicated VPS. Browserbase is the pay-per-minute alternative.
Why is my agent extracting “0” for prices?
Almost always a hydration timing issue. The agent extracted while the React app still showed the skeleton placeholder. Add a longer wait or a stability check before extraction.
How do I parallelize SPA scraping when each page takes 8 seconds?
Run multiple browser contexts in parallel within one Chromium process (cheaper than multiple browsers). Cap concurrency at roughly 1 per CPU core to avoid thrashing the browser’s rendering pipeline.
How do I handle SPAs that detect headless browsers?
Use headless=False with a virtual display (Xvfb) on Linux, or use Browserbase’s stealth mode. The single biggest tell is the navigator.webdriver flag, which Chromium sets to true in headless mode and most stealth plugins patch.
Can I use the Beautiful Soup HTML output from a Playwright render?
Yes. After page.content(), you have the rendered HTML as a string and can pass it to BeautifulSoup or lxml for traditional parsing. The combination of headless render plus traditional parser is a real production pattern.
What about WebSockets and Server-Sent Events?
Real-time data over WebSocket or SSE is harder to scrape because the data flows continuously. Use Playwright’s page.on('websocket') event listener to capture frames as they arrive, or intercept the underlying API connection.
How do I scrape an SPA behind a paywall I have access to?
Save the browser storage state (cookies, localStorage) after manual login, then load it on each scrape via storageState. Refresh the state when it expires.
Are there SPAs that just cannot be scraped reliably?
Yes. Sites with strong client-side encryption (some financial dashboards), sites that gate data behind interactive verification (real-time KYC), and sites with anti-replay tokens that bind to a specific browser fingerprint. For these, manual data export or partner APIs are the only realistic paths.
For broader patterns on the agentic browser stack, see our AI modern scraping category.