Reverse Engineering AJAX-Heavy Sites for Scraping in 2026

AJAX-heavy sites are the most common scraping target in 2026, and also the most misunderstood. Engineers reach for Playwright or Puppeteer by default, burn through browser credits, and still get rate-limited because they never looked at what the page was actually doing under the hood. Reverse engineering AJAX calls is faster, cheaper, and more reliable than driving a browser when you know where to look.

Why Browser Automation Is Often the Wrong Starting Point

Headless browsers solve the rendering problem but ignore the network layer. A site loading 40 API calls on mount is not asking you to execute JavaScript — it is asking you to replicate those calls. When you script a browser instead, you pay the cost of DOM parsing, JavaScript execution, and bot detection surface area for every single request.

The smarter path is intercepting the real HTTP traffic, identifying which endpoints carry the data you need, and calling them directly. If you are scraping a SPA framework like Next.js, this pairs well with techniques from Reverse Engineering Single Page Apps Powered by Next.js for Scraping (2026), where the JSON hydration payload often makes direct API calls redundant entirely.

Intercepting AJAX Traffic: The Right Tools

Start with the browser’s own DevTools. Open the Network tab, filter by XHR/Fetch, reload the page, and sort by size descending. The largest responses are almost always the data you want.

For programmatic interception at scale, three tools dominate:

ToolBest ForOverheadAnti-Bot Risk
mitmproxyScripted HTTPS interceptionLowLow (no browser)
Playwright network eventsCapturing calls during renderMediumMedium
Charles ProxyManual session analysisNone (GUI)Low
FridaMobile/native app APIsHighLow

mitmproxy is the workhorse. Run it as a transparent proxy, record a session, then export the HAR file. You will have every request, response header, cookie, and timing in one place.

mitmproxy --mode transparent --save-stream-file session.mitm
# Later: export to HAR for analysis
mitmflow export session.mitm > session.har

Once you have the HAR, parse it for JSON responses with content-type: application/json and payloads over 1 KB. Those are your targets.

Reading the Request: Headers, Tokens, and Signatures

Raw endpoint discovery is only step one. Most AJAX APIs in 2026 carry at least one of the following:

  1. A short-lived bearer token generated at page load
  2. A CSRF token tied to the session cookie
  3. A request signature (HMAC or hash of timestamp + endpoint + body)
  4. A device fingerprint header (x-device-id, x-client-hash, etc.)

The token is usually in localStorage or injected into the page as a window.__STATE__ variable. Search the page source for the endpoint path (e.g. /api/listings) and trace backwards to where the token gets set. In many React apps it lives in the initial Redux or Zustand store dump.

Signatures are harder. If you see a header like x-sig: a3f9c2... that changes every request, the site is computing it client-side. Open Sources in DevTools, search for the header name, and find the signing function. It is almost always a few lines of crypto using timestamp + nonce + secret. The secret is usually hardcoded or derived from a public key in the JS bundle. Deobfuscate with de4js or prettier, then replicate in Python with hmac.

For Remix-based targets or apps using React Server Components, the data loading pattern is structurally different — check Reverse Engineering Remix and React Server Components for Scraping (2026) before spending time hunting for a REST API that may not exist.

Replicating Sessions at Scale

Once you have a working curl replay of the target endpoint, the next challenge is session management. AJAX APIs typically require:

  • A valid session cookie (session_id, cf_clearance, etc.)
  • A matching User-Agent and sec-ch-ua header set
  • A consistent IP across the session (cookie pinned to IP in many cases)

Build your scraper to treat sessions as first-class objects. Initialize a session by hitting the login or landing page, capture all Set-Cookie headers, and replay them on every subsequent AJAX call. Use httpx with cookies=session.cookies rather than passing raw header strings, which is brittle.

For anti-bot layers like Cloudflare, the cf_clearance cookie expires in 30 minutes and is tied to the TLS fingerprint of the client that solved the challenge. That means your AJAX replayer must use the same HTTP client (and TLS stack) as the browser that fetched the cookie. This is where Scraping JavaScript-Heavy Sites with Stagehand and Browserbase (2026) becomes relevant — if cookie acquisition is the bottleneck, a managed browser gets you the cookie; the AJAX replay handles everything after.

Pagination, Rate Limits, and Incremental Crawling

Most AJAX APIs paginate with either cursor tokens or offset integers. Cursor-based pagination is better for you as a scraper because you can resume mid-crawl without recalculating offsets after deletions.

Common patterns and how to handle them:

  • next_cursor in response body: store the cursor after each page, request with ?cursor= on the next call
  • X-Next-Page-Token header: same as above, just pulled from headers instead
  • offset + limit: calculate total from total_count field in first response, then fan out with concurrent requests
  • GraphQL pageInfo.endCursor: treat exactly like cursor-based REST

Rate limits show up as 429 Too Many Requests with a Retry-After header, or more aggressively as soft blocks returning empty arrays with HTTP 200. Always log response sizes per page — a sudden drop from 50 items to 0 with a 200 status is a silent block, not an empty result set.

For incremental crawling, filter on a updated_at or created_at parameter if the API exposes one. Most listing APIs support ?since=. Run a full crawl once, then schedule delta fetches every few hours. This cuts request volume by 80-90% on mature datasets and keeps you well under per-IP rate limits.

If the site uses GraphQL, the introspection query ({ __schema { types { name } } }) is often left enabled on staging environments and occasionally on production. It hands you the full schema for free.

import httpx, json

resp = httpx.post(
    "https://target.com/graphql",
    json={"query": "{ __schema { queryType { fields { name description } } } }"},
    headers={"Authorization": f"Bearer {token}"}
)
print(json.dumps(resp.json(), indent=2))

Bottom Line

Reverse engineering AJAX calls is faster and cheaper than browser automation for data-heavy endpoints — intercept first with mitmproxy, replicate the signature and session logic, then scale with plain HTTP. Reserve headless browsers for cookie acquisition only. DRT covers the full stack from network-layer interception to framework-specific quirks, so if your target is framework-specific, the reverse-engineering series has you covered.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)