AJAX-heavy sites are the most common scraping target in 2026, and also the most misunderstood. Engineers reach for Playwright or Puppeteer by default, burn through browser credits, and still get rate-limited because they never looked at what the page was actually doing under the hood. Reverse engineering AJAX calls is faster, cheaper, and more reliable than driving a browser when you know where to look.
Why Browser Automation Is Often the Wrong Starting Point
Headless browsers solve the rendering problem but ignore the network layer. A site loading 40 API calls on mount is not asking you to execute JavaScript — it is asking you to replicate those calls. When you script a browser instead, you pay the cost of DOM parsing, JavaScript execution, and bot detection surface area for every single request.
The smarter path is intercepting the real HTTP traffic, identifying which endpoints carry the data you need, and calling them directly. If you are scraping a SPA framework like Next.js, this pairs well with techniques from Reverse Engineering Single Page Apps Powered by Next.js for Scraping (2026), where the JSON hydration payload often makes direct API calls redundant entirely.
Intercepting AJAX Traffic: The Right Tools
Start with the browser’s own DevTools. Open the Network tab, filter by XHR/Fetch, reload the page, and sort by size descending. The largest responses are almost always the data you want.
For programmatic interception at scale, three tools dominate:
| Tool | Best For | Overhead | Anti-Bot Risk |
|---|---|---|---|
| mitmproxy | Scripted HTTPS interception | Low | Low (no browser) |
| Playwright network events | Capturing calls during render | Medium | Medium |
| Charles Proxy | Manual session analysis | None (GUI) | Low |
| Frida | Mobile/native app APIs | High | Low |
mitmproxy is the workhorse. Run it as a transparent proxy, record a session, then export the HAR file. You will have every request, response header, cookie, and timing in one place.
mitmproxy --mode transparent --save-stream-file session.mitm
# Later: export to HAR for analysis
mitmflow export session.mitm > session.harOnce you have the HAR, parse it for JSON responses with content-type: application/json and payloads over 1 KB. Those are your targets.
Reading the Request: Headers, Tokens, and Signatures
Raw endpoint discovery is only step one. Most AJAX APIs in 2026 carry at least one of the following:
- A short-lived bearer token generated at page load
- A CSRF token tied to the session cookie
- A request signature (HMAC or hash of timestamp + endpoint + body)
- A device fingerprint header (
x-device-id,x-client-hash, etc.)
The token is usually in localStorage or injected into the page as a window.__STATE__ variable. Search the page source for the endpoint path (e.g. /api/listings) and trace backwards to where the token gets set. In many React apps it lives in the initial Redux or Zustand store dump.
Signatures are harder. If you see a header like x-sig: a3f9c2... that changes every request, the site is computing it client-side. Open Sources in DevTools, search for the header name, and find the signing function. It is almost always a few lines of crypto using timestamp + nonce + secret. The secret is usually hardcoded or derived from a public key in the JS bundle. Deobfuscate with de4js or prettier, then replicate in Python with hmac.
For Remix-based targets or apps using React Server Components, the data loading pattern is structurally different — check Reverse Engineering Remix and React Server Components for Scraping (2026) before spending time hunting for a REST API that may not exist.
Replicating Sessions at Scale
Once you have a working curl replay of the target endpoint, the next challenge is session management. AJAX APIs typically require:
- A valid session cookie (
session_id,cf_clearance, etc.) - A matching
User-Agentandsec-ch-uaheader set - A consistent IP across the session (cookie pinned to IP in many cases)
Build your scraper to treat sessions as first-class objects. Initialize a session by hitting the login or landing page, capture all Set-Cookie headers, and replay them on every subsequent AJAX call. Use httpx with cookies=session.cookies rather than passing raw header strings, which is brittle.
For anti-bot layers like Cloudflare, the cf_clearance cookie expires in 30 minutes and is tied to the TLS fingerprint of the client that solved the challenge. That means your AJAX replayer must use the same HTTP client (and TLS stack) as the browser that fetched the cookie. This is where Scraping JavaScript-Heavy Sites with Stagehand and Browserbase (2026) becomes relevant — if cookie acquisition is the bottleneck, a managed browser gets you the cookie; the AJAX replay handles everything after.
Pagination, Rate Limits, and Incremental Crawling
Most AJAX APIs paginate with either cursor tokens or offset integers. Cursor-based pagination is better for you as a scraper because you can resume mid-crawl without recalculating offsets after deletions.
Common patterns and how to handle them:
next_cursorin response body: store the cursor after each page, request with?cursor=on the next callX-Next-Page-Tokenheader: same as above, just pulled from headers insteadoffset+limit: calculate total fromtotal_countfield in first response, then fan out with concurrent requests- GraphQL
pageInfo.endCursor: treat exactly like cursor-based REST
Rate limits show up as 429 Too Many Requests with a Retry-After header, or more aggressively as soft blocks returning empty arrays with HTTP 200. Always log response sizes per page — a sudden drop from 50 items to 0 with a 200 status is a silent block, not an empty result set.
For incremental crawling, filter on a updated_at or created_at parameter if the API exposes one. Most listing APIs support ?since=. Run a full crawl once, then schedule delta fetches every few hours. This cuts request volume by 80-90% on mature datasets and keeps you well under per-IP rate limits.
If the site uses GraphQL, the introspection query ({ __schema { types { name } } }) is often left enabled on staging environments and occasionally on production. It hands you the full schema for free.
import httpx, json
resp = httpx.post(
"https://target.com/graphql",
json={"query": "{ __schema { queryType { fields { name description } } } }"},
headers={"Authorization": f"Bearer {token}"}
)
print(json.dumps(resp.json(), indent=2))Bottom Line
Reverse engineering AJAX calls is faster and cheaper than browser automation for data-heavy endpoints — intercept first with mitmproxy, replicate the signature and session logic, then scale with plain HTTP. Reserve headless browsers for cookie acquisition only. DRT covers the full stack from network-layer interception to framework-specific quirks, so if your target is framework-specific, the reverse-engineering series has you covered.