GraphQL has quietly become one of the harder targets in web scraping. Unlike REST endpoints where the URL tells you what you’re getting, scraping GraphQL APIs in 2026 requires understanding the schema, constructing valid queries, and navigating rate limits that are often tighter and more opaque than anything you’ll hit on a REST service.
Start with Introspection (While You Still Can)
Introspection is GraphQL’s built-in schema documentation system. Send a standard introspection query and the server hands you every type, field, argument, and relationship in the API. It’s the fastest way to map an unfamiliar endpoint.
query IntrospectionQuery {
__schema {
queryType { name }
types {
name
kind
fields {
name
type { name kind ofType { name kind } }
}
}
}
}Post this to the endpoint as JSON ({"query": "..."}) with a Content-Type: application/json header. Tools like graphql-inspector and Insomnia can parse the response and render a full schema tree.
The catch: production APIs increasingly disable introspection. Shopify, GitHub, and most large SaaS platforms turned it off years ago. If you hit {"errors":[{"message":"GraphQL introspection is not allowed"}]}, you have two fallback paths: look for a public developer portal that exports the schema, or reconstruct it through field-level probing (send queries with one field at a time and infer types from error messages). For the pillar-level breakdown of both approaches, see GraphQL API Scraping: Introspection & Query Guide.
Crafting Queries That Actually Work
Once you have a schema (or a partial map), the goal is writing the leanest query that returns what you need. Over-fetching wastes tokens and flags you faster. Under-fetching means multiple round trips.
A few patterns that hold up in 2026:
- Pagination via cursor: most modern APIs use
first+aftercursor pagination, not page numbers. Always check forpageInfo { hasNextPage endCursor }in the response. - Aliases: if you need the same field with different arguments, alias them in one query rather than making multiple requests.
- Fragments: for repeated field sets across queries, use fragments to keep queries DRY and reduce payload size.
Numbered workflow for a new target:
- Confirm the endpoint (
/graphql,/api/graphql,/v1/graphqlare common). - Fetch or reconstruct the schema.
- Identify the root query type and the collections you need.
- Write a paginated query with the minimum field set.
- Test with a small
first: 5limit before scaling. - Add retry logic before going to production volume.
Variable injection is important. Never string-interpolate user-controlled or loop-generated values into query strings. Use GraphQL variables ($cursor: String) both for safety and because some WAFs flag inline dynamic values as anomalous.
Rate Limits and How GraphQL Makes Them Harder
REST rate limits are usually per-endpoint per-minute. GraphQL rate limits in 2026 are mostly complexity-based or cost-based, which means a single expensive query can exhaust your quota faster than 100 simple ones.
| Platform | Rate limit model | Reset window | Notes |
|---|---|---|---|
| GitHub GraphQL | Point-based (5000/hr) | 1 hour | Complexity visible in response extensions |
| Shopify | Bucket (1000 pts) + restore | Per request | X-Shopify-Shop-Api-Call-Limit header |
| Contentful | Request count | 1 minute | No complexity scoring |
| Hasura Cloud | Configurable depth/complexity | Varies | Depth limit default 10 |
| Hygraph | Request count + field count | 1 hour | Undocumented complexity weights |
GitHub returns remaining complexity in extensions.rateLimit on every response. Shopify restores points at 50/second. If you’re not reading those headers and extensions, you’re flying blind and will hit 429s unpredictably.
Practical mitigation: implement an adaptive delay that backs off when remaining quota drops below 20%, rather than a fixed sleep between requests. For comparison, WebSocket-based scraping has different but equally tricky timing constraints — the patterns in Scraping WebSocket-Based Apps: Patterns for Real-Time Data (2026) apply when your target uses GraphQL subscriptions over WS.
Anti-Bot Bypass Considerations
GraphQL endpoints sit behind the same WAF and bot detection stacks as REST. In practice the fingerprinting is often lighter because fewer tools target GraphQL, but that’s changing.
The main signals to manage:
- Headers: always send
Accept,Origin,Referer, and a realisticUser-Agent. GraphQL clients from browsers include all of these. Missing headers are a strong bot signal. - Content-Type: must be
application/json. Some servers also acceptmultipart/form-datafor file uploads but that’s not relevant for scraping. - Persisted queries: Apollo and Relay clients often use persisted queries (a hash instead of the full query string). If the server enforces this, you’ll get
PersistedQueryNotFound. You need to register your query first viaextensions: {"persistedQuery": {"version": 1, "sha256Hash": "..."}}. - Query depth limits: most hardened APIs reject queries deeper than 10-12 levels. Keep your queries flat.
If the endpoint requires authentication, GraphQL tokens are usually JWTs passed as Authorization: Bearer . Session-based auth follows the same cookie flow as any other scraping target. For APIs that expose their data structure via JSON-LD on the surrounding pages, Scraping JSON-LD Structured Data: Schema.org Extraction at Scale (2026) can help you cross-reference what the GraphQL API returns against what’s embedded in the HTML.
Tooling Choices in 2026
Python’s gql library (v3+) handles schema introspection, transports (HTTP, WebSocket, aiohttp), and retry logic cleanly. For high-volume async workloads, pair it with httpx and asyncio:
import httpx, asyncio, json
ENDPOINT = "https://api.example.com/graphql"
HEADERS = {"Authorization": "Bearer TOKEN", "Content-Type": "application/json"}
QUERY = """
query GetItems($cursor: String) {
items(first: 100, after: $cursor) {
pageInfo { hasNextPage endCursor }
nodes { id name createdAt }
}
}
"""
async def fetch_page(client, cursor=None):
resp = await client.post(ENDPOINT, json={"query": QUERY, "variables": {"cursor": cursor}}, headers=HEADERS)
resp.raise_for_status()
return resp.json()["data"]["items"]
async def scrape_all():
async with httpx.AsyncClient(timeout=30) as client:
cursor, results = None, []
while True:
page = await fetch_page(client, cursor)
results.extend(page["nodes"])
if not page["pageInfo"]["hasNextPage"]:
break
cursor = page["pageInfo"]["endCursor"]
await asyncio.sleep(0.5)
return resultsFor feed-like data that updates frequently, check whether the target also exposes an RSS or Atom feed before building a GraphQL poller. The overhead comparison in RSS and Atom Feed Aggregation at Scale 2026: Tooling and Rate Patterns is worth reading. Feeds are often rate-limit-free. Similarly, if the target pushes updates via SSE rather than requiring you to poll, Scraping Server-Sent Events (SSE) Streams: Live Data Patterns (2026) covers the lighter-weight alternative.
Bottom line
GraphQL scraping is not dramatically harder than REST scraping, but it punishes sloppy tooling more. Read the schema before writing queries, track complexity-based quotas in response extensions rather than assuming per-minute limits, and keep queries shallow and paginated. DRT covers this protocol category in depth because GraphQL is now the default API layer for most modern SaaS products — you will hit it whether you plan to or not.
Related guides on dataresearchtools.com
- Scraping WebSocket-Based Apps: Patterns for Real-Time Data (2026)
- Scraping Server-Sent Events (SSE) Streams: Live Data Patterns (2026)
- Scraping JSON-LD Structured Data: Schema.org Extraction at Scale (2026)
- RSS and Atom Feed Aggregation at Scale 2026: Tooling and Rate Patterns
- Pillar: GraphQL API Scraping: Introspection Query Guide