GraphQL API Scraping: Introspection & Query Guide
GraphQL APIs are increasingly replacing REST endpoints across the web. Unlike REST, where you scrape multiple endpoints, GraphQL gives you a single endpoint where you request exactly the data you need. This makes GraphQL APIs simultaneously easier and harder to scrape — easier because you get structured data in one call, harder because you need to understand the schema.
This guide covers discovering GraphQL endpoints, introspection, building queries, and handling pagination.
Identifying GraphQL Endpoints
GraphQL APIs typically use:
- A single endpoint:
/graphql,/api/graphql, or/gql - POST requests with JSON body containing
queryandvariables - Content-Type:
application/json
import httpx
async def detect_graphql(base_url, proxy=None):
"""Detect GraphQL endpoints on a website."""
common_paths = [
'/graphql', '/api/graphql', '/gql', '/query',
'/api/gql', '/graphql/v1', '/v1/graphql',
]
async with httpx.AsyncClient(proxy=proxy, timeout=10) as client:
for path in common_paths:
url = f"{base_url.rstrip('/')}{path}"
try:
# Try introspection query
response = await client.post(url, json={
'query': '{ __typename }'
})
if response.status_code == 200:
data = response.json()
if 'data' in data or 'errors' in data:
print(f"GraphQL found: {url}")
return url
except Exception:
continue
return NoneSchema Introspection
GraphQL’s introspection system lets you discover the entire API schema:
INTROSPECTION_QUERY = """
query IntrospectionQuery {
__schema {
types {
name
kind
fields {
name
type {
name
kind
ofType { name kind }
}
args {
name
type { name kind }
}
}
}
queryType { name }
mutationType { name }
}
}
"""
async def introspect_schema(graphql_url, proxy=None):
"""Discover the full GraphQL schema."""
async with httpx.AsyncClient(proxy=proxy) as client:
response = await client.post(graphql_url, json={
'query': INTROSPECTION_QUERY
})
schema = response.json()
# Extract useful types (skip internal types)
types = schema['data']['__schema']['types']
user_types = [t for t in types if not t['name'].startswith('__')]
for t in user_types:
if t['fields']:
print(f"\nType: {t['name']} ({t['kind']})")
for field in t['fields']:
field_type = field['type']['name'] or field['type']['ofType']['name']
print(f" {field['name']}: {field_type}")
return schemaBuilding Scraping Queries
Once you know the schema, build targeted queries:
class GraphQLScraper:
"""Scrape data from GraphQL APIs."""
def __init__(self, endpoint, proxy=None):
self.endpoint = endpoint
self.client = httpx.AsyncClient(proxy=proxy, timeout=30)
async def query(self, query_string, variables=None):
"""Execute a GraphQL query."""
payload = {'query': query_string}
if variables:
payload['variables'] = variables
response = await self.client.post(
self.endpoint,
json=payload,
headers={
'Content-Type': 'application/json',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
}
)
data = response.json()
if 'errors' in data:
print(f"GraphQL errors: {data['errors']}")
return data.get('data', {})
async def scrape_paginated(self, query_template, page_size=50, max_pages=100):
"""Handle cursor-based pagination."""
all_items = []
cursor = None
for page in range(max_pages):
variables = {
'first': page_size,
'after': cursor,
}
data = await self.query(query_template, variables)
# Extract edges and page info (Relay-style pagination)
connection = list(data.values())[0] # First field
edges = connection.get('edges', [])
page_info = connection.get('pageInfo', {})
items = [edge['node'] for edge in edges]
all_items.extend(items)
print(f"Page {page + 1}: {len(items)} items (total: {len(all_items)})")
if not page_info.get('hasNextPage', False):
break
cursor = page_info.get('endCursor')
return all_items
# Usage
async def main():
scraper = GraphQLScraper(
endpoint='https://api.example.com/graphql',
proxy='http://user:pass@proxy.example.com:8080'
)
# Cursor-based pagination query
query = """
query GetProducts($first: Int!, $after: String) {
products(first: $first, after: $after) {
edges {
node {
id
name
price
category
description
}
cursor
}
pageInfo {
hasNextPage
endCursor
}
}
}
"""
products = await scraper.scrape_paginated(query, page_size=100)
print(f"Total products: {len(products)}")Handling Anti-Scraping Measures
Query Complexity Limits
Many GraphQL APIs limit query complexity:
# Instead of one deep query, break into multiple simpler queries
# BAD: Complex nested query that may be rejected
bad_query = """
{
products(first: 100) {
edges {
node {
id name price
reviews(first: 50) {
edges { node { text rating user { name } } }
}
relatedProducts(first: 10) {
edges { node { id name } }
}
}
}
}
}
"""
# GOOD: Separate simpler queries
product_query = """
query GetProducts($first: Int!, $after: String) {
products(first: $first, after: $after) {
edges { node { id name price } }
pageInfo { hasNextPage endCursor }
}
}
"""
review_query = """
query GetReviews($productId: ID!, $first: Int!) {
product(id: $productId) {
reviews(first: $first) {
edges { node { text rating } }
}
}
}
"""Persisted Queries
Some APIs only accept pre-registered query hashes:
async def try_persisted_query(client, endpoint, query_hash, variables):
"""Use persisted query extension."""
response = await client.post(endpoint, json={
'extensions': {
'persistedQuery': {
'version': 1,
'sha256Hash': query_hash,
}
},
'variables': variables,
})
return response.json()Real-World Examples
Scraping Shopify Stores (GraphQL Storefront API)
async def scrape_shopify_store(store_url, proxy=None):
endpoint = f"{store_url}/api/2024-01/graphql.json"
async with httpx.AsyncClient(proxy=proxy) as client:
query = """
{
products(first: 50) {
edges {
node {
title
handle
priceRange {
minVariantPrice { amount currencyCode }
}
images(first: 1) {
edges { node { url } }
}
}
}
}
}
"""
response = await client.post(endpoint, json={'query': query}, headers={
'X-Shopify-Storefront-Access-Token': 'public-token-here',
'Content-Type': 'application/json',
})
return response.json()Internal Links
- AJAX Request Interception — discover GraphQL endpoints via network interception
- mitmproxy Tutorial — capture and analyze GraphQL traffic
- Web Scraping Architecture — design patterns for API-based scrapers
- Bandwidth Optimization — GraphQL reduces bandwidth by requesting only needed fields
- Building a Web Scraping Dashboard — visualize scraped GraphQL data
FAQ
Can I scrape any GraphQL API?
Not all GraphQL APIs have introspection enabled. Production APIs often disable it. You can still discover the schema through browser DevTools by observing queries the frontend makes, then replicate those queries in your scraper.
Is GraphQL scraping faster than REST scraping?
Generally yes. GraphQL lets you request exactly the fields you need in a single call, reducing both the number of requests and the data transferred. A single GraphQL query can replace 5-10 REST API calls.
How do I handle rate limiting on GraphQL APIs?
GraphQL rate limiting is often based on query complexity rather than request count. Simplify your queries, reduce the number of requested fields, and lower pagination page sizes. Add delays between requests as with any scraping.
What tools can I use to explore GraphQL schemas?
GraphiQL (built into many GraphQL APIs), Apollo Studio, Insomnia, and Postman all support GraphQL schema exploration. For automated discovery, use introspection queries via httpx or requests.
How do I handle authentication with GraphQL APIs?
GraphQL APIs use the same authentication methods as REST — Bearer tokens, API keys, session cookies. Authenticate first (usually via a REST endpoint or mutation), then include the token in your GraphQL request headers.
- AJAX Request Interception: Scraping API Calls Directly
- Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a Proxy Rotator in Python: Complete Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
- AJAX Request Interception: Scraping API Calls Directly
- Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a Proxy Rotator in Python: Complete Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
Related Reading
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)