GraphQL API Scraping: Introspection & Query Guide

GraphQL API Scraping: Introspection & Query Guide

GraphQL APIs are increasingly replacing REST endpoints across the web. Unlike REST, where you scrape multiple endpoints, GraphQL gives you a single endpoint where you request exactly the data you need. This makes GraphQL APIs simultaneously easier and harder to scrape — easier because you get structured data in one call, harder because you need to understand the schema.

This guide covers discovering GraphQL endpoints, introspection, building queries, and handling pagination.

Identifying GraphQL Endpoints

GraphQL APIs typically use:

  • A single endpoint: /graphql, /api/graphql, or /gql
  • POST requests with JSON body containing query and variables
  • Content-Type: application/json
import httpx

async def detect_graphql(base_url, proxy=None):
    """Detect GraphQL endpoints on a website."""
    common_paths = [
        '/graphql', '/api/graphql', '/gql', '/query',
        '/api/gql', '/graphql/v1', '/v1/graphql',
    ]
    
    async with httpx.AsyncClient(proxy=proxy, timeout=10) as client:
        for path in common_paths:
            url = f"{base_url.rstrip('/')}{path}"
            try:
                # Try introspection query
                response = await client.post(url, json={
                    'query': '{ __typename }'
                })
                if response.status_code == 200:
                    data = response.json()
                    if 'data' in data or 'errors' in data:
                        print(f"GraphQL found: {url}")
                        return url
            except Exception:
                continue
    return None

Schema Introspection

GraphQL’s introspection system lets you discover the entire API schema:

INTROSPECTION_QUERY = """
query IntrospectionQuery {
  __schema {
    types {
      name
      kind
      fields {
        name
        type {
          name
          kind
          ofType { name kind }
        }
        args {
          name
          type { name kind }
        }
      }
    }
    queryType { name }
    mutationType { name }
  }
}
"""

async def introspect_schema(graphql_url, proxy=None):
    """Discover the full GraphQL schema."""
    async with httpx.AsyncClient(proxy=proxy) as client:
        response = await client.post(graphql_url, json={
            'query': INTROSPECTION_QUERY
        })
        schema = response.json()
        
        # Extract useful types (skip internal types)
        types = schema['data']['__schema']['types']
        user_types = [t for t in types if not t['name'].startswith('__')]
        
        for t in user_types:
            if t['fields']:
                print(f"\nType: {t['name']} ({t['kind']})")
                for field in t['fields']:
                    field_type = field['type']['name'] or field['type']['ofType']['name']
                    print(f"  {field['name']}: {field_type}")
        
        return schema

Building Scraping Queries

Once you know the schema, build targeted queries:

class GraphQLScraper:
    """Scrape data from GraphQL APIs."""
    
    def __init__(self, endpoint, proxy=None):
        self.endpoint = endpoint
        self.client = httpx.AsyncClient(proxy=proxy, timeout=30)
    
    async def query(self, query_string, variables=None):
        """Execute a GraphQL query."""
        payload = {'query': query_string}
        if variables:
            payload['variables'] = variables
        
        response = await self.client.post(
            self.endpoint,
            json=payload,
            headers={
                'Content-Type': 'application/json',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            }
        )
        
        data = response.json()
        if 'errors' in data:
            print(f"GraphQL errors: {data['errors']}")
        return data.get('data', {})
    
    async def scrape_paginated(self, query_template, page_size=50, max_pages=100):
        """Handle cursor-based pagination."""
        all_items = []
        cursor = None
        
        for page in range(max_pages):
            variables = {
                'first': page_size,
                'after': cursor,
            }
            
            data = await self.query(query_template, variables)
            
            # Extract edges and page info (Relay-style pagination)
            connection = list(data.values())[0]  # First field
            edges = connection.get('edges', [])
            page_info = connection.get('pageInfo', {})
            
            items = [edge['node'] for edge in edges]
            all_items.extend(items)
            
            print(f"Page {page + 1}: {len(items)} items (total: {len(all_items)})")
            
            if not page_info.get('hasNextPage', False):
                break
            cursor = page_info.get('endCursor')
        
        return all_items

# Usage
async def main():
    scraper = GraphQLScraper(
        endpoint='https://api.example.com/graphql',
        proxy='http://user:pass@proxy.example.com:8080'
    )
    
    # Cursor-based pagination query
    query = """
    query GetProducts($first: Int!, $after: String) {
        products(first: $first, after: $after) {
            edges {
                node {
                    id
                    name
                    price
                    category
                    description
                }
                cursor
            }
            pageInfo {
                hasNextPage
                endCursor
            }
        }
    }
    """
    
    products = await scraper.scrape_paginated(query, page_size=100)
    print(f"Total products: {len(products)}")

Handling Anti-Scraping Measures

Query Complexity Limits

Many GraphQL APIs limit query complexity:

# Instead of one deep query, break into multiple simpler queries
# BAD: Complex nested query that may be rejected
bad_query = """
{
    products(first: 100) {
        edges {
            node {
                id name price
                reviews(first: 50) {
                    edges { node { text rating user { name } } }
                }
                relatedProducts(first: 10) {
                    edges { node { id name } }
                }
            }
        }
    }
}
"""

# GOOD: Separate simpler queries
product_query = """
query GetProducts($first: Int!, $after: String) {
    products(first: $first, after: $after) {
        edges { node { id name price } }
        pageInfo { hasNextPage endCursor }
    }
}
"""

review_query = """
query GetReviews($productId: ID!, $first: Int!) {
    product(id: $productId) {
        reviews(first: $first) {
            edges { node { text rating } }
        }
    }
}
"""

Persisted Queries

Some APIs only accept pre-registered query hashes:

async def try_persisted_query(client, endpoint, query_hash, variables):
    """Use persisted query extension."""
    response = await client.post(endpoint, json={
        'extensions': {
            'persistedQuery': {
                'version': 1,
                'sha256Hash': query_hash,
            }
        },
        'variables': variables,
    })
    return response.json()

Real-World Examples

Scraping Shopify Stores (GraphQL Storefront API)

async def scrape_shopify_store(store_url, proxy=None):
    endpoint = f"{store_url}/api/2024-01/graphql.json"
    
    async with httpx.AsyncClient(proxy=proxy) as client:
        query = """
        {
            products(first: 50) {
                edges {
                    node {
                        title
                        handle
                        priceRange {
                            minVariantPrice { amount currencyCode }
                        }
                        images(first: 1) {
                            edges { node { url } }
                        }
                    }
                }
            }
        }
        """
        
        response = await client.post(endpoint, json={'query': query}, headers={
            'X-Shopify-Storefront-Access-Token': 'public-token-here',
            'Content-Type': 'application/json',
        })
        
        return response.json()

Internal Links

FAQ

Can I scrape any GraphQL API?

Not all GraphQL APIs have introspection enabled. Production APIs often disable it. You can still discover the schema through browser DevTools by observing queries the frontend makes, then replicate those queries in your scraper.

Is GraphQL scraping faster than REST scraping?

Generally yes. GraphQL lets you request exactly the fields you need in a single call, reducing both the number of requests and the data transferred. A single GraphQL query can replace 5-10 REST API calls.

How do I handle rate limiting on GraphQL APIs?

GraphQL rate limiting is often based on query complexity rather than request count. Simplify your queries, reduce the number of requested fields, and lower pagination page sizes. Add delays between requests as with any scraping.

What tools can I use to explore GraphQL schemas?

GraphiQL (built into many GraphQL APIs), Apollo Studio, Insomnia, and Postman all support GraphQL schema exploration. For automated discovery, use introspection queries via httpx or requests.

How do I handle authentication with GraphQL APIs?

GraphQL APIs use the same authentication methods as REST — Bearer tokens, API keys, session cookies. Authenticate first (usually via a REST endpoint or mutation), then include the token in your GraphQL request headers.


Related Reading

Scroll to Top