ScrapeGraphAI: LLM-Powered Web Scraping

ScrapeGraphAI: LLM-Powered Web Scraping

What if you could scrape any website by simply describing what data you want in plain English? That is exactly what ScrapeGraphAI delivers. This open-source Python library uses large language models and a graph-based pipeline architecture to turn natural language prompts into structured data extraction — no CSS selectors, no XPath, no brittle parsing code.

ScrapeGraphAI stands out from other AI web scrapers because of its unique architecture: instead of treating scraping as a single operation, it breaks the process into a directed graph of nodes, where each node handles a specific task (fetching, parsing, extracting, merging). This modular approach makes it both powerful and customizable.

Table of Contents

What Is ScrapeGraphAI?

ScrapeGraphAI is an open-source Python library (MIT license) that creates web scraping pipelines powered by large language models. Created by Marco Vinciguerra and Lorenzo Padoan, it uses a graph-based architecture where the scraping process flows through connected nodes — each responsible for a step like fetching HTML, cleaning content, or extracting data.

Key Differentiators

FeatureDescription
Natural language promptsDescribe what you want in English
Graph pipeline architectureModular, customizable processing flows
Multiple LLM supportOpenAI, Anthropic, Ollama, Groq, Google, and more
Multiple graph typesSmartScraper, Search, Speech, and custom graphs
Schema supportPydantic and JSON schema for structured output
Local model supportRun entirely locally with Ollama
Open sourceMIT license, fully free to use

How It Differs from Other AI Scrapers

While tools like Firecrawl focus on URL-to-markdown conversion and Crawl4ai emphasizes async crawling with optional LLM extraction, ScrapeGraphAI puts the LLM at the center of the entire pipeline. The LLM doesn’t just extract data — it drives the scraping process itself.

Installation & Setup

Basic Installation

pip install scrapegraphai

Full Installation (with all LLM providers)

pip install scrapegraphai[all]

Install Playwright for Browser Rendering

playwright install chromium

Environment Setup

Set your LLM API keys:

# For OpenAI
export OPENAI_API_KEY="sk-your-key-here"

# For Anthropic
export ANTHROPIC_API_KEY="sk-ant-your-key-here"

# For Google
export GOOGLE_API_KEY="your-google-key"

Core Concepts: Graph Pipelines

What Is a Graph Pipeline?

In ScrapeGraphAI, a “graph” is a directed acyclic graph (DAG) where each node performs a specific operation:

FetchNode → ParseNode → RAGNode → GenerateAnswerNode
  • FetchNode — Downloads the web page HTML
  • ParseNode — Cleans and processes the HTML content
  • RAGNode — Chunks content and creates embeddings for retrieval
  • GenerateAnswerNode — Uses the LLM to extract data based on your prompt

Why Graphs?

The graph architecture provides several advantages:

  1. Modularity — Each step is independent and replaceable
  2. Transparency — You can inspect what each node produces
  3. Customization — Add, remove, or modify nodes for custom workflows
  4. Reusability — Build once, apply to different sites with different prompts

SmartScraperGraph: Basic Usage

The SmartScraperGraph is the most common graph type. It handles the full pipeline from URL to structured data.

Simple Example

from scrapegraphai.graphs import SmartScraperGraph

graph = SmartScraperGraph(
    prompt="Extract the main headline, author, and publication date",
    source="https://example.com/article",
    config={
        "llm": {
            "model": "openai/gpt-4o-mini",
            "api_key": "sk-your-key"
        }
    }
)

result = graph.run()
print(result)
# Output: {"headline": "...", "author": "...", "date": "..."}

Extracting Lists of Items

graph = SmartScraperGraph(
    prompt="Extract all products with their name, price, rating, and whether they are in stock",
    source="https://example.com/products",
    config={
        "llm": {
            "model": "openai/gpt-4o-mini",
            "api_key": "sk-your-key"
        }
    }
)

result = graph.run()
for product in result.get("products", []):
    print(f"{product['name']}: ${product['price']} ({product['rating']}★)")

Scraping from Local HTML

You can also process local HTML files or raw HTML strings:

# From a local file
graph = SmartScraperGraph(
    prompt="Extract all contact information",
    source="path/to/page.html",
    config={"llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-key"}}
)

# From a raw string
html_content = "<html><body><h1>John Doe</h1><p>Email: john@example.com</p></body></html>"
graph = SmartScraperGraph(
    prompt="Extract the person's name and email",
    source=html_content,
    config={"llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-key"}}
)

Supported LLM Providers

ScrapeGraphAI works with virtually any LLM:

OpenAI

config = {
    "llm": {
        "model": "openai/gpt-4o-mini",
        "api_key": "sk-your-key"
    }
}

Anthropic Claude

config = {
    "llm": {
        "model": "anthropic/claude-sonnet-4-20250514",
        "api_key": "sk-ant-your-key"
    }
}

Google Gemini

config = {
    "llm": {
        "model": "google_genai/gemini-pro",
        "api_key": "your-google-key"
    }
}

Groq (Fast Inference)

config = {
    "llm": {
        "model": "groq/llama3-70b-8192",
        "api_key": "your-groq-key"
    }
}

Ollama (Local, Free)

config = {
    "llm": {
        "model": "ollama/llama3.1",
        "temperature": 0,
        "base_url": "http://localhost:11434"
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434"
    }
}

Advanced Graph Types

SearchGraph

Searches the web first, then extracts data from the results:

from scrapegraphai.graphs import SearchGraph

graph = SearchGraph(
    prompt="Find the current price of Bitcoin and Ethereum",
    config={
        "llm": {
            "model": "openai/gpt-4o-mini",
            "api_key": "sk-your-key"
        },
        "search_engine": "google"
    }
)

result = graph.run()
print(result)

SmartScraperMultiGraph

Scrape multiple URLs with a single prompt:

from scrapegraphai.graphs import SmartScraperMultiGraph

urls = [
    "https://example.com/product/1",
    "https://example.com/product/2",
    "https://example.com/product/3",
]

graph = SmartScraperMultiGraph(
    prompt="Extract the product name, price, and description",
    source=urls,
    config={
        "llm": {
            "model": "openai/gpt-4o-mini",
            "api_key": "sk-your-key"
        }
    }
)

result = graph.run()

SpeechGraph

Converts scraped data to audio narration:

from scrapegraphai.graphs import SpeechGraph

graph = SpeechGraph(
    prompt="Summarize the main points of this article",
    source="https://example.com/article",
    config={
        "llm": {
            "model": "openai/gpt-4o-mini",
            "api_key": "sk-your-key"
        },
        "tts_model": "openai/tts-1"
    }
)

result = graph.run()
# result includes audio file path

Using Local LLMs with Ollama

For completely free, private scraping, use Ollama:

Setup

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull models
ollama pull llama3.1
ollama pull nomic-embed-text

# Start server
ollama serve

Configuration

from scrapegraphai.graphs import SmartScraperGraph

graph = SmartScraperGraph(
    prompt="Extract all article titles and authors",
    source="https://example.com/blog",
    config={
        "llm": {
            "model": "ollama/llama3.1",
            "temperature": 0,
            "base_url": "http://localhost:11434"
        },
        "embeddings": {
            "model": "ollama/nomic-embed-text",
            "base_url": "http://localhost:11434"
        },
        "verbose": True
    }
)

result = graph.run()

Performance Notes

Local models are slower than cloud APIs but provide:

  • Zero API costs
  • Complete data privacy
  • No rate limits
  • Offline operation

For production use with high throughput, cloud APIs are recommended. For development, prototyping, or privacy-sensitive tasks, local models are ideal.

Schema-Based Extraction

For consistent, typed output, define Pydantic schemas:

from pydantic import BaseModel, Field
from typing import List, Optional
from scrapegraphai.graphs import SmartScraperGraph

class Review(BaseModel):
    author: str = Field(description="Name of the reviewer")
    rating: float = Field(description="Rating out of 5")
    text: str = Field(description="Review text content")
    date: Optional[str] = Field(description="Date of the review")

class ProductReviews(BaseModel):
    product_name: str
    average_rating: float
    total_reviews: int
    reviews: List[Review]

graph = SmartScraperGraph(
    prompt="Extract product reviews with ratings and details",
    source="https://example.com/product/reviews",
    config={
        "llm": {
            "model": "openai/gpt-4o-mini",
            "api_key": "sk-your-key"
        }
    },
    schema=ProductReviews
)

result = graph.run()
# result is validated against the ProductReviews schema

Proxy Integration

For scraping at scale or accessing geo-restricted content, configure proxies:

graph = SmartScraperGraph(
    prompt="Extract pricing information",
    source="https://example.com/pricing",
    config={
        "llm": {
            "model": "openai/gpt-4o-mini",
            "api_key": "sk-your-key"
        },
        "loader_kwargs": {
            "proxy": {
                "server": "http://proxy-server:8080",
                "username": "user",
                "password": "pass"
            }
        },
        "headless": True
    }
)

For rotating proxies, use a residential proxy provider with a single rotating endpoint:

config = {
    "llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-key"},
    "loader_kwargs": {
        "proxy": {
            "server": "http://gate.provider.com:7777",
            "username": "customer-id",
            "password": "password"
        }
    }
}

ScrapeGraphAI vs Other Tools

FeatureScrapeGraphAICrawl4aiFirecrawl
ApproachLLM-centric graphsBrowser + optional LLMAPI + built-in LLM
Natural languageYes (core feature)No (schema-based)No (schema-based)
CostFree + LLM costsFree + LLM costsCredit-based
Local modelsYes (Ollama)Yes (Ollama)No (self-host only)
Graph pipelinesYesNoNo
Async supportLimitedFull asyncAPI-based
Browser renderingVia PlaywrightVia PlaywrightBuilt-in
Best forPrompt-based extractionHigh-volume crawlingClean markdown output

For a detailed comparison of the top two open-source options, see our Crawl4ai vs Firecrawl guide.

Common Use Cases

Price Monitoring

graph = SmartScraperGraph(
    prompt="Extract all product prices, including original price and sale price if available",
    source="https://competitor.com/products",
    config=llm_config
)

Job Listing Extraction

graph = SmartScraperGraph(
    prompt="Extract job listings with title, company, location, salary range, and required skills",
    source="https://jobboard.com/search?q=python",
    config=llm_config
)

News Aggregation

graph = SmartScraperGraph(
    prompt="Extract all news headlines, summaries, authors, and publication dates",
    source="https://news-site.com",
    config=llm_config
)

Research Data Collection

graph = SmartScraperGraph(
    prompt="Extract research paper titles, authors, abstracts, and citation counts",
    source="https://scholar.example.com/results",
    config=llm_config
)

Building a Complete Pipeline

Here’s a full example combining multiple ScrapeGraphAI features into a production-ready data collection pipeline:

import json
import time
from datetime import datetime
from scrapegraphai.graphs import SmartScraperGraph, SmartScraperMultiGraph
from pydantic import BaseModel, Field
from typing import List, Optional

class CompetitorProduct(BaseModel):
    name: str
    price: float
    currency: str = "USD"
    features: List[str] = []
    rating: Optional[float] = None
    url: Optional[str] = None

class CompetitorPage(BaseModel):
    company_name: str
    products: List[CompetitorProduct]

def build_monitoring_pipeline(competitor_urls: list[str], output_file: str):
    """Monitor competitor pricing across multiple sites."""
    config = {
        "llm": {
            "model": "openai/gpt-4o-mini",
            "api_key": "sk-your-key",
            "temperature": 0
        },
        "loader_kwargs": {
            "proxy": {
                "server": "http://gate.provider.com:7777",
                "username": "customer-id",
                "password": "password"
            }
        },
        "headless": True,
        "verbose": False
    }

    all_results = []

    for url in competitor_urls:
        try:
            graph = SmartScraperGraph(
                prompt="Extract all products with their names, prices, key features, and ratings. Include the currency.",
                source=url,
                config=config,
                schema=CompetitorPage
            )

            result = graph.run()
            result_dict = result if isinstance(result, dict) else result.model_dump()
            result_dict["source_url"] = url
            result_dict["scraped_at"] = datetime.now().isoformat()
            all_results.append(result_dict)

            print(f"Scraped {url}: {len(result_dict.get('products', []))} products")
            time.sleep(3)  # Rate limiting

        except Exception as e:
            print(f"Error scraping {url}: {e}")
            all_results.append({
                "source_url": url,
                "error": str(e),
                "scraped_at": datetime.now().isoformat()
            })

    # Save results
    with open(output_file, "w") as f:
        json.dump(all_results, f, indent=2, default=str)

    print(f"Saved {len(all_results)} results to {output_file}")
    return all_results

# Usage
results = build_monitoring_pipeline(
    competitor_urls=[
        "https://competitor-a.com/pricing",
        "https://competitor-b.com/products",
        "https://competitor-c.com/plans",
    ],
    output_file="competitor_data.json"
)

Post-Processing Results

After scraping, you can analyze the collected data:

import pandas as pd

def analyze_competitor_data(results: list[dict]) -> pd.DataFrame:
    """Flatten competitor data into a DataFrame for analysis."""
    rows = []
    for result in results:
        if "error" in result:
            continue
        for product in result.get("products", []):
            rows.append({
                "competitor": result.get("company_name", "Unknown"),
                "product": product.get("name"),
                "price": product.get("price"),
                "currency": product.get("currency", "USD"),
                "rating": product.get("rating"),
                "source": result.get("source_url"),
                "scraped_at": result.get("scraped_at"),
            })

    df = pd.DataFrame(rows)
    print(f"\nPrice Summary by Competitor:")
    print(df.groupby("competitor")["price"].describe())
    return df

Troubleshooting

Common Issues

“Model not found” error: Ensure your model string matches the provider format. Use openai/gpt-4o-mini, not just gpt-4o-mini.

Empty results:

  • Check if the page requires JavaScript rendering (add Playwright config)
  • Try a more specific prompt
  • Verify the URL is accessible

Rate limiting: Add delays between requests and consider using proxy rotation.

High API costs:

  • Use gpt-4o-mini instead of gpt-4o for most tasks
  • Switch to Ollama for local inference
  • Use schemas to reduce output tokens

Timeout errors: Increase the timeout in the loader configuration:

config = {
    "llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-key"},
    "loader_kwargs": {"timeout": 60}
}

FAQ

Is ScrapeGraphAI free?

The library itself is free and open source (MIT license). You pay only for LLM API calls if using cloud providers like OpenAI or Anthropic. Using Ollama with local models makes it entirely free.

How accurate is ScrapeGraphAI compared to CSS-based scraping?

For well-structured pages, CSS-based scraping is more deterministic. ScrapeGraphAI excels when pages are inconsistent, when you don’t know the HTML structure, or when you need to extract nuanced information that’s hard to capture with selectors. Accuracy depends on the LLM model used — GPT-4o provides the best accuracy, while smaller models may miss edge cases.

Can ScrapeGraphAI handle JavaScript-rendered pages?

Yes, when configured with Playwright for browser rendering. Add the headless browser configuration to handle SPAs and dynamic content.

How does ScrapeGraphAI handle large pages?

It uses chunking and RAG (Retrieval-Augmented Generation) internally. Large pages are split into chunks, embedded, and the most relevant chunks are selected for LLM processing. This keeps token usage manageable even for very long pages.

Can I use ScrapeGraphAI in production?

Yes, but be mindful of LLM costs at scale. For high-volume production scraping, consider combining ScrapeGraphAI with caching, proxy rotation, and rate limiting. For very high volumes, Crawl4ai may be more suitable due to its async architecture.


Related Reading

Scroll to Top