ScrapeGraphAI: LLM-Powered Web Scraping
What if you could scrape any website by simply describing what data you want in plain English? That is exactly what ScrapeGraphAI delivers. This open-source Python library uses large language models and a graph-based pipeline architecture to turn natural language prompts into structured data extraction — no CSS selectors, no XPath, no brittle parsing code.
ScrapeGraphAI stands out from other AI web scrapers because of its unique architecture: instead of treating scraping as a single operation, it breaks the process into a directed graph of nodes, where each node handles a specific task (fetching, parsing, extracting, merging). This modular approach makes it both powerful and customizable.
Table of Contents
- What Is ScrapeGraphAI?
- Installation & Setup
- Core Concepts: Graph Pipelines
- SmartScraperGraph: Basic Usage
- Supported LLM Providers
- Advanced Graph Types
- Using Local LLMs with Ollama
- Schema-Based Extraction
- Proxy Integration
- ScrapeGraphAI vs Other Tools
- Common Use Cases
- Troubleshooting
- FAQ
What Is ScrapeGraphAI?
ScrapeGraphAI is an open-source Python library (MIT license) that creates web scraping pipelines powered by large language models. Created by Marco Vinciguerra and Lorenzo Padoan, it uses a graph-based architecture where the scraping process flows through connected nodes — each responsible for a step like fetching HTML, cleaning content, or extracting data.
Key Differentiators
| Feature | Description |
|---|---|
| Natural language prompts | Describe what you want in English |
| Graph pipeline architecture | Modular, customizable processing flows |
| Multiple LLM support | OpenAI, Anthropic, Ollama, Groq, Google, and more |
| Multiple graph types | SmartScraper, Search, Speech, and custom graphs |
| Schema support | Pydantic and JSON schema for structured output |
| Local model support | Run entirely locally with Ollama |
| Open source | MIT license, fully free to use |
How It Differs from Other AI Scrapers
While tools like Firecrawl focus on URL-to-markdown conversion and Crawl4ai emphasizes async crawling with optional LLM extraction, ScrapeGraphAI puts the LLM at the center of the entire pipeline. The LLM doesn’t just extract data — it drives the scraping process itself.
Installation & Setup
Basic Installation
pip install scrapegraphaiFull Installation (with all LLM providers)
pip install scrapegraphai[all]Install Playwright for Browser Rendering
playwright install chromiumEnvironment Setup
Set your LLM API keys:
# For OpenAI
export OPENAI_API_KEY="sk-your-key-here"
# For Anthropic
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
# For Google
export GOOGLE_API_KEY="your-google-key"Core Concepts: Graph Pipelines
What Is a Graph Pipeline?
In ScrapeGraphAI, a “graph” is a directed acyclic graph (DAG) where each node performs a specific operation:
FetchNode → ParseNode → RAGNode → GenerateAnswerNode- FetchNode — Downloads the web page HTML
- ParseNode — Cleans and processes the HTML content
- RAGNode — Chunks content and creates embeddings for retrieval
- GenerateAnswerNode — Uses the LLM to extract data based on your prompt
Why Graphs?
The graph architecture provides several advantages:
- Modularity — Each step is independent and replaceable
- Transparency — You can inspect what each node produces
- Customization — Add, remove, or modify nodes for custom workflows
- Reusability — Build once, apply to different sites with different prompts
SmartScraperGraph: Basic Usage
The SmartScraperGraph is the most common graph type. It handles the full pipeline from URL to structured data.
Simple Example
from scrapegraphai.graphs import SmartScraperGraph
graph = SmartScraperGraph(
prompt="Extract the main headline, author, and publication date",
source="https://example.com/article",
config={
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": "sk-your-key"
}
}
)
result = graph.run()
print(result)
# Output: {"headline": "...", "author": "...", "date": "..."}Extracting Lists of Items
graph = SmartScraperGraph(
prompt="Extract all products with their name, price, rating, and whether they are in stock",
source="https://example.com/products",
config={
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": "sk-your-key"
}
}
)
result = graph.run()
for product in result.get("products", []):
print(f"{product['name']}: ${product['price']} ({product['rating']}★)")Scraping from Local HTML
You can also process local HTML files or raw HTML strings:
# From a local file
graph = SmartScraperGraph(
prompt="Extract all contact information",
source="path/to/page.html",
config={"llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-key"}}
)
# From a raw string
html_content = "<html><body><h1>John Doe</h1><p>Email: john@example.com</p></body></html>"
graph = SmartScraperGraph(
prompt="Extract the person's name and email",
source=html_content,
config={"llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-key"}}
)Supported LLM Providers
ScrapeGraphAI works with virtually any LLM:
OpenAI
config = {
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": "sk-your-key"
}
}Anthropic Claude
config = {
"llm": {
"model": "anthropic/claude-sonnet-4-20250514",
"api_key": "sk-ant-your-key"
}
}Google Gemini
config = {
"llm": {
"model": "google_genai/gemini-pro",
"api_key": "your-google-key"
}
}Groq (Fast Inference)
config = {
"llm": {
"model": "groq/llama3-70b-8192",
"api_key": "your-groq-key"
}
}Ollama (Local, Free)
config = {
"llm": {
"model": "ollama/llama3.1",
"temperature": 0,
"base_url": "http://localhost:11434"
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434"
}
}Advanced Graph Types
SearchGraph
Searches the web first, then extracts data from the results:
from scrapegraphai.graphs import SearchGraph
graph = SearchGraph(
prompt="Find the current price of Bitcoin and Ethereum",
config={
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": "sk-your-key"
},
"search_engine": "google"
}
)
result = graph.run()
print(result)SmartScraperMultiGraph
Scrape multiple URLs with a single prompt:
from scrapegraphai.graphs import SmartScraperMultiGraph
urls = [
"https://example.com/product/1",
"https://example.com/product/2",
"https://example.com/product/3",
]
graph = SmartScraperMultiGraph(
prompt="Extract the product name, price, and description",
source=urls,
config={
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": "sk-your-key"
}
}
)
result = graph.run()SpeechGraph
Converts scraped data to audio narration:
from scrapegraphai.graphs import SpeechGraph
graph = SpeechGraph(
prompt="Summarize the main points of this article",
source="https://example.com/article",
config={
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": "sk-your-key"
},
"tts_model": "openai/tts-1"
}
)
result = graph.run()
# result includes audio file pathUsing Local LLMs with Ollama
For completely free, private scraping, use Ollama:
Setup
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull models
ollama pull llama3.1
ollama pull nomic-embed-text
# Start server
ollama serveConfiguration
from scrapegraphai.graphs import SmartScraperGraph
graph = SmartScraperGraph(
prompt="Extract all article titles and authors",
source="https://example.com/blog",
config={
"llm": {
"model": "ollama/llama3.1",
"temperature": 0,
"base_url": "http://localhost:11434"
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434"
},
"verbose": True
}
)
result = graph.run()Performance Notes
Local models are slower than cloud APIs but provide:
- Zero API costs
- Complete data privacy
- No rate limits
- Offline operation
For production use with high throughput, cloud APIs are recommended. For development, prototyping, or privacy-sensitive tasks, local models are ideal.
Schema-Based Extraction
For consistent, typed output, define Pydantic schemas:
from pydantic import BaseModel, Field
from typing import List, Optional
from scrapegraphai.graphs import SmartScraperGraph
class Review(BaseModel):
author: str = Field(description="Name of the reviewer")
rating: float = Field(description="Rating out of 5")
text: str = Field(description="Review text content")
date: Optional[str] = Field(description="Date of the review")
class ProductReviews(BaseModel):
product_name: str
average_rating: float
total_reviews: int
reviews: List[Review]
graph = SmartScraperGraph(
prompt="Extract product reviews with ratings and details",
source="https://example.com/product/reviews",
config={
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": "sk-your-key"
}
},
schema=ProductReviews
)
result = graph.run()
# result is validated against the ProductReviews schemaProxy Integration
For scraping at scale or accessing geo-restricted content, configure proxies:
graph = SmartScraperGraph(
prompt="Extract pricing information",
source="https://example.com/pricing",
config={
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": "sk-your-key"
},
"loader_kwargs": {
"proxy": {
"server": "http://proxy-server:8080",
"username": "user",
"password": "pass"
}
},
"headless": True
}
)For rotating proxies, use a residential proxy provider with a single rotating endpoint:
config = {
"llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-key"},
"loader_kwargs": {
"proxy": {
"server": "http://gate.provider.com:7777",
"username": "customer-id",
"password": "password"
}
}
}ScrapeGraphAI vs Other Tools
| Feature | ScrapeGraphAI | Crawl4ai | Firecrawl |
|---|---|---|---|
| Approach | LLM-centric graphs | Browser + optional LLM | API + built-in LLM |
| Natural language | Yes (core feature) | No (schema-based) | No (schema-based) |
| Cost | Free + LLM costs | Free + LLM costs | Credit-based |
| Local models | Yes (Ollama) | Yes (Ollama) | No (self-host only) |
| Graph pipelines | Yes | No | No |
| Async support | Limited | Full async | API-based |
| Browser rendering | Via Playwright | Via Playwright | Built-in |
| Best for | Prompt-based extraction | High-volume crawling | Clean markdown output |
For a detailed comparison of the top two open-source options, see our Crawl4ai vs Firecrawl guide.
Common Use Cases
Price Monitoring
graph = SmartScraperGraph(
prompt="Extract all product prices, including original price and sale price if available",
source="https://competitor.com/products",
config=llm_config
)Job Listing Extraction
graph = SmartScraperGraph(
prompt="Extract job listings with title, company, location, salary range, and required skills",
source="https://jobboard.com/search?q=python",
config=llm_config
)News Aggregation
graph = SmartScraperGraph(
prompt="Extract all news headlines, summaries, authors, and publication dates",
source="https://news-site.com",
config=llm_config
)Research Data Collection
graph = SmartScraperGraph(
prompt="Extract research paper titles, authors, abstracts, and citation counts",
source="https://scholar.example.com/results",
config=llm_config
)Building a Complete Pipeline
Here’s a full example combining multiple ScrapeGraphAI features into a production-ready data collection pipeline:
import json
import time
from datetime import datetime
from scrapegraphai.graphs import SmartScraperGraph, SmartScraperMultiGraph
from pydantic import BaseModel, Field
from typing import List, Optional
class CompetitorProduct(BaseModel):
name: str
price: float
currency: str = "USD"
features: List[str] = []
rating: Optional[float] = None
url: Optional[str] = None
class CompetitorPage(BaseModel):
company_name: str
products: List[CompetitorProduct]
def build_monitoring_pipeline(competitor_urls: list[str], output_file: str):
"""Monitor competitor pricing across multiple sites."""
config = {
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": "sk-your-key",
"temperature": 0
},
"loader_kwargs": {
"proxy": {
"server": "http://gate.provider.com:7777",
"username": "customer-id",
"password": "password"
}
},
"headless": True,
"verbose": False
}
all_results = []
for url in competitor_urls:
try:
graph = SmartScraperGraph(
prompt="Extract all products with their names, prices, key features, and ratings. Include the currency.",
source=url,
config=config,
schema=CompetitorPage
)
result = graph.run()
result_dict = result if isinstance(result, dict) else result.model_dump()
result_dict["source_url"] = url
result_dict["scraped_at"] = datetime.now().isoformat()
all_results.append(result_dict)
print(f"Scraped {url}: {len(result_dict.get('products', []))} products")
time.sleep(3) # Rate limiting
except Exception as e:
print(f"Error scraping {url}: {e}")
all_results.append({
"source_url": url,
"error": str(e),
"scraped_at": datetime.now().isoformat()
})
# Save results
with open(output_file, "w") as f:
json.dump(all_results, f, indent=2, default=str)
print(f"Saved {len(all_results)} results to {output_file}")
return all_results
# Usage
results = build_monitoring_pipeline(
competitor_urls=[
"https://competitor-a.com/pricing",
"https://competitor-b.com/products",
"https://competitor-c.com/plans",
],
output_file="competitor_data.json"
)Post-Processing Results
After scraping, you can analyze the collected data:
import pandas as pd
def analyze_competitor_data(results: list[dict]) -> pd.DataFrame:
"""Flatten competitor data into a DataFrame for analysis."""
rows = []
for result in results:
if "error" in result:
continue
for product in result.get("products", []):
rows.append({
"competitor": result.get("company_name", "Unknown"),
"product": product.get("name"),
"price": product.get("price"),
"currency": product.get("currency", "USD"),
"rating": product.get("rating"),
"source": result.get("source_url"),
"scraped_at": result.get("scraped_at"),
})
df = pd.DataFrame(rows)
print(f"\nPrice Summary by Competitor:")
print(df.groupby("competitor")["price"].describe())
return dfTroubleshooting
Common Issues
“Model not found” error: Ensure your model string matches the provider format. Use openai/gpt-4o-mini, not just gpt-4o-mini.
Empty results:
- Check if the page requires JavaScript rendering (add Playwright config)
- Try a more specific prompt
- Verify the URL is accessible
Rate limiting: Add delays between requests and consider using proxy rotation.
High API costs:
- Use
gpt-4o-miniinstead ofgpt-4ofor most tasks - Switch to Ollama for local inference
- Use schemas to reduce output tokens
Timeout errors: Increase the timeout in the loader configuration:
config = {
"llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-key"},
"loader_kwargs": {"timeout": 60}
}FAQ
Is ScrapeGraphAI free?
The library itself is free and open source (MIT license). You pay only for LLM API calls if using cloud providers like OpenAI or Anthropic. Using Ollama with local models makes it entirely free.
How accurate is ScrapeGraphAI compared to CSS-based scraping?
For well-structured pages, CSS-based scraping is more deterministic. ScrapeGraphAI excels when pages are inconsistent, when you don’t know the HTML structure, or when you need to extract nuanced information that’s hard to capture with selectors. Accuracy depends on the LLM model used — GPT-4o provides the best accuracy, while smaller models may miss edge cases.
Can ScrapeGraphAI handle JavaScript-rendered pages?
Yes, when configured with Playwright for browser rendering. Add the headless browser configuration to handle SPAs and dynamic content.
How does ScrapeGraphAI handle large pages?
It uses chunking and RAG (Retrieval-Augmented Generation) internally. Large pages are split into chunks, embedded, and the most relevant chunks are selected for LLM processing. This keeps token usage manageable even for very long pages.
Can I use ScrapeGraphAI in production?
Yes, but be mindful of LLM costs at scale. For high-volume production scraping, consider combining ScrapeGraphAI with caching, proxy rotation, and rate limiting. For very high volumes, Crawl4ai may be more suitable due to its async architecture.
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
Related Reading
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data