ScrapeGraphAI Tutorial: AI-Powered Scraping Without Selectors (2026)

scrapegraphai tutorial: ai-powered scraping without selectors (2026)

scrapegraphai is an open-source python library that scrapes any website by describing what you want in plain english. it sends the rendered html to an llm, which extracts structured json without you writing css or xpath selectors. install with pip install scrapegraphai, plug in an openai or local ollama key, point it at a url, and it returns parsed data. it is useful for one-off scrapes, prototype work, and small sites where selectors break weekly.

this tutorial covers install, the four pipeline types, proxy and headless integration, and when to use it versus a traditional scrapy or playwright stack.

what scrapegraphai is

scrapegraphai (github: scrapegraph-ai/scrapegraph-ai) is a graph-based web scraping framework that uses an llm to extract data instead of selectors.

the workflow:

  1. you provide a url and a natural-language prompt (“get all product names and prices”).
  2. scrapegraphai fetches the page (with optional headless browser).
  3. it cleans and chunks the html.
  4. an llm parses the chunks into structured json matching your prompt.

no css, xpath, or regex. selector drift on the target site does not break your scraper unless the page structure changes so much the llm cannot find the data.

installation in 2026

pip install scrapegraphai
playwright install chromium

the playwright install is needed for the smart_scraper graph that uses a real browser. for static-html scraping (no js), the playwright step is optional.

set your llm api key as an environment variable:

export OPENAI_API_KEY="sk-..."

scrapegraphai supports openai, anthropic, groq, and local ollama out of the box. for cost-conscious development, use ollama with a 7b model.

your first scrape: smartscraper graph

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "api_key": "sk-...",
        "model": "openai/gpt-4o-mini",
    },
    "verbose": False,
    "headless": True,
}

scraper = SmartScraperGraph(
    prompt="list all article titles and their authors on this page",
    source="https://hnrss.org/frontpage",
    config=graph_config,
)

result = scraper.run()
print(result)

output (truncated):

{
    "articles": [
        {"title": "show hn: a new approach to ai scraping", "author": "alex"},
        {"title": "rust 1.78 released", "author": "rust-lang team"},
        ...
    ]
}

no selectors. the llm read the rendered page and returned what you asked for in json.

the four main graph types

smartscrapergraph

single-page extraction. you give it a url and a prompt, it returns json. this is the workhorse for 80 percent of use cases.

searchgraph

google-search-driven scraping. you give it a query, it searches google, picks top results, and runs smartscraper on each.

from scrapegraphai.graphs import SearchGraph

graph = SearchGraph(
    prompt="find python scraping libraries with examples",
    config=graph_config,
)
result = graph.run()

useful for research-style scraping where you do not have a fixed url list.

speechgraph

extracts data and converts the result to audio via tts. useful for accessibility apps. less commonly used but it ships in the library.

smartscrapermultigraph

batch version of smartscraper. give it a list of urls, run the same prompt against each in parallel.

from scrapegraphai.graphs import SmartScraperMultiGraph

urls = [
    "https://example.com/product/1",
    "https://example.com/product/2",
    "https://example.com/product/3",
]

graph = SmartScraperMultiGraph(
    prompt="extract product name, price, and stock status",
    source=urls,
    config=graph_config,
)
result = graph.run()

the multi-graph is concurrent under the hood. it is the right pick for scraping a list of similar pages.

adding proxies

production scrapers need proxies. scrapegraphai accepts a proxy in the config:

graph_config = {
    "llm": {
        "api_key": "sk-...",
        "model": "openai/gpt-4o-mini",
    },
    "loader_kwargs": {
        "proxy": {
            "server": "http://proxy.example.com:8080",
            "username": "user",
            "password": "pass",
        },
    },
    "headless": True,
}

this passes the proxy to playwright, which routes both the page fetch and any sub-resources through the proxy.

for proxy rotation across many requests, wrap your scrape calls in a loop and switch the config per call. for the full pattern see our python proxy rotation guide.

using local llms with ollama

api costs add up fast on real workloads. each smartscraper run sends the rendered html to the llm, which can be 5,000 to 50,000 tokens. for high-volume scraping, run a local model.

ollama pull llama3.1:8b
ollama serve

then update config:

graph_config = {
    "llm": {
        "model": "ollama/llama3.1",
        "temperature": 0,
        "format": "json",
        "model_tokens": 8192,
        "base_url": "http://localhost:11434",
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",
    },
    "verbose": False,
}

a 4060ti or m2 max can run llama3.1 8b at usable speed for scraping. the trade-off is extraction quality. gpt-4o-mini is more reliable on messy pages than 8b local models.

for cost-free development and prototyping, ollama is the right choice. for production, gpt-4o-mini at $0.15 per million input tokens is usually cheaper than running a gpu.

handling js-heavy and login-walled sites

smartscraper uses playwright by default with headless: true. for sites that require login, pass cookies via playwright before scraping:

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {"api_key": "sk-...", "model": "openai/gpt-4o-mini"},
    "loader_kwargs": {
        "extra_http_headers": {
            "cookie": "session=abc123; user_token=xyz",
        },
    },
    "headless": True,
}

for sites with strong anti-bot (cloudflare, datadome, perimeterx), pair scrapegraphai with a residential proxy. the llm can still parse the rendered page, but you need a browser the target lets through. see our best web scraping apis comparison for managed options that bundle this.

when to use scrapegraphai vs traditional scrapy

scenariouse scrapegraphaiuse scrapy/playwright
one-off research scrapeyesoverkill
10 to 100 pages, low frequencyyesworks either way
10,000+ pages per daymaybe (cost-sensitive)yes
schema is stable and well-knownoverkillyes
schema changes weeklyyespainful with selectors
target site uses heavy jsyesyes (with playwright)
budget under $5 per scrape jobscrapegraphai with ollamayes
budget under $0.50 per scrape jobyes (gpt-4o-mini)yes

for high-volume production scraping with a stable schema, a hand-coded scrapy spider is still cheaper and more reliable. for quick scrapes, prototypes, or sites where the html structure shifts, scrapegraphai saves significant time.

cost math for openai api

gpt-4o-mini in 2026 is roughly $0.15 per million input tokens and $0.60 per million output tokens.

a typical product-page scrape sends 10,000 input tokens (cleaned html) and outputs 500 tokens (json). cost per page:

  • input: 10,000 / 1,000,000 * $0.15 = $0.0015
  • output: 500 / 1,000,000 * $0.60 = $0.0003
  • total: roughly $0.0018 per page

for 1000 pages, $1.80. for 100,000 pages, $180. budget llm cost into your scrape estimate.

debugging tips

set verbose=True to see every llm call and intermediate output. this is the fastest way to figure out why a prompt is not extracting what you expect.

start prompts simple. “list all product names” works better than a 5-clause instruction with edge cases. add complexity once the basic prompt works.

inspect the cleaned html scrapegraphai sends to the llm. it strips scripts, styles, and a lot of noise. if your target data is in a script tag or rendered late, you may need to pre-render harder before passing to the graph.

for stable schemas, define a pydantic model and pass it as the schema arg. the llm will fill the model exactly, which improves consistency.

from pydantic import BaseModel
from typing import List

class Product(BaseModel):
    name: str
    price: float
    in_stock: bool

class ProductList(BaseModel):
    products: List[Product]

scraper = SmartScraperGraph(
    prompt="extract all products",
    source="https://example.com/shop",
    config=graph_config,
    schema=ProductList,
)

official docs at the scrapegraphai github.

faq

what is scrapegraphai used for?

ai-powered web scraping. you describe what you want in english and an llm extracts structured json from any url, no css or xpath needed.

is scrapegraphai free?

yes, the library is open source. you pay only for the llm api you use (openai, anthropic, etc.). with ollama and a local model, the entire stack is free.

does scrapegraphai handle javascript-rendered pages?

yes. it uses playwright under the hood with headless: true by default. for sites that need to scroll or click before content loads, you can extend the loader to run custom js.

how does scrapegraphai compare to firecrawl?

firecrawl is a managed scraping api. scrapegraphai is a self-hosted python library. firecrawl handles the infra and proxies. scrapegraphai gives you full control and lower cost at scale, but you wire up your own browser and proxies.

can i use scrapegraphai with proxies?

yes. pass proxy details in loader_kwargs.proxy. it routes through playwright. for rotation across many requests, swap the proxy per call or wrap in a custom session pool.

what is the cost per page using scrapegraphai with gpt-4o-mini?

roughly $0.0018 per page for a typical product page (10k input tokens, 500 output tokens). 1000 pages costs about $1.80. for high-volume production, run ollama locally to drop llm cost to zero.

the bottom line

scrapegraphai is the right tool for prototype scrapes, schema-flexible jobs, and sites where selectors break too often to maintain. with gpt-4o-mini, the cost is around $0.002 per page, which beats most managed scraping apis at small scale.

for high-volume production with a stable target, scrapy with proper selectors is still cheaper and faster. but for the long tail of “i need to scrape this once and i do not want to write selectors,” scrapegraphai is the fastest path from url to json in 2026.

start with smartscrapergraph, add proxies once you scale, and switch to ollama if api costs become the bottleneck. the library is actively developed and the api is stable enough to depend on.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)