Crawl4ai: Open Source AI Crawler Tutorial

Crawl4ai: Open Source AI Crawler Tutorial

The web scraping landscape has shifted dramatically toward AI-powered tools, and Crawl4ai stands out as the leading open-source option. With over 40,000 GitHub stars, zero API costs, and full local execution, it has become the go-to crawler for developers who want AI-enhanced data extraction without vendor lock-in.

This comprehensive guide covers everything you need to build production-grade scraping pipelines with Crawl4ai — from basic setup to advanced techniques like proxy rotation, LLM-powered extraction, and session management for complex multi-page workflows.

Table of Contents

What Is Crawl4ai?

Crawl4ai is a free, open-source Python library that combines headless browser automation with intelligent content extraction. Built on Playwright for browser rendering and compatible with any LLM for structured data extraction, it gives you the power of paid scraping APIs without the recurring costs.

Feature Overview

FeatureDetails
CostFree and open source (Apache 2.0)
Browser EnginePlaywright (Chromium-based)
Async SupportFull asyncio integration
LLM IntegrationOpenAI, Anthropic, Ollama, any OpenAI-compatible API
Output FormatsMarkdown, structured JSON, raw HTML
JavaScript HandlingFull rendering with custom script execution
Proxy SupportHTTP, HTTPS, SOCKS5 with rotation
Session ManagementPersistent sessions for multi-step flows
Anti-DetectionStealth mode, custom headers, fingerprint management

Crawl4ai vs Paid Alternatives

The primary advantage of Crawl4ai is cost — there are no API credits to purchase, no rate limits imposed by a third-party service, and your data stays on your infrastructure. The tradeoff is that you manage the infrastructure yourself, including browser instances, proxy rotation, and scaling.

For teams already running their own servers, this is often a net positive. For those who prefer managed services, tools like Firecrawl offer a more hands-off approach at a per-credit cost.

Installation & Setup

System Requirements

  • Python 3.9 or higher
  • 2 GB RAM minimum (4 GB recommended for concurrent crawling)
  • Chromium browser (installed automatically by Playwright)

Basic Installation

pip install crawl4ai

After installation, set up the browser:

crawl4ai-setup

This downloads and configures the Chromium browser that Crawl4ai uses for rendering.

Installation with LLM Support

If you plan to use LLM-powered extraction:

pip install crawl4ai[all]

Docker Installation

For production deployments, Docker provides a consistent environment:

docker pull unclecode/crawl4ai
docker run -p 11235:11235 unclecode/crawl4ai

The Docker image exposes a REST API on port 11235 that mirrors the Python API.

Verifying Installation

import asyncio
from crawl4ai import AsyncWebCrawler

async def test():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        print(f"Status: {result.success}")
        print(f"Content length: {len(result.markdown)}")

asyncio.run(test())

Core Crawling Concepts

Basic Page Crawling

Every crawl starts with the AsyncWebCrawler context manager and the arun method:

import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_page():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://example.com/pricing",
            bypass_cache=True
        )

        if result.success:
            # Clean markdown content
            print(result.markdown)

            # Extracted links
            print(result.links)

            # Page metadata
            print(result.metadata)
        else:
            print(f"Error: {result.error_message}")

asyncio.run(crawl_page())

The CrawlResult Object

Every arun() call returns a CrawlResult with these key attributes:

AttributeTypeDescription
successboolWhether the crawl succeeded
markdownstrClean markdown content
cleaned_htmlstrCleaned HTML (boilerplate removed)
htmlstrRaw page HTML
linksdictInternal and external links
mediadictImages, videos, and audio found
metadatadictPage title, description, etc.
extracted_contentstrStructured data (if extraction strategy used)
error_messagestrError details if crawl failed

Crawling Multiple URLs

Crawl4ai supports batch crawling with concurrency control:

import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_batch():
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
    ]

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(
            urls=urls,
            max_concurrent=5
        )

        for result in results:
            if result.success:
                print(f"{result.url}: {len(result.markdown)} chars")

asyncio.run(crawl_batch())

JavaScript Execution

For pages that require interaction before content loads:

result = await crawler.arun(
    url="https://example.com/dynamic-content",
    js_code=[
        "document.querySelector('.load-more').click()",
        "await new Promise(r => setTimeout(r, 2000))",
        "document.querySelector('.show-all').click()"
    ],
    wait_for="css:.results-loaded"
)

The wait_for parameter accepts CSS selectors or JavaScript expressions to ensure the page is fully loaded before extraction.

Extraction Strategies

Crawl4ai provides multiple extraction strategies depending on your needs.

CSS-Based Extraction

For well-structured pages where you know the DOM layout:

from crawl4ai.extraction_strategy import CssExtractionStrategy
import json

schema = {
    "name": "Product Listing",
    "baseSelector": "div.product-card",
    "fields": [
        {"name": "title", "selector": "h2.product-title", "type": "text"},
        {"name": "price", "selector": "span.price", "type": "text"},
        {"name": "url", "selector": "a.product-link", "type": "attribute", "attribute": "href"},
        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"}
    ]
}

strategy = CssExtractionStrategy(schema)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com/products",
        extraction_strategy=strategy
    )

    products = json.loads(result.extracted_content)
    for product in products:
        print(f"{product['title']}: {product['price']}")

JsonCssExtractionStrategy

A more flexible variant that handles nested structures:

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "Articles",
    "baseSelector": "article.post",
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "author", "selector": ".author-name", "type": "text"},
        {"name": "date", "selector": "time", "type": "attribute", "attribute": "datetime"},
        {
            "name": "tags",
            "selector": ".tag",
            "type": "list",
            "fields": [
                {"name": "tag_name", "selector": "", "type": "text"}
            ]
        }
    ]
}

strategy = JsonCssExtractionStrategy(schema)

LLM-Powered Structured Extraction

The most powerful feature of Crawl4ai is its ability to use any LLM to extract structured data from unstructured web content.

Setting Up LLM Extraction

from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
from typing import List

class Product(BaseModel):
    name: str
    price: float
    currency: str
    rating: float
    review_count: int
    in_stock: bool

class ProductPage(BaseModel):
    products: List[Product]

strategy = LLMExtractionStrategy(
    provider="openai/gpt-4o-mini",
    api_token="sk-your-api-key",
    schema=ProductPage.model_json_schema(),
    instruction="Extract all product information from this page. Convert prices to numeric values."
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com/products",
        extraction_strategy=strategy
    )

    products = ProductPage.model_validate_json(result.extracted_content)
    for p in products.products:
        print(f"{p.name}: {p.currency}{p.price} ({p.rating}★)")

Using Local LLMs with Ollama

To avoid API costs entirely, use a local model:

strategy = LLMExtractionStrategy(
    provider="ollama/llama3.1",
    api_token="no-token",
    api_base="http://localhost:11434/v1",
    schema=ProductPage.model_json_schema(),
    instruction="Extract product details from the page content."
)

Using Anthropic Claude

strategy = LLMExtractionStrategy(
    provider="anthropic/claude-sonnet-4-20250514",
    api_token="sk-ant-your-key",
    schema=ProductPage.model_json_schema(),
    instruction="Extract all products with their details."
)

Proxy Integration

For any serious scraping project, proxies are essential to avoid IP blocks and access geo-restricted content.

Single Proxy Configuration

async with AsyncWebCrawler(
    proxy="http://username:password@proxy-server:8080"
) as crawler:
    result = await crawler.arun(url="https://target-site.com")

SOCKS5 Proxy Support

async with AsyncWebCrawler(
    proxy="socks5://username:password@proxy-server:1080"
) as crawler:
    result = await crawler.arun(url="https://target-site.com")

Rotating Proxies

For large-scale scraping, rotate through a pool of residential proxies or mobile proxies:

import random

proxy_pool = [
    "http://user:pass@proxy1:8080",
    "http://user:pass@proxy2:8080",
    "http://user:pass@proxy3:8080",
]

async def crawl_with_rotation(urls):
    for url in urls:
        proxy = random.choice(proxy_pool)
        async with AsyncWebCrawler(proxy=proxy) as crawler:
            result = await crawler.arun(url=url)
            if result.success:
                yield result

Using Proxy Provider APIs

Many proxy providers offer rotating endpoints that handle rotation automatically:

# Single endpoint that rotates automatically
async with AsyncWebCrawler(
    proxy="http://customer-id:password@gate.provider.com:7777"
) as crawler:
    # Each request gets a different IP
    for url in urls:
        result = await crawler.arun(url=url)

Session Management & Multi-Page Flows

Persistent Sessions

For workflows that require maintaining state across pages (login flows, pagination):

async with AsyncWebCrawler() as crawler:
    # Step 1: Login
    await crawler.arun(
        url="https://example.com/login",
        session_id="my_session",
        js_code=[
            "document.querySelector('#email').value = 'user@example.com'",
            "document.querySelector('#password').value = 'password123'",
            "document.querySelector('form').submit()"
        ],
        wait_for="css:.dashboard"
    )

    # Step 2: Navigate to data page (session cookies preserved)
    result = await crawler.arun(
        url="https://example.com/dashboard/data",
        session_id="my_session"
    )

    print(result.markdown)

Handling Pagination

async def crawl_paginated(base_url, max_pages=10):
    all_content = []

    async with AsyncWebCrawler() as crawler:
        for page in range(1, max_pages + 1):
            result = await crawler.arun(
                url=f"{base_url}?page={page}",
                session_id="pagination_session",
                extraction_strategy=my_strategy
            )

            if not result.success or not result.extracted_content:
                break

            all_content.extend(json.loads(result.extracted_content))

    return all_content

Chunking Strategies

When feeding content to LLMs, you need to break pages into manageable chunks.

Available Chunking Methods

from crawl4ai.chunking_strategy import (
    RegexChunking,
    SlidingWindowChunking,
    OverlappingWindowChunking,
    FixedLengthWordChunking
)

# Split by headers or paragraphs
regex_chunker = RegexChunking(patterns=[r'\n#{1,3}\s'])

# Fixed-size sliding window
sliding_chunker = SlidingWindowChunking(
    window_size=500,
    step=250
)

# Overlapping windows for context preservation
overlap_chunker = OverlappingWindowChunking(
    window_size=1000,
    overlap=200
)

These chunking strategies work with the LLM extraction strategy to process long pages efficiently.

Advanced Configuration

Browser Configuration

from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_config = BrowserConfig(
    headless=True,
    browser_type="chromium",
    viewport_width=1920,
    viewport_height=1080,
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    ignore_https_errors=True,
    java_script_enabled=True
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(url="https://example.com")

Crawl Configuration

from crawl4ai import CrawlerRunConfig

run_config = CrawlerRunConfig(
    word_count_threshold=50,        # Minimum words per content block
    bypass_cache=True,               # Don't use cached results
    page_timeout=30000,              # 30 second timeout
    wait_for="css:.content-loaded",  # Wait for this selector
    screenshot=True,                 # Capture screenshot
    pdf=True,                        # Generate PDF
    remove_overlay_elements=True     # Remove popups/modals
)

result = await crawler.arun(
    url="https://example.com",
    config=run_config
)

Custom Headers and Cookies

result = await crawler.arun(
    url="https://example.com",
    headers={
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://google.com"
    },
    cookies=[
        {"name": "session", "value": "abc123", "domain": "example.com"}
    ]
)

Performance Optimization

Memory Management

When crawling many pages, manage browser instances carefully:

async def crawl_large_batch(urls, batch_size=10):
    results = []

    for i in range(0, len(urls), batch_size):
        batch = urls[i:i + batch_size]

        async with AsyncWebCrawler() as crawler:
            batch_results = await crawler.arun_many(
                urls=batch,
                max_concurrent=batch_size
            )
            results.extend(batch_results)

    return results

Caching Strategy

Crawl4ai includes built-in caching. Use it wisely:

# First crawl: results are cached
result = await crawler.arun(url="https://example.com")

# Second crawl: uses cache (fast)
result = await crawler.arun(url="https://example.com")

# Force fresh crawl
result = await crawler.arun(url="https://example.com", bypass_cache=True)

Rate Limiting

Be respectful to target sites and avoid triggering anti-bot systems:

import asyncio

async def crawl_with_rate_limit(urls, delay=2.0):
    async with AsyncWebCrawler() as crawler:
        for url in urls:
            result = await crawler.arun(url=url)
            if result.success:
                yield result
            await asyncio.sleep(delay)

Production Best Practices

Error Handling and Retries

import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_with_retry(url, max_retries=3):
    async with AsyncWebCrawler() as crawler:
        for attempt in range(max_retries):
            try:
                result = await crawler.arun(
                    url=url,
                    bypass_cache=True,
                    page_timeout=30000
                )

                if result.success:
                    return result

            except Exception as e:
                print(f"Attempt {attempt + 1} failed: {e}")

            await asyncio.sleep(2 ** attempt)  # Exponential backoff

    return None

Logging and Monitoring

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("crawl4ai_pipeline")

async def monitored_crawl(urls):
    success_count = 0
    fail_count = 0

    async with AsyncWebCrawler(verbose=True) as crawler:
        for url in urls:
            result = await crawler.arun(url=url)

            if result.success:
                success_count += 1
                logger.info(f"OK: {url} ({len(result.markdown)} chars)")
            else:
                fail_count += 1
                logger.error(f"FAIL: {url} — {result.error_message}")

    logger.info(f"Done: {success_count} success, {fail_count} failed")

Combining with Data Pipelines

Crawl4ai works well with existing data processing tools:

import pandas as pd
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def scrape_to_dataframe(urls, schema):
    strategy = LLMExtractionStrategy(
        provider="openai/gpt-4o-mini",
        api_token="sk-your-key",
        schema=schema
    )

    all_records = []

    async with AsyncWebCrawler() as crawler:
        for url in urls:
            result = await crawler.arun(url=url, extraction_strategy=strategy)
            if result.extracted_content:
                records = json.loads(result.extracted_content)
                all_records.extend(records)

    return pd.DataFrame(all_records)

FAQ

Is Crawl4ai really free?

Yes. Crawl4ai is licensed under Apache 2.0 and is completely free to use, modify, and distribute. The only costs you might incur are for LLM API calls if you use the LLM extraction strategy with a paid provider. You can avoid even those costs by using local models through Ollama.

How does Crawl4ai compare to Firecrawl?

Crawl4ai is free and self-hosted, while Firecrawl is a managed service with per-credit pricing. Crawl4ai gives you more control and zero ongoing costs but requires you to manage infrastructure. Firecrawl provides a simpler API experience with built-in scaling. See our detailed Crawl4ai vs Firecrawl comparison for a full breakdown.

Can Crawl4ai handle JavaScript-rendered pages?

Yes. Crawl4ai uses Playwright with Chromium under the hood, so it fully renders JavaScript-heavy pages. You can also execute custom JavaScript before extraction and wait for specific elements to load.

Do I need proxies with Crawl4ai?

For small-scale projects or public data, you may not need proxies. For any serious scraping operation — especially against sites with anti-bot protection — rotating residential proxies or mobile proxies are strongly recommended. Check our proxy provider comparisons for options.

Can I use Crawl4ai for RAG pipelines?

Absolutely. Crawl4ai’s clean markdown output is ideal for RAG pipelines. Pair it with a vector database and an embedding model to build a knowledge base from live web data.


Related Reading

Scroll to Top