Crawl4ai: Open Source AI Crawler Tutorial
The web scraping landscape has shifted dramatically toward AI-powered tools, and Crawl4ai stands out as the leading open-source option. With over 40,000 GitHub stars, zero API costs, and full local execution, it has become the go-to crawler for developers who want AI-enhanced data extraction without vendor lock-in.
This comprehensive guide covers everything you need to build production-grade scraping pipelines with Crawl4ai — from basic setup to advanced techniques like proxy rotation, LLM-powered extraction, and session management for complex multi-page workflows.
Table of Contents
- What Is Crawl4ai?
- Installation & Setup
- Core Crawling Concepts
- Extraction Strategies
- LLM-Powered Structured Extraction
- Proxy Integration
- Session Management & Multi-Page Flows
- Chunking Strategies
- Advanced Configuration
- Performance Optimization
- Production Best Practices
- FAQ
What Is Crawl4ai?
Crawl4ai is a free, open-source Python library that combines headless browser automation with intelligent content extraction. Built on Playwright for browser rendering and compatible with any LLM for structured data extraction, it gives you the power of paid scraping APIs without the recurring costs.
Feature Overview
| Feature | Details |
|---|---|
| Cost | Free and open source (Apache 2.0) |
| Browser Engine | Playwright (Chromium-based) |
| Async Support | Full asyncio integration |
| LLM Integration | OpenAI, Anthropic, Ollama, any OpenAI-compatible API |
| Output Formats | Markdown, structured JSON, raw HTML |
| JavaScript Handling | Full rendering with custom script execution |
| Proxy Support | HTTP, HTTPS, SOCKS5 with rotation |
| Session Management | Persistent sessions for multi-step flows |
| Anti-Detection | Stealth mode, custom headers, fingerprint management |
Crawl4ai vs Paid Alternatives
The primary advantage of Crawl4ai is cost — there are no API credits to purchase, no rate limits imposed by a third-party service, and your data stays on your infrastructure. The tradeoff is that you manage the infrastructure yourself, including browser instances, proxy rotation, and scaling.
For teams already running their own servers, this is often a net positive. For those who prefer managed services, tools like Firecrawl offer a more hands-off approach at a per-credit cost.
Installation & Setup
System Requirements
- Python 3.9 or higher
- 2 GB RAM minimum (4 GB recommended for concurrent crawling)
- Chromium browser (installed automatically by Playwright)
Basic Installation
pip install crawl4aiAfter installation, set up the browser:
crawl4ai-setupThis downloads and configures the Chromium browser that Crawl4ai uses for rendering.
Installation with LLM Support
If you plan to use LLM-powered extraction:
pip install crawl4ai[all]Docker Installation
For production deployments, Docker provides a consistent environment:
docker pull unclecode/crawl4ai
docker run -p 11235:11235 unclecode/crawl4aiThe Docker image exposes a REST API on port 11235 that mirrors the Python API.
Verifying Installation
import asyncio
from crawl4ai import AsyncWebCrawler
async def test():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(f"Status: {result.success}")
print(f"Content length: {len(result.markdown)}")
asyncio.run(test())Core Crawling Concepts
Basic Page Crawling
Every crawl starts with the AsyncWebCrawler context manager and the arun method:
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_page():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://example.com/pricing",
bypass_cache=True
)
if result.success:
# Clean markdown content
print(result.markdown)
# Extracted links
print(result.links)
# Page metadata
print(result.metadata)
else:
print(f"Error: {result.error_message}")
asyncio.run(crawl_page())The CrawlResult Object
Every arun() call returns a CrawlResult with these key attributes:
| Attribute | Type | Description |
|---|---|---|
success | bool | Whether the crawl succeeded |
markdown | str | Clean markdown content |
cleaned_html | str | Cleaned HTML (boilerplate removed) |
html | str | Raw page HTML |
links | dict | Internal and external links |
media | dict | Images, videos, and audio found |
metadata | dict | Page title, description, etc. |
extracted_content | str | Structured data (if extraction strategy used) |
error_message | str | Error details if crawl failed |
Crawling Multiple URLs
Crawl4ai supports batch crawling with concurrency control:
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_batch():
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
]
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=urls,
max_concurrent=5
)
for result in results:
if result.success:
print(f"{result.url}: {len(result.markdown)} chars")
asyncio.run(crawl_batch())JavaScript Execution
For pages that require interaction before content loads:
result = await crawler.arun(
url="https://example.com/dynamic-content",
js_code=[
"document.querySelector('.load-more').click()",
"await new Promise(r => setTimeout(r, 2000))",
"document.querySelector('.show-all').click()"
],
wait_for="css:.results-loaded"
)The wait_for parameter accepts CSS selectors or JavaScript expressions to ensure the page is fully loaded before extraction.
Extraction Strategies
Crawl4ai provides multiple extraction strategies depending on your needs.
CSS-Based Extraction
For well-structured pages where you know the DOM layout:
from crawl4ai.extraction_strategy import CssExtractionStrategy
import json
schema = {
"name": "Product Listing",
"baseSelector": "div.product-card",
"fields": [
{"name": "title", "selector": "h2.product-title", "type": "text"},
{"name": "price", "selector": "span.price", "type": "text"},
{"name": "url", "selector": "a.product-link", "type": "attribute", "attribute": "href"},
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"}
]
}
strategy = CssExtractionStrategy(schema)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/products",
extraction_strategy=strategy
)
products = json.loads(result.extracted_content)
for product in products:
print(f"{product['title']}: {product['price']}")JsonCssExtractionStrategy
A more flexible variant that handles nested structures:
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "author", "selector": ".author-name", "type": "text"},
{"name": "date", "selector": "time", "type": "attribute", "attribute": "datetime"},
{
"name": "tags",
"selector": ".tag",
"type": "list",
"fields": [
{"name": "tag_name", "selector": "", "type": "text"}
]
}
]
}
strategy = JsonCssExtractionStrategy(schema)LLM-Powered Structured Extraction
The most powerful feature of Crawl4ai is its ability to use any LLM to extract structured data from unstructured web content.
Setting Up LLM Extraction
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
from typing import List
class Product(BaseModel):
name: str
price: float
currency: str
rating: float
review_count: int
in_stock: bool
class ProductPage(BaseModel):
products: List[Product]
strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
api_token="sk-your-api-key",
schema=ProductPage.model_json_schema(),
instruction="Extract all product information from this page. Convert prices to numeric values."
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/products",
extraction_strategy=strategy
)
products = ProductPage.model_validate_json(result.extracted_content)
for p in products.products:
print(f"{p.name}: {p.currency}{p.price} ({p.rating}★)")Using Local LLMs with Ollama
To avoid API costs entirely, use a local model:
strategy = LLMExtractionStrategy(
provider="ollama/llama3.1",
api_token="no-token",
api_base="http://localhost:11434/v1",
schema=ProductPage.model_json_schema(),
instruction="Extract product details from the page content."
)Using Anthropic Claude
strategy = LLMExtractionStrategy(
provider="anthropic/claude-sonnet-4-20250514",
api_token="sk-ant-your-key",
schema=ProductPage.model_json_schema(),
instruction="Extract all products with their details."
)Proxy Integration
For any serious scraping project, proxies are essential to avoid IP blocks and access geo-restricted content.
Single Proxy Configuration
async with AsyncWebCrawler(
proxy="http://username:password@proxy-server:8080"
) as crawler:
result = await crawler.arun(url="https://target-site.com")SOCKS5 Proxy Support
async with AsyncWebCrawler(
proxy="socks5://username:password@proxy-server:1080"
) as crawler:
result = await crawler.arun(url="https://target-site.com")Rotating Proxies
For large-scale scraping, rotate through a pool of residential proxies or mobile proxies:
import random
proxy_pool = [
"http://user:pass@proxy1:8080",
"http://user:pass@proxy2:8080",
"http://user:pass@proxy3:8080",
]
async def crawl_with_rotation(urls):
for url in urls:
proxy = random.choice(proxy_pool)
async with AsyncWebCrawler(proxy=proxy) as crawler:
result = await crawler.arun(url=url)
if result.success:
yield resultUsing Proxy Provider APIs
Many proxy providers offer rotating endpoints that handle rotation automatically:
# Single endpoint that rotates automatically
async with AsyncWebCrawler(
proxy="http://customer-id:password@gate.provider.com:7777"
) as crawler:
# Each request gets a different IP
for url in urls:
result = await crawler.arun(url=url)Session Management & Multi-Page Flows
Persistent Sessions
For workflows that require maintaining state across pages (login flows, pagination):
async with AsyncWebCrawler() as crawler:
# Step 1: Login
await crawler.arun(
url="https://example.com/login",
session_id="my_session",
js_code=[
"document.querySelector('#email').value = 'user@example.com'",
"document.querySelector('#password').value = 'password123'",
"document.querySelector('form').submit()"
],
wait_for="css:.dashboard"
)
# Step 2: Navigate to data page (session cookies preserved)
result = await crawler.arun(
url="https://example.com/dashboard/data",
session_id="my_session"
)
print(result.markdown)Handling Pagination
async def crawl_paginated(base_url, max_pages=10):
all_content = []
async with AsyncWebCrawler() as crawler:
for page in range(1, max_pages + 1):
result = await crawler.arun(
url=f"{base_url}?page={page}",
session_id="pagination_session",
extraction_strategy=my_strategy
)
if not result.success or not result.extracted_content:
break
all_content.extend(json.loads(result.extracted_content))
return all_contentChunking Strategies
When feeding content to LLMs, you need to break pages into manageable chunks.
Available Chunking Methods
from crawl4ai.chunking_strategy import (
RegexChunking,
SlidingWindowChunking,
OverlappingWindowChunking,
FixedLengthWordChunking
)
# Split by headers or paragraphs
regex_chunker = RegexChunking(patterns=[r'\n#{1,3}\s'])
# Fixed-size sliding window
sliding_chunker = SlidingWindowChunking(
window_size=500,
step=250
)
# Overlapping windows for context preservation
overlap_chunker = OverlappingWindowChunking(
window_size=1000,
overlap=200
)These chunking strategies work with the LLM extraction strategy to process long pages efficiently.
Advanced Configuration
Browser Configuration
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_config = BrowserConfig(
headless=True,
browser_type="chromium",
viewport_width=1920,
viewport_height=1080,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
ignore_https_errors=True,
java_script_enabled=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")Crawl Configuration
from crawl4ai import CrawlerRunConfig
run_config = CrawlerRunConfig(
word_count_threshold=50, # Minimum words per content block
bypass_cache=True, # Don't use cached results
page_timeout=30000, # 30 second timeout
wait_for="css:.content-loaded", # Wait for this selector
screenshot=True, # Capture screenshot
pdf=True, # Generate PDF
remove_overlay_elements=True # Remove popups/modals
)
result = await crawler.arun(
url="https://example.com",
config=run_config
)Custom Headers and Cookies
result = await crawler.arun(
url="https://example.com",
headers={
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://google.com"
},
cookies=[
{"name": "session", "value": "abc123", "domain": "example.com"}
]
)Performance Optimization
Memory Management
When crawling many pages, manage browser instances carefully:
async def crawl_large_batch(urls, batch_size=10):
results = []
for i in range(0, len(urls), batch_size):
batch = urls[i:i + batch_size]
async with AsyncWebCrawler() as crawler:
batch_results = await crawler.arun_many(
urls=batch,
max_concurrent=batch_size
)
results.extend(batch_results)
return resultsCaching Strategy
Crawl4ai includes built-in caching. Use it wisely:
# First crawl: results are cached
result = await crawler.arun(url="https://example.com")
# Second crawl: uses cache (fast)
result = await crawler.arun(url="https://example.com")
# Force fresh crawl
result = await crawler.arun(url="https://example.com", bypass_cache=True)Rate Limiting
Be respectful to target sites and avoid triggering anti-bot systems:
import asyncio
async def crawl_with_rate_limit(urls, delay=2.0):
async with AsyncWebCrawler() as crawler:
for url in urls:
result = await crawler.arun(url=url)
if result.success:
yield result
await asyncio.sleep(delay)Production Best Practices
Error Handling and Retries
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_with_retry(url, max_retries=3):
async with AsyncWebCrawler() as crawler:
for attempt in range(max_retries):
try:
result = await crawler.arun(
url=url,
bypass_cache=True,
page_timeout=30000
)
if result.success:
return result
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
await asyncio.sleep(2 ** attempt) # Exponential backoff
return NoneLogging and Monitoring
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("crawl4ai_pipeline")
async def monitored_crawl(urls):
success_count = 0
fail_count = 0
async with AsyncWebCrawler(verbose=True) as crawler:
for url in urls:
result = await crawler.arun(url=url)
if result.success:
success_count += 1
logger.info(f"OK: {url} ({len(result.markdown)} chars)")
else:
fail_count += 1
logger.error(f"FAIL: {url} — {result.error_message}")
logger.info(f"Done: {success_count} success, {fail_count} failed")Combining with Data Pipelines
Crawl4ai works well with existing data processing tools:
import pandas as pd
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
async def scrape_to_dataframe(urls, schema):
strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
api_token="sk-your-key",
schema=schema
)
all_records = []
async with AsyncWebCrawler() as crawler:
for url in urls:
result = await crawler.arun(url=url, extraction_strategy=strategy)
if result.extracted_content:
records = json.loads(result.extracted_content)
all_records.extend(records)
return pd.DataFrame(all_records)FAQ
Is Crawl4ai really free?
Yes. Crawl4ai is licensed under Apache 2.0 and is completely free to use, modify, and distribute. The only costs you might incur are for LLM API calls if you use the LLM extraction strategy with a paid provider. You can avoid even those costs by using local models through Ollama.
How does Crawl4ai compare to Firecrawl?
Crawl4ai is free and self-hosted, while Firecrawl is a managed service with per-credit pricing. Crawl4ai gives you more control and zero ongoing costs but requires you to manage infrastructure. Firecrawl provides a simpler API experience with built-in scaling. See our detailed Crawl4ai vs Firecrawl comparison for a full breakdown.
Can Crawl4ai handle JavaScript-rendered pages?
Yes. Crawl4ai uses Playwright with Chromium under the hood, so it fully renders JavaScript-heavy pages. You can also execute custom JavaScript before extraction and wait for specific elements to load.
Do I need proxies with Crawl4ai?
For small-scale projects or public data, you may not need proxies. For any serious scraping operation — especially against sites with anti-bot protection — rotating residential proxies or mobile proxies are strongly recommended. Check our proxy provider comparisons for options.
Can I use Crawl4ai for RAG pipelines?
Absolutely. Crawl4ai’s clean markdown output is ideal for RAG pipelines. Pair it with a vector database and an embedding model to build a knowledge base from live web data.
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
Related Reading
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data