What Is Crawl4ai? Setup & Complete Guide
Crawl4ai is a free, open-source Python framework for AI-powered web crawling and data extraction. It has rapidly grown to become one of the most starred web scraping repositories on GitHub, and for good reason — it delivers capabilities that rival paid services while costing nothing to run.
This guide explains what Crawl4ai is, how it works, how to set it up, and when you should (and shouldn’t) use it.
Crawl4ai Overview
Crawl4ai was created to solve a specific problem: getting clean, structured data from websites into AI pipelines without the typical pain of web scraping. It combines a headless browser for rendering JavaScript, intelligent algorithms for identifying main content, and optional LLM integration for extracting structured data.
What Makes It Different
Traditional web scrapers return raw HTML. You then spend hours writing CSS selectors, handling edge cases, and cleaning the output. When the target website changes its layout, your selectors break and you start over.
Crawl4ai takes a different approach:
- Renders the page in a real browser (Playwright + Chromium)
- Identifies main content automatically, stripping navigation, ads, and boilerplate
- Converts to clean markdown that’s immediately useful
- Optionally extracts structured data using LLMs or CSS selectors
The result is clean data from any website with minimal code and minimal maintenance.
Key Capabilities
| Capability | Details |
|---|---|
| Browser Rendering | Full Chromium via Playwright — handles any JavaScript framework |
| Smart Extraction | Automatic content identification and cleaning |
| Markdown Output | Clean, formatted markdown from any web page |
| CSS Extraction | Schema-based structured extraction using CSS selectors |
| LLM Extraction | Use OpenAI, Anthropic, Ollama, or any LiteLLM model |
| Deep Crawling | BFS and DFS strategies for multi-page crawling |
| Session Management | Persistent sessions for login flows and pagination |
| Async Architecture | Built on asyncio for high-performance concurrent crawling |
| Caching | Built-in page cache to avoid redundant downloads |
| Proxy Support | HTTP and SOCKS proxy configuration |
| Docker Support | Ready-made container for deployment |
How Crawl4ai Works
Here’s the architecture in simplified form:
Your Code → Crawl4ai → Playwright Browser → Target Website
↓
Content Extraction (auto or LLM)
↓
Clean Output (markdown, JSON, or both)The Extraction Pipeline
- URL Input — You provide one or more URLs
- Browser Launch — Crawl4ai starts a headless Chromium instance
- Page Load — The browser navigates to the URL, executing all JavaScript
- Wait Conditions — Optional waiting for specific elements or time delays
- JS Execution — Optional custom JavaScript (scrolling, clicking, etc.)
- HTML Capture — The fully rendered DOM is captured
- Content Filtering — Main content is identified; boilerplate is removed
- Markdown Conversion — HTML is converted to clean markdown
- Optional Extraction — CSS or LLM strategies extract structured fields
- Output — Clean markdown and/or structured JSON returned to your code
Setting Up Crawl4ai
System Requirements
- Python 3.9 or higher
- 2GB+ RAM (4GB+ recommended for concurrent crawling)
- macOS, Linux, or Windows
Installation
Standard installation:
pip install crawl4ai
crawl4ai-setup # Downloads Chromium browserWith all optional dependencies:
pip install crawl4ai[all]
crawl4ai-setupDocker installation:
docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai unclecode/crawl4ai:latestQuick Verification
import asyncio
from crawl4ai import AsyncWebCrawler
async def verify():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://httpbin.org/html")
assert result.success, f"Failed: {result.error_message}"
assert len(result.markdown) > 100, "Content too short"
print("Crawl4ai is working correctly!")
print(f"Extracted {len(result.markdown)} characters of markdown")
asyncio.run(verify())Practical Examples
Example 1: Scrape a Blog Post
import asyncio
from crawl4ai import AsyncWebCrawler
async def scrape_blog():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/blog/interesting-article"
)
if result.success:
print(f"Title: {result.metadata.get('title')}")
print(f"Word count: {len(result.markdown.split())}")
print(f"\n{result.markdown[:2000]}")
# Save to file
with open("article.md", "w") as f:
f.write(result.markdown)
asyncio.run(scrape_blog())Example 2: Extract Product Data
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def extract_products():
schema = {
"name": "Products",
"baseSelector": ".product-item",
"fields": [
{"name": "name", "selector": ".product-name", "type": "text"},
{"name": "price", "selector": ".product-price", "type": "text"},
{"name": "rating", "selector": ".stars", "type": "attribute", "attribute": "data-rating"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/shop",
extraction_strategy=JsonCssExtractionStrategy(schema)
)
import json
products = json.loads(result.extracted_content)
for p in products:
print(f"{p['name']}: {p['price']} (rating: {p['rating']})")
asyncio.run(extract_products())Example 3: Crawl Documentation for RAG
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
import json
async def build_docs_index():
strategy = BFSDeepCrawlStrategy(
max_depth=3,
max_pages=200,
include_patterns=["/docs/*"]
)
config = CrawlerRunConfig(deep_crawl_strategy=strategy)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun(
url="https://docs.example.com",
config=config
)
documents = []
for r in results:
if r.success and len(r.markdown) > 100:
documents.append({
"url": r.url,
"title": r.metadata.get("title", ""),
"content": r.markdown
})
with open("docs_index.json", "w") as f:
json.dump(documents, f, indent=2)
print(f"Indexed {len(documents)} documentation pages")
asyncio.run(build_docs_index())For a complete walkthrough of building AI knowledge bases, see our RAG pipeline with web scraping guide.
Example 4: LLM-Powered Extraction with Ollama
from crawl4ai.extraction_strategy import LLMExtractionStrategy
async def smart_extract():
strategy = LLMExtractionStrategy(
provider="ollama/llama3.2",
instruction="Extract all job listings with title, company, location, salary range, and required skills as a list",
base_url="http://localhost:11434"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/jobs",
extraction_strategy=strategy
)
jobs = json.loads(result.extracted_content)
for job in jobs:
print(f"{job['title']} at {job['company']} — {job.get('salary_range', 'Not listed')}")
asyncio.run(smart_extract())Configuring Proxies
For scraping at scale or accessing region-specific content:
# Single proxy
async with AsyncWebCrawler(
proxy="http://user:pass@proxy.example.com:8080"
) as crawler:
result = await crawler.arun(url="https://target-site.com")
# Or via environment variable
# export HTTP_PROXY=http://user:pass@proxy.example.com:8080Using residential proxies significantly improves success rates on sites with anti-bot protection. For recommendations, see our guide on proxy setup for web scraping.
Performance Tips
- Use
arun_manyfor batch processing — much faster than sequentialaruncalls - Enable caching for development — avoid re-downloading pages while iterating on extraction logic
- Limit concurrency — start with 3-5 concurrent tabs and increase based on your system’s capacity
- Use
fit_markdown— this contains only the main content, versusmarkdownwhich may include navigation elements - Set appropriate wait conditions — use
wait_for="css:.target-element"instead of fixed delays when possible
When to Use Crawl4ai (and When Not To)
Crawl4ai is ideal when:
- You want a free, self-hosted solution
- Data privacy is important (nothing leaves your machine)
- You need LLM-powered extraction with local models
- You’re building AI/ML pipelines that need clean training data
- You want full control over the crawling infrastructure
Consider alternatives when:
- You need the simplest possible setup (try Firecrawl)
- You need advanced anti-bot bypasses out of the box
- You’re scraping millions of pages monthly (consider Scrapy for raw throughput)
- You want a visual, no-code scraping solution
Crawl4ai vs Firecrawl
This is the most common comparison. Here’s the quick version:
| Factor | Crawl4ai | Firecrawl |
|---|---|---|
| Cost | Free forever | Free tier + paid plans |
| Hosting | Self-hosted | Cloud or self-hosted |
| Anti-Bot | Basic | Advanced |
| LLM Flexibility | Any model (local or cloud) | Cloud-only on hosted |
| Ease of Setup | Moderate | Very easy |
| Community | Large open-source | Growing |
For the full comparison, read our Crawl4ai vs Firecrawl deep dive.
Frequently Asked Questions
What is Crawl4ai used for?
Crawl4ai is used for extracting clean, structured data from websites for AI applications. Common use cases include building knowledge bases for RAG systems, collecting training data for machine learning models, competitive intelligence gathering, content aggregation, and automated research. Its ability to output clean markdown makes it particularly popular for feeding data into LLM-based applications.
Is Crawl4ai better than Scrapy?
They serve different niches. Scrapy is better for high-volume, traditional web scraping where you need maximum throughput and distributed crawling. Crawl4ai is better when you need JavaScript rendering, AI-powered extraction, or clean markdown output for LLM pipelines. Many teams use both — Scrapy for simple HTML pages at scale, and Crawl4ai for complex, dynamic sites.
Does Crawl4ai work on Windows?
Yes. Crawl4ai supports Windows, macOS, and Linux. The installation process is the same across platforms: pip install crawl4ai followed by crawl4ai-setup. Docker installation is also available for all platforms.
Can Crawl4ai scrape sites behind a login?
Yes. Crawl4ai supports session management, which lets you persist browser state (cookies, localStorage) across multiple page loads. You can automate login flows using JavaScript execution, then continue crawling authenticated pages within the same session.
How fast is Crawl4ai?
Speed depends on your hardware, internet connection, and target site. In benchmark tests, Crawl4ai can process 10-50 pages per minute with a single browser instance, and more with concurrent instances. It’s not as fast as raw HTTP scrapers like Scrapy (which can do hundreds per second), but it handles JavaScript-heavy sites that simpler tools cannot.
Conclusion
Crawl4ai is a powerful, free tool that democratizes AI-powered web scraping. It’s the right choice for developers who want full control, data privacy, and the flexibility to use any LLM for extraction.
The setup takes about 5 minutes, and the examples in this guide should have you extracting useful data within the hour. As your needs grow, explore the deep crawling, session management, and LLM extraction features to build sophisticated data pipelines.
For more on AI scraping tools, see our best AI web scrapers comparison or our in-depth Crawl4ai tutorial.
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
Related Reading
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
Related: Walk through a full project in our Crawl4AI setup guide for 2026.