What Is Crawl4ai? Setup & Complete Guide

What Is Crawl4ai? Setup & Complete Guide

Crawl4ai is a free, open-source Python framework for AI-powered web crawling and data extraction. It has rapidly grown to become one of the most starred web scraping repositories on GitHub, and for good reason — it delivers capabilities that rival paid services while costing nothing to run.

This guide explains what Crawl4ai is, how it works, how to set it up, and when you should (and shouldn’t) use it.

Crawl4ai Overview

Crawl4ai was created to solve a specific problem: getting clean, structured data from websites into AI pipelines without the typical pain of web scraping. It combines a headless browser for rendering JavaScript, intelligent algorithms for identifying main content, and optional LLM integration for extracting structured data.

What Makes It Different

Traditional web scrapers return raw HTML. You then spend hours writing CSS selectors, handling edge cases, and cleaning the output. When the target website changes its layout, your selectors break and you start over.

Crawl4ai takes a different approach:

  1. Renders the page in a real browser (Playwright + Chromium)
  2. Identifies main content automatically, stripping navigation, ads, and boilerplate
  3. Converts to clean markdown that’s immediately useful
  4. Optionally extracts structured data using LLMs or CSS selectors

The result is clean data from any website with minimal code and minimal maintenance.

Key Capabilities

CapabilityDetails
Browser RenderingFull Chromium via Playwright — handles any JavaScript framework
Smart ExtractionAutomatic content identification and cleaning
Markdown OutputClean, formatted markdown from any web page
CSS ExtractionSchema-based structured extraction using CSS selectors
LLM ExtractionUse OpenAI, Anthropic, Ollama, or any LiteLLM model
Deep CrawlingBFS and DFS strategies for multi-page crawling
Session ManagementPersistent sessions for login flows and pagination
Async ArchitectureBuilt on asyncio for high-performance concurrent crawling
CachingBuilt-in page cache to avoid redundant downloads
Proxy SupportHTTP and SOCKS proxy configuration
Docker SupportReady-made container for deployment

How Crawl4ai Works

Here’s the architecture in simplified form:

Your Code → Crawl4ai → Playwright Browser → Target Website
                ↓
        Content Extraction (auto or LLM)
                ↓
        Clean Output (markdown, JSON, or both)

The Extraction Pipeline

  1. URL Input — You provide one or more URLs
  2. Browser Launch — Crawl4ai starts a headless Chromium instance
  3. Page Load — The browser navigates to the URL, executing all JavaScript
  4. Wait Conditions — Optional waiting for specific elements or time delays
  5. JS Execution — Optional custom JavaScript (scrolling, clicking, etc.)
  6. HTML Capture — The fully rendered DOM is captured
  7. Content Filtering — Main content is identified; boilerplate is removed
  8. Markdown Conversion — HTML is converted to clean markdown
  9. Optional Extraction — CSS or LLM strategies extract structured fields
  10. Output — Clean markdown and/or structured JSON returned to your code

Setting Up Crawl4ai

System Requirements

  • Python 3.9 or higher
  • 2GB+ RAM (4GB+ recommended for concurrent crawling)
  • macOS, Linux, or Windows

Installation

Standard installation:

pip install crawl4ai
crawl4ai-setup  # Downloads Chromium browser

With all optional dependencies:

pip install crawl4ai[all]
crawl4ai-setup

Docker installation:

docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai unclecode/crawl4ai:latest

Quick Verification

import asyncio
from crawl4ai import AsyncWebCrawler

async def verify():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://httpbin.org/html")
        assert result.success, f"Failed: {result.error_message}"
        assert len(result.markdown) > 100, "Content too short"
        print("Crawl4ai is working correctly!")
        print(f"Extracted {len(result.markdown)} characters of markdown")

asyncio.run(verify())

Practical Examples

Example 1: Scrape a Blog Post

import asyncio
from crawl4ai import AsyncWebCrawler

async def scrape_blog():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/blog/interesting-article"
        )

        if result.success:
            print(f"Title: {result.metadata.get('title')}")
            print(f"Word count: {len(result.markdown.split())}")
            print(f"\n{result.markdown[:2000]}")

            # Save to file
            with open("article.md", "w") as f:
                f.write(result.markdown)

asyncio.run(scrape_blog())

Example 2: Extract Product Data

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def extract_products():
    schema = {
        "name": "Products",
        "baseSelector": ".product-item",
        "fields": [
            {"name": "name", "selector": ".product-name", "type": "text"},
            {"name": "price", "selector": ".product-price", "type": "text"},
            {"name": "rating", "selector": ".stars", "type": "attribute", "attribute": "data-rating"},
            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/shop",
            extraction_strategy=JsonCssExtractionStrategy(schema)
        )

        import json
        products = json.loads(result.extracted_content)
        for p in products:
            print(f"{p['name']}: {p['price']} (rating: {p['rating']})")

asyncio.run(extract_products())

Example 3: Crawl Documentation for RAG

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
import json

async def build_docs_index():
    strategy = BFSDeepCrawlStrategy(
        max_depth=3,
        max_pages=200,
        include_patterns=["/docs/*"]
    )

    config = CrawlerRunConfig(deep_crawl_strategy=strategy)

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun(
            url="https://docs.example.com",
            config=config
        )

    documents = []
    for r in results:
        if r.success and len(r.markdown) > 100:
            documents.append({
                "url": r.url,
                "title": r.metadata.get("title", ""),
                "content": r.markdown
            })

    with open("docs_index.json", "w") as f:
        json.dump(documents, f, indent=2)

    print(f"Indexed {len(documents)} documentation pages")

asyncio.run(build_docs_index())

For a complete walkthrough of building AI knowledge bases, see our RAG pipeline with web scraping guide.

Example 4: LLM-Powered Extraction with Ollama

from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def smart_extract():
    strategy = LLMExtractionStrategy(
        provider="ollama/llama3.2",
        instruction="Extract all job listings with title, company, location, salary range, and required skills as a list",
        base_url="http://localhost:11434"
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/jobs",
            extraction_strategy=strategy
        )

        jobs = json.loads(result.extracted_content)
        for job in jobs:
            print(f"{job['title']} at {job['company']} — {job.get('salary_range', 'Not listed')}")

asyncio.run(smart_extract())

Configuring Proxies

For scraping at scale or accessing region-specific content:

# Single proxy
async with AsyncWebCrawler(
    proxy="http://user:pass@proxy.example.com:8080"
) as crawler:
    result = await crawler.arun(url="https://target-site.com")

# Or via environment variable
# export HTTP_PROXY=http://user:pass@proxy.example.com:8080

Using residential proxies significantly improves success rates on sites with anti-bot protection. For recommendations, see our guide on proxy setup for web scraping.

Performance Tips

  1. Use arun_many for batch processing — much faster than sequential arun calls
  2. Enable caching for development — avoid re-downloading pages while iterating on extraction logic
  3. Limit concurrency — start with 3-5 concurrent tabs and increase based on your system’s capacity
  4. Use fit_markdown — this contains only the main content, versus markdown which may include navigation elements
  5. Set appropriate wait conditions — use wait_for="css:.target-element" instead of fixed delays when possible

When to Use Crawl4ai (and When Not To)

Crawl4ai is ideal when:

  • You want a free, self-hosted solution
  • Data privacy is important (nothing leaves your machine)
  • You need LLM-powered extraction with local models
  • You’re building AI/ML pipelines that need clean training data
  • You want full control over the crawling infrastructure

Consider alternatives when:

  • You need the simplest possible setup (try Firecrawl)
  • You need advanced anti-bot bypasses out of the box
  • You’re scraping millions of pages monthly (consider Scrapy for raw throughput)
  • You want a visual, no-code scraping solution

Crawl4ai vs Firecrawl

This is the most common comparison. Here’s the quick version:

FactorCrawl4aiFirecrawl
CostFree foreverFree tier + paid plans
HostingSelf-hostedCloud or self-hosted
Anti-BotBasicAdvanced
LLM FlexibilityAny model (local or cloud)Cloud-only on hosted
Ease of SetupModerateVery easy
CommunityLarge open-sourceGrowing

For the full comparison, read our Crawl4ai vs Firecrawl deep dive.

Frequently Asked Questions

What is Crawl4ai used for?

Crawl4ai is used for extracting clean, structured data from websites for AI applications. Common use cases include building knowledge bases for RAG systems, collecting training data for machine learning models, competitive intelligence gathering, content aggregation, and automated research. Its ability to output clean markdown makes it particularly popular for feeding data into LLM-based applications.

Is Crawl4ai better than Scrapy?

They serve different niches. Scrapy is better for high-volume, traditional web scraping where you need maximum throughput and distributed crawling. Crawl4ai is better when you need JavaScript rendering, AI-powered extraction, or clean markdown output for LLM pipelines. Many teams use both — Scrapy for simple HTML pages at scale, and Crawl4ai for complex, dynamic sites.

Does Crawl4ai work on Windows?

Yes. Crawl4ai supports Windows, macOS, and Linux. The installation process is the same across platforms: pip install crawl4ai followed by crawl4ai-setup. Docker installation is also available for all platforms.

Can Crawl4ai scrape sites behind a login?

Yes. Crawl4ai supports session management, which lets you persist browser state (cookies, localStorage) across multiple page loads. You can automate login flows using JavaScript execution, then continue crawling authenticated pages within the same session.

How fast is Crawl4ai?

Speed depends on your hardware, internet connection, and target site. In benchmark tests, Crawl4ai can process 10-50 pages per minute with a single browser instance, and more with concurrent instances. It’s not as fast as raw HTTP scrapers like Scrapy (which can do hundreds per second), but it handles JavaScript-heavy sites that simpler tools cannot.

Conclusion

Crawl4ai is a powerful, free tool that democratizes AI-powered web scraping. It’s the right choice for developers who want full control, data privacy, and the flexibility to use any LLM for extraction.

The setup takes about 5 minutes, and the examples in this guide should have you extracting useful data within the hour. As your needs grow, explore the deep crawling, session management, and LLM extraction features to build sophisticated data pipelines.

For more on AI scraping tools, see our best AI web scrapers comparison or our in-depth Crawl4ai tutorial.


Related Reading

Related: Walk through a full project in our Crawl4AI setup guide for 2026.

Scroll to Top