Firecrawl: Complete Guide to AI Web Scraping

Firecrawl: Complete Guide to AI Web Scraping

Web scraping has entered a new era. Traditional scrapers break when websites change their layout, struggle with JavaScript-heavy pages, and return messy HTML that requires hours of parsing. Firecrawl solves all of these problems by combining headless browser rendering with AI-powered content extraction to deliver clean, structured data from any website.

Whether you’re building a RAG pipeline, training an AI model, or collecting competitive intelligence, Firecrawl has become the go-to tool for developers who need reliable, clean web data without the maintenance headache.

This guide covers everything you need to know about Firecrawl — from initial setup to advanced features like batch crawling, structured extraction, and integration with AI workflows.

What Is Firecrawl?

Firecrawl is an API-first web scraping and crawling platform developed by Mendable. It converts any website into clean markdown or structured data that’s ready for LLM consumption. Unlike traditional scraping tools that return raw HTML, Firecrawl handles JavaScript rendering, pagination, anti-bot bypasses, and content cleaning automatically.

Key Features at a Glance

FeatureDescription
Scrape ModeExtract content from a single URL as markdown or structured data
Crawl ModeRecursively crawl entire websites following internal links
Map ModeDiscover all URLs on a website without extracting content
Extract ModePull structured data using LLM-powered schema extraction
Batch ScrapeProcess thousands of URLs concurrently
JavaScript RenderingFull Chromium-based rendering for dynamic content
Anti-Bot HandlingBuilt-in stealth techniques for protected sites
Clean MarkdownAutomatic removal of ads, navigation, and boilerplate

Getting Started with Firecrawl

Installation

Firecrawl offers SDKs for Python, Node.js, Go, and Rust. Here’s how to get started with the most popular options.

Python SDK:

pip install firecrawl-py

Node.js SDK:

npm install @mendable/firecrawl-js

Self-Hosted (Docker):

git clone https://github.com/mendableai/firecrawl.git
cd firecrawl
docker compose up -d

API Key Setup

Sign up at firecrawl.dev to get your API key. The free tier includes 500 credits per month, which is enough for testing and small projects.

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-your-api-key-here")

Core Scraping Modes

1. Scrape Mode — Single Page Extraction

The most basic operation extracts clean content from a single URL:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-your-key")

# Basic scrape — returns markdown
result = app.scrape_url("https://example.com/blog/article", {
    "formats": ["markdown"]
})

print(result["markdown"])

You can also request multiple formats simultaneously:

result = app.scrape_url("https://example.com/pricing", {
    "formats": ["markdown", "html", "links", "screenshot"],
    "waitFor": 3000,  # Wait 3s for JS to load
    "timeout": 30000
})

# Access different formats
markdown_content = result["markdown"]
raw_html = result["html"]
page_links = result["links"]
screenshot_url = result["screenshot"]

2. Crawl Mode — Full Website Crawling

Crawl mode follows internal links recursively to extract content from an entire website:

# Start an async crawl
crawl_result = app.crawl_url("https://example.com", {
    "limit": 100,           # Max pages to crawl
    "maxDepth": 3,          # How deep to follow links
    "includePaths": ["/blog/*", "/docs/*"],
    "excludePaths": ["/admin/*", "/login/*"],
    "formats": ["markdown"],
    "scrapeOptions": {
        "waitFor": 2000
    }
})

# Process results
for page in crawl_result["data"]:
    print(f"URL: {page['metadata']['url']}")
    print(f"Title: {page['metadata']['title']}")
    print(f"Content length: {len(page['markdown'])}")
    print("---")

For large crawls, use the async pattern with polling:

# Start crawl (returns immediately)
crawl_job = app.async_crawl_url("https://example.com", {
    "limit": 500
})

job_id = crawl_job["id"]
print(f"Crawl job started: {job_id}")

# Check status
import time
while True:
    status = app.check_crawl_status(job_id)
    print(f"Status: {status['status']} — {status.get('completed', 0)} pages done")
    if status["status"] == "completed":
        break
    time.sleep(10)

# Get all results
all_pages = status["data"]

3. Map Mode — URL Discovery

Map mode quickly discovers all accessible URLs on a website without downloading their content:

map_result = app.map_url("https://example.com", {
    "search": "pricing",  # Optional: filter URLs containing this term
    "limit": 5000
})

urls = map_result["links"]
print(f"Found {len(urls)} URLs")

for url in urls[:20]:
    print(url)

This is extremely useful for planning a targeted crawl — discover the site structure first, then scrape only the pages you need.

4. Extract Mode — Structured Data Extraction

Extract mode uses LLMs to pull structured data according to a schema you define:

from pydantic import BaseModel
from typing import List, Optional

class ProductInfo(BaseModel):
    name: str
    price: float
    currency: str
    description: str
    features: List[str]
    availability: Optional[str]

result = app.scrape_url("https://example.com/product/widget-pro", {
    "formats": ["extract"],
    "extract": {
        "schema": ProductInfo.model_json_schema(),
        "prompt": "Extract the product information from this page"
    }
})

product = result["extract"]
print(f"Product: {product['name']}")
print(f"Price: {product['currency']}{product['price']}")
print(f"Features: {', '.join(product['features'])}")

You can also use a natural language prompt without a schema:

result = app.scrape_url("https://example.com/about", {
    "formats": ["extract"],
    "extract": {
        "prompt": "Extract the company name, founding year, number of employees, and headquarters location"
    }
})

Advanced Features

Batch Scraping

Process hundreds or thousands of URLs concurrently:

urls = [
    "https://example.com/product/1",
    "https://example.com/product/2",
    "https://example.com/product/3",
    # ... hundreds more
]

batch_result = app.batch_scrape_urls(urls, {
    "formats": ["markdown", "extract"],
    "extract": {
        "prompt": "Extract product name, price, and rating"
    }
})

for page in batch_result["data"]:
    print(page["extract"])

Handling Authentication

For pages that require login:

result = app.scrape_url("https://example.com/dashboard", {
    "formats": ["markdown"],
    "headers": {
        "Cookie": "session=your-session-cookie",
        "Authorization": "Bearer your-token"
    }
})

Using Proxies with Firecrawl

While Firecrawl’s cloud service handles proxy rotation internally, the self-hosted version lets you configure your own proxies for maximum control:

# In your .env file for self-hosted Firecrawl
PROXY_SERVER=http://your-proxy:8080
PROXY_USERNAME=user
PROXY_PASSWORD=pass

For enterprise-grade scraping that requires residential or mobile proxies, dedicated proxy infrastructure gives you the most reliable results. Combining Firecrawl’s extraction capabilities with high-quality rotating proxies provides both clean data output and consistent access to protected sites.

Webhook Integration

Set up webhooks to receive results as they complete:

crawl_job = app.async_crawl_url("https://example.com", {
    "limit": 100,
    "webhook": "https://your-server.com/webhook/firecrawl"
})

Your webhook endpoint will receive POST requests with the crawl results as each page is processed.

Firecrawl Pricing (2026)

PlanCredits/MonthPriceBest For
Free500$0Testing and personal projects
Hobby3,000$16/moSmall projects and MVPs
Standard100,000$83/moProduction applications
Growth500,000$333/moHigh-volume scraping
EnterpriseCustomCustomLarge-scale operations

One credit equals one page scrape. Crawl mode uses one credit per page discovered and scraped.

Self-Hosting Firecrawl

For teams that need full control over their scraping infrastructure:

# Clone the repository
git clone https://github.com/mendableai/firecrawl.git
cd firecrawl

# Configure environment
cp .env.example .env
# Edit .env with your settings (OpenAI API key for extract mode, etc.)

# Start with Docker Compose
docker compose up -d

Self-hosting advantages:

  • No credit limits — scrape as much as you need
  • Data privacy — scraped content never leaves your infrastructure
  • Custom proxy configuration — use your own proxy infrastructure
  • Lower cost at scale — no per-page pricing

Self-hosting disadvantages:

  • You manage the infrastructure
  • Need your own proxy rotation solution
  • Anti-bot bypass capabilities may be limited compared to the cloud version

Real-World Use Cases

Building a RAG Knowledge Base

from firecrawl import FirecrawlApp
import json

app = FirecrawlApp(api_key="fc-your-key")

# Crawl documentation site
result = app.crawl_url("https://docs.example.com", {
    "limit": 200,
    "formats": ["markdown"],
    "includePaths": ["/docs/*"]
})

# Save for RAG ingestion
documents = []
for page in result["data"]:
    documents.append({
        "content": page["markdown"],
        "metadata": {
            "url": page["metadata"]["url"],
            "title": page["metadata"]["title"]
        }
    })

with open("knowledge_base.json", "w") as f:
    json.dump(documents, f, indent=2)

For a deeper dive into this workflow, check out our guide on building RAG pipelines with web scraping.

Competitive Price Monitoring

competitors = [
    "https://competitor1.com/pricing",
    "https://competitor2.com/plans",
    "https://competitor3.com/pricing"
]

results = app.batch_scrape_urls(competitors, {
    "formats": ["extract"],
    "extract": {
        "prompt": "Extract all pricing tiers with plan names, monthly prices, annual prices, and included features"
    }
})

for page in results["data"]:
    print(f"\n{page['metadata']['url']}:")
    print(json.dumps(page["extract"], indent=2))

Content Aggregation Pipeline

# Step 1: Discover URLs
map_result = app.map_url("https://news-site.com", {
    "search": "artificial intelligence",
    "limit": 50
})

# Step 2: Scrape relevant articles
articles = app.batch_scrape_urls(map_result["links"][:20], {
    "formats": ["markdown", "extract"],
    "extract": {
        "prompt": "Extract article title, author, publish date, and a 2-sentence summary"
    }
})

Firecrawl vs Traditional Scrapers

AspectFirecrawlBeautifulSoup/ScrapySelenium/Playwright
Setup TimeMinutesHoursHours
JS RenderingBuilt-inNot supportedBuilt-in
Anti-Bot BypassBuilt-inManualManual
Output QualityClean markdownRaw HTMLRaw HTML
LLM ExtractionBuilt-inRequires codingRequires coding
MaintenanceLowHighHigh
CostPay per pageFree (+ infra)Free (+ infra)
ScalabilityCloud-managedSelf-managedSelf-managed

Tips for Getting the Most Out of Firecrawl

  1. Use Map before Crawl — Discover the site structure first, then target specific sections
  2. Set appropriate wait times — Dynamic sites need waitFor values of 2000-5000ms
  3. Use includePaths/excludePaths — Don’t waste credits on irrelevant pages
  4. Leverage Extract mode — Let the LLM parse complex layouts instead of writing custom selectors
  5. Batch your requests — Batch scraping is faster and more efficient than individual calls
  6. Cache results — Store scraped data locally to avoid redundant API calls
  7. Monitor your credit usage — Set up alerts before you hit plan limits

Frequently Asked Questions

Is Firecrawl free to use?

Firecrawl offers a free tier with 500 credits per month, enough for testing and small projects. For production use, paid plans start at $16/month. You can also self-host the open-source version for unlimited usage, though you’ll need to provide your own infrastructure and OpenAI API key for extract mode.

Can Firecrawl scrape JavaScript-heavy websites?

Yes. Firecrawl uses a headless Chromium browser to render JavaScript before extracting content. This means it handles React, Vue, Angular, and other SPA frameworks. You can use the waitFor parameter to give dynamic content time to load before extraction.

How does Firecrawl compare to Crawl4ai?

Both are AI-powered scrapers, but they take different approaches. Firecrawl is an API service (with self-host option) focused on clean markdown output and LLM extraction. Crawl4ai is a fully open-source Python library that runs locally. Firecrawl is easier to start with; Crawl4ai gives more control and costs nothing to run. See our detailed Crawl4ai vs Firecrawl comparison for a thorough breakdown.

Does Firecrawl handle CAPTCHAs?

The cloud version of Firecrawl includes anti-bot capabilities that handle many common protection mechanisms. For heavily protected sites, you may need to combine Firecrawl with specialized anti-detect browser tools or residential proxies for the best results.

Can I use Firecrawl for large-scale scraping?

Absolutely. Firecrawl’s batch scraping mode, async crawling, and webhook support are designed for high-volume operations. The Growth plan supports 500,000 pages per month, and Enterprise plans offer custom limits. For the largest operations, self-hosting removes all credit restrictions.

Conclusion

Firecrawl represents a significant shift in how developers approach web scraping. By combining headless browser rendering with AI-powered extraction, it eliminates most of the pain points that make traditional scraping fragile and time-consuming.

For teams building AI applications, RAG pipelines, or data collection workflows, Firecrawl’s ability to deliver clean, structured data from any website — with minimal code and zero maintenance — makes it a compelling choice. The free tier is generous enough to evaluate the platform, and self-hosting provides a cost-effective path for high-volume needs.

Start with the Python SDK, experiment with scrape and extract modes, and you’ll quickly see why Firecrawl has become one of the most popular AI web scraping tools in 2026.


Related Reading

Scroll to Top