Firecrawl: Complete Guide to AI Web Scraping
Web scraping has entered a new era. Traditional scrapers break when websites change their layout, struggle with JavaScript-heavy pages, and return messy HTML that requires hours of parsing. Firecrawl solves all of these problems by combining headless browser rendering with AI-powered content extraction to deliver clean, structured data from any website.
Whether you’re building a RAG pipeline, training an AI model, or collecting competitive intelligence, Firecrawl has become the go-to tool for developers who need reliable, clean web data without the maintenance headache.
This guide covers everything you need to know about Firecrawl — from initial setup to advanced features like batch crawling, structured extraction, and integration with AI workflows.
What Is Firecrawl?
Firecrawl is an API-first web scraping and crawling platform developed by Mendable. It converts any website into clean markdown or structured data that’s ready for LLM consumption. Unlike traditional scraping tools that return raw HTML, Firecrawl handles JavaScript rendering, pagination, anti-bot bypasses, and content cleaning automatically.
Key Features at a Glance
| Feature | Description |
|---|---|
| Scrape Mode | Extract content from a single URL as markdown or structured data |
| Crawl Mode | Recursively crawl entire websites following internal links |
| Map Mode | Discover all URLs on a website without extracting content |
| Extract Mode | Pull structured data using LLM-powered schema extraction |
| Batch Scrape | Process thousands of URLs concurrently |
| JavaScript Rendering | Full Chromium-based rendering for dynamic content |
| Anti-Bot Handling | Built-in stealth techniques for protected sites |
| Clean Markdown | Automatic removal of ads, navigation, and boilerplate |
Getting Started with Firecrawl
Installation
Firecrawl offers SDKs for Python, Node.js, Go, and Rust. Here’s how to get started with the most popular options.
Python SDK:
pip install firecrawl-pyNode.js SDK:
npm install @mendable/firecrawl-jsSelf-Hosted (Docker):
git clone https://github.com/mendableai/firecrawl.git
cd firecrawl
docker compose up -dAPI Key Setup
Sign up at firecrawl.dev to get your API key. The free tier includes 500 credits per month, which is enough for testing and small projects.
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-your-api-key-here")Core Scraping Modes
1. Scrape Mode — Single Page Extraction
The most basic operation extracts clean content from a single URL:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-your-key")
# Basic scrape — returns markdown
result = app.scrape_url("https://example.com/blog/article", {
"formats": ["markdown"]
})
print(result["markdown"])You can also request multiple formats simultaneously:
result = app.scrape_url("https://example.com/pricing", {
"formats": ["markdown", "html", "links", "screenshot"],
"waitFor": 3000, # Wait 3s for JS to load
"timeout": 30000
})
# Access different formats
markdown_content = result["markdown"]
raw_html = result["html"]
page_links = result["links"]
screenshot_url = result["screenshot"]2. Crawl Mode — Full Website Crawling
Crawl mode follows internal links recursively to extract content from an entire website:
# Start an async crawl
crawl_result = app.crawl_url("https://example.com", {
"limit": 100, # Max pages to crawl
"maxDepth": 3, # How deep to follow links
"includePaths": ["/blog/*", "/docs/*"],
"excludePaths": ["/admin/*", "/login/*"],
"formats": ["markdown"],
"scrapeOptions": {
"waitFor": 2000
}
})
# Process results
for page in crawl_result["data"]:
print(f"URL: {page['metadata']['url']}")
print(f"Title: {page['metadata']['title']}")
print(f"Content length: {len(page['markdown'])}")
print("---")For large crawls, use the async pattern with polling:
# Start crawl (returns immediately)
crawl_job = app.async_crawl_url("https://example.com", {
"limit": 500
})
job_id = crawl_job["id"]
print(f"Crawl job started: {job_id}")
# Check status
import time
while True:
status = app.check_crawl_status(job_id)
print(f"Status: {status['status']} — {status.get('completed', 0)} pages done")
if status["status"] == "completed":
break
time.sleep(10)
# Get all results
all_pages = status["data"]3. Map Mode — URL Discovery
Map mode quickly discovers all accessible URLs on a website without downloading their content:
map_result = app.map_url("https://example.com", {
"search": "pricing", # Optional: filter URLs containing this term
"limit": 5000
})
urls = map_result["links"]
print(f"Found {len(urls)} URLs")
for url in urls[:20]:
print(url)This is extremely useful for planning a targeted crawl — discover the site structure first, then scrape only the pages you need.
4. Extract Mode — Structured Data Extraction
Extract mode uses LLMs to pull structured data according to a schema you define:
from pydantic import BaseModel
from typing import List, Optional
class ProductInfo(BaseModel):
name: str
price: float
currency: str
description: str
features: List[str]
availability: Optional[str]
result = app.scrape_url("https://example.com/product/widget-pro", {
"formats": ["extract"],
"extract": {
"schema": ProductInfo.model_json_schema(),
"prompt": "Extract the product information from this page"
}
})
product = result["extract"]
print(f"Product: {product['name']}")
print(f"Price: {product['currency']}{product['price']}")
print(f"Features: {', '.join(product['features'])}")You can also use a natural language prompt without a schema:
result = app.scrape_url("https://example.com/about", {
"formats": ["extract"],
"extract": {
"prompt": "Extract the company name, founding year, number of employees, and headquarters location"
}
})Advanced Features
Batch Scraping
Process hundreds or thousands of URLs concurrently:
urls = [
"https://example.com/product/1",
"https://example.com/product/2",
"https://example.com/product/3",
# ... hundreds more
]
batch_result = app.batch_scrape_urls(urls, {
"formats": ["markdown", "extract"],
"extract": {
"prompt": "Extract product name, price, and rating"
}
})
for page in batch_result["data"]:
print(page["extract"])Handling Authentication
For pages that require login:
result = app.scrape_url("https://example.com/dashboard", {
"formats": ["markdown"],
"headers": {
"Cookie": "session=your-session-cookie",
"Authorization": "Bearer your-token"
}
})Using Proxies with Firecrawl
While Firecrawl’s cloud service handles proxy rotation internally, the self-hosted version lets you configure your own proxies for maximum control:
# In your .env file for self-hosted Firecrawl
PROXY_SERVER=http://your-proxy:8080
PROXY_USERNAME=user
PROXY_PASSWORD=passFor enterprise-grade scraping that requires residential or mobile proxies, dedicated proxy infrastructure gives you the most reliable results. Combining Firecrawl’s extraction capabilities with high-quality rotating proxies provides both clean data output and consistent access to protected sites.
Webhook Integration
Set up webhooks to receive results as they complete:
crawl_job = app.async_crawl_url("https://example.com", {
"limit": 100,
"webhook": "https://your-server.com/webhook/firecrawl"
})Your webhook endpoint will receive POST requests with the crawl results as each page is processed.
Firecrawl Pricing (2026)
| Plan | Credits/Month | Price | Best For |
|---|---|---|---|
| Free | 500 | $0 | Testing and personal projects |
| Hobby | 3,000 | $16/mo | Small projects and MVPs |
| Standard | 100,000 | $83/mo | Production applications |
| Growth | 500,000 | $333/mo | High-volume scraping |
| Enterprise | Custom | Custom | Large-scale operations |
One credit equals one page scrape. Crawl mode uses one credit per page discovered and scraped.
Self-Hosting Firecrawl
For teams that need full control over their scraping infrastructure:
# Clone the repository
git clone https://github.com/mendableai/firecrawl.git
cd firecrawl
# Configure environment
cp .env.example .env
# Edit .env with your settings (OpenAI API key for extract mode, etc.)
# Start with Docker Compose
docker compose up -dSelf-hosting advantages:
- No credit limits — scrape as much as you need
- Data privacy — scraped content never leaves your infrastructure
- Custom proxy configuration — use your own proxy infrastructure
- Lower cost at scale — no per-page pricing
Self-hosting disadvantages:
- You manage the infrastructure
- Need your own proxy rotation solution
- Anti-bot bypass capabilities may be limited compared to the cloud version
Real-World Use Cases
Building a RAG Knowledge Base
from firecrawl import FirecrawlApp
import json
app = FirecrawlApp(api_key="fc-your-key")
# Crawl documentation site
result = app.crawl_url("https://docs.example.com", {
"limit": 200,
"formats": ["markdown"],
"includePaths": ["/docs/*"]
})
# Save for RAG ingestion
documents = []
for page in result["data"]:
documents.append({
"content": page["markdown"],
"metadata": {
"url": page["metadata"]["url"],
"title": page["metadata"]["title"]
}
})
with open("knowledge_base.json", "w") as f:
json.dump(documents, f, indent=2)For a deeper dive into this workflow, check out our guide on building RAG pipelines with web scraping.
Competitive Price Monitoring
competitors = [
"https://competitor1.com/pricing",
"https://competitor2.com/plans",
"https://competitor3.com/pricing"
]
results = app.batch_scrape_urls(competitors, {
"formats": ["extract"],
"extract": {
"prompt": "Extract all pricing tiers with plan names, monthly prices, annual prices, and included features"
}
})
for page in results["data"]:
print(f"\n{page['metadata']['url']}:")
print(json.dumps(page["extract"], indent=2))Content Aggregation Pipeline
# Step 1: Discover URLs
map_result = app.map_url("https://news-site.com", {
"search": "artificial intelligence",
"limit": 50
})
# Step 2: Scrape relevant articles
articles = app.batch_scrape_urls(map_result["links"][:20], {
"formats": ["markdown", "extract"],
"extract": {
"prompt": "Extract article title, author, publish date, and a 2-sentence summary"
}
})Firecrawl vs Traditional Scrapers
| Aspect | Firecrawl | BeautifulSoup/Scrapy | Selenium/Playwright |
|---|---|---|---|
| Setup Time | Minutes | Hours | Hours |
| JS Rendering | Built-in | Not supported | Built-in |
| Anti-Bot Bypass | Built-in | Manual | Manual |
| Output Quality | Clean markdown | Raw HTML | Raw HTML |
| LLM Extraction | Built-in | Requires coding | Requires coding |
| Maintenance | Low | High | High |
| Cost | Pay per page | Free (+ infra) | Free (+ infra) |
| Scalability | Cloud-managed | Self-managed | Self-managed |
Tips for Getting the Most Out of Firecrawl
- Use Map before Crawl — Discover the site structure first, then target specific sections
- Set appropriate wait times — Dynamic sites need
waitForvalues of 2000-5000ms - Use includePaths/excludePaths — Don’t waste credits on irrelevant pages
- Leverage Extract mode — Let the LLM parse complex layouts instead of writing custom selectors
- Batch your requests — Batch scraping is faster and more efficient than individual calls
- Cache results — Store scraped data locally to avoid redundant API calls
- Monitor your credit usage — Set up alerts before you hit plan limits
Frequently Asked Questions
Is Firecrawl free to use?
Firecrawl offers a free tier with 500 credits per month, enough for testing and small projects. For production use, paid plans start at $16/month. You can also self-host the open-source version for unlimited usage, though you’ll need to provide your own infrastructure and OpenAI API key for extract mode.
Can Firecrawl scrape JavaScript-heavy websites?
Yes. Firecrawl uses a headless Chromium browser to render JavaScript before extracting content. This means it handles React, Vue, Angular, and other SPA frameworks. You can use the waitFor parameter to give dynamic content time to load before extraction.
How does Firecrawl compare to Crawl4ai?
Both are AI-powered scrapers, but they take different approaches. Firecrawl is an API service (with self-host option) focused on clean markdown output and LLM extraction. Crawl4ai is a fully open-source Python library that runs locally. Firecrawl is easier to start with; Crawl4ai gives more control and costs nothing to run. See our detailed Crawl4ai vs Firecrawl comparison for a thorough breakdown.
Does Firecrawl handle CAPTCHAs?
The cloud version of Firecrawl includes anti-bot capabilities that handle many common protection mechanisms. For heavily protected sites, you may need to combine Firecrawl with specialized anti-detect browser tools or residential proxies for the best results.
Can I use Firecrawl for large-scale scraping?
Absolutely. Firecrawl’s batch scraping mode, async crawling, and webhook support are designed for high-volume operations. The Growth plan supports 500,000 pages per month, and Enterprise plans offer custom limits. For the largest operations, self-hosting removes all credit restrictions.
Conclusion
Firecrawl represents a significant shift in how developers approach web scraping. By combining headless browser rendering with AI-powered extraction, it eliminates most of the pain points that make traditional scraping fragile and time-consuming.
For teams building AI applications, RAG pipelines, or data collection workflows, Firecrawl’s ability to deliver clean, structured data from any website — with minimal code and zero maintenance — makes it a compelling choice. The free tier is generous enough to evaluate the platform, and self-hosting provides a cost-effective path for high-volume needs.
Start with the Python SDK, experiment with scrape and extract modes, and you’ll quickly see why Firecrawl has become one of the most popular AI web scraping tools in 2026.
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
Related Reading
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data