how to use crawl4ai for llm-ready web scraping (python tutorial 2026)
crawl4ai is an open-source python library that turns any webpage into clean, llm-ready markdown in one async call. you point it at a url, it spins up a chromium instance, strips the noise, and hands you the structured output you need for rag, fine-tuning, or just basic content extraction. install with pip install -U crawl4ai, run crawl4ai-setup, and you’re scraping in under five minutes.
most python scrapers were built before language models existed. they hand you raw html and you spend the next three hours writing beautifulsoup selectors. crawl4ai flipped that workflow. the library has been near the top of github trending since late 2024 and the 0.5.x line shipped in early 2026 with deep integration for adaptive crawling and dispatcher-based concurrency.
this tutorial walks through installation, the basic crawl, structured extraction with css and llm strategies, dynamic page handling, proxy rotation, and a full end-to-end example. if you’ve used playwright before, this will feel familiar. if you haven’t, that’s fine too. crawl4ai handles the browser plumbing for you.
why crawl4ai instead of beautifulsoup or scrapy
scrapy is still the right pick if you’re crawling millions of pages and you have time to write spiders. beautifulsoup is fine for static html. crawl4ai sits in a different lane. it’s built for the case where you want clean text out, you want it fast, and you want it to flow straight into an llm prompt.
three things make it different from older tools.
first, the default output is markdown, not html. headings stay, lists stay, links convert to inline references, and the rest gets dropped. you don’t write a single selector to get readable text.
second, it ships with a real browser by default. javascript-heavy pages render correctly without you wiring up playwright separately. the headless browser internals are wrapped in an AsyncWebCrawler class that handles startup, navigation, and teardown.
third, it’s async-first. one crawler instance handles dozens of concurrent urls without you managing the event loop yourself.
installing crawl4ai
the install is two commands. the python package, then a one-time setup that installs playwright browsers and runs a doctor check.
pip install -U crawl4ai
crawl4ai-setup
if the setup script throws errors about missing system libraries on linux, run the diagnostic:
crawl4ai-doctor
on macos and windows the playwright install usually just works. on a fresh ubuntu container you may need apt-get install -y libnss3 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libpango-1.0-0 libcairo2 libasound2 before the browsers will launch.
verify the install:
python -c "import crawl4ai; print(crawl4ai.__version__)"
you should see 0.5.x or higher. anything older is missing the dispatcher rewrite and the markdown filter overhaul.
your first crawl
the simplest possible script. one url in, markdown out.
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://news.ycombinator.com")
print(result.markdown[:2000])
asyncio.run(main())
run that. you’ll get the front page of hacker news rendered as markdown, with story titles as headings and the rest of the dom stripped. no selectors, no parsing, no cleanup pass.
the result object holds more than markdown. it also has result.html (the raw rendered html), result.cleaned_html (post-filter), result.media (a dict of images, audio, video extracted from the page), result.links (internal and external link lists), and result.metadata (title, description, og tags).
controlling the crawl with browser and run configs
the defaults are sensible but you’ll outgrow them fast. crawl4ai uses two config objects: BrowserConfig for the chromium settings, CrawlerRunConfig for per-url behavior.
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
browser_cfg = BrowserConfig(
headless=True,
viewport_width=1920,
viewport_height=1080,
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 14_0) AppleWebKit/537.36",
)
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
wait_until="networkidle",
page_timeout=30000,
screenshot=True,
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/dynamic-page",
config=run_cfg,
)
print(result.markdown)
if result.screenshot:
with open("page.png", "wb") as f:
import base64
f.write(base64.b64decode(result.screenshot))
asyncio.run(main())
wait_until="networkidle" is critical for spas. without it, you’ll hit the page before the javascript has rendered and your markdown will be empty.
cache_mode=CacheMode.BYPASS forces a fresh fetch. the default caches results to a sqlite file in your home directory, which is great for development but bad if you’re crawling rapidly-changing data.
extracting structured data with css selectors
raw markdown is great for rag pipelines. for everything else you usually want fields. product price, author name, publish date. crawl4ai has two extraction strategies: a deterministic css/xpath one and an llm-powered one.
css first. it’s fast, free, and predictable.
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "hn_stories",
"baseSelector": "tr.athing",
"fields": [
{"name": "title", "selector": "span.titleline > a", "type": "text"},
{"name": "url", "selector": "span.titleline > a", "type": "attribute", "attribute": "href"},
{"name": "rank", "selector": "span.rank", "type": "text"},
],
}
async def main():
cfg = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=False),
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://news.ycombinator.com", config=cfg)
stories = json.loads(result.extracted_content)
for s in stories[:5]:
print(s)
asyncio.run(main())
the schema describes what to extract. baseSelector finds repeated rows. each row gets the listed fields pulled out. you get json back, ready for a database or a csv.
extracting with an llm when selectors won’t hold
some sites change layouts often. some pages have data scattered in prose. that’s where the llm strategy earns its keep.
import asyncio
import os
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
class Article(BaseModel):
headline: str
author: str
published_date: str
summary: str
async def main():
llm_cfg = LLMConfig(
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"),
)
strategy = LLMExtractionStrategy(
llm_config=llm_cfg,
schema=Article.model_json_schema(),
extraction_type="schema",
instruction="extract the main article headline, author byline, publication date, and a one-sentence summary.",
)
cfg = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.theverge.com/some-article-url",
config=cfg,
)
print(result.extracted_content)
asyncio.run(main())
gpt-4o-mini is cheap, fast, and accurate enough for most extraction work. the openai pricing page lists current per-token costs. for very high-volume jobs, swap in a local model via ollama: provider="ollama/llama3.1:8b".
a word of caution. llm extraction is non-deterministic. for production pipelines on stable sites, use css selectors. reserve the llm path for messy sources and one-off jobs.
handling javascript, infinite scroll, and clicks
modern sites love to hide their content behind scroll triggers and click handlers. crawl4ai exposes js_code and wait_for for this.
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
scroll_js = """
(async () => {
for (let i = 0; i < 5; i++) {
window.scrollTo(0, document.body.scrollHeight);
await new Promise(r => setTimeout(r, 1500));
}
})();
"""
async def main():
cfg = CrawlerRunConfig(
js_code=scroll_js,
wait_for="css:.product-card:nth-child(50)",
page_timeout=60000,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example-shop.com/category",
config=cfg,
)
print(f"loaded {result.markdown.count('product-card')} product cards")
asyncio.run(main())
the js scrolls five times, the wait_for blocks until the 50th product card appears, and only then does the markdown extraction run.
adding proxies for blocked sites
once you scale past a few hundred requests, you’ll start hitting rate limits and ip blocks. crawl4ai accepts a proxy in the browser config.
from crawl4ai import BrowserConfig
browser_cfg = BrowserConfig(
headless=True,
proxy_config={
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass",
},
)
for rotating residential proxies you’ll typically point at a single endpoint that rotates the exit ip on every request. mobile proxy networks like singapore mobile proxy work the same way. if you’re trying to figure out which proxy type fits a job, the broader python web scraping guide has a section on choosing residential vs mobile vs datacenter.
crawling many urls in parallel
the arun_many method takes a list and a dispatcher. the memory-adaptive dispatcher is the sane default.
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher
urls = [f"https://example.com/page/{i}" for i in range(1, 101)]
async def main():
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=80.0,
max_session_permit=10,
)
cfg = CrawlerRunConfig()
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(urls=urls, config=cfg, dispatcher=dispatcher)
for r in results:
if r.success:
print(r.url, len(r.markdown))
asyncio.run(main())
the dispatcher caps concurrent sessions at 10 and pauses if memory crosses 80%. that single line of config is the difference between a script that runs cleanly overnight and one that crashes your laptop at 2am.
cleaning the markdown for llm input
the default markdown is good. it’s not perfect. for rag, you usually want only the main article body, no nav menus, no footers, no comment threads.
crawl4ai exposes a content filter for exactly that.
from crawl4ai import CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
md_gen = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed"),
)
cfg = CrawlerRunConfig(markdown_generator=md_gen)
the pruning filter scores each block by text density and link ratio, drops anything below the threshold, and gives you a tighter result.markdown.fit_markdown field optimized for embedding.
a complete production-ready example
putting it together. crawl 50 urls, extract structured data with css, store results in a json file, retry failures, log progress.
import asyncio
import json
from pathlib import Path
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher
SCHEMA = {
"name": "products",
"baseSelector": "div.product",
"fields": [
{"name": "name", "selector": "h2.product-name", "type": "text"},
{"name": "price", "selector": "span.price", "type": "text"},
{"name": "stock", "selector": "span.availability", "type": "text"},
],
}
URLS = [f"https://shop.example.com/category/page-{i}" for i in range(1, 51)]
async def main():
browser_cfg = BrowserConfig(headless=True)
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(SCHEMA),
wait_until="networkidle",
page_timeout=30000,
)
dispatcher = MemoryAdaptiveDispatcher(max_session_permit=8)
out = Path("products.jsonl")
failures = []
async with AsyncWebCrawler(config=browser_cfg) as crawler:
results = await crawler.arun_many(urls=URLS, config=run_cfg, dispatcher=dispatcher)
with out.open("w") as f:
for r in results:
if r.success and r.extracted_content:
products = json.loads(r.extracted_content)
for p in products:
p["source_url"] = r.url
f.write(json.dumps(p) + "\n")
else:
failures.append(r.url)
print(f"saved {sum(1 for _ in out.open())} products. failures: {len(failures)}")
asyncio.run(main())
that script is the skeleton of a real production scraper. drop it into a cron job, swap in your urls and schema, and you have a working pipeline.
faq
is crawl4ai free?
yes. it’s mit-licensed and open-source. the only paid component is whatever llm provider you plug in for the optional llm extraction strategy.
does crawl4ai bypass cloudflare?
not by default. it uses standard chromium. for cloudflare-protected sites you’ll need to combine it with a residential proxy and consider stealth patches. the cloudflare turnstile bypass guide covers the techniques.
can crawl4ai scrape behind a login?
yes. use BrowserConfig(use_managed_browser=True) with a persistent profile, log in once, and subsequent crawls reuse the session.
how is crawl4ai different from firecrawl?
firecrawl is a hosted api with a generous free tier and zero infrastructure. crawl4ai is a python library you run yourself. for a side-by-side, see the firecrawl vs crawl4ai vs jina comparison.
what python versions does crawl4ai support?
3.10 and above. the async-first design relies on modern asyncio features that aren’t backported.
can i use crawl4ai with rag frameworks like langchain?
yes, and it’s actually the killer use case. the markdown output drops straight into a langchain document loader. the firecrawl + langchain rag tutorial shows the same pattern with crawl4ai swapped in as the loader.
conclusion
crawl4ai is the cleanest way to get llm-ready text out of the modern web with python. one async call, one markdown blob, no selector pain unless you want it. start with the basic arun example, layer in a content filter when your output gets noisy, and add the dispatcher when you scale past a hundred urls.
the official repo at github.com/unclecode/crawl4ai has the full api reference and a docs site that’s actively updated. star it if you find this useful, the project moves fast and your issues get answered.