CrewAI for scraping pipelines: complete 2026 guide
CrewAI scraping pipeline patterns turn the abstract “give the LLM some tools and pray” approach into a structured org chart of specialized agents who each do one job well. By early 2026, CrewAI has reached version 0.86, the Crew DSL has stabilized, and the framework’s role-based architecture maps surprisingly cleanly onto the way real scraping teams actually think about work: someone scouts targets, someone fetches, someone extracts, someone validates.
This guide builds an end-to-end CrewAI scraping pipeline for monitoring competitor prices across multiple ecommerce sites. We define agents, tasks, tools, the hierarchical process, and the integration points with your proxy pool, your database, and your alerting system. By the end you will have a working pipeline plus the knowledge to scale it past prototype.
Why CrewAI shines for scraping
CrewAI’s central abstraction is the Crew, a group of Agents that share Tools and execute Tasks under a Process. The framework forces you to think about division of labor up front, and that constraint produces cleaner pipelines than a single mega-agent.
For scraping, the role split that works in production is:
| Role | Responsibility | Typical tools |
|---|---|---|
| Scout | Discover URLs to scrape | Search APIs, sitemap crawler |
| Fetcher | Pull HTML for each URL with proxy rotation | HTTP client, Playwright |
| Extractor | Parse structured data from HTML | LLM with JSON Schema |
| Validator | QA the extraction against business rules | Pydantic, custom validators |
| Reporter | Format and dispatch results | Database writer, notifier |
This shape mirrors how a competent human team handles scraping. CrewAI lets you express it directly.
Installing the stack
pip install crewai==0.86.0 crewai-tools==0.20.0 langchain-openai==0.2.10 \
playwright==1.49.0 pydantic==2.9.2 httpx==0.27.2
playwright install chromium
export OPENAI_API_KEY="sk-..."
If you prefer Anthropic:
pip install langchain-anthropic==0.2.10
export ANTHROPIC_API_KEY="sk-ant-..."
Defining agents
Agents are declarative. You give them a role, a goal, a backstory (which is more important than it sounds because the LLM uses it for tone and decision style), and a set of tools.
from crewai import Agent
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)
scout = Agent(
role="Web Scout",
goal="Find every product URL for a competitor's catalog",
backstory=(
"Senior research analyst with a decade of experience mapping ecommerce "
"catalogs. Methodical, exhaustive, never misses a category page."
),
llm=llm,
verbose=True,
allow_delegation=False,
)
fetcher = Agent(
role="HTTP Fetcher",
goal="Reliably fetch page HTML through a rotating proxy pool",
backstory=(
"Pragmatic engineer who treats every fetch as adversarial. Always retries, "
"always rotates IPs, always respects rate limits."
),
llm=llm,
verbose=True,
allow_delegation=False,
)
extractor = Agent(
role="Data Extractor",
goal="Pull title, price, currency, and stock status from product HTML",
backstory=(
"Detail-obsessed analyst who treats malformed JSON as a personal insult. "
"Returns clean structured records or explicit nulls, never guesses."
),
llm=llm,
verbose=True,
allow_delegation=False,
)
Notice how the backstory does most of the work. CrewAI agents read the backstory before every task, and a good backstory shifts behavior more reliably than instruction tweaks in the task itself.
Building tools
CrewAI tools are simple Python functions decorated with @tool or subclasses of BaseTool. The tool docstring is what the LLM reads to decide when to call it, so write it like a help message.
from crewai.tools import BaseTool
from pydantic import BaseModel, Field
import httpx
import os
import random
PROXIES = os.environ.get("PROXY_POOL", "").split(",")
class FetchInput(BaseModel):
url: str = Field(..., description="The URL to fetch")
timeout_s: int = Field(30, description="Request timeout in seconds")
class FetchTool(BaseTool):
name: str = "fetch_url"
description: str = (
"Fetch a URL through the rotating proxy pool. Returns HTML or an error. "
"Use for any HTTP fetch in this pipeline."
)
args_schema: type[BaseModel] = FetchInput
def _run(self, url: str, timeout_s: int = 30) -> str:
proxy = random.choice(PROXIES) if PROXIES and PROXIES != [""] else None
with httpx.Client(proxy=proxy, timeout=timeout_s, follow_redirects=True) as c:
r = c.get(url, headers={"User-Agent": "Mozilla/5.0"})
return f"HTTP {r.status_code}\n\n{r.text[:200000]}"
fetch_tool = FetchTool()
For Playwright-based fetching of JavaScript-heavy sites, expose a parallel render_url tool:
from playwright.sync_api import sync_playwright
class RenderTool(BaseTool):
name: str = "render_url"
description: str = (
"Render a URL in headless Chromium and return the HTML after JS executes. "
"Use only when fetch_url returns insufficient data because of JS-driven content."
)
args_schema: type[BaseModel] = FetchInput
def _run(self, url: str, timeout_s: int = 30) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle", timeout=timeout_s * 1000)
html = page.content()
browser.close()
return html
render_tool = RenderTool()
Attach the tools to the right agents:
fetcher.tools = [fetch_tool, render_tool]
Defining tasks
Tasks express what each agent should do and what they should produce. Every task gets an expected_output description that doubles as the validation hint for the LLM.
from crewai import Task
scout_task = Task(
description=(
"Discover all product URLs for {target_site} in the {category} category. "
"Use the sitemap if available. Return a JSON array of URLs."
),
expected_output="JSON array of strings, each a fully qualified product URL",
agent=scout,
)
fetch_task = Task(
description=(
"For each URL in the prior task output, fetch the HTML using fetch_url. "
"If the response contains less than 1000 characters of body content or "
"looks like a JS-only shell, retry with render_url. "
"Return a JSON map of {url: html}."
),
expected_output="JSON object mapping URL to raw HTML",
agent=fetcher,
context=[scout_task],
)
extract_task = Task(
description=(
"For each {url: html} pair, extract title, price (number), currency (3-letter code), "
"and in_stock (boolean). Return a JSON array of records."
),
expected_output=(
"JSON array of objects with keys: url, title, price, currency, in_stock"
),
agent=extractor,
context=[fetch_task],
)
The context field is how data flows between tasks. CrewAI passes prior task outputs as text to the next agent’s prompt.
Assembling the crew
from crewai import Crew, Process
crew = Crew(
agents=[scout, fetcher, extractor],
tasks=[scout_task, fetch_task, extract_task],
process=Process.sequential,
verbose=True,
)
result = crew.kickoff(inputs={
"target_site": "https://www.lazada.sg",
"category": "ergonomic-keyboards",
})
print(result.raw)
Sequential is the default and the right pick for a linear scraping pipeline. For more complex workflows where one agent should orchestrate others, use Process.hierarchical and provide a manager LLM.
Hierarchical process for adaptive pipelines
The hierarchical process puts a manager agent in charge. The manager decides which worker to call, in what order, with what arguments. This is the right shape when the pipeline shape depends on the input.
from crewai import Crew, Process
from langchain_openai import ChatOpenAI
manager_llm = ChatOpenAI(model="gpt-4o", temperature=0)
crew = Crew(
agents=[scout, fetcher, extractor],
tasks=[scout_task, fetch_task, extract_task],
process=Process.hierarchical,
manager_llm=manager_llm,
verbose=True,
)
The manager dispatches dynamically and is the right choice when you cannot enumerate the steps up front. Cost is higher because every dispatch decision is an LLM call. Reserve for genuine adaptivity, not as the default.
Comparing CrewAI to LangGraph and AutoGen
| Dimension | CrewAI | LangGraph | AutoGen |
|---|---|---|---|
| Mental model | Org chart of roles | State machine | Group chat |
| Best fit for scraping | Multi-step pipelines with clear roles | Branching state-driven workflows | Conversational extraction |
| Tool definition | BaseTool subclass | LangChain Tool | Function decorator |
| Persistence | Manual | Built-in checkpointer | Manual |
| Learning curve | Lowest | Moderate | Steep |
| Production maturity | High | High | Moderate |
| Native MCP | Via wrappers | Via wrappers | Yes |
CrewAI is the framework to pick when the pipeline shape mirrors a team org chart and you want the fastest possible path from idea to working code. LangGraph wins for pipelines with non-linear branching and long-running state. AutoGen wins for conversational extraction where multiple agents debate the right answer.
For the LangGraph alternative, see our scraping with LangGraph agents guide. For AutoGen, see Multi-agent scraping with AutoGen in 2026.
Adding proxy rotation
CrewAI itself does not own the network layer; your tools do. The pattern is to keep a proxy pool in env or in a small singleton, and pick from it inside every fetch tool. We showed this pattern above; here is the production refinement that adds health checking.
import time
from collections import defaultdict
class ProxyPool:
def __init__(self, proxies):
self.proxies = proxies
self.failures = defaultdict(int)
self.last_used = defaultdict(float)
def pick(self):
now = time.time()
candidates = [p for p in self.proxies if self.failures[p] < 3]
if not candidates:
self.failures.clear()
candidates = self.proxies
return min(candidates, key=lambda p: self.last_used[p])
def report(self, proxy, success):
self.last_used[proxy] = time.time()
if not success:
self.failures[proxy] += 1
else:
self.failures[proxy] = max(0, self.failures[proxy] - 1)
pool = ProxyPool([os.environ["PROXY_POOL"].split(",")])
For ASEAN ecommerce scraping with mobile IPs that pass strict carrier-level checks, Singapore mobile proxy plugs into this pool directly.
Validator and Reporter agents in detail
The five-role split listed at the top of the article needs two agents we have not yet shown in code. Adding them turns a fragile demo into a production-grade pipeline.
from crewai import Agent
validator = Agent(
role="Data Quality Auditor",
goal=(
"Reject any extracted record that violates business rules. Price must be > 0 "
"and < 100000. Currency must be a valid ISO 4217 code. Title must be non-empty "
"and under 500 chars. in_stock must be a bool. Flag suspicious values for review."
),
backstory=(
"Former data engineer who spent two years cleaning a B2B product catalog. "
"Allergic to silently incorrect data. Will refuse to pass through anything "
"that smells wrong, and will document why."
),
llm=llm,
verbose=True,
)
reporter = Agent(
role="Insights Reporter",
goal=(
"Compare today's extracted prices against yesterday's and emit a Slack-ready "
"summary highlighting price drops over 10%, new SKUs, and out-of-stock changes."
),
backstory=(
"Pricing analyst with a journalism background. Writes summaries that a busy "
"merchandising manager can act on in 30 seconds."
),
llm=llm,
verbose=True,
)
The Validator agent in particular pays for itself the first time the Extractor mistakes a postcode field for a price and tries to write 94025 to your price_usd column.
Custom Python validation tool
LLMs are fine for fuzzy validation but bad at strict rule checking. Pair the Validator agent with a deterministic tool that does the unforgiving work.
from crewai.tools import BaseTool
from pydantic import BaseModel, ValidationError, Field as PField, conlist
from typing import List, Literal
class ProductRecord(BaseModel):
url: str
title: str = PField(min_length=1, max_length=500)
price: float = PField(gt=0, lt=100000)
currency: Literal["USD", "SGD", "EUR", "JPY", "GBP", "INR", "MYR", "THB", "IDR", "VND"]
in_stock: bool
class ValidateInput(BaseModel):
records: list
class StrictValidatorTool(BaseTool):
name: str = "strict_validate"
description: str = (
"Run deterministic validation on a list of product records. Returns the "
"subset that passes plus a list of errors for the rejected records."
)
args_schema: type[BaseModel] = ValidateInput
def _run(self, records: list) -> dict:
passed, errors = [], []
for r in records:
try:
passed.append(ProductRecord(**r).model_dump())
except ValidationError as e:
errors.append({"record": r, "errors": e.errors()})
return {"passed": passed, "errors": errors}
The Validator agent calls strict_validate and uses its output to decide what to forward to the Reporter.
Adding async execution and concurrency
Sequential is the default but real production crews need parallelism. CrewAI 0.86 introduced Process.async_sequential and async-friendly task callbacks. Use them when tasks are independent.
from crewai import Crew, Process
crew = Crew(
agents=[scout, fetcher, extractor, validator, reporter],
tasks=[scout_task, fetch_task, extract_task, validate_task, report_task],
process=Process.sequential,
)
# Process N URLs in parallel by spawning N crews
import asyncio
async def scrape_many(urls):
sem = asyncio.Semaphore(10) # cap concurrent crews
async def one(url):
async with sem:
return await crew.kickoff_async(inputs={"target_url": url})
return await asyncio.gather(*(one(u) for u in urls))
The Semaphore is the single most important line. Without it, 1000 URLs spawn 1000 simultaneous Crew instances, each holding a Playwright browser, and the host runs out of memory in 30 seconds.
Wiring CrewAI into a warehouse
The Reporter agent should not be writing directly to your warehouse. Give it a thin tool that calls a typed function in your data layer.
import asyncpg
class WarehouseWriteTool(BaseTool):
name: str = "warehouse_write"
description: str = "Append validated product records to the warehouse."
def _run(self, records: list) -> dict:
# synchronous wrapper around an async pool
return {"written": _sync_warehouse_write(records)}
The reason to keep this thin: if the LLM hallucinates a malformed record that slips through the Validator, the warehouse layer catches it via the column-type contract. Defense in depth.
Cost benchmarks
For a sequential scrape of 100 product URLs with three agents:
| Setup | LLM cost | Wall clock |
|---|---|---|
| GPT-4o-mini all agents | $0.45 | 8 min |
| GPT-4o all agents | $7.20 | 9 min |
| Mixed: 4o-mini scout/fetcher, 4o extractor | $1.80 | 8.5 min |
| Claude Haiku all agents | $0.55 | 7 min |
| Claude Sonnet all agents | $7.80 | 8 min |
The mixed setup is the value sweet spot. Use cheap models for orchestration agents, expensive models only for the agent that needs to read messy HTML and produce clean JSON.
Cost levers in priority order
The fastest wins on CrewAI cost, in the order they pay off:
- Trim HTML before passing to Extractor. Strip scripts, styles, comments. Cuts tokens by 40 to 70 percent.
- Use GPT-4o-mini for Scout, Fetcher, Validator, Reporter. Reserve GPT-4o or Sonnet for Extractor only.
- Cache extractions by HTML hash. Same page extracted twice should not pay LLM twice.
- Cap
max_iterper agent to 8. Default 25 lets confused agents loop expensively. - Switch verbose off in production. Verbose mode includes all intermediate thought tokens in the trace which the agent re-reads.
Combined, these cut typical per-page cost by 60 to 80 percent on a tuned pipeline versus a default-config baseline.
Memory and learning
CrewAI agents can be configured with memory that persists across runs. Two flavors: short-term (within a crew run) and long-term (across runs, backed by a vector store).
from crewai.memory import LongTermMemory
from crewai.memory.storage.ltm_sqlite_storage import LTMSQLiteStorage
extractor.memory = True
crew = Crew(
agents=[scout, fetcher, extractor],
tasks=[scout_task, fetch_task, extract_task],
process=Process.sequential,
memory=True,
long_term_memory=LongTermMemory(
storage=LTMSQLiteStorage(db_path="./crew_ltm.db")
),
)
For scraping, long-term memory pays off when the same agent learns site-specific quirks across runs. After a few iterations, the extractor remembers that a certain Lazada page renders price in a non-standard div.
Production deployment
Run CrewAI under a worker queue (Celery, RQ, or a Postgres-backed queue) and treat each crew kickoff as a job. Set hard timeouts, log every agent step, and persist the result to your warehouse.
For long-running crews, set step_callback and task_callback to stream progress to your observability stack. The callbacks fire after every agent step and every task completion respectively.
The official CrewAI documentation covers deployment options in depth, including the managed CrewAI Plus platform.
Frequently asked questions
Can CrewAI agents call MCP servers?
Not natively in 0.86. Wrap the MCP server in a Python tool that translates BaseTool calls to MCP JSON-RPC. The community has at least three open-source bridges; pick the one most actively maintained.
How do I prevent runaway agent loops?
Set max_iter on each agent (default 25). For task-level safety, set max_execution_time in seconds. Both options cap cost and clock.
Can I run CrewAI without OpenAI or Anthropic?
Yes. Any LangChain LLM works, including Ollama, vLLM, LM Studio, and Bedrock. Quality drops with smaller open-source models, especially on the extraction agent which needs strong JSON Schema adherence.
Does CrewAI handle parallel agent execution?
Sequential and hierarchical processes are single-threaded. For genuine parallelism, run multiple crew invocations under asyncio or a worker pool.
What about debugging when an agent goes off-script?
Set verbose=True on agents and crew. The terminal trace shows every thought, action, and observation. For richer logging, integrate with LangSmith.
Are CrewAI tasks idempotent? Can I safely retry?
Tasks themselves are idempotent only if your tools are idempotent. The framework will not deduplicate side effects. Wrap the side-effecting step (warehouse write, alert dispatch) with an idempotency key based on the input.
Can the same crew handle multiple sites with different layouts?
Yes, but the Extractor agent benefits from per-site backstory or per-site fewshot examples. The pattern that works is one Crew per site, sharing the Scout/Fetcher/Validator/Reporter agents and swapping only the Extractor.
How does CrewAI compare to writing the same pipeline in plain LangChain?
CrewAI is roughly 30 percent fewer lines of code for typical 3-to-5 agent pipelines and the role-based abstraction makes the intent clearer in code review. Plain LangChain wins when you need fine-grained control over the prompt assembly per turn.
Can I use CrewAI for crawling, not just scraping?
Yes. Add a Crawler agent with a BFS tool that takes seed URLs and depth, and pass the discovered URL list to the Fetcher. CrewAI’s role abstraction handles the recursion via the same Crawler agent re-invoked per depth.
Common production gotchas
A handful of issues bite teams during their first month.
The Extractor agent silently truncates HTML when the input exceeds the LLM’s context window. Always pre-trim with a deterministic rule (strip scripts, styles, hidden divs, and whitespace) before sending to the LLM, and log the trimmed size.
CrewAI’s verbose=True writes to stdout in a hard-to-parse format. For production, replace with a structured logging callback that emits JSON lines you can grep and ship to your log aggregator.
Tool argument schemas are inferred at class load time. Adding a field after the agent has already been constructed is a no-op until you reload. Restart workers on tool changes.
Hierarchical mode’s manager LLM is GPT-4o by default and that single decision dominates the cost on small jobs. Override manager_llm to GPT-4o-mini or a stronger model only when the dispatch decision is hard.
The crew’s kickoff method blocks the calling thread. For async workers, use kickoff_async and await it. Mixing the two in the same process leads to nested event loop errors.
For broader patterns on building agentic scraping in 2026, browse the AI agentic proxies category.