Scrapy Cloud vs Crawlee Cloud in 2026
Scrapy Cloud vs Crawlee Cloud is the comparison every Python or Node scraping team faces eventually. Both platforms come from Apify (which acquired Scrapinghub’s Scrapy Cloud business in 2021 and integrated Crawlee, its own Node-first scraping framework, into the same hosted offering). They share underlying infrastructure but differ in framework, language ecosystem, pricing model, and the tooling they expose. Picking the right one is mostly about which framework your team standardizes on and which language has better library coverage for your targets.
This guide compares both platforms feature by feature in 2026, prices the realistic cost at common workloads, and walks through deployment for each. The benchmarks come from actually running production scrapers on both during 2025-2026 with similar workloads. By the end you will know which platform fits your project, what the real costs are, and how to deploy without surprises.
What each platform is
Scrapy Cloud is Apify’s hosted platform for Scrapy spiders. You write Scrapy in Python, use shub deploy to push to the platform, and pay for compute units (run-time) and data items processed. Has a long history (since 2010), used by teams at large enterprises, and integrates with the Scrapy ecosystem (item pipelines, middleware, schedulers).
Crawlee Cloud is Apify’s hosted platform for Crawlee, the JavaScript/TypeScript-first scraping framework that Apify itself maintains. You write Crawlee in Node or TypeScript, deploy via the Apify CLI, and pay for actor runs (compute time) and storage. Newer (Crawlee released 2022, hosted version stabilized 2024) but actively developed.
Both run on the same Apify cloud underneath, so deployment, scheduling, monitoring, and storage primitives are similar. The main difference is the framework you write in.
For Apify’s official platform docs, see docs.apify.com.
Pricing comparison
As of mid-2026:
| dimension | Scrapy Cloud | Crawlee Cloud |
|---|---|---|
| compute cost | $0.40/CU/hr | $0.40/CU/hr (same) |
| storage cost | $0.20/GB/month | $0.20/GB/month (same) |
| dataset reads | $0.02/1k records | $0.02/1k records (same) |
| starter plan | $39/month (1 CU) | $39/month (1 CU) |
| free tier | $5 credit/month | $5 credit/month |
Pricing is identical because they share infrastructure. The cost differences come from runtime efficiency, which depends on your code:
| workload | Scrapy time | Crawlee time | cost diff |
|---|---|---|---|
| 100k HTML pages, simple parse | 8 hr | 6 hr | Crawlee 25% cheaper |
| 100k pages with Playwright | 12 hr | 10 hr | Crawlee 17% cheaper |
| 10k pages with custom middleware | 0.5 hr | 0.6 hr | Scrapy slightly cheaper |
| 1M pages, no JS | 80 hr | 60 hr | Crawlee 25% cheaper |
Crawlee tends to run faster for browser-heavy workloads because its Playwright integration is more efficient. Scrapy can be faster for HTML-only workloads with heavy custom middleware where its async model shines.
For a 1M-page/month workload, expect ~$80-120/month on either platform.
Framework comparison
| dimension | Scrapy | Crawlee |
|---|---|---|
| primary language | Python | TypeScript / JavaScript |
| years in production | since 2008 | since 2022 |
| async model | Twisted (legacy), asyncio (modern) | Node async/await native |
| browser integration | scrapy-playwright | built-in PlaywrightCrawler/PuppeteerCrawler |
| middleware system | very rich, mature | growing |
| deduplication | Request fingerprinter | RequestQueue with built-in dedup |
| pipelines | item pipelines (post-process) | dataset push (simpler) |
| spider lifecycle hooks | yes (signals) | yes (lifecycle handlers) |
| stats collection | Scrapy Stats | built-in metrics |
| community size | very large | growing |
| TypeScript support | n/a | native |
Scrapy is the more mature framework with a richer middleware ecosystem. Crawlee is newer but designed from scratch for browser scraping, which gives it cleaner APIs for that use case.
For a Scrapy + Playwright deep dive, see Scrapy + Playwright integration in 2026.
Deploy: Scrapy Cloud
pip install shub
shub login # paste API key from Apify console
# In your Scrapy project
shub deploy
# Select project from list
Wait a minute for the build, then run via the web UI or CLI:
shub schedule scraper-name
shub items <run-id> # download items as JSON
scrapinghub.yml config:
projects:
default: 12345
stacks:
default: scrapy:2.11-py310
requirements:
file: requirements.txt
Scrapy Cloud has a few stacks (Python versions and Scrapy versions). Pick the latest unless you have specific constraints.
Deploy: Crawlee Cloud
npm install -g apify-cli
apify login # paste API key
# In your Crawlee project
apify push
The Apify CLI wraps the project as an “actor” (Apify’s term for a deployable scraping unit) and pushes it to the platform.
actor.json config:
{
"actorSpecification": 1,
"name": "my-crawler",
"version": "0.0",
"buildTag": "latest",
"input": {
"title": "Crawler Input",
"type": "object",
"properties": {
"startUrls": {
"title": "Start URLs",
"type": "array",
"editor": "requestListSources"
}
}
}
}
Run via the web UI or CLI:
apify call my-crawler --input='{"startUrls":["https://example.com"]}'
Scrapy Cloud workflow
A typical Scrapy Cloud project:
# myspider.py
import scrapy
class MySpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/products"]
custom_settings = {
"DOWNLOAD_DELAY": 1,
"CONCURRENT_REQUESTS": 16,
}
def parse(self, response):
for product in response.css("article.product"):
yield {
"title": product.css("h2::text").get(),
"price": product.css(".price::text").get(),
"url": product.css("a::attr(href)").get(),
}
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
scrapy_cloud.yml:
project: 12345
stack: scrapy:2.11-py310
requirements:
file: requirements.txt
The platform handles scheduling, logs, and item storage. Items are written to Scrapy Cloud’s Items API, which you can pull via REST or download as CSV/JSON.
Crawlee Cloud workflow
// src/main.ts
import { CheerioCrawler, Dataset } from "crawlee";
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log, enqueueLinks }) {
log.info(`Processing ${request.url}`);
const products = $("article.product").map((_, el) => ({
title: $(el).find("h2").text().trim(),
price: $(el).find(".price").text().trim(),
url: $(el).find("a").attr("href"),
})).get();
await Dataset.pushData(products);
await enqueueLinks({
selector: "a.next",
label: "PAGINATION",
});
},
maxRequestsPerCrawl: 1000,
maxConcurrency: 10,
});
await crawler.run(["https://example.com/products"]);
Crawlee’s Dataset is the cloud-side equivalent of Scrapy items. enqueueLinks is the simpler equivalent of response.follow.
Browser support
| feature | Scrapy Cloud | Crawlee Cloud |
|---|---|---|
| Playwright | via scrapy-playwright | native PlaywrightCrawler |
| Puppeteer | not supported | native PuppeteerCrawler |
| Headless Chrome | yes | yes |
| Headless Firefox | yes | yes |
| Mobile emulation | manual config | built-in helpers |
| Browser pool / context reuse | manual | built-in |
| Stealth (patchright integration) | manual via custom middleware | recipe in docs |
Crawlee has more out-of-the-box ergonomics for browser scraping. Scrapy gets you there but with more configuration.
Storage
Both platforms expose:
- Datasets: structured data from your scrapers (rows of items)
- Key-value stores: arbitrary blobs (HTML snapshots, screenshots, JSON configs)
- Request queues: URL queues for crawl coordination
Pricing is the same. APIs are very similar. For most purposes the storage layer is interchangeable.
# Scrapy: writing to a dataset
yield {"title": title, "price": price}
# Crawlee: writing to a dataset
await Dataset.pushData({title, price});
# Scrapy: reading items from a previous run via API
import requests
resp = requests.get(
f"https://api.apify.com/v2/datasets/{dataset_id}/items",
headers={"Authorization": f"Bearer {API_KEY}"},
)
// Crawlee: reading items via SDK
import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: API_KEY });
const items = await client.dataset(datasetId).listItems();
Scheduling
Both support cron-style scheduling via the Apify console or API.
Scrapy Cloud:
shub schedule myspider --frequency 'every 1 hour'
Crawlee Cloud / Apify:
// Via Apify SDK
import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: API_KEY });
await client.actor("my-actor").schedules.create({
cronExpression: "0 * * * *", // hourly
timezone: "UTC",
});
For more complex orchestration (DAGs, dependencies between scrapers), use external schedulers like Prefect or Dagster. See building scraping pipelines with Prefect 3 and building scraping pipelines with Dagster.
Proxy support
Both platforms include Apify Proxy:
- Datacenter proxies: included in plans, per-GB billed beyond free tier
- Residential proxies: $8-12/GB
- Smart Proxy: rotating residential with automatic anti-bot evasion
Per-request proxy assignment in Scrapy:
import os
custom_settings = {
"DOWNLOAD_DELAY": 1,
"DOWNLOADER_MIDDLEWARES": {
"scrapinghub_proxy.ScrapinghubProxyMiddleware": 410,
},
"PROXY_GROUPS": ["RESIDENTIAL"],
}
In Crawlee:
import { CheerioCrawler, ProxyConfiguration } from "crawlee";
const proxyConfiguration = new ProxyConfiguration({
groups: ["RESIDENTIAL"],
});
const crawler = new CheerioCrawler({
proxyConfiguration,
// ...
});
Apify Proxy is convenient but more expensive than dedicated providers. For high volume, route through Bright Data, Oxylabs, or self-hosted instead. See best residential proxy providers 2026.
When to choose Scrapy Cloud
Pick Scrapy Cloud when:
- Your team is Python-first and already uses Scrapy
- You have existing Scrapy spiders to migrate
- You need rich middleware (cookies, retries, custom headers per host)
- You want item pipelines for post-processing (validation, dedup, DB write)
- Long-running spiders that benefit from Twisted/asyncio-style concurrency
- Heavy HTML parsing where Scrapy’s async model is faster
When to choose Crawlee Cloud
Pick Crawlee Cloud when:
- Your team is JavaScript/TypeScript-first
- Your scraping is browser-heavy (Crawlee’s Playwright integration is cleaner)
- You want simpler APIs (less to learn than Scrapy)
- You want native TypeScript support
- You operate Apify actors for other use cases already
When to use neither
Pick self-hosted when:
- You have ops capacity and want to avoid platform lock-in
- Volume is high enough that hosted costs dominate compute (5M+ pages/month)
- You need custom infrastructure (specific GPU, specific OS, edge deployment)
- You want full control of proxy egress
For self-hosted patterns, see self-hosted proxy infrastructure and building scraping pipelines with Prefect 3.
Comparison: hosted vs self-hosted
| dimension | hosted (Apify) | self-hosted |
|---|---|---|
| ops burden | none | high |
| time to first run | minutes | days |
| cost at low volume | low | high (fixed VM cost) |
| cost at high volume | medium | low |
| scaling | automatic | manual |
| compliance | Apify’s audits | yours |
| customization | limited | unlimited |
| break-even | ~5M pages/month |
For most teams under 5M pages/month, hosted is cheaper and faster. Above that, the calculation shifts toward self-hosted.
Migration: Scrapy → Crawlee or vice versa
If you decide to switch frameworks, the migration involves rewriting spiders. Scrapy items map to Crawlee dataset records, Scrapy middleware maps to Crawlee request handlers, but the syntax is entirely different. Plan a 2-4 week project for a moderate-sized spider suite.
Common migration concerns:
- Custom middleware: rewrite as Crawlee request hooks or pre-navigation hooks
- Item pipelines: move post-processing into the request handler or into a downstream Apify actor
- Custom selectors: cheerio (Crawlee) is similar to BeautifulSoup but not identical
- Stats and monitoring: Apify SDK provides equivalent metrics
Operational checklist
For deciding between platforms in 2026:
- Match framework to team language (Python → Scrapy, Node → Crawlee)
- Estimate monthly compute cost based on test runs
- Verify proxy bandwidth budget separately
- Test deployment workflow with a real spider
- Set up scheduling for your most frequent jobs
- Use external storage (S3, Postgres) for long-term data, not Apify datasets
- Monitor compute unit consumption weekly
- Plan migration to self-hosted at the volume break-even
FAQ
Q: can I run both Scrapy and Crawlee on the same Apify account?
Yes. They are different actor types but share the same compute pool, storage, and billing.
Q: which is faster for the same workload?
Crawlee tends to be 15-25% faster for browser-heavy workloads, Scrapy can be faster for pure HTML scraping with heavy middleware. Both are within the same cost ballpark.
Q: does Crawlee support Python?
A Python port of Crawlee exists (“crawlee-python”) but is less mature than the TypeScript original. For Python in 2026, Scrapy is still the more mature choice.
Q: can I use Apify Proxy with my own scraper code?
Yes. Apify Proxy is exposed as standard HTTP/HTTPS endpoints with credentials. Any scraper that supports HTTP proxies can use it.
Q: what about open-source self-hosted Apify (Apify Open Source)?
Apify open-sourced parts of the platform around 2024. You can run a limited subset on your own infrastructure but not the full hosted experience. For most teams, the hosted offering is more practical until volume justifies the build.
Common pitfalls in production hosted scraping
The first failure mode that catches teams migrating from self-hosted is compute unit (CU) billing surprise. Apify’s CU pricing meters not just CPU time but also memory-time-product. A spider configured with 4GB RAM that runs for 10 minutes consumes more CU than a spider configured with 1GB RAM that runs for 30 minutes, even though wall-clock memory usage may be similar. The fix is to right-size memory allocation per spider via the actor.json config. Default to 1GB and only bump if you observe OOM kills in the logs. The CU savings from halving memory often outweigh the marginal latency increase from tighter memory pressure.
The second pitfall is dataset row limit drift on the free tier. Apify’s free tier caps datasets at 10,000 records per dataset, which scrapers commonly hit during development without realizing. The dataset.pushData() call returns success but the underlying storage rejects rows beyond the limit, leading to silent data loss. Test runs with 5K records succeed, production runs with 50K records lose 80 percent of rows. The fix is to either upgrade off the free tier (the Personal plan at $49/month removes the limit) or to chunk results into multiple datasets via Dataset.open(name) with a rolling name like ${date}-${shard}.
The third pitfall is Apify Proxy IP exhaustion under per-host rate limits. Apify Proxy’s residential pool serves all customers from a shared IP set. If you hit a target site at 50 requests/second, Apify’s pool may rotate through 200 IPs in an hour, and the target’s per-IP rate limiter starts returning 429s on previously-fresh IPs because other Apify customers also hit them earlier in the day. The mitigation for high-volume targets is to use a dedicated residential pool (Apify offers it at higher per-GB cost) or to bypass Apify Proxy entirely and route through your own residential provider via the proxyUrl field on individual requests.
Real-world example: cost analysis migration from Apify to self-hosted
A scraper team running 8 million product pages per month across 40 ecommerce sites tracked their costs on Apify Scrapy Cloud over six months and decided to migrate to self-hosted in month seven. The detailed cost breakdown that drove the decision:
| month | apify CU | proxy GB | apify total | self-host equivalent |
|---|---|---|---|---|
| jan | 4,200 | 320 | $2,840 | $1,100 |
| feb | 4,800 | 380 | $3,260 | $1,100 |
| mar | 5,100 | 410 | $3,510 | $1,100 |
| apr | 5,400 | 450 | $3,790 | $1,150 |
| may | 6,200 | 520 | $4,420 | $1,200 |
| jun | 7,100 | 600 | $5,080 | $1,250 |
The self-host equivalent included: a 16-vCPU VM ($380/month), 50 residential IPs from a budget provider at $4/GB ($200-2400/month based on usage), Postgres for results ($120/month), Prometheus + Grafana monitoring ($80/month), and one full-time engineer maintaining the stack at 0.2 FTE (allocated cost $400/month).
Migration took 8 weeks and consumed $32,000 in engineering time. After migration, monthly cost stabilized at $1,250 versus the projected Apify cost of $5,800+ by month seven. Payback period: 7.5 months from migration completion. The team retained Apify Scrapy Cloud for two specific spiders that benefited from Apify’s anti-detection-tuned proxy pool (those spiders accounted for 8 percent of total volume), so they did not fully zero their Apify bill.
The lessons: Apify is meaningfully cheaper than self-hosted up to roughly 3-5 million pages/month. Above that, self-hosted wins decisively. Hybrid (most volume self-hosted, anti-bot-heavy targets on Apify) is a viable middle ground that captures both economies.
Comparison: Scrapy Cloud vs Crawlee Cloud feature parity matrix
A 2026-current feature comparison across the two platforms on Apify’s shared infrastructure:
| feature | Scrapy Cloud | Crawlee Cloud |
|---|---|---|
| max concurrent runs | 32 (Personal), unlimited (Team+) | 32 (Personal), unlimited (Team+) |
| max RAM per actor | 32GB | 32GB |
| storage retention | 7 days (free), 30 days (paid) | 7 days (free), 30 days (paid) |
| scheduled runs | yes | yes |
| webhook callbacks | yes | yes |
| dataset format | JSON, CSV, XML, Excel | JSON, CSV, XML, Excel |
| key-value store | yes | yes |
| live console | yes | yes |
| log retention | 14 days | 14 days |
| proxy: datacenter | included | included |
| proxy: residential | $8-12/GB | $8-12/GB |
| browser support | Playwright via scrapy-playwright | native Playwright + Puppeteer |
| TypeScript support | no (Python only) | yes (first-class) |
| AI agent integration | manual | Apify Agents native (2025+) |
| Webhooks-as-trigger | yes | yes |
| Apify SDK access | yes (Python) | yes (Node/Bun) |
| Container deploy | yes (custom Dockerfile) | yes (custom Dockerfile) |
The platforms are at near-feature-parity in 2026. The choice between them is almost entirely about your team’s language preference. The exception: Apify Agents (their AI agent framework launched 2025) is Node-first and integrates more cleanly with Crawlee. Teams building AI-driven scrapers tend to lean toward Crawlee for that integration.
Detection: when hosted is the wrong choice
Five signals that your scraping workload should NOT live on Apify:
- Custom TLS fingerprinting required: Apify’s HTTP egress uses their pool’s fingerprint; you cannot inject curl_cffi or tls-client at the network layer. Move to self-hosted.
- Hardware GPU for inference inside the spider: Apify does not provide GPU instances. Use Modal, RunPod, or self-hosted with GPUs.
- Compliance requires data residency: Apify operates in EU and US regions only. If your data must stay in Singapore, Brazil, or other regions, self-host or pick a regional provider.
- Real-time webhooks under 500ms p99: Apify actor cold starts can exceed 5 seconds. For real-time response APIs, run a long-lived service on Cloud Run or Fargate instead.
- Custom kernel-level networking: Apify runs in a managed container environment; you cannot install custom iptables rules, custom DNS, or kernel modules. Self-host on a VM if you need this.
If any two of these apply, the answer is self-hosted regardless of volume.
Wrapping up
Scrapy Cloud vs Crawlee Cloud is mostly a framework choice in 2026. Same underlying infrastructure, same pricing, similar features. Pick Scrapy Cloud if your team is Python-first; pick Crawlee Cloud if your team is Node-first. Both pay off at low to medium volumes versus self-hosting, then break even and lose to self-hosted around 5M+ pages/month. Pair this with our Scrapy + Playwright integration and self-hosted proxy infrastructure writeups for the full hosted-vs-self-hosted picture, and browse the dev-tools-projects category on DRT for related infrastructure deep-dives.
Related comparison: See how Bright Data stacks up against a dedicated Singapore mobile network in our Singapore Mobile Proxy vs Bright Data comparison.
last updated: May 11, 2026