Scrapy Cloud vs Crawlee Cloud in 2026

Scrapy Cloud vs Crawlee Cloud is the comparison every Python or Node scraping team faces eventually. Both platforms come from Apify (which acquired Scrapinghub’s Scrapy Cloud business in 2021 and integrated Crawlee, its own Node-first scraping framework, into the same hosted offering). They share underlying infrastructure but differ in framework, language ecosystem, pricing model, and the tooling they expose. Picking the right one is mostly about which framework your team standardizes on and which language has better library coverage for your targets.

This guide compares both platforms feature by feature in 2026, prices the realistic cost at common workloads, and walks through deployment for each. The benchmarks come from actually running production scrapers on both during 2025-2026 with similar workloads. By the end you will know which platform fits your project, what the real costs are, and how to deploy without surprises.

What each platform is

Scrapy Cloud is Apify’s hosted platform for Scrapy spiders. You write Scrapy in Python, use shub deploy to push to the platform, and pay for compute units (run-time) and data items processed. Has a long history (since 2010), used by teams at large enterprises, and integrates with the Scrapy ecosystem (item pipelines, middleware, schedulers).

Crawlee Cloud is Apify’s hosted platform for Crawlee, the JavaScript/TypeScript-first scraping framework that Apify itself maintains. You write Crawlee in Node or TypeScript, deploy via the Apify CLI, and pay for actor runs (compute time) and storage. Newer (Crawlee released 2022, hosted version stabilized 2024) but actively developed.

Both run on the same Apify cloud underneath, so deployment, scheduling, monitoring, and storage primitives are similar. The main difference is the framework you write in.

For Apify’s official platform docs, see docs.apify.com.

Pricing comparison

As of mid-2026:

dimension	Scrapy Cloud	Crawlee Cloud
compute cost	$0.40/CU/hr	$0.40/CU/hr (same)
storage cost	$0.20/GB/month	$0.20/GB/month (same)
dataset reads	$0.02/1k records	$0.02/1k records (same)
starter plan	$39/month (1 CU)	$39/month (1 CU)
free tier	$5 credit/month	$5 credit/month

Pricing is identical because they share infrastructure. The cost differences come from runtime efficiency, which depends on your code:

workload	Scrapy time	Crawlee time	cost diff
100k HTML pages, simple parse	8 hr	6 hr	Crawlee 25% cheaper
100k pages with Playwright	12 hr	10 hr	Crawlee 17% cheaper
10k pages with custom middleware	0.5 hr	0.6 hr	Scrapy slightly cheaper
1M pages, no JS	80 hr	60 hr	Crawlee 25% cheaper

Crawlee tends to run faster for browser-heavy workloads because its Playwright integration is more efficient. Scrapy can be faster for HTML-only workloads with heavy custom middleware where its async model shines.

For a 1M-page/month workload, expect ~$80-120/month on either platform.

Framework comparison

dimension	Scrapy	Crawlee
primary language	Python	TypeScript / JavaScript
years in production	since 2008	since 2022
async model	Twisted (legacy), asyncio (modern)	Node async/await native
browser integration	scrapy-playwright	built-in PlaywrightCrawler/PuppeteerCrawler
middleware system	very rich, mature	growing
deduplication	Request fingerprinter	RequestQueue with built-in dedup
pipelines	item pipelines (post-process)	dataset push (simpler)
spider lifecycle hooks	yes (signals)	yes (lifecycle handlers)
stats collection	Scrapy Stats	built-in metrics
community size	very large	growing
TypeScript support	n/a	native

Scrapy is the more mature framework with a richer middleware ecosystem. Crawlee is newer but designed from scratch for browser scraping, which gives it cleaner APIs for that use case.

For a Scrapy + Playwright deep dive, see Scrapy + Playwright integration in 2026.

Deploy: Scrapy Cloud

pip install shub
shub login  # paste API key from Apify console

# In your Scrapy project
shub deploy
# Select project from list

Wait a minute for the build, then run via the web UI or CLI:

shub schedule scraper-name
shub items <run-id>  # download items as JSON

scrapinghub.yml config:

projects:
  default: 12345
stacks:
  default: scrapy:2.11-py310
requirements:
  file: requirements.txt

Scrapy Cloud has a few stacks (Python versions and Scrapy versions). Pick the latest unless you have specific constraints.

Deploy: Crawlee Cloud

npm install -g apify-cli
apify login  # paste API key

# In your Crawlee project
apify push

The Apify CLI wraps the project as an “actor” (Apify’s term for a deployable scraping unit) and pushes it to the platform.

actor.json config:

{
  "actorSpecification": 1,
  "name": "my-crawler",
  "version": "0.0",
  "buildTag": "latest",
  "input": {
    "title": "Crawler Input",
    "type": "object",
    "properties": {
      "startUrls": {
        "title": "Start URLs",
        "type": "array",
        "editor": "requestListSources"
      }
    }
  }
}

Run via the web UI or CLI:

apify call my-crawler --input='{"startUrls":["https://example.com"]}'

Scrapy Cloud workflow

A typical Scrapy Cloud project:

# myspider.py
import scrapy

class MySpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]

    custom_settings = {
        "DOWNLOAD_DELAY": 1,
        "CONCURRENT_REQUESTS": 16,
    }

    def parse(self, response):
        for product in response.css("article.product"):
            yield {
                "title": product.css("h2::text").get(),
                "price": product.css(".price::text").get(),
                "url": product.css("a::attr(href)").get(),
            }
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

scrapy_cloud.yml:

project: 12345
stack: scrapy:2.11-py310
requirements:
  file: requirements.txt

The platform handles scheduling, logs, and item storage. Items are written to Scrapy Cloud’s Items API, which you can pull via REST or download as CSV/JSON.

Crawlee Cloud workflow

// src/main.ts
import { CheerioCrawler, Dataset } from "crawlee";

const crawler = new CheerioCrawler({
  async requestHandler({ request, $, log, enqueueLinks }) {
    log.info(`Processing ${request.url}`);

    const products = $("article.product").map((_, el) => ({
      title: $(el).find("h2").text().trim(),
      price: $(el).find(".price").text().trim(),
      url: $(el).find("a").attr("href"),
    })).get();

    await Dataset.pushData(products);

    await enqueueLinks({
      selector: "a.next",
      label: "PAGINATION",
    });
  },
  maxRequestsPerCrawl: 1000,
  maxConcurrency: 10,
});

await crawler.run(["https://example.com/products"]);

Crawlee’s Dataset is the cloud-side equivalent of Scrapy items. enqueueLinks is the simpler equivalent of response.follow.

Browser support

feature	Scrapy Cloud	Crawlee Cloud
Playwright	via scrapy-playwright	native PlaywrightCrawler
Puppeteer	not supported	native PuppeteerCrawler
Headless Chrome	yes	yes
Headless Firefox	yes	yes
Mobile emulation	manual config	built-in helpers
Browser pool / context reuse	manual	built-in
Stealth (patchright integration)	manual via custom middleware	recipe in docs

Crawlee has more out-of-the-box ergonomics for browser scraping. Scrapy gets you there but with more configuration.

Storage

Both platforms expose:

Datasets: structured data from your scrapers (rows of items)
Key-value stores: arbitrary blobs (HTML snapshots, screenshots, JSON configs)
Request queues: URL queues for crawl coordination

Pricing is the same. APIs are very similar. For most purposes the storage layer is interchangeable.

# Scrapy: writing to a dataset
yield {"title": title, "price": price}

# Crawlee: writing to a dataset
await Dataset.pushData({title, price});

# Scrapy: reading items from a previous run via API
import requests
resp = requests.get(
    f"https://api.apify.com/v2/datasets/{dataset_id}/items",
    headers={"Authorization": f"Bearer {API_KEY}"},
)

// Crawlee: reading items via SDK
import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: API_KEY });
const items = await client.dataset(datasetId).listItems();

Scheduling

Both support cron-style scheduling via the Apify console or API.

Scrapy Cloud:

shub schedule myspider --frequency 'every 1 hour'

Crawlee Cloud / Apify:

// Via Apify SDK
import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: API_KEY });
await client.actor("my-actor").schedules.create({
  cronExpression: "0 * * * *",  // hourly
  timezone: "UTC",
});

For more complex orchestration (DAGs, dependencies between scrapers), use external schedulers like Prefect or Dagster. See building scraping pipelines with Prefect 3 and building scraping pipelines with Dagster.

Proxy support

Both platforms include Apify Proxy:

Datacenter proxies: included in plans, per-GB billed beyond free tier
Residential proxies: $8-12/GB
Smart Proxy: rotating residential with automatic anti-bot evasion

Per-request proxy assignment in Scrapy:

import os
custom_settings = {
    "DOWNLOAD_DELAY": 1,
    "DOWNLOADER_MIDDLEWARES": {
        "scrapinghub_proxy.ScrapinghubProxyMiddleware": 410,
    },
    "PROXY_GROUPS": ["RESIDENTIAL"],
}

In Crawlee:

import { CheerioCrawler, ProxyConfiguration } from "crawlee";

const proxyConfiguration = new ProxyConfiguration({
  groups: ["RESIDENTIAL"],
});

const crawler = new CheerioCrawler({
  proxyConfiguration,
  // ...
});

Apify Proxy is convenient but more expensive than dedicated providers. For high volume, route through Bright Data, Oxylabs, or self-hosted instead. See best residential proxy providers 2026.

When to choose Scrapy Cloud

Pick Scrapy Cloud when:

Your team is Python-first and already uses Scrapy
You have existing Scrapy spiders to migrate
You need rich middleware (cookies, retries, custom headers per host)
You want item pipelines for post-processing (validation, dedup, DB write)
Long-running spiders that benefit from Twisted/asyncio-style concurrency
Heavy HTML parsing where Scrapy’s async model is faster

When to choose Crawlee Cloud

Pick Crawlee Cloud when:

Your team is JavaScript/TypeScript-first
Your scraping is browser-heavy (Crawlee’s Playwright integration is cleaner)
You want simpler APIs (less to learn than Scrapy)
You want native TypeScript support
You operate Apify actors for other use cases already

When to use neither

Pick self-hosted when:

You have ops capacity and want to avoid platform lock-in
Volume is high enough that hosted costs dominate compute (5M+ pages/month)
You need custom infrastructure (specific GPU, specific OS, edge deployment)
You want full control of proxy egress

For self-hosted patterns, see self-hosted proxy infrastructure and building scraping pipelines with Prefect 3.

Comparison: hosted vs self-hosted

dimension	hosted (Apify)	self-hosted
ops burden	none	high
time to first run	minutes	days
cost at low volume	low	high (fixed VM cost)
cost at high volume	medium	low
scaling	automatic	manual
compliance	Apify’s audits	yours
customization	limited	unlimited
break-even	~5M pages/month

For most teams under 5M pages/month, hosted is cheaper and faster. Above that, the calculation shifts toward self-hosted.

Migration: Scrapy → Crawlee or vice versa

If you decide to switch frameworks, the migration involves rewriting spiders. Scrapy items map to Crawlee dataset records, Scrapy middleware maps to Crawlee request handlers, but the syntax is entirely different. Plan a 2-4 week project for a moderate-sized spider suite.

Common migration concerns:

Custom middleware: rewrite as Crawlee request hooks or pre-navigation hooks
Item pipelines: move post-processing into the request handler or into a downstream Apify actor
Custom selectors: cheerio (Crawlee) is similar to BeautifulSoup but not identical
Stats and monitoring: Apify SDK provides equivalent metrics

Operational checklist

For deciding between platforms in 2026:

Match framework to team language (Python → Scrapy, Node → Crawlee)
Estimate monthly compute cost based on test runs
Verify proxy bandwidth budget separately
Test deployment workflow with a real spider
Set up scheduling for your most frequent jobs
Use external storage (S3, Postgres) for long-term data, not Apify datasets
Monitor compute unit consumption weekly
Plan migration to self-hosted at the volume break-even

FAQ

Q: can I run both Scrapy and Crawlee on the same Apify account?
Yes. They are different actor types but share the same compute pool, storage, and billing.

Q: which is faster for the same workload?
Crawlee tends to be 15-25% faster for browser-heavy workloads, Scrapy can be faster for pure HTML scraping with heavy middleware. Both are within the same cost ballpark.

Q: does Crawlee support Python?
A Python port of Crawlee exists (“crawlee-python”) but is less mature than the TypeScript original. For Python in 2026, Scrapy is still the more mature choice.

Q: can I use Apify Proxy with my own scraper code?
Yes. Apify Proxy is exposed as standard HTTP/HTTPS endpoints with credentials. Any scraper that supports HTTP proxies can use it.

Q: what about open-source self-hosted Apify (Apify Open Source)?
Apify open-sourced parts of the platform around 2024. You can run a limited subset on your own infrastructure but not the full hosted experience. For most teams, the hosted offering is more practical until volume justifies the build.

Common pitfalls in production hosted scraping

The first failure mode that catches teams migrating from self-hosted is compute unit (CU) billing surprise. Apify’s CU pricing meters not just CPU time but also memory-time-product. A spider configured with 4GB RAM that runs for 10 minutes consumes more CU than a spider configured with 1GB RAM that runs for 30 minutes, even though wall-clock memory usage may be similar. The fix is to right-size memory allocation per spider via the actor.json config. Default to 1GB and only bump if you observe OOM kills in the logs. The CU savings from halving memory often outweigh the marginal latency increase from tighter memory pressure.

The second pitfall is dataset row limit drift on the free tier. Apify’s free tier caps datasets at 10,000 records per dataset, which scrapers commonly hit during development without realizing. The dataset.pushData() call returns success but the underlying storage rejects rows beyond the limit, leading to silent data loss. Test runs with 5K records succeed, production runs with 50K records lose 80 percent of rows. The fix is to either upgrade off the free tier (the Personal plan at $49/month removes the limit) or to chunk results into multiple datasets via Dataset.open(name) with a rolling name like ${date}-${shard}.

The third pitfall is Apify Proxy IP exhaustion under per-host rate limits. Apify Proxy’s residential pool serves all customers from a shared IP set. If you hit a target site at 50 requests/second, Apify’s pool may rotate through 200 IPs in an hour, and the target’s per-IP rate limiter starts returning 429s on previously-fresh IPs because other Apify customers also hit them earlier in the day. The mitigation for high-volume targets is to use a dedicated residential pool (Apify offers it at higher per-GB cost) or to bypass Apify Proxy entirely and route through your own residential provider via the proxyUrl field on individual requests.

Real-world example: cost analysis migration from Apify to self-hosted

A scraper team running 8 million product pages per month across 40 ecommerce sites tracked their costs on Apify Scrapy Cloud over six months and decided to migrate to self-hosted in month seven. The detailed cost breakdown that drove the decision:

month	apify CU	proxy GB	apify total	self-host equivalent
jan	4,200	320	$2,840	$1,100
feb	4,800	380	$3,260	$1,100
mar	5,100	410	$3,510	$1,100
apr	5,400	450	$3,790	$1,150
may	6,200	520	$4,420	$1,200
jun	7,100	600	$5,080	$1,250

The self-host equivalent included: a 16-vCPU VM ($380/month), 50 residential IPs from a budget provider at $4/GB ($200-2400/month based on usage), Postgres for results ($120/month), Prometheus + Grafana monitoring ($80/month), and one full-time engineer maintaining the stack at 0.2 FTE (allocated cost $400/month).

Migration took 8 weeks and consumed $32,000 in engineering time. After migration, monthly cost stabilized at $1,250 versus the projected Apify cost of $5,800+ by month seven. Payback period: 7.5 months from migration completion. The team retained Apify Scrapy Cloud for two specific spiders that benefited from Apify’s anti-detection-tuned proxy pool (those spiders accounted for 8 percent of total volume), so they did not fully zero their Apify bill.

The lessons: Apify is meaningfully cheaper than self-hosted up to roughly 3-5 million pages/month. Above that, self-hosted wins decisively. Hybrid (most volume self-hosted, anti-bot-heavy targets on Apify) is a viable middle ground that captures both economies.

Comparison: Scrapy Cloud vs Crawlee Cloud feature parity matrix

A 2026-current feature comparison across the two platforms on Apify’s shared infrastructure:

feature	Scrapy Cloud	Crawlee Cloud
max concurrent runs	32 (Personal), unlimited (Team+)	32 (Personal), unlimited (Team+)
max RAM per actor	32GB	32GB
storage retention	7 days (free), 30 days (paid)	7 days (free), 30 days (paid)
scheduled runs	yes	yes
webhook callbacks	yes	yes
dataset format	JSON, CSV, XML, Excel	JSON, CSV, XML, Excel
key-value store	yes	yes
live console	yes	yes
log retention	14 days	14 days
proxy: datacenter	included	included
proxy: residential	$8-12/GB	$8-12/GB
browser support	Playwright via scrapy-playwright	native Playwright + Puppeteer
TypeScript support	no (Python only)	yes (first-class)
AI agent integration	manual	Apify Agents native (2025+)
Webhooks-as-trigger	yes	yes
Apify SDK access	yes (Python)	yes (Node/Bun)
Container deploy	yes (custom Dockerfile)	yes (custom Dockerfile)

The platforms are at near-feature-parity in 2026. The choice between them is almost entirely about your team’s language preference. The exception: Apify Agents (their AI agent framework launched 2025) is Node-first and integrates more cleanly with Crawlee. Teams building AI-driven scrapers tend to lean toward Crawlee for that integration.

Detection: when hosted is the wrong choice

Five signals that your scraping workload should NOT live on Apify:

Custom TLS fingerprinting required: Apify’s HTTP egress uses their pool’s fingerprint; you cannot inject curl_cffi or tls-client at the network layer. Move to self-hosted.
Hardware GPU for inference inside the spider: Apify does not provide GPU instances. Use Modal, RunPod, or self-hosted with GPUs.
Compliance requires data residency: Apify operates in EU and US regions only. If your data must stay in Singapore, Brazil, or other regions, self-host or pick a regional provider.
Real-time webhooks under 500ms p99: Apify actor cold starts can exceed 5 seconds. For real-time response APIs, run a long-lived service on Cloud Run or Fargate instead.
Custom kernel-level networking: Apify runs in a managed container environment; you cannot install custom iptables rules, custom DNS, or kernel modules. Self-host on a VM if you need this.

If any two of these apply, the answer is self-hosted regardless of volume.

Wrapping up

Scrapy Cloud vs Crawlee Cloud is mostly a framework choice in 2026. Same underlying infrastructure, same pricing, similar features. Pick Scrapy Cloud if your team is Python-first; pick Crawlee Cloud if your team is Node-first. Both pay off at low to medium volumes versus self-hosting, then break even and lose to self-hosted around 5M+ pages/month. Pair this with our Scrapy + Playwright integration and self-hosted proxy infrastructure writeups for the full hosted-vs-self-hosted picture, and browse the dev-tools-projects category on DRT for related infrastructure deep-dives.

Related comparison: See how Bright Data stacks up against a dedicated Singapore mobile network in our Singapore Mobile Proxy vs Bright Data comparison.

last updated: May 11, 2026