Elixir is an unusual choice for web scraping, but Crawly makes a compelling case: if you’re already running a BEAM-based stack or need to manage hundreds of concurrent crawl jobs without melting a server, the Elixir web scraping ecosystem deserves a serious look in 2026. Crawly is a high-level scraping framework built on top of Elixir and OTP, drawing clear inspiration from Python’s Scrapy while leaning fully into actor-model concurrency, fault tolerance, and distributed-by-default primitives.
Why the BEAM runtime matters for scraping
Most scraping runtimes handle concurrency through threads (Go, Java) or event loops (Node.js, Python asyncio). The BEAM virtual machine takes a different approach: millions of lightweight processes, each with its own heap, supervised by OTP trees that restart crashed workers automatically.
For scrapers, this has concrete benefits. A single Elixir node can handle 10,000+ concurrent HTTP connections without the context-switching overhead that kills thread-based crawlers at scale. More importantly, if a pipeline stage panics on a malformed response, the supervisor restarts just that worker, not the whole process. Compare that to Go, where Colly v2 production patterns require careful goroutine management and explicit error recovery to avoid cascade failures.
Setting up Crawly
Add Crawly to your mix.exs:
defp deps do
[
{:crawly, "~> 0.17"},
{:floki, ">= 0.30.0"}
]
endA minimal spider looks like this:
defmodule MyScraper.Spider do
use Crawly.Spider
@impl Crawly.Spider
def base_url(), do: "https://example.com"
@impl Crawly.Spider
def init() do
[start_urls: ["https://example.com/listings"]]
end
@impl Crawly.Spider
def parse_item(response) do
{:ok, document} = Floki.parse_document(response.body)
items =
document
|> Floki.find(".listing-card")
|> Enum.map(fn card ->
%{
title: card |> Floki.find("h2") |> Floki.text(),
price: card |> Floki.find(".price") |> Floki.text(),
url: response.request.url
}
end)
next_urls =
document
|> Floki.find("a.next-page")
|> Floki.attribute("href")
%{items: items, requests: Crawly.Utils.build_absolute_urls(next_urls, response.request.url)}
end
endFloki is the de facto HTML parsing library for Elixir, using a CSS selector API that feels close to Python’s BeautifulSoup. The spider pattern here is nearly identical to Scrapy: define seed URLs, parse pages, yield items and follow-up requests.
Crawly’s middleware and pipeline system
Crawly’s architecture mirrors Scrapy’s middleware stack. Request middlewares run before fetching (for proxy rotation, rate limiting, headers), and item pipelines process extracted data before storage.
Useful built-in middlewares include:
Crawly.Middlewares.DomainFilter— restricts crawls to the base domainCrawly.Middlewares.UniqueRequest— deduplicates URLs using a bloom filterCrawly.Middlewares.RobotsTxt— honors robots.txt automaticallyCrawly.Middlewares.RequestOptions— attaches custom headers and proxy settings per request
For pipelines, you can write extracted items directly to CSV, JSON, SQLite, or stream to an external queue. Configuring a JSON file pipeline takes three lines in config/config.exs:
config :crawly, MyScraper.Spider,
pipelines: [Crawly.Pipelines.JSONEncoder, Crawly.Pipelines.WriteToFile],
output_filename: "listings.json"This pipeline composability is where Crawly pulls ahead of ad-hoc Elixir HTTP scripts. It gives you the same structured data flow you’d expect from a mature scraping framework.
Crawly vs other framework options: a direct comparison
| Framework | Language | Concurrency model | JS rendering | Distributed | Best for |
|---|---|---|---|---|---|
| Crawly | Elixir | BEAM actors + OTP | No (external) | Yes (native) | High-volume pipelines, BEAM stacks |
| Scrapy | Python | Twisted async | Splash/Playwright | Via Scrapyd | General scraping, large ecosystem |
| Colly v2 | Go | Goroutines | No | Manual | Fast, low-memory crawlers |
| Crawlee | Node.js | Async/await | Playwright native | No | JS-heavy sites, dev ergonomics |
| Bun scraping | JS/TS | Event loop | Playwright | No | Fast prototyping, Node.js replacement |
If your target sites are JavaScript-heavy, Crawly needs help: you’d run a headless browser sidecar (Playwright or Puppeteer) and proxy the rendered HTML back to Crawly’s parser. That’s more setup than using a framework with native JS support, like what’s covered in Cypress vs Playwright for web scraping comparisons. Similarly, if you’re evaluating runtime speed for pure HTTP scraping, Bun’s benchmarks against Node.js are worth reviewing — though Elixir’s concurrency advantage is less about raw throughput per core and more about graceful scaling across thousands of simultaneous connections.
Scaling, proxies, and anti-bot concerns
Crawly’s rate limiter runs per-spider and is configurable in two lines:
config :crawly, MyScraper.Spider,
concurrent_requests_per_domain: 8,
request_timeout: 30_000For proxy rotation, you inject a custom RequestOptions middleware that randomly selects from a pool. Because each Crawly request is handled by an independent OTP process, proxy assignment is stateless and safe to parallelize without locks.
Anti-bot bypass is not Crawly’s strength out of the box. It has no built-in fingerprint rotation, TLS fingerprint normalization, or browser impersonation. For sites running Cloudflare, PerimeterX, or DataDome, you’ll want to either route requests through a residential proxy service with built-in rotation, or feed rendered HTML from a headless Chrome pipeline into Crawly’s item parsing stage.
For large-scale pipelines that need intelligent retry logic, session management, or AI-assisted navigation on dynamic pages, it’s worth reviewing how agent frameworks like Mastra handle scraping orchestration — especially when the target site structure changes frequently enough that hardcoded selectors break every few weeks.
Deploying Crawly on multiple nodes
One area where Crawly genuinely exceeds most alternatives is distributed crawling. Because it’s built on Erlang’s distributed networking primitives, connecting two Crawly nodes is:
- Start both nodes with the same cookie:
iex --name node1@host1 --cookie secret -S mix - Connect them:
Node.connect(:"node2@host2") - Crawly automatically distributes spider tasks across connected nodes
No external message broker needed. Redis-backed deduplication queues, Celery workers, and Kafka topics are optional extras in Python/Go setups. In Elixir, this is built into the runtime.
Bottom line
Crawly is the right choice if you’re building a high-concurrency scraping pipeline on an Elixir/Phoenix stack, need OTP-supervised fault tolerance without external orchestration tooling, and are scraping mostly server-rendered HTML. it’s the wrong choice if you need seamless JS rendering support, a large plugin ecosystem, or a team already fluent in Python or Go. DRT will keep covering framework-level tradeoffs like this as the scraping tooling landscape evolves through 2026.
Related guides on dataresearchtools.com
- Go Web Scraping with Colly v2: Production Patterns for 2026
- Go Web Scraping with chromedp: Headless Chrome in Pure Go (2026)
- Web Scraping with Bun: Faster Than Node.js for Scrapers in 2026?
- Cypress vs Playwright for Web Scraping: When to Pick Each (2026)
- Pillar: Mastra AI Agent Framework for Web Scraping: Build Intelligent Scrapers