Elixir Web Scraping with Crawly: BEAM Concurrency for Scrapers (2026)

Elixir is an unusual choice for web scraping, but Crawly makes a compelling case: if you’re already running a BEAM-based stack or need to manage hundreds of concurrent crawl jobs without melting a server, the Elixir web scraping ecosystem deserves a serious look in 2026. Crawly is a high-level scraping framework built on top of Elixir and OTP, drawing clear inspiration from Python’s Scrapy while leaning fully into actor-model concurrency, fault tolerance, and distributed-by-default primitives.

Why the BEAM runtime matters for scraping

Most scraping runtimes handle concurrency through threads (Go, Java) or event loops (Node.js, Python asyncio). The BEAM virtual machine takes a different approach: millions of lightweight processes, each with its own heap, supervised by OTP trees that restart crashed workers automatically.

For scrapers, this has concrete benefits. A single Elixir node can handle 10,000+ concurrent HTTP connections without the context-switching overhead that kills thread-based crawlers at scale. More importantly, if a pipeline stage panics on a malformed response, the supervisor restarts just that worker, not the whole process. Compare that to Go, where Colly v2 production patterns require careful goroutine management and explicit error recovery to avoid cascade failures.

Setting up Crawly

Add Crawly to your mix.exs:

defp deps do
  [
    {:crawly, "~> 0.17"},
    {:floki, ">= 0.30.0"}
  ]
end

A minimal spider looks like this:

defmodule MyScraper.Spider do
  use Crawly.Spider

  @impl Crawly.Spider
  def base_url(), do: "https://example.com"

  @impl Crawly.Spider
  def init() do
    [start_urls: ["https://example.com/listings"]]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    {:ok, document} = Floki.parse_document(response.body)

    items =
      document
      |> Floki.find(".listing-card")
      |> Enum.map(fn card ->
        %{
          title: card |> Floki.find("h2") |> Floki.text(),
          price: card |> Floki.find(".price") |> Floki.text(),
          url: response.request.url
        }
      end)

    next_urls =
      document
      |> Floki.find("a.next-page")
      |> Floki.attribute("href")

    %{items: items, requests: Crawly.Utils.build_absolute_urls(next_urls, response.request.url)}
  end
end

Floki is the de facto HTML parsing library for Elixir, using a CSS selector API that feels close to Python’s BeautifulSoup. The spider pattern here is nearly identical to Scrapy: define seed URLs, parse pages, yield items and follow-up requests.

Crawly’s middleware and pipeline system

Crawly’s architecture mirrors Scrapy’s middleware stack. Request middlewares run before fetching (for proxy rotation, rate limiting, headers), and item pipelines process extracted data before storage.

Useful built-in middlewares include:

  • Crawly.Middlewares.DomainFilter — restricts crawls to the base domain
  • Crawly.Middlewares.UniqueRequest — deduplicates URLs using a bloom filter
  • Crawly.Middlewares.RobotsTxt — honors robots.txt automatically
  • Crawly.Middlewares.RequestOptions — attaches custom headers and proxy settings per request

For pipelines, you can write extracted items directly to CSV, JSON, SQLite, or stream to an external queue. Configuring a JSON file pipeline takes three lines in config/config.exs:

config :crawly, MyScraper.Spider,
  pipelines: [Crawly.Pipelines.JSONEncoder, Crawly.Pipelines.WriteToFile],
  output_filename: "listings.json"

This pipeline composability is where Crawly pulls ahead of ad-hoc Elixir HTTP scripts. It gives you the same structured data flow you’d expect from a mature scraping framework.

Crawly vs other framework options: a direct comparison

FrameworkLanguageConcurrency modelJS renderingDistributedBest for
CrawlyElixirBEAM actors + OTPNo (external)Yes (native)High-volume pipelines, BEAM stacks
ScrapyPythonTwisted asyncSplash/PlaywrightVia ScrapydGeneral scraping, large ecosystem
Colly v2GoGoroutinesNoManualFast, low-memory crawlers
CrawleeNode.jsAsync/awaitPlaywright nativeNoJS-heavy sites, dev ergonomics
Bun scrapingJS/TSEvent loopPlaywrightNoFast prototyping, Node.js replacement

If your target sites are JavaScript-heavy, Crawly needs help: you’d run a headless browser sidecar (Playwright or Puppeteer) and proxy the rendered HTML back to Crawly’s parser. That’s more setup than using a framework with native JS support, like what’s covered in Cypress vs Playwright for web scraping comparisons. Similarly, if you’re evaluating runtime speed for pure HTTP scraping, Bun’s benchmarks against Node.js are worth reviewing — though Elixir’s concurrency advantage is less about raw throughput per core and more about graceful scaling across thousands of simultaneous connections.

Scaling, proxies, and anti-bot concerns

Crawly’s rate limiter runs per-spider and is configurable in two lines:

config :crawly, MyScraper.Spider,
  concurrent_requests_per_domain: 8,
  request_timeout: 30_000

For proxy rotation, you inject a custom RequestOptions middleware that randomly selects from a pool. Because each Crawly request is handled by an independent OTP process, proxy assignment is stateless and safe to parallelize without locks.

Anti-bot bypass is not Crawly’s strength out of the box. It has no built-in fingerprint rotation, TLS fingerprint normalization, or browser impersonation. For sites running Cloudflare, PerimeterX, or DataDome, you’ll want to either route requests through a residential proxy service with built-in rotation, or feed rendered HTML from a headless Chrome pipeline into Crawly’s item parsing stage.

For large-scale pipelines that need intelligent retry logic, session management, or AI-assisted navigation on dynamic pages, it’s worth reviewing how agent frameworks like Mastra handle scraping orchestration — especially when the target site structure changes frequently enough that hardcoded selectors break every few weeks.

Deploying Crawly on multiple nodes

One area where Crawly genuinely exceeds most alternatives is distributed crawling. Because it’s built on Erlang’s distributed networking primitives, connecting two Crawly nodes is:

  1. Start both nodes with the same cookie: iex --name node1@host1 --cookie secret -S mix
  2. Connect them: Node.connect(:"node2@host2")
  3. Crawly automatically distributes spider tasks across connected nodes

No external message broker needed. Redis-backed deduplication queues, Celery workers, and Kafka topics are optional extras in Python/Go setups. In Elixir, this is built into the runtime.

Bottom line

Crawly is the right choice if you’re building a high-concurrency scraping pipeline on an Elixir/Phoenix stack, need OTP-supervised fault tolerance without external orchestration tooling, and are scraping mostly server-rendered HTML. it’s the wrong choice if you need seamless JS rendering support, a large plugin ecosystem, or a team already fluent in Python or Go. DRT will keep covering framework-level tradeoffs like this as the scraping tooling landscape evolves through 2026.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)