Best Node.js scraping libraries 2026

Best Node.js scraping libraries 2026

Best Node scraping libraries in 2026 occupy a different ecosystem than Python’s. Node’s event-loop architecture is naturally async-first, so concurrency comes for free. Browser automation has stronger native fit because Puppeteer was originally a Node-only library and the JavaScript-runtime-controlling-JavaScript story is uniquely tight. The Node scraping market has consolidated around a smaller list of high-quality libraries than Python’s, but each library is more polished and the gaps are smaller. The four-layer model still applies: HTTP client, browser automation, HTML parser, framework. Picking the right combination per layer determines whether your scraper handles 100 or 10,000 requests per second.

This guide ranks the Node.js scraping libraries actually worth using in 2026, with honest performance comparisons, clear use case mapping, and the gotchas that surprise developers coming from Python.

HTTP clients

undici

Node’s modern HTTP client, developed by the Node.js core team. Faster than every alternative by 2-3x for high-concurrency workloads. The standard fetch global in modern Node uses undici under the hood.

import { fetch } from 'undici';

async function scrape(url) {
  const resp = await fetch(url, {
    headers: { 'User-Agent': 'Mozilla/5.0 ...' },
  });
  return resp.text();
}

Best for: any new project, high-concurrency workloads, the default unless you have specific needs.

got

The popular HTTP client before undici took over. Excellent retry, redirect, and cookie handling. Slightly slower than undici but more feature-complete out of the box.

import got from 'got';

const html = await got('https://example.com', {
  retry: { limit: 3 },
  timeout: { request: 10000 },
}).text();

Best for: existing got codebases, projects that benefit from got’s batteries-included extras.

axios

The classic. Sync-style promise API that everyone knows. Slower than undici and got. Still ubiquitous because of legacy familiarity.

Best for: existing axios codebases, teams that already know its API.

node-fetch

The original Node fetch polyfill, now mostly obsolete since native fetch landed in Node 18+.

Best for: legacy projects, nothing else.

Browser automation

Playwright

Same library as Python; the JavaScript version is actually the reference implementation. Cleaner API in JavaScript than Python because of TypeScript autocompletion. Best browser automation framework in any language.

import { chromium } from 'playwright';

const browser = await chromium.launch();
const context = await browser.newContext({
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
});
const page = await context.newPage();
await page.goto('https://target.example.com');
const title = await page.locator('h1.product-title').textContent();
await browser.close();

Best for: most modern browser automation in Node, multi-browser needs.

Puppeteer

The Google-maintained Chrome automation library. Slightly cleaner Chrome-specific features than Playwright, similar overall capability. The puppeteer-extra plugin ecosystem (especially stealth plugin) is more mature than Playwright’s equivalents.

Best for: Chrome-only workflows, projects using puppeteer-extra plugins.

Crawlee

Apify’s scraping framework with built-in browser support. Wraps Playwright/Puppeteer with crawler-style ergonomics. We cover it under frameworks below.

HTML parsers

Cheerio

The jQuery-syntax server-side parser. The dominant Node HTML parser. Fast (built on parse5 or htmlparser2), familiar API for anyone who used jQuery.

import * as cheerio from 'cheerio';

const $ = cheerio.load(html);
const titles = $('h2.product-title').map((i, el) => $(el).text()).get();

Best for: most HTML parsing in Node, jQuery-familiar developers.

parse5

The lower-level HTML parser that Cheerio uses under the hood. Direct use is rare but available for custom AST work.

Best for: custom HTML manipulation, building higher-level tools.

htmlparser2

Streaming HTML parser, very fast on large documents. Used by Cheerio when configured for it. Direct use for stream-based parsing.

Best for: parsing very large HTML documents in stream mode.

linkedom

Modern alternative offering full DOM API (not just jQuery-style). If your code expects document.querySelector semantics, linkedom feels native.

import { parseHTML } from 'linkedom';

const { document } = parseHTML(html);
const titles = Array.from(document.querySelectorAll('h2.product-title')).map(el => el.textContent);

Best for: developers who prefer DOM API over jQuery API, code shared between client and server.

Frameworks

Crawlee

Apify’s modern scraping framework. The Node version is the original; the Python port came later. Excellent abstractions for HTTP and browser scraping with the same Crawler interface, built-in queue management, dedupe, retry logic, and proxy rotation.

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
  async requestHandler({ request, $ }) {
    console.log(`Scraping ${request.url}`);
    const titles = $('h2.product-title').map((i, el) => $(el).text()).get();
    await crawler.pushData({ url: request.url, titles });
  },
  maxRequestsPerCrawl: 1000,
  maxConcurrency: 10,
});

await crawler.run(['https://shop.example.com/page/1']);

Best for: most modern Node scraping projects that need framework benefits.

x-ray

Older declarative scraping framework. Still works but rarely chosen for new projects.

Best for: legacy x-ray codebases.

Apify SDK

Crawlee’s parent SDK with additional Actor and platform features. Right choice if deploying to Apify cloud.

Best for: Apify platform deployments.

Comparison table

librarylayerspeedlearning curvebest for
undiciHTTPfastesteasymost new projects
gotHTTPfasteasyretry-heavy needs
axiosHTTPmideasylegacy codebases
node-fetchHTTPmideasynothing in 2026
Playwrightbrowsermidmediummost browser automation
PuppeteerbrowsermidmediumChrome-only, stealth plugins
Cheerioparserfasteasymost parsing
parse5parserfasthardcustom AST work
htmlparser2parserfastestmediumvery large docs, streaming
linkedomparserfasteasyDOM-API preference
Crawleeframeworkfastmediummodern crawler projects
x-rayframeworkmideasylegacy

Decision matrix: solopreneur, SMB, enterprise

profilescalerecommended stackreasoning
Solopreneur learning<10k pages/daynative fetch + CheerioZero dependencies, modern defaults
Indie scraper<500k pages/dayundici + Cheerio + p-limitBest HTTP perf, simple flow control
Indie JS-heavy<100k pages/dayPlaywright + Cheerio post-parseBrowser only when needed
SMB crawler500k-10M pages/dayCrawlee CheerioCrawlerFramework manages queue, dedupe, retry
SMB anti-detect100k-1M pages/dayPuppeteer + puppeteer-extra-plugin-stealthStealth ecosystem maturity
Enterprise10M+ pages/dayCrawlee on K8s + custom middlewareVolume justifies platform investment
Hybrid HTTP/JSvariesCrawlee (CheerioCrawler + PlaywrightCrawler)Same dataset across two modes

The Node ecosystem rewards convergence: most teams end up on undici + Cheerio for HTTP and Playwright + stealth plugins for browser. Crawlee adds value above 500k pages/day; below that, hand-rolled async with p-limit is simpler and fast enough.

Migration path: axios + cheerio to undici + Cheerio

Most legacy Node scrapers run on axios because it was the dominant HTTP client of the 2018-2022 era. Modernizing to undici is straightforward and yields a 2-3x throughput improvement:

  1. Replace axios.get(url, opts) with await fetch(url, opts) from undici. The API differs slightly (response body via .text() / .json() instead of .data).
  2. Replace axios interceptors with explicit retry wrappers. undici does not have an interceptor system; use a small wrapper function for retry, logging, and metrics.
  3. Update timeout handling to use AbortSignal.timeout(ms) instead of axios’s timeout option.
  4. Benchmark the same workload before and after. Expect 2-3x improvement on concurrent request throughput.
  5. Keep axios for any code that uses interceptors heavily (auth refresh patterns, request signing) where the cost of unwinding interceptor logic outweighs the perf gain.

A typical Node scraper migration completes in a day. The performance gain often unblocks scaling work that was on the roadmap for distributed infrastructure.

Performance benchmarks

Same workload as the Python benchmarks: 10,000 simple HTML pages from a local mirror, single Node process.

stacktotal timerequests/sec
native fetch (50 concurrency)8s1250
undici (50 concurrency)7s1428
got (50 concurrency)11s909
axios (50 concurrency)14s714
Playwright (50 concurrent contexts)88s113
Crawlee CheerioCrawler9s1111

Node beats Python on raw HTTP throughput thanks to its event loop architecture. The browser automation gap is similar in both languages because the bottleneck is browser execution, not the host runtime.

Cost worked example

For a 100k-pages-per-day Node scraping workload on mixed protected and unprotected targets:

  • 1 small VPS ($20/mo, 4 vCPU, 8 GB)
  • undici + Cheerio + p-limit stack (free, Node only)
  • node-libcurl when TLS impersonation is needed (free, requires native build)
  • Smartproxy/Decodo residential proxies (~$50/mo for 5 GB)
  • PostgreSQL on a hosted instance ($25/mo)
  • Optional: ZenRows fallback for hard surfaces (~$69/mo)

Total: about $95-165/month depending on the API fallback. Node throughput is higher than Python on raw HTTP, which lets you pack more work into the same VPS; expect to need ~30% less compute capacity than the equivalent Python deployment.

The other Node-specific cost dimension is RAM. Node processes typically run 200-300 MB at scraping idle and grow with concurrent contexts. For a single-process scraper, 8 GB RAM is plenty; for distributed multi-worker setups, prefer many small workers over few large ones to limit blast radius from leaks.

Stack recommendations

Small project, scripts: native fetch + Cheerio. Built into modern Node, no dependencies, fast.

Medium project, no JS needs: undici + Cheerio. Fastest HTTP client + standard parser. Add tenacity-style retry via simple wrapper.

JavaScript-heavy targets: Playwright + Cheerio (parse the extracted HTML with Cheerio for speed instead of using Playwright’s slower DOM querying).

Large crawler with link-following: Crawlee. Built-in queue management saves you from writing your own.

Anti-bot heavy targets: Puppeteer with puppeteer-extra-plugin-stealth. The stealth plugin ecosystem is more mature for Puppeteer than for Playwright in Node.

Hybrid HTTP + browser: Crawlee with multiple crawler classes (CheerioCrawler for HTTP-only pages, PlaywrightCrawler for JS-heavy pages, both writing to the same dataset).

Crawlee deep dive

Crawlee deserves a closer look because it has matured into the de-facto Node scraping framework. Its three crawler classes cover the spectrum:

  • CheerioCrawler: HTTP-only, uses got (or undici under the hood) and Cheerio. Fast, low-resource. The right default for HTTP scraping.
  • PlaywrightCrawler: full browser automation with Playwright. Highest resource cost but handles any JavaScript.
  • PuppeteerCrawler: same as Playwright but using Puppeteer. Choose this if your team prefers Puppeteer’s API or uses puppeteer-extra plugins.

All three share the same RequestQueue, Dataset, and KeyValueStore abstractions, which means you can switch between HTTP and browser modes per request without changing your data layer. A typical pattern is to start with CheerioCrawler, fall back to PlaywrightCrawler when the HTML is missing the data you need, and store both kinds of results in the same Dataset.

Crawlee’s RequestQueue supports SQLite, MongoDB, and the Apify cloud as backends. SQLite works for single-process crawlers; MongoDB works for distributed crawlers across machines. The cloud backend gives you a managed queue with no operational overhead.

Modern async patterns

Node’s async syntax is cleaner than Python’s for typical scraping patterns:

import { fetch } from 'undici';
import * as cheerio from 'cheerio';
import pLimit from 'p-limit';

async function fetchPage(url, retries = 3) {
  for (let attempt = 0; attempt < retries; attempt++) {
    try {
      const resp = await fetch(url, {
        signal: AbortSignal.timeout(15000),
      });
      if (resp.status === 200) {
        return await resp.text();
      }
      if (resp.status === 429 || resp.status === 503) {
        await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000));
        continue;
      }
      return null;
    } catch (err) {
      if (attempt === retries - 1) throw err;
      await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000));
    }
  }
}

function parseProducts(html) {
  if (!html) return [];
  const $ = cheerio.load(html);
  return $('div.product-card').map((i, el) => ({
    name: $(el).find('h2.title').text(),
    price: $(el).find('span.price').text(),
  })).get();
}

async function scrapeAll(urls, concurrency = 20) {
  const limit = pLimit(concurrency);
  const results = await Promise.all(
    urls.map(url => limit(async () => {
      const html = await fetchPage(url);
      return parseProducts(html);
    }))
  );
  return results.flat();
}

This pattern handles 1000+ pages per minute on a modest VPS with retries and concurrency control built in.

Persistence and storage in Node

Node scrapers benefit from a few storage patterns specific to JavaScript ecosystems:

  • better-sqlite3 for synchronous local storage. Faster than the async sqlite3 library for write-heavy workloads because it avoids async overhead.
  • Knex or Prisma for typed Postgres access. Both work well; Prisma’s TypeScript types are stronger but Knex is lighter.
  • Crawlee KeyValueStore + Dataset. When using Crawlee, prefer its built-in storage abstractions; they handle large blobs and structured records cleanly.
  • DuckDB-WASM for in-process analytics. When you want to query scraped data without a database server, DuckDB now ships a Node binding that lets you run SQL on Parquet or local arrays.

For very large output volumes, stream writes to S3 / R2 with @aws-sdk/client-s3 MultipartUpload rather than collecting everything in memory and uploading at the end.

Common mistakes to avoid

Using axios in 2026: it works but is slower than undici and got. New projects should default to undici.

Forgetting AbortSignal.timeout: Node’s native fetch does not have a default timeout. Without one, your scraper hangs on slow targets indefinitely.

Loading huge HTML strings into Cheerio at once: for documents over 10 MB, use htmlparser2 in streaming mode.

Running too many browser contexts in one Node process: Node memory grows fast with many Playwright contexts. Stay under 50 concurrent contexts per process.

Ignoring back-pressure in Crawlee: Crawlee’s queues can grow unboundedly if you push faster than you consume. Set maxRequestsPerCrawl and maxConcurrency appropriately.

We cover the Python equivalent in our best Python scraping libraries 2026 review.

Common gotchas

  • Native fetch lacks default timeout. Without AbortSignal.timeout(), your scraper hangs on slow targets indefinitely. Always set a timeout.
  • undici keep-alive defaults. undici defaults to keep-alive connections. For one-off scripts, this can leave the process hanging waiting for sockets. Use Agent({ keepAliveTimeout: 1 }) or call agent.close() at script end.
  • Cheerio re-parse cost. Each call to cheerio.load() re-parses the HTML. For many extractions on the same document, parse once and pass $ around.
  • Playwright newPage vs newContext. newPage() reuses the parent context’s cookies; newContext() creates a fresh storage state. Use newContext() per scrape to isolate cookies; many subtle bugs come from cookie cross-contamination.
  • Crawlee request handler errors. A throw inside requestHandler retries the request by default. If the error is permanent (404, parse failure), call request.noRetry = true to skip the retry queue.
  • JSON parsing with native fetch. await resp.json() throws on empty body; wrap in try/catch or check resp.ok first.
  • EventEmitter memory leaks. Browser launches that emit console, request, or response events accumulate listeners if you do not clean them up. Use page.removeAllListeners() before close or use named handler functions you can remove explicitly.
  • TLS hardening on undici. Some targets refuse TLS 1.2 connections. undici defaults to negotiating up; if you see handshake errors, force connect: { tls: { minVersion: 'TLSv1.3' } }.

TypeScript vs JavaScript

For new projects, TypeScript is the right choice. The HTTP and parsing libraries all ship with strong type definitions. The Playwright API in TypeScript is a different developer experience than JavaScript.

import { chromium, Browser, Page } from 'playwright';

async function scrape(browser: Browser, url: string): Promise<string | null> {
  const page: Page = await browser.newPage();
  try {
    await page.goto(url);
    return await page.locator('h1').textContent();
  } finally {
    await page.close();
  }
}

The autocompletion and refactoring support pay off within the first week of any non-trivial project.

External authoritative reference: the Node.js documentation on the global fetch covers the standard HTTP client.

Bun and Deno alternatives

Bun and Deno both ship with built-in fetch and run all the libraries above. Bun is notably faster than Node for HTTP-heavy workloads (about 1.5-2x in our testing). Deno’s permission model is interesting for scraping isolation but the ecosystem is smaller.

For most teams, Node remains the right default in 2026 because library compatibility is broadest. We cover the alternatives in our forthcoming guides on Bun and Deno scraping.

FAQ

Q: Cheerio or jQuery selectors?
Cheerio implements jQuery-style selectors server-side. The API is essentially identical. Use Cheerio in Node; do not import actual jQuery server-side.

Q: should I use Crawlee or write my own crawler?
For projects under 1000 pages, write your own with undici + Cheerio + p-limit. For larger crawls with link-following, dedupe, and retry needs, Crawlee saves significant code.

Q: Puppeteer or Playwright?
Playwright is technically better for new projects: cleaner API, multi-browser, better auto-waiting. Puppeteer has the puppeteer-extra-plugin-stealth ecosystem advantage which still matters for some anti-detect work.

Q: how do I handle TLS fingerprinting in Node?
Node does not have a great equivalent to Python’s curl_cffi yet. The closest options are node-libcurl (libcurl bindings for Node) or routing through a proxy that handles TLS fingerprinting on your behalf.

Q: is JSDOM useful for scraping?
JSDOM is heavier than Cheerio because it implements the full DOM API including layout. For scraping where you do not need actual JavaScript execution, Cheerio is faster. JSDOM is the right choice when you want to execute scripts on a parsed document without a full browser.

Q: how do I integrate proxies?
With undici, use a ProxyAgent. With Playwright, pass proxy to chromium.launch(). Crawlee has built-in proxy rotation across a pool. Avoid manual proxy management; use the built-in tools wherever possible.

Q: is Bun production-ready for scraping?
Yes for most use cases. Bun’s built-in fetch and HTML parser are excellent. The remaining gaps are around obscure npm packages with native dependencies that have not been compiled for Bun. Test your dependency tree before committing to Bun in production.

Q: what is the cleanest way to handle pagination?
Wrap your fetch in an async generator that yields pages until a stop condition. Async generators in Node compose nicely with for await loops and avoid materializing all pages in memory.

Closing

The Node.js scraping stack in 2026 is mature and stable. undici for HTTP, Cheerio for parsing, Playwright for browser automation, Crawlee for crawler frameworks. The ecosystem moves slower than Python’s but each piece is more polished. Match the stack to the workload and Node will outperform Python on raw HTTP throughput while matching it on browser automation. For broader scraping infrastructure see our dev-tools-projects category hub.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)