Picking the right HTML parser for your Node.js scraper is one of those decisions that looks trivial until you’re burning 4 GB of RAM on a 10,000-page crawl. Cheerio, JSDom, and Linkedom all let you query HTML with CSS selectors in Node.js, but they make very different tradeoffs around speed, correctness, and memory footprint — and choosing wrong will cost you.
What each library actually does
Cheerio is a thin jQuery-style API layered over a fast HTML parser (htmlparser2 by default, or parse5 for stricter mode). it does not execute JavaScript, it does not build a real DOM, and it does not care about CSS rendering. it just parses markup and gives you a traversal API. that’s it.
JSDom simulates a full browser environment in Node. it runs JavaScript, fires events, maintains a living DOM, and implements a large chunk of the Web APIs. it’s what tools like Jest use to fake a browser in tests. for scraping, this power usually works against you.
Linkedom sits between the two. it implements enough of the DOM standard to run querySelector and basic DOM APIs, but it skips JavaScript execution and keeps memory usage low. think of it as “the DOM spec, minus the browser.”
Speed and memory: the numbers that matter
Here’s a rough benchmark comparison across a 5 MB HTML document (a realistic news index page with ~2,000 nodes):
| Library | Parse time | Peak memory | JS execution |
|---|---|---|---|
| Cheerio (htmlparser2) | ~8ms | ~18 MB | No |
| Cheerio (parse5 mode) | ~22ms | ~28 MB | No |
| Linkedom | ~14ms | ~22 MB | No |
| JSDom | ~120ms | ~180 MB | Yes |
JSDom is roughly 15x slower and 10x heavier for parse-only workloads. if your scraper doesn’t need JavaScript rendering, JSDom is almost never the right answer. for JavaScript-heavy SPAs you should be reaching for Playwright or Puppeteer anyway — for which the Python-side comparison of Pyppeteer vs Playwright Python: Which to Use in 2026 gives a good sense of what browser automation actually costs at scale.
When to use Cheerio
Cheerio is the default choice for static HTML scraping in Node.js. the API is familiar, the ecosystem is mature, and htmlparser2 is lenient enough to handle the broken HTML you’ll find on real sites.
import * as cheerio from 'cheerio';
import { fetch } from 'undici';
const html = await (await fetch('https://example.com/products')).text();
const $ = cheerio.load(html);
const prices = [];
$('.product-card .price').each((_, el) => {
prices.push($(el).text().trim());
});
console.log(prices);Cheerio works well when:
- you’re scraping static or server-rendered pages
- you need to process thousands of documents per minute
- your team already knows jQuery selectors
- you’re running in a memory-constrained environment (cheap VPS, Lambda)
one caveat: Cheerio’s default htmlparser2 is forgiving to a fault. if you’re hitting sites with deeply malformed HTML and getting wrong results, switch to parse5 mode via cheerio.load(html, { xmlMode: false }) — parse5 is spec-compliant and handles edge cases htmlparser2 silently mishandles. if you’re exploring the broader parser landscape, Selectolax: The Fastest HTML Parser You’re Not Using in 2026 covers the Python equivalent with similar speed-vs-correctness tradeoffs.
When Linkedom makes sense
Linkedom’s value proposition is correctness without the JSDom weight. if your scraping logic relies on DOM APIs beyond what Cheerio exposes — element.closest(), MutationObserver stubs, document.createElement for re-serialization — Linkedom handles these without loading a full browser runtime.
it’s also a clean fit for Worker Threads workloads. because Linkedom avoids the global state JSDom introduces, you can safely instantiate it inside Node.js worker threads and parse documents in parallel without hitting concurrency bugs.
numbered setup for a Linkedom worker pipeline:
- spawn N worker threads (one per CPU core)
- pass raw HTML strings via
workerData - parse with
parseHTML(html)inside each worker - return structured objects (not DOM nodes) back to main thread
- aggregate results in main thread
this pattern keeps memory per-worker predictable and avoids the GC pressure that JSDom creates when many documents are live simultaneously. if you’re running this on Bun instead of Node, the performance gap widens further — Web Scraping with Bun: Faster Than Node.js for Scrapers in 2026? benchmarks the runtime difference directly.
When JSDom is actually justified
JSDom earns its place in two specific scenarios:
- test environments where you need
window,document, and event simulation for code that runs in both browser and Node (Jest’s default jsdom environment is exactly this) - scraping targets that do light client-side rendering via inline
tags, where the rendered content isn't worth spinning up a full headless browser but needs actual JS execution to materialize
even in the second case, be honest about the tradeoff. JSDom's JavaScript engine is not a real browser. it will fail on anything using modern browser APIs, Web Workers, or complex module graphs. sites that genuinely need JS execution usually need Playwright. JSDom in a scraping context is mostly for the narrow band of sites that use simple document.write-style rendering.
for stateful workflows where JSDom's session handling might seem appealing -- managing cookies across pages, submitting forms -- the Python library MechanicalSoup Library Review 2026: When Cookies + Forms Matter covers the right mental model for that problem, even if you're working in a different language.
Handling real-world scraping pain points
whichever parser you choose, the parsing layer is rarely where scrapers break. what actually breaks them:
- encoding issues: always decode response bytes before passing to the parser.
undicireturns UTF-8 by default but many sites serve latin-1 or windows-1252. checkContent-Typeheaders and useiconv-litewhen needed - chunked/streaming HTML: Cheerio and Linkedom both accept full strings. stream parsing requires htmlparser2's streaming API directly, or a Rust-based option -- html5ever vs lol-html: Rust HTML Parsing Compared (2026) covers lol-html's streaming model which is particularly relevant if you're processing large responses at scale
- selector specificity bugs: Cheerio's CSS selector support (via
css-select) covers most of what you need but misses some pseudo-classes. Linkedom's selector engine is generally more spec-complete if you're hitting edge cases
one pattern worth adopting regardless of library: always validate your parsed output against a schema before writing to a database. a one-line if (!price || isNaN(parseFloat(price))) check catches 80% of the silent failures that corrupt your dataset when a site changes its HTML structure.
Bottom line
for most Node.js scraping work in 2026, Cheerio is the right default: fast, battle-tested, and easy to maintain. reach for Linkedom when you need real DOM APIs without JSDom's overhead, especially in multi-threaded pipelines. reserve JSDom for test environments and the rare inline-script rendering case. we cover the full Node.js scraping stack at DRT, including runtime benchmarks and library-level comparisons, for engineers who want to make these calls with real numbers.
Related guides on dataresearchtools.com
- Selectolax: The Fastest HTML Parser You're Not Using in 2026
- html5ever vs lol-html: Rust HTML Parsing Compared (2026)
- Pyppeteer vs Playwright Python: Which to Use in 2026
- Mechanicalsoup Library Review 2026: When Cookies + Forms Matter
- Pillar: Web Scraping with Bun: Faster Than Node.js for Scrapers in 2026?