Cheerio.js Web Scraping: Node.js HTML Parsing
Cheerio is the BeautifulSoup of Node.js. It provides jQuery-style selectors for parsing and manipulating HTML on the server side, without running a browser. Combined with an HTTP client like Axios or Got, Cheerio delivers the fastest possible web scraping in the Node.js ecosystem — 10-50x faster than Puppeteer or Playwright for static pages.
This tutorial covers Cheerio’s API, common scraping patterns, and integration with HTTP clients for production scraping.
Table of Contents
- Why Cheerio
- Installation
- Basic Usage
- Selectors
- Extracting Data
- Traversing the DOM
- Manipulating HTML
- Integration with Axios
- Handling Pagination
- Proxy Integration
- Performance Tips
- Complete Example
- FAQ
Why Cheerio
- Speed — No browser rendering overhead, pure HTML parsing
- jQuery syntax — Familiar API if you know jQuery
- Low memory — Parses HTML without building a full DOM
- Server-side — Runs on Node.js, no browser dependencies
- Lightweight — Tiny package with minimal dependencies
Use Cheerio when pages are static HTML. For JavaScript-rendered pages, use Puppeteer or Playwright.
Installation
npm install cheerio axiosBasic Usage
const cheerio = require('cheerio');
const html = `
<html>
<body>
<div class="products">
<div class="product" data-id="1">
<h2>Laptop</h2>
<span class="price">$999.99</span>
</div>
<div class="product" data-id="2">
<h2>Tablet</h2>
<span class="price">$499.99</span>
</div>
</div>
</body>
</html>
`;
const $ = cheerio.load(html);
// jQuery-style selectors
$('.product').each((i, el) => {
const name = $(el).find('h2').text();
const price = $(el).find('.price').text();
const id = $(el).attr('data-id');
console.log(`${name}: ${price} (ID: ${id})`);
});Selectors
Cheerio supports all standard CSS selectors:
const $ = cheerio.load(html);
// By tag
$('h2');
// By class
$('.product');
// By ID
$('#main-content');
// By attribute
$('a[href]');
$('input[type="text"]');
$('[data-id="123"]');
// Hierarchical
$('div.products > .product');
$('.product h2');
$('.product + .product');
// Pseudo selectors
$('.product:first-child');
$('.product:last-child');
$('.product:nth-child(2)');
$('tr:even');
$('tr:odd');
// Multiple selectors
$('h1, h2, h3');
// Contains text
$('span:contains("Price")');
// Has child
$('.product:has(.price)');
// Not
$('.product:not(.sold-out)');Extracting Data
const $ = cheerio.load(html);
// Text content
const title = $('h1').text(); // All text including children
const ownText = $('h1').contents().first().text(); // Own text only
// Attributes
const href = $('a.link').attr('href');
const dataId = $('.product').attr('data-id');
const classes = $('.product').attr('class');
// HTML content
const innerHTML = $('.content').html();
const outerHTML = $.html('.content');
// Data attributes
const id = $('.product').data('id');
// Check existence
const exists = $('.element').length > 0;
// Multiple elements as array
const allPrices = $('.price').map((i, el) => $(el).text()).get();
const allLinks = $('a').map((i, el) => $(el).attr('href')).get();
// First/last
const firstProduct = $('.product').first().text();
const lastProduct = $('.product').last().text();
// Nth element
const secondProduct = $('.product').eq(1).text();Traversing the DOM
const $ = cheerio.load(html);
const product = $('.product').first();
// Children
product.children();
product.children('.price');
// Parent
product.parent();
product.parents('.wrapper');
product.closest('.products');
// Siblings
product.next();
product.prev();
product.siblings();
// Find descendants
product.find('h2');
product.find('.price');
// Filter
$('.product').filter((i, el) => {
return parseFloat($(el).find('.price').text().replace('$', '')) > 500;
});
// Each iteration
$('.product').each((index, element) => {
const $el = $(element);
console.log(`${index}: ${$el.find('h2').text()}`);
});Manipulating HTML
Useful for cleaning HTML before extraction:
const $ = cheerio.load(html);
// Remove elements
$('script, style, nav, footer').remove();
// Get clean text
const cleanText = $('body').text().trim();
// Add/modify attributes
$('a').attr('target', '_blank');
// Replace content
$('.price').text('REDACTED');
// Wrap/unwrap
$('h2').wrap('<div class="title-wrapper"></div>');Integration with Axios
const axios = require('axios');
const cheerio = require('cheerio');
async function scrape(url) {
const { data } = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html',
},
timeout: 30000,
});
const $ = cheerio.load(data);
const books = [];
$('article.product_pod').each((i, el) => {
books.push({
title: $(el).find('h3 a').attr('title'),
price: $(el).find('.price_color').text(),
rating: $(el).find('p.star-rating').attr('class').replace('star-rating ', ''),
});
});
return books;
}
scrape('https://books.toscrape.com/')
.then(books => console.log(`Found ${books.length} books`))
.catch(console.error);For a complete Axios + Cheerio stack guide, see our Axios + Cheerio tutorial.
Handling Pagination
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeAllPages() {
const allBooks = [];
for (let page = 1; page <= 50; page++) {
const url = `https://books.toscrape.com/catalogue/page-${page}.html`;
try {
const { data } = await axios.get(url, { timeout: 30000 });
const $ = cheerio.load(data);
const books = $('article.product_pod').map((i, el) => ({
title: $(el).find('h3 a').attr('title'),
price: $(el).find('.price_color').text(),
})).get();
if (books.length === 0) break;
allBooks.push(...books);
console.log(`Page ${page}: ${books.length} books`);
// Polite delay
await new Promise(r => setTimeout(r, 1000));
} catch (err) {
console.log(`Error on page ${page}: ${err.message}`);
break;
}
}
return allBooks;
}
scrapeAllPages().then(books => {
console.log(`Total: ${books.length} books`);
require('fs').writeFileSync('books.json', JSON.stringify(books, null, 2));
});Proxy Integration
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeWithProxy(url) {
const { data } = await axios.get(url, {
proxy: {
host: 'proxy.example.com',
port: 8080,
auth: {
username: 'user',
password: 'pass',
},
},
timeout: 30000,
});
const $ = cheerio.load(data);
return $('body').text();
}
// With rotating proxies
const proxies = [
{ host: 'proxy1.example.com', port: 8080 },
{ host: 'proxy2.example.com', port: 8080 },
];
async function scrapeWithRotation(url) {
const proxy = proxies[Math.floor(Math.random() * proxies.length)];
const { data } = await axios.get(url, {
proxy: { ...proxy, auth: { username: 'user', password: 'pass' } },
});
return cheerio.load(data);
}For proxy types, see our web scraping proxy guide and proxy glossary.
Performance Tips
- Skip unnecessary parsing — Only load the HTML you need
- Use specific selectors —
.product h3 ais faster than$('a') - Remove scripts/styles first — Reduces parsing time
- Use
.get()for arrays —.map().get()is faster than manual array building - Stream large HTML — Use
htmlparser2for very large documents
// Pre-clean HTML before full parsing
const $ = cheerio.load(data, {
xmlMode: false,
decodeEntities: true,
});
// Remove noise before extraction
$('script, style, noscript, iframe').remove();Complete Example
const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');
class BookScraper {
constructor() {
this.client = axios.create({
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
},
timeout: 30000,
});
this.books = [];
}
async scrapePage(url) {
const { data } = await this.client.get(url);
const $ = cheerio.load(data);
$('article.product_pod').each((i, el) => {
const $el = $(el);
this.books.push({
title: $el.find('h3 a').attr('title'),
price: parseFloat($el.find('.price_color').text().replace('£', '')),
rating: $el.find('p.star-rating').attr('class').replace('star-rating ', ''),
available: $el.find('.instock').length > 0,
});
});
const nextHref = $('li.next a').attr('href');
return nextHref ? new URL(nextHref, url).href : null;
}
async scrapeAll() {
let url = 'https://books.toscrape.com/catalogue/page-1.html';
while (url) {
try {
const nextUrl = await this.scrapePage(url);
console.log(`Scraped: ${url} (${this.books.length} total)`);
url = nextUrl;
await new Promise(r => setTimeout(r, 1000));
} catch (err) {
console.error(`Failed: ${url} — ${err.message}`);
break;
}
}
return this.books;
}
save(filename) {
fs.writeFileSync(filename, JSON.stringify(this.books, null, 2));
console.log(`Saved ${this.books.length} books to ${filename}`);
}
}
const scraper = new BookScraper();
scraper.scrapeAll().then(() => scraper.save('books.json'));FAQ
When should I use Cheerio instead of Puppeteer?
Use Cheerio for static HTML pages that do not require JavaScript rendering. Cheerio processes pages 10-50x faster than Puppeteer because it parses raw HTML without launching a browser. Only use Puppeteer or Playwright when the page loads content dynamically with JavaScript.
Is Cheerio the same as jQuery?
Cheerio implements a subset of jQuery’s API for server-side use. The selector syntax and DOM manipulation methods are identical, but Cheerio does not include AJAX, animations, or event handling — it focuses purely on HTML parsing and manipulation.
Can Cheerio execute JavaScript?
No. Cheerio only parses static HTML. If a page requires JavaScript to render content, you need a headless browser like Puppeteer or Playwright. Always check the page source first — if the data is in the HTML, Cheerio can extract it.
What is the best HTTP client to pair with Cheerio?
Axios is the most popular choice. Got is a good alternative with more features (retries, pagination). For the lightest option, Node’s built-in fetch() (available in Node 18+) works without dependencies.
Explore more Node.js scraping: Axios + Cheerio stack, Puppeteer tutorial. For proxy setup, see our web scraping proxy guide.
External Resources:
- Cheerio Documentation
- Cheerio GitHub Repository
- Axios Documentation
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
Related Reading
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company