Cheerio.js Web Scraping: Node.js HTML Parsing

Cheerio.js Web Scraping: Node.js HTML Parsing

Cheerio is the BeautifulSoup of Node.js. It provides jQuery-style selectors for parsing and manipulating HTML on the server side, without running a browser. Combined with an HTTP client like Axios or Got, Cheerio delivers the fastest possible web scraping in the Node.js ecosystem — 10-50x faster than Puppeteer or Playwright for static pages.

This tutorial covers Cheerio’s API, common scraping patterns, and integration with HTTP clients for production scraping.

Table of Contents

Why Cheerio

  • Speed — No browser rendering overhead, pure HTML parsing
  • jQuery syntax — Familiar API if you know jQuery
  • Low memory — Parses HTML without building a full DOM
  • Server-side — Runs on Node.js, no browser dependencies
  • Lightweight — Tiny package with minimal dependencies

Use Cheerio when pages are static HTML. For JavaScript-rendered pages, use Puppeteer or Playwright.

Installation

npm install cheerio axios

Basic Usage

const cheerio = require('cheerio');

const html = `
<html>
<body>
    <div class="products">
        <div class="product" data-id="1">
            <h2>Laptop</h2>
            <span class="price">$999.99</span>
        </div>
        <div class="product" data-id="2">
            <h2>Tablet</h2>
            <span class="price">$499.99</span>
        </div>
    </div>
</body>
</html>
`;

const $ = cheerio.load(html);

// jQuery-style selectors
$('.product').each((i, el) => {
    const name = $(el).find('h2').text();
    const price = $(el).find('.price').text();
    const id = $(el).attr('data-id');
    console.log(`${name}: ${price} (ID: ${id})`);
});

Selectors

Cheerio supports all standard CSS selectors:

const $ = cheerio.load(html);

// By tag
$('h2');

// By class
$('.product');

// By ID
$('#main-content');

// By attribute
$('a[href]');
$('input[type="text"]');
$('[data-id="123"]');

// Hierarchical
$('div.products > .product');
$('.product h2');
$('.product + .product');

// Pseudo selectors
$('.product:first-child');
$('.product:last-child');
$('.product:nth-child(2)');
$('tr:even');
$('tr:odd');

// Multiple selectors
$('h1, h2, h3');

// Contains text
$('span:contains("Price")');

// Has child
$('.product:has(.price)');

// Not
$('.product:not(.sold-out)');

Extracting Data

const $ = cheerio.load(html);

// Text content
const title = $('h1').text();                    // All text including children
const ownText = $('h1').contents().first().text(); // Own text only

// Attributes
const href = $('a.link').attr('href');
const dataId = $('.product').attr('data-id');
const classes = $('.product').attr('class');

// HTML content
const innerHTML = $('.content').html();
const outerHTML = $.html('.content');

// Data attributes
const id = $('.product').data('id');

// Check existence
const exists = $('.element').length > 0;

// Multiple elements as array
const allPrices = $('.price').map((i, el) => $(el).text()).get();
const allLinks = $('a').map((i, el) => $(el).attr('href')).get();

// First/last
const firstProduct = $('.product').first().text();
const lastProduct = $('.product').last().text();

// Nth element
const secondProduct = $('.product').eq(1).text();

Traversing the DOM

const $ = cheerio.load(html);

const product = $('.product').first();

// Children
product.children();
product.children('.price');

// Parent
product.parent();
product.parents('.wrapper');
product.closest('.products');

// Siblings
product.next();
product.prev();
product.siblings();

// Find descendants
product.find('h2');
product.find('.price');

// Filter
$('.product').filter((i, el) => {
    return parseFloat($(el).find('.price').text().replace('$', '')) > 500;
});

// Each iteration
$('.product').each((index, element) => {
    const $el = $(element);
    console.log(`${index}: ${$el.find('h2').text()}`);
});

Manipulating HTML

Useful for cleaning HTML before extraction:

const $ = cheerio.load(html);

// Remove elements
$('script, style, nav, footer').remove();

// Get clean text
const cleanText = $('body').text().trim();

// Add/modify attributes
$('a').attr('target', '_blank');

// Replace content
$('.price').text('REDACTED');

// Wrap/unwrap
$('h2').wrap('<div class="title-wrapper"></div>');

Integration with Axios

const axios = require('axios');
const cheerio = require('cheerio');

async function scrape(url) {
    const { data } = await axios.get(url, {
        headers: {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept': 'text/html',
        },
        timeout: 30000,
    });

    const $ = cheerio.load(data);

    const books = [];
    $('article.product_pod').each((i, el) => {
        books.push({
            title: $(el).find('h3 a').attr('title'),
            price: $(el).find('.price_color').text(),
            rating: $(el).find('p.star-rating').attr('class').replace('star-rating ', ''),
        });
    });

    return books;
}

scrape('https://books.toscrape.com/')
    .then(books => console.log(`Found ${books.length} books`))
    .catch(console.error);

For a complete Axios + Cheerio stack guide, see our Axios + Cheerio tutorial.

Handling Pagination

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeAllPages() {
    const allBooks = [];

    for (let page = 1; page <= 50; page++) {
        const url = `https://books.toscrape.com/catalogue/page-${page}.html`;

        try {
            const { data } = await axios.get(url, { timeout: 30000 });
            const $ = cheerio.load(data);

            const books = $('article.product_pod').map((i, el) => ({
                title: $(el).find('h3 a').attr('title'),
                price: $(el).find('.price_color').text(),
            })).get();

            if (books.length === 0) break;
            allBooks.push(...books);
            console.log(`Page ${page}: ${books.length} books`);

            // Polite delay
            await new Promise(r => setTimeout(r, 1000));

        } catch (err) {
            console.log(`Error on page ${page}: ${err.message}`);
            break;
        }
    }

    return allBooks;
}

scrapeAllPages().then(books => {
    console.log(`Total: ${books.length} books`);
    require('fs').writeFileSync('books.json', JSON.stringify(books, null, 2));
});

Proxy Integration

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeWithProxy(url) {
    const { data } = await axios.get(url, {
        proxy: {
            host: 'proxy.example.com',
            port: 8080,
            auth: {
                username: 'user',
                password: 'pass',
            },
        },
        timeout: 30000,
    });

    const $ = cheerio.load(data);
    return $('body').text();
}

// With rotating proxies
const proxies = [
    { host: 'proxy1.example.com', port: 8080 },
    { host: 'proxy2.example.com', port: 8080 },
];

async function scrapeWithRotation(url) {
    const proxy = proxies[Math.floor(Math.random() * proxies.length)];
    const { data } = await axios.get(url, {
        proxy: { ...proxy, auth: { username: 'user', password: 'pass' } },
    });
    return cheerio.load(data);
}

For proxy types, see our web scraping proxy guide and proxy glossary.

Performance Tips

  1. Skip unnecessary parsing — Only load the HTML you need
  2. Use specific selectors.product h3 a is faster than $('a')
  3. Remove scripts/styles first — Reduces parsing time
  4. Use .get() for arrays.map().get() is faster than manual array building
  5. Stream large HTML — Use htmlparser2 for very large documents
// Pre-clean HTML before full parsing
const $ = cheerio.load(data, {
    xmlMode: false,
    decodeEntities: true,
});

// Remove noise before extraction
$('script, style, noscript, iframe').remove();

Complete Example

const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');

class BookScraper {
    constructor() {
        this.client = axios.create({
            headers: {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
            },
            timeout: 30000,
        });
        this.books = [];
    }

    async scrapePage(url) {
        const { data } = await this.client.get(url);
        const $ = cheerio.load(data);

        $('article.product_pod').each((i, el) => {
            const $el = $(el);
            this.books.push({
                title: $el.find('h3 a').attr('title'),
                price: parseFloat($el.find('.price_color').text().replace('£', '')),
                rating: $el.find('p.star-rating').attr('class').replace('star-rating ', ''),
                available: $el.find('.instock').length > 0,
            });
        });

        const nextHref = $('li.next a').attr('href');
        return nextHref ? new URL(nextHref, url).href : null;
    }

    async scrapeAll() {
        let url = 'https://books.toscrape.com/catalogue/page-1.html';

        while (url) {
            try {
                const nextUrl = await this.scrapePage(url);
                console.log(`Scraped: ${url} (${this.books.length} total)`);
                url = nextUrl;
                await new Promise(r => setTimeout(r, 1000));
            } catch (err) {
                console.error(`Failed: ${url} — ${err.message}`);
                break;
            }
        }

        return this.books;
    }

    save(filename) {
        fs.writeFileSync(filename, JSON.stringify(this.books, null, 2));
        console.log(`Saved ${this.books.length} books to ${filename}`);
    }
}

const scraper = new BookScraper();
scraper.scrapeAll().then(() => scraper.save('books.json'));

FAQ

When should I use Cheerio instead of Puppeteer?

Use Cheerio for static HTML pages that do not require JavaScript rendering. Cheerio processes pages 10-50x faster than Puppeteer because it parses raw HTML without launching a browser. Only use Puppeteer or Playwright when the page loads content dynamically with JavaScript.

Is Cheerio the same as jQuery?

Cheerio implements a subset of jQuery’s API for server-side use. The selector syntax and DOM manipulation methods are identical, but Cheerio does not include AJAX, animations, or event handling — it focuses purely on HTML parsing and manipulation.

Can Cheerio execute JavaScript?

No. Cheerio only parses static HTML. If a page requires JavaScript to render content, you need a headless browser like Puppeteer or Playwright. Always check the page source first — if the data is in the HTML, Cheerio can extract it.

What is the best HTTP client to pair with Cheerio?

Axios is the most popular choice. Got is a good alternative with more features (retries, pagination). For the lightest option, Node’s built-in fetch() (available in Node 18+) works without dependencies.


Explore more Node.js scraping: Axios + Cheerio stack, Puppeteer tutorial. For proxy setup, see our web scraping proxy guide.

External Resources:


Related Reading

Scroll to Top