Cheerio.js Web Scraping: Node.js HTML Parsing

Cheerio is the BeautifulSoup of Node.js. It provides jQuery-style selectors for parsing and manipulating HTML on the server side, without running a browser. Combined with an HTTP client like Axios or Got, Cheerio delivers the fastest possible web scraping in the Node.js ecosystem — 10-50x faster than Puppeteer or Playwright for static pages.

This tutorial covers Cheerio’s API, common scraping patterns, and integration with HTTP clients for production scraping.

Why Cheerio
Installation
Basic Usage
Selectors
Extracting Data
Traversing the DOM
Manipulating HTML
Integration with Axios
Handling Pagination
Proxy Integration
Performance Tips
Complete Example
FAQ

Why Cheerio

Speed — No browser rendering overhead, pure HTML parsing
jQuery syntax — Familiar API if you know jQuery
Low memory — Parses HTML without building a full DOM
Server-side — Runs on Node.js, no browser dependencies
Lightweight — Tiny package with minimal dependencies

Use Cheerio when pages are static HTML. For JavaScript-rendered pages, use Puppeteer or Playwright.

Installation

npm install cheerio axios

Basic Usage

const cheerio = require('cheerio');

const html = `
<html>
<body>
    <div class="products">
        <div class="product" data-id="1">
            <h2>Laptop</h2>
            <span class="price">$999.99</span>
        </div>
        <div class="product" data-id="2">
            <h2>Tablet</h2>
            <span class="price">$499.99</span>
        </div>
    </div>
</body>
</html>
`;

const $ = cheerio.load(html);

// jQuery-style selectors
$('.product').each((i, el) => {
    const name = $(el).find('h2').text();
    const price = $(el).find('.price').text();
    const id = $(el).attr('data-id');
    console.log(`${name}: ${price} (ID: ${id})`);
});

Selectors

Cheerio supports all standard CSS selectors:

const $ = cheerio.load(html);

// By tag
$('h2');

// By class
$('.product');

// By ID
$('#main-content');

// By attribute
$('a[href]');
$('input[type="text"]');
$('[data-id="123"]');

// Hierarchical
$('div.products > .product');
$('.product h2');
$('.product + .product');

// Pseudo selectors
$('.product:first-child');
$('.product:last-child');
$('.product:nth-child(2)');
$('tr:even');
$('tr:odd');

// Multiple selectors
$('h1, h2, h3');

// Contains text
$('span:contains("Price")');

// Has child
$('.product:has(.price)');

// Not
$('.product:not(.sold-out)');

Extracting Data

const $ = cheerio.load(html);

// Text content
const title = $('h1').text();                    // All text including children
const ownText = $('h1').contents().first().text(); // Own text only

// Attributes
const href = $('a.link').attr('href');
const dataId = $('.product').attr('data-id');
const classes = $('.product').attr('class');

// HTML content
const innerHTML = $('.content').html();
const outerHTML = $.html('.content');

// Data attributes
const id = $('.product').data('id');

// Check existence
const exists = $('.element').length > 0;

// Multiple elements as array
const allPrices = $('.price').map((i, el) => $(el).text()).get();
const allLinks = $('a').map((i, el) => $(el).attr('href')).get();

// First/last
const firstProduct = $('.product').first().text();
const lastProduct = $('.product').last().text();

// Nth element
const secondProduct = $('.product').eq(1).text();

Traversing the DOM

const $ = cheerio.load(html);

const product = $('.product').first();

// Children
product.children();
product.children('.price');

// Parent
product.parent();
product.parents('.wrapper');
product.closest('.products');

// Siblings
product.next();
product.prev();
product.siblings();

// Find descendants
product.find('h2');
product.find('.price');

// Filter
$('.product').filter((i, el) => {
    return parseFloat($(el).find('.price').text().replace('$', '')) > 500;
});

// Each iteration
$('.product').each((index, element) => {
    const $el = $(element);
    console.log(`${index}: ${$el.find('h2').text()}`);
});

Manipulating HTML

Useful for cleaning HTML before extraction:

const $ = cheerio.load(html);

// Remove elements
$('script, style, nav, footer').remove();

// Get clean text
const cleanText = $('body').text().trim();

// Add/modify attributes
$('a').attr('target', '_blank');

// Replace content
$('.price').text('REDACTED');

// Wrap/unwrap
$('h2').wrap('<div class="title-wrapper"></div>');

Integration with Axios

const axios = require('axios');
const cheerio = require('cheerio');

async function scrape(url) {
    const { data } = await axios.get(url, {
        headers: {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept': 'text/html',
        },
        timeout: 30000,
    });

    const $ = cheerio.load(data);

    const books = [];
    $('article.product_pod').each((i, el) => {
        books.push({
            title: $(el).find('h3 a').attr('title'),
            price: $(el).find('.price_color').text(),
            rating: $(el).find('p.star-rating').attr('class').replace('star-rating ', ''),
        });
    });

    return books;
}

scrape('https://books.toscrape.com/')
    .then(books => console.log(`Found ${books.length} books`))
    .catch(console.error);

For a complete Axios + Cheerio stack guide, see our Axios + Cheerio tutorial.

Handling Pagination

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeAllPages() {
    const allBooks = [];

    for (let page = 1; page <= 50; page++) {
        const url = `https://books.toscrape.com/catalogue/page-${page}.html`;

        try {
            const { data } = await axios.get(url, { timeout: 30000 });
            const $ = cheerio.load(data);

            const books = $('article.product_pod').map((i, el) => ({
                title: $(el).find('h3 a').attr('title'),
                price: $(el).find('.price_color').text(),
            })).get();

            if (books.length === 0) break;
            allBooks.push(...books);
            console.log(`Page ${page}: ${books.length} books`);

            // Polite delay
            await new Promise(r => setTimeout(r, 1000));

        } catch (err) {
            console.log(`Error on page ${page}: ${err.message}`);
            break;
        }
    }

    return allBooks;
}

scrapeAllPages().then(books => {
    console.log(`Total: ${books.length} books`);
    require('fs').writeFileSync('books.json', JSON.stringify(books, null, 2));
});

Proxy Integration

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeWithProxy(url) {
    const { data } = await axios.get(url, {
        proxy: {
            host: 'proxy.example.com',
            port: 8080,
            auth: {
                username: 'user',
                password: 'pass',
            },
        },
        timeout: 30000,
    });

    const $ = cheerio.load(data);
    return $('body').text();
}

// With rotating proxies
const proxies = [
    { host: 'proxy1.example.com', port: 8080 },
    { host: 'proxy2.example.com', port: 8080 },
];

async function scrapeWithRotation(url) {
    const proxy = proxies[Math.floor(Math.random() * proxies.length)];
    const { data } = await axios.get(url, {
        proxy: { ...proxy, auth: { username: 'user', password: 'pass' } },
    });
    return cheerio.load(data);
}

For proxy types, see our web scraping proxy guide and proxy glossary.

Performance Tips

Skip unnecessary parsing — Only load the HTML you need
Use specific selectors — .product h3 a is faster than $('a')
Remove scripts/styles first — Reduces parsing time
Use .get() for arrays — .map().get() is faster than manual array building
Stream large HTML — Use htmlparser2 for very large documents

// Pre-clean HTML before full parsing
const $ = cheerio.load(data, {
    xmlMode: false,
    decodeEntities: true,
});

// Remove noise before extraction
$('script, style, noscript, iframe').remove();

Complete Example

const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');

class BookScraper {
    constructor() {
        this.client = axios.create({
            headers: {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
            },
            timeout: 30000,
        });
        this.books = [];
    }

    async scrapePage(url) {
        const { data } = await this.client.get(url);
        const $ = cheerio.load(data);

        $('article.product_pod').each((i, el) => {
            const $el = $(el);
            this.books.push({
                title: $el.find('h3 a').attr('title'),
                price: parseFloat($el.find('.price_color').text().replace('£', '')),
                rating: $el.find('p.star-rating').attr('class').replace('star-rating ', ''),
                available: $el.find('.instock').length > 0,
            });
        });

        const nextHref = $('li.next a').attr('href');
        return nextHref ? new URL(nextHref, url).href : null;
    }

    async scrapeAll() {
        let url = 'https://books.toscrape.com/catalogue/page-1.html';

        while (url) {
            try {
                const nextUrl = await this.scrapePage(url);
                console.log(`Scraped: ${url} (${this.books.length} total)`);
                url = nextUrl;
                await new Promise(r => setTimeout(r, 1000));
            } catch (err) {
                console.error(`Failed: ${url} — ${err.message}`);
                break;
            }
        }

        return this.books;
    }

    save(filename) {
        fs.writeFileSync(filename, JSON.stringify(this.books, null, 2));
        console.log(`Saved ${this.books.length} books to ${filename}`);
    }
}

const scraper = new BookScraper();
scraper.scrapeAll().then(() => scraper.save('books.json'));

FAQ

When should I use Cheerio instead of Puppeteer?

Use Cheerio for static HTML pages that do not require JavaScript rendering. Cheerio processes pages 10-50x faster than Puppeteer because it parses raw HTML without launching a browser. Only use Puppeteer or Playwright when the page loads content dynamically with JavaScript.

Is Cheerio the same as jQuery?

Cheerio implements a subset of jQuery’s API for server-side use. The selector syntax and DOM manipulation methods are identical, but Cheerio does not include AJAX, animations, or event handling — it focuses purely on HTML parsing and manipulation.

Can Cheerio execute JavaScript?

No. Cheerio only parses static HTML. If a page requires JavaScript to render content, you need a headless browser like Puppeteer or Playwright. Always check the page source first — if the data is in the HTML, Cheerio can extract it.

What is the best HTTP client to pair with Cheerio?

Axios is the most popular choice. Got is a good alternative with more features (retries, pagination). For the lightest option, Node’s built-in fetch() (available in Node 18+) works without dependencies.

Explore more Node.js scraping: Axios + Cheerio stack, Puppeteer tutorial. For proxy setup, see our web scraping proxy guide.

External Resources:

Cheerio.js Web Scraping: Node.js HTML Parsing

Cheerio.js Web Scraping: Node.js HTML Parsing

Table of Contents

Why Cheerio

Installation

Basic Usage

Selectors

Extracting Data

Traversing the DOM

Manipulating HTML

Integration with Axios

Handling Pagination

Proxy Integration

Performance Tips

Complete Example

FAQ

When should I use Cheerio instead of Puppeteer?

Is Cheerio the same as jQuery?

Can Cheerio execute JavaScript?

What is the best HTTP client to pair with Cheerio?

Related Reading