web scraping with Node.js: axios, cheerio, puppeteer complete guide (2026)
Node.js is the second most popular language for web scraping after Python, and in 2026 it has caught up on tooling. you get axios for HTTP requests, cheerio for fast HTML parsing, and puppeteer for headless Chrome control. this guide walks through all three with real code, proxy rotation, and the anti-bot patterns that actually work today.
why scrape with Node.js in 2026
Node.js wins when your scraper needs to share code with a frontend, run inside an existing Express or Next.js app, or handle thousands of concurrent requests over a single event loop. its async model is genuinely faster than threaded Python for I/O-bound jobs.
it also gets first-class support from puppeteer (Google maintains it) and the new Chrome DevTools Protocol features land in Node before they land anywhere else. if you scrape JavaScript-heavy sites at scale, Node is the path of least resistance.
the trade-off is the parsing ecosystem. Python’s BeautifulSoup is more forgiving than cheerio when HTML is broken, and pandas is way ahead of anything in JS land for downstream data work. pick Node when concurrency or browser automation matter most.
install the stack
start with a fresh project. you only need three core packages and one helper for proxy support.
mkdir scraper && cd scraper
npm init -y
npm install axios cheerio puppeteer https-proxy-agent
axios handles plain HTTP. cheerio parses HTML with a jQuery-like API. puppeteer launches headless Chrome. https-proxy-agent lets you route axios through HTTP or HTTPS proxies without external config.
| library | use for | speed | handles JS |
|---|---|---|---|
| axios + cheerio | static HTML, APIs, RSS | very fast | no |
| puppeteer | SPAs, login flows, screenshots | slow | yes |
| playwright | same as puppeteer + multi-browser | slow | yes |
most production scrapers use both. axios handles 80% of pages, puppeteer handles the JavaScript-rendered 20%.
scrape static HTML with axios + cheerio
here’s the simplest possible scraper. it pulls book titles and prices from books.toscrape.com (the official sandbox for scraping practice).
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeBooks() {
const { data } = await axios.get('https://books.toscrape.com/');
const $ = cheerio.load(data);
const books = [];
$('article.product_pod').each((i, el) => {
books.push({
title: $(el).find('h3 a').attr('title'),
price: $(el).find('.price_color').text(),
stock: $(el).find('.availability').text().trim(),
});
});
return books;
}
scrapeBooks().then(console.log);
axios returns the raw HTML. cheerio loads it into a DOM-like object you query with CSS selectors. the each loop builds an array of objects you can write to JSON or a database.
this pattern handles thousands of requests per minute on a single machine. for sites without anti-bot defenses, it’s all you need.
handle pagination
most listings span multiple pages. wrap the scraper in a loop that follows the next-page link until it disappears.
async function scrapeAllPages() {
let url = 'https://books.toscrape.com/';
const all = [];
while (url) {
const { data } = await axios.get(url);
const $ = cheerio.load(data);
$('article.product_pod').each((i, el) => {
all.push({
title: $(el).find('h3 a').attr('title'),
price: $(el).find('.price_color').text(),
});
});
const next = $('li.next a').attr('href');
url = next ? new URL(next, url).href : null;
}
return all;
}
URL resolution matters. relative links like catalogue/page-2.html break if you concatenate them naively. the URL constructor merges them against the current base.
for sites with offset pagination (?page=1, ?page=2), increment the query param until you get an empty result set. for cursor-based APIs, follow the next-cursor token.
scrape JavaScript-rendered pages with puppeteer
axios fetches HTML that the server returns. it does not run JavaScript. modern SPAs (React, Vue, Next.js) render content client-side, so axios returns an empty shell.
puppeteer launches a real Chrome instance, runs the JavaScript, and gives you the rendered DOM.
const puppeteer = require('puppeteer');
async function scrapeQuotes() {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
await page.goto('https://quotes.toscrape.com/js/', {
waitUntil: 'networkidle2',
});
const quotes = await page.$$eval('.quote', (els) =>
els.map((el) => ({
text: el.querySelector('.text').innerText,
author: el.querySelector('.author').innerText,
}))
);
await browser.close();
return quotes;
}
waitUntil: 'networkidle2' tells puppeteer to wait until network traffic settles. for slower sites, use waitForSelector('.quote') instead so you don’t time out on a busy XHR.
$$eval runs a function inside the browser, so you query the live DOM with regular querySelector calls. the result serializes back to Node.
if you’re picking between puppeteer and the alternatives, see our headless browser deep dive and the selenium vs playwright vs puppeteer comparison.
handle infinite scroll
many listing pages load more results as you scroll. simulate that with page.evaluate.
async function autoScroll(page) {
await page.evaluate(async () => {
await new Promise((resolve) => {
let total = 0;
const distance = 300;
const timer = setInterval(() => {
const { scrollHeight } = document.body;
window.scrollBy(0, distance);
total += distance;
if (total >= scrollHeight) {
clearInterval(timer);
resolve();
}
}, 200);
});
});
}
call autoScroll(page) before extracting data. tune the distance (300px) and interval (200ms) to match how the target site loads chunks. too fast and you’ll miss content, too slow wastes time.
use proxies (essential for any real target)
the moment you scrape Amazon, Google, LinkedIn, or any high-value target, you need rotating proxies. a single IP making 100 requests per minute gets blocked within seconds.
axios with a single proxy:
const { HttpsProxyAgent } = require('https-proxy-agent');
const agent = new HttpsProxyAgent('http://user:pass@proxy.example.com:8000');
const { data } = await axios.get('https://example.com', {
httpsAgent: agent,
httpAgent: agent,
timeout: 15000,
});
for a rotating residential pool, you typically get a single gateway endpoint and the provider rotates IPs per request. mobile proxy gateways work the same way.
puppeteer with a proxy:
const browser = await puppeteer.launch({
args: ['--proxy-server=http://proxy.example.com:8000'],
});
const page = await browser.newPage();
await page.authenticate({ username: 'user', password: 'pass' });
proxy auth in puppeteer is page-level, not browser-level. call page.authenticate before page.goto.
for a deeper walkthrough on rotation strategies, see rotating proxies and unlimited bandwidth.
avoid common anti-bot triggers
generic blocks happen because your scraper looks nothing like a real browser. fix the obvious tells first.
set a real user agent. axios defaults to axios/1.x which screams bot. rotate through 3-5 modern Chrome strings.
const headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
};
await axios.get(url, { headers });
add randomized delays between requests. 1-3 seconds is enough for most sites. use await new Promise(r => setTimeout(r, 1500 + Math.random() * 1500)).
for puppeteer, install puppeteer-extra-plugin-stealth. it patches the navigator.webdriver flag and dozens of other fingerprint tells in one line.
const puppeteer = require('puppeteer-extra');
const Stealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(Stealth());
stealth alone won’t beat enterprise anti-bot like Cloudflare, DataDome, or Akamai. for those, you need residential proxies plus stealth plus careful request timing. for tougher targets, also see how Python scrapers handle this.
handle errors and retries
network errors are normal. wrap requests in retry logic so a single 503 doesn’t kill your job.
async function fetchWithRetry(url, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
return await axios.get(url, { timeout: 15000 });
} catch (e) {
if (i === retries - 1) throw e;
await new Promise((r) => setTimeout(r, 2 ** i * 1000));
}
}
}
exponential backoff (1s, 2s, 4s) is the standard. for 429 responses, respect the Retry-After header if the server sets it.
log everything. failed scrapes are signal, not noise. a sudden spike in 403s means your IP got flagged or the site changed defenses.
save data
for small jobs, write JSON.
const fs = require('fs');
fs.writeFileSync('books.json', JSON.stringify(books, null, 2));
for thousands of rows, stream to a CSV or a database. the csv-stringify package handles escaping. for Postgres, use pg with batch inserts of 500-1000 rows per transaction. for cloud, push to S3 or BigQuery.
never store credentials in your scraper code. use dotenv (npm install dotenv) and a .env file that’s gitignored.
faq
is web scraping with Node.js faster than Python?
for I/O-bound concurrent scraping, yes. Node.js handles thousands of simultaneous connections on a single thread thanks to its event loop. Python needs asyncio or threading to match it, and even then the GIL limits CPU work.
do I need puppeteer or is axios enough?
axios + cheerio is enough for any page that returns full HTML on the first request. open the target site with JavaScript disabled in your browser. if the content you want is still visible, axios works. if the page goes blank, you need puppeteer.
what’s the difference between puppeteer and playwright?
playwright is Microsoft’s fork of puppeteer with multi-browser support (Chromium, Firefox, WebKit) and better auto-waiting. puppeteer only drives Chrome but has tighter integration with Chrome DevTools. both work for scraping. our comparison guide breaks down which to pick.
can I run Node.js scrapers serverless?
yes for axios + cheerio (small footprint, fits in Lambda or Vercel functions). puppeteer is harder because Chrome binaries are 200MB+. use chrome-aws-lambda or the puppeteer-core + @sparticuz/chromium combo, or run puppeteer on a small VM instead.
how do I avoid getting blocked?
rotate residential or mobile proxies, randomize user agents, add 1-3 second delays between requests, and use stealth plugins for puppeteer. for tough targets like Cloudflare or DataDome, you’ll also need to defeat TLS fingerprinting. for any of those defenses, residential IPs are non-negotiable.
is web scraping legal in 2026?
scraping public data is generally legal in the US after the hiQ v. LinkedIn rulings, but terms of service violations and CFAA risk still exist. EU GDPR adds personal-data restrictions. always check the target site’s robots.txt and ToS, and consult a lawyer for commercial use.
conclusion
Node.js gives you a complete scraping stack in three packages. axios for fast HTTP, cheerio for HTML parsing, puppeteer for JavaScript-heavy pages. add a residential proxy pool, stealth plugins, and retry logic, and you have a production scraper.
start with the static stack (axios + cheerio) and only add puppeteer when you actually need it. headless Chrome is 100x slower than HTTP requests and burns way more proxy bandwidth. the cheapest reliable scraper is the one that does the least work per page.
if your scraping needs grow, look at distributed runners like Apify, BullMQ for queues, and managed scraping APIs that handle proxies and CAPTCHAs for you. but for most jobs, the three libraries in this guide will get you 90% of the way there.