Web Scraping with JavaScript/Node.js: Full Tutorial
JavaScript isn’t just for building websites — it’s one of the most powerful languages for scraping them too. With Node.js running server-side JavaScript, you get access to an ecosystem of libraries purpose-built for extracting data from the web. Whether you need to parse static HTML or render JavaScript-heavy single-page applications, Node.js has you covered.
This tutorial walks you through every approach to web scraping with JavaScript, from lightweight HTTP requests to full browser automation. By the end, you’ll know which tool to use for any scraping scenario and have working code you can adapt to your projects.
Prerequisites
Before starting, make sure you have:
- Node.js 18+ installed (download from nodejs.org)
- npm (comes with Node.js)
- A code editor (VS Code recommended)
- Basic JavaScript/ES6 knowledge (async/await, destructuring)
- A terminal or command prompt
Verify your installation:
node --version # Should show v18.x or higher
npm --version # Should show 9.x or higherProject Setup
Create a new project directory and initialize it:
mkdir js-scraper
cd js-scraper
npm init -yAdd "type": "module" to your package.json to use ES module imports:
{
"name": "js-scraper",
"version": "1.0.0",
"type": "module",
"dependencies": {}
}Approach 1: Axios + Cheerio (Static Pages)
For pages that don’t require JavaScript rendering, the combination of Axios (HTTP client) and Cheerio (HTML parser) is fast and lightweight.
Installation
npm install axios cheerioBasic Example: Scraping a Product Page
import axios from 'axios';
import * as cheerio from 'cheerio';
async function scrapeProducts(url) {
try {
const { data } = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
});
const $ = cheerio.load(data);
const products = [];
$('.product-card').each((index, element) => {
const name = $(element).find('.product-title').text().trim();
const price = $(element).find('.price').text().trim();
const link = $(element).find('a').attr('href');
const rating = $(element).find('.rating').attr('data-score');
products.push({
name,
price,
link,
rating: parseFloat(rating) || null
});
});
return products;
} catch (error) {
console.error(`Error scraping ${url}:`, error.message);
return [];
}
}
const products = await scrapeProducts('https://example.com/products');
console.log(JSON.stringify(products, null, 2));Handling Pagination
async function scrapeAllPages(baseUrl, maxPages = 10) {
const allProducts = [];
for (let page = 1; page <= maxPages; page++) {
const url = `${baseUrl}?page=${page}`;
console.log(`Scraping page ${page}...`);
const { data } = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
});
const $ = cheerio.load(data);
const products = [];
$('.product-card').each((i, el) => {
products.push({
name: $(el).find('.title').text().trim(),
price: $(el).find('.price').text().trim()
});
});
if (products.length === 0) {
console.log('No more products found. Stopping.');
break;
}
allProducts.push(...products);
// Respectful delay between requests
await new Promise(resolve => setTimeout(resolve, 1500));
}
return allProducts;
}Saving Data to CSV
import { writeFileSync } from 'fs';
function saveToCSV(data, filename) {
if (data.length === 0) return;
const headers = Object.keys(data[0]).join(',');
const rows = data.map(item =>
Object.values(item).map(val =>
`"${String(val).replace(/"/g, '""')}"`
).join(',')
);
const csv = [headers, ...rows].join('\n');
writeFileSync(filename, csv, 'utf-8');
console.log(`Saved ${data.length} records to ${filename}`);
}
const results = await scrapeProducts('https://example.com/products');
saveToCSV(results, 'products.csv');Approach 2: Puppeteer (Browser Automation)
When pages load content dynamically with JavaScript, you need a real browser. Puppeteer controls headless Chrome.
Installation
npm install puppeteerBasic Puppeteer Scraping
import puppeteer from 'puppeteer';
async function scrapeDynamicPage(url) {
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
// Set a realistic viewport and user agent
await page.setViewport({ width: 1920, height: 1080 });
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
);
await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
// Wait for dynamic content to load
await page.waitForSelector('.product-list', { timeout: 10000 });
// Extract data from the rendered page
const products = await page.evaluate(() => {
const items = document.querySelectorAll('.product-card');
return Array.from(items).map(item => ({
name: item.querySelector('.title')?.textContent?.trim(),
price: item.querySelector('.price')?.textContent?.trim(),
image: item.querySelector('img')?.src,
available: !item.classList.contains('out-of-stock')
}));
});
await browser.close();
return products;
}
const results = await scrapeDynamicPage('https://example.com/spa-products');
console.log(results);Handling Infinite Scroll
async function scrapeInfiniteScroll(url, maxScrolls = 10) {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2' });
let previousHeight = 0;
let scrollCount = 0;
while (scrollCount < maxScrolls) {
// Scroll to bottom
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
// Wait for new content to load
await new Promise(resolve => setTimeout(resolve, 2000));
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
if (currentHeight === previousHeight) {
console.log('Reached end of content');
break;
}
previousHeight = currentHeight;
scrollCount++;
console.log(`Scroll ${scrollCount}/${maxScrolls}`);
}
// Extract all loaded items
const items = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.feed-item')).map(item => ({
title: item.querySelector('h3')?.textContent?.trim(),
description: item.querySelector('p')?.textContent?.trim(),
link: item.querySelector('a')?.href
}));
});
await browser.close();
return items;
}Intercepting API Calls
One of the most efficient Puppeteer techniques is intercepting the underlying API calls that populate a page:
async function interceptAPIData(url) {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
const apiResponses = [];
// Listen for API responses
page.on('response', async (response) => {
const reqUrl = response.url();
if (reqUrl.includes('/api/products') || reqUrl.includes('/graphql')) {
try {
const json = await response.json();
apiResponses.push(json);
} catch (e) {
// Not JSON, skip
}
}
});
await page.goto(url, { waitUntil: 'networkidle2' });
await browser.close();
return apiResponses;
}Approach 3: Playwright (Modern Alternative)
Playwright offers multi-browser support (Chrome, Firefox, Safari) and more reliable auto-waiting.
Installation
npm install playwrightPlaywright Scraping Example
import { chromium } from 'playwright';
async function scrapeWithPlaywright(url) {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
viewport: { width: 1920, height: 1080 }
});
const page = await context.newPage();
await page.goto(url, { waitUntil: 'domcontentloaded' });
// Playwright's auto-waiting makes this more reliable
await page.waitForSelector('.product-grid');
const products = await page.$$eval('.product-card', cards =>
cards.map(card => ({
name: card.querySelector('.name')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
url: card.querySelector('a')?.href
}))
);
await browser.close();
return products;
}Using Proxies with Node.js Scrapers
For any serious scraping project, proxies are essential to avoid IP blocks. Here’s how to integrate them with each approach.
Axios with Proxy
import axios from 'axios';
import { HttpsProxyAgent } from 'https-proxy-agent';
const proxyAgent = new HttpsProxyAgent('http://user:pass@proxy-server:8080');
const { data } = await axios.get('https://example.com', {
httpsAgent: proxyAgent,
headers: { 'User-Agent': 'Mozilla/5.0 ...' }
});Puppeteer with Proxy
const browser = await puppeteer.launch({
headless: 'new',
args: ['--proxy-server=http://proxy-server:8080']
});
const page = await browser.newPage();
await page.authenticate({
username: 'proxy_user',
password: 'proxy_pass'
});Playwright with Proxy
const browser = await chromium.launch({
proxy: {
server: 'http://proxy-server:8080',
username: 'proxy_user',
password: 'proxy_pass'
}
});For production scraping, rotating residential proxies give you the best success rates. See our guide on choosing the right proxy type for your use case.
Building a Production-Ready Scraper
Here’s a complete, robust scraper with error handling, retries, and rate limiting:
import axios from 'axios';
import * as cheerio from 'cheerio';
import { writeFileSync } from 'fs';
class WebScraper {
constructor(options = {}) {
this.maxRetries = options.maxRetries || 3;
this.delayMs = options.delayMs || 1500;
this.timeout = options.timeout || 15000;
this.results = [];
}
async delay(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
async fetchPage(url, retries = 0) {
try {
const response = await axios.get(url, {
timeout: this.timeout,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive'
}
});
return response.data;
} catch (error) {
if (retries < this.maxRetries) {
const backoff = Math.pow(2, retries) * 1000;
console.log(`Retry ${retries + 1}/${this.maxRetries} for ${url} in ${backoff}ms`);
await this.delay(backoff);
return this.fetchPage(url, retries + 1);
}
throw error;
}
}
parsePage(html) {
const $ = cheerio.load(html);
// Override this method for your specific scraping logic
return [];
}
async scrapeUrls(urls) {
for (const url of urls) {
try {
console.log(`Scraping: ${url}`);
const html = await this.fetchPage(url);
const data = this.parsePage(html);
this.results.push(...data);
await this.delay(this.delayMs);
} catch (error) {
console.error(`Failed to scrape ${url}: ${error.message}`);
}
}
return this.results;
}
saveJSON(filename) {
writeFileSync(filename, JSON.stringify(this.results, null, 2));
console.log(`Saved ${this.results.length} records to ${filename}`);
}
}
// Usage
const scraper = new WebScraper({ delayMs: 2000, maxRetries: 3 });
// Override parsePage for your target site
scraper.parsePage = function(html) {
const $ = cheerio.load(html);
const items = [];
$('.listing').each((i, el) => {
items.push({
title: $(el).find('h2').text().trim(),
price: $(el).find('.price').text().trim(),
location: $(el).find('.location').text().trim()
});
});
return items;
};
const urls = Array.from({ length: 5 }, (_, i) => `https://example.com/listings?page=${i + 1}`);
await scraper.scrapeUrls(urls);
scraper.saveJSON('listings.json');Handling Common Anti-Scraping Measures
Rotating User Agents
const userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
];
function randomUserAgent() {
return userAgents[Math.floor(Math.random() * userAgents.length)];
}Handling CAPTCHAs and Cloudflare
If you encounter Cloudflare protection, you have several options:
- Use a headless browser (Puppeteer/Playwright) instead of HTTP requests
- Use residential proxies to appear as a real user — see our residential proxy guide
- Use stealth plugins like
puppeteer-extra-plugin-stealth
npm install puppeteer-extra puppeteer-extra-plugin-stealthimport puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({ headless: 'new' });Choosing the Right Tool
| Feature | Axios + Cheerio | Puppeteer | Playwright |
|---|---|---|---|
| Speed | Fastest | Slow | Medium |
| JS Rendering | No | Yes | Yes |
| Memory Usage | Low | High | Medium |
| Multi-browser | N/A | Chrome only | Chrome, FF, Safari |
| Learning Curve | Easy | Medium | Medium |
| Best For | Static HTML | Chrome automation | Cross-browser |
Use Axios + Cheerio when pages don’t need JavaScript to render content. Use Puppeteer when you need Chrome specifically or have existing Puppeteer code. Use Playwright for new projects needing browser automation — it’s more modern and reliable.
Common Pitfalls and Troubleshooting
1. “Request failed with status code 403”
The site is blocking your request. Add proper headers, use a realistic User-Agent, or switch to a browser-based approach.
2. “Navigation timeout exceeded”
The page took too long to load. Increase the timeout or use waitUntil: 'domcontentloaded' instead of networkidle2.
3. Empty results from page.evaluate()
Make sure you’re waiting for content to load. Use waitForSelector() before extracting data.
4. Memory issues with Puppeteer
Close pages and browsers when done. For large-scale scraping, process URLs in batches and restart the browser periodically.
5. Rate limiting (429 errors)
Add delays between requests, rotate IP addresses with proxy rotation, and respect robots.txt.
FAQ
Is web scraping with JavaScript legal?
Web scraping public data is generally legal, but you should respect website terms of service, robots.txt rules, and applicable data protection laws like GDPR. Avoid scraping personal data without consent. See our web scraping compliance guide for details.
Should I use Puppeteer or Playwright for web scraping?
For new projects in 2026, Playwright is generally the better choice. It has better auto-waiting, multi-browser support, and more consistent behavior. Puppeteer remains excellent for Chrome-specific tasks. See our Puppeteer vs Playwright comparison for a detailed breakdown.
How fast can I scrape with Node.js?
With Axios + Cheerio, you can process hundreds of pages per minute. Browser-based scraping (Puppeteer/Playwright) is slower — typically 5-20 pages per minute depending on complexity. Always add delays between requests to be respectful.
How do I handle JavaScript-rendered content?
Use Puppeteer or Playwright to launch a real browser that executes JavaScript. Alternatively, check if the site has an API that returns the data directly — intercepting network requests is often more efficient than parsing rendered HTML.
Can I use Node.js for large-scale scraping?
Yes, but you’ll need proxy rotation, proper error handling, and potentially a distributed architecture. For enterprise-scale needs, consider using a dedicated scraping infrastructure with rotating residential proxies.
Next Steps
Now that you have a solid foundation in JavaScript web scraping, explore these related guides:
- Puppeteer Web Scraping: Complete Tutorial for advanced browser automation
- Cheerio.js Web Scraping Guide for deep-dive HTML parsing
- Axios + Cheerio: Lightweight Scraping for optimized static scraping
- Best Python Web Scraping Libraries to compare with the Python ecosystem
- Web Scraping Proxy Guide for scaling your scrapers with proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
Related Reading
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company