Web Scraping with JavaScript/Node.js: Full Tutorial

JavaScript isn’t just for building websites — it’s one of the most powerful languages for scraping them too. With Node.js running server-side JavaScript, you get access to an ecosystem of libraries purpose-built for extracting data from the web. Whether you need to parse static HTML or render JavaScript-heavy single-page applications, Node.js has you covered.

This tutorial walks you through every approach to web scraping with JavaScript, from lightweight HTTP requests to full browser automation. By the end, you’ll know which tool to use for any scraping scenario and have working code you can adapt to your projects.

Prerequisites

Before starting, make sure you have:

Node.js 18+ installed (download from nodejs.org)
npm (comes with Node.js)
A code editor (VS Code recommended)
Basic JavaScript/ES6 knowledge (async/await, destructuring)
A terminal or command prompt

Verify your installation:

node --version   # Should show v18.x or higher
npm --version    # Should show 9.x or higher

Project Setup

Create a new project directory and initialize it:

mkdir js-scraper
cd js-scraper
npm init -y

Add "type": "module" to your package.json to use ES module imports:

{
  "name": "js-scraper",
  "version": "1.0.0",
  "type": "module",
  "dependencies": {}
}

Approach 1: Axios + Cheerio (Static Pages)

For pages that don’t require JavaScript rendering, the combination of Axios (HTTP client) and Cheerio (HTML parser) is fast and lightweight.

Installation

npm install axios cheerio

Basic Example: Scraping a Product Page

import axios from 'axios';
import * as cheerio from 'cheerio';

async function scrapeProducts(url) {
  try {
    const { data } = await axios.get(url, {
      headers: {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
      }
    });

    const $ = cheerio.load(data);
    const products = [];

    $('.product-card').each((index, element) => {
      const name = $(element).find('.product-title').text().trim();
      const price = $(element).find('.price').text().trim();
      const link = $(element).find('a').attr('href');
      const rating = $(element).find('.rating').attr('data-score');

      products.push({
        name,
        price,
        link,
        rating: parseFloat(rating) || null
      });
    });

    return products;
  } catch (error) {
    console.error(`Error scraping ${url}:`, error.message);
    return [];
  }
}

const products = await scrapeProducts('https://example.com/products');
console.log(JSON.stringify(products, null, 2));

Handling Pagination

async function scrapeAllPages(baseUrl, maxPages = 10) {
  const allProducts = [];

  for (let page = 1; page <= maxPages; page++) {
    const url = `${baseUrl}?page=${page}`;
    console.log(`Scraping page ${page}...`);

    const { data } = await axios.get(url, {
      headers: {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
      }
    });

    const $ = cheerio.load(data);
    const products = [];

    $('.product-card').each((i, el) => {
      products.push({
        name: $(el).find('.title').text().trim(),
        price: $(el).find('.price').text().trim()
      });
    });

    if (products.length === 0) {
      console.log('No more products found. Stopping.');
      break;
    }

    allProducts.push(...products);

    // Respectful delay between requests
    await new Promise(resolve => setTimeout(resolve, 1500));
  }

  return allProducts;
}

Saving Data to CSV

import { writeFileSync } from 'fs';

function saveToCSV(data, filename) {
  if (data.length === 0) return;

  const headers = Object.keys(data[0]).join(',');
  const rows = data.map(item =>
    Object.values(item).map(val =>
      `"${String(val).replace(/"/g, '""')}"`
    ).join(',')
  );

  const csv = [headers, ...rows].join('\n');
  writeFileSync(filename, csv, 'utf-8');
  console.log(`Saved ${data.length} records to ${filename}`);
}

const results = await scrapeProducts('https://example.com/products');
saveToCSV(results, 'products.csv');

Approach 2: Puppeteer (Browser Automation)

When pages load content dynamically with JavaScript, you need a real browser. Puppeteer controls headless Chrome.

Installation

npm install puppeteer

Basic Puppeteer Scraping

import puppeteer from 'puppeteer';

async function scrapeDynamicPage(url) {
  const browser = await puppeteer.launch({
    headless: 'new',
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });

  const page = await browser.newPage();

  // Set a realistic viewport and user agent
  await page.setViewport({ width: 1920, height: 1080 });
  await page.setUserAgent(
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
  );

  await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });

  // Wait for dynamic content to load
  await page.waitForSelector('.product-list', { timeout: 10000 });

  // Extract data from the rendered page
  const products = await page.evaluate(() => {
    const items = document.querySelectorAll('.product-card');
    return Array.from(items).map(item => ({
      name: item.querySelector('.title')?.textContent?.trim(),
      price: item.querySelector('.price')?.textContent?.trim(),
      image: item.querySelector('img')?.src,
      available: !item.classList.contains('out-of-stock')
    }));
  });

  await browser.close();
  return products;
}

const results = await scrapeDynamicPage('https://example.com/spa-products');
console.log(results);

Handling Infinite Scroll

async function scrapeInfiniteScroll(url, maxScrolls = 10) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle2' });

  let previousHeight = 0;
  let scrollCount = 0;

  while (scrollCount < maxScrolls) {
    // Scroll to bottom
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));

    // Wait for new content to load
    await new Promise(resolve => setTimeout(resolve, 2000));

    const currentHeight = await page.evaluate(() => document.body.scrollHeight);
    if (currentHeight === previousHeight) {
      console.log('Reached end of content');
      break;
    }

    previousHeight = currentHeight;
    scrollCount++;
    console.log(`Scroll ${scrollCount}/${maxScrolls}`);
  }

  // Extract all loaded items
  const items = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.feed-item')).map(item => ({
      title: item.querySelector('h3')?.textContent?.trim(),
      description: item.querySelector('p')?.textContent?.trim(),
      link: item.querySelector('a')?.href
    }));
  });

  await browser.close();
  return items;
}

Intercepting API Calls

One of the most efficient Puppeteer techniques is intercepting the underlying API calls that populate a page:

async function interceptAPIData(url) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();

  const apiResponses = [];

  // Listen for API responses
  page.on('response', async (response) => {
    const reqUrl = response.url();
    if (reqUrl.includes('/api/products') || reqUrl.includes('/graphql')) {
      try {
        const json = await response.json();
        apiResponses.push(json);
      } catch (e) {
        // Not JSON, skip
      }
    }
  });

  await page.goto(url, { waitUntil: 'networkidle2' });

  await browser.close();
  return apiResponses;
}

Approach 3: Playwright (Modern Alternative)

Playwright offers multi-browser support (Chrome, Firefox, Safari) and more reliable auto-waiting.

Installation

npm install playwright

Playwright Scraping Example

import { chromium } from 'playwright';

async function scrapeWithPlaywright(url) {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext({
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    viewport: { width: 1920, height: 1080 }
  });

  const page = await context.newPage();
  await page.goto(url, { waitUntil: 'domcontentloaded' });

  // Playwright's auto-waiting makes this more reliable
  await page.waitForSelector('.product-grid');

  const products = await page.$$eval('.product-card', cards =>
    cards.map(card => ({
      name: card.querySelector('.name')?.textContent?.trim(),
      price: card.querySelector('.price')?.textContent?.trim(),
      url: card.querySelector('a')?.href
    }))
  );

  await browser.close();
  return products;
}

Using Proxies with Node.js Scrapers

For any serious scraping project, proxies are essential to avoid IP blocks. Here’s how to integrate them with each approach.

Axios with Proxy

import axios from 'axios';
import { HttpsProxyAgent } from 'https-proxy-agent';

const proxyAgent = new HttpsProxyAgent('http://user:pass@proxy-server:8080');

const { data } = await axios.get('https://example.com', {
  httpsAgent: proxyAgent,
  headers: { 'User-Agent': 'Mozilla/5.0 ...' }
});

Puppeteer with Proxy

const browser = await puppeteer.launch({
  headless: 'new',
  args: ['--proxy-server=http://proxy-server:8080']
});

const page = await browser.newPage();
await page.authenticate({
  username: 'proxy_user',
  password: 'proxy_pass'
});

Playwright with Proxy

const browser = await chromium.launch({
  proxy: {
    server: 'http://proxy-server:8080',
    username: 'proxy_user',
    password: 'proxy_pass'
  }
});

For production scraping, rotating residential proxies give you the best success rates. See our guide on choosing the right proxy type for your use case.

Building a Production-Ready Scraper

Here’s a complete, robust scraper with error handling, retries, and rate limiting:

import axios from 'axios';
import * as cheerio from 'cheerio';
import { writeFileSync } from 'fs';

class WebScraper {
  constructor(options = {}) {
    this.maxRetries = options.maxRetries || 3;
    this.delayMs = options.delayMs || 1500;
    this.timeout = options.timeout || 15000;
    this.results = [];
  }

  async delay(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }

  async fetchPage(url, retries = 0) {
    try {
      const response = await axios.get(url, {
        timeout: this.timeout,
        headers: {
          'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Language': 'en-US,en;q=0.5',
          'Accept-Encoding': 'gzip, deflate, br',
          'Connection': 'keep-alive'
        }
      });
      return response.data;
    } catch (error) {
      if (retries < this.maxRetries) {
        const backoff = Math.pow(2, retries) * 1000;
        console.log(`Retry ${retries + 1}/${this.maxRetries} for ${url} in ${backoff}ms`);
        await this.delay(backoff);
        return this.fetchPage(url, retries + 1);
      }
      throw error;
    }
  }

  parsePage(html) {
    const $ = cheerio.load(html);
    // Override this method for your specific scraping logic
    return [];
  }

  async scrapeUrls(urls) {
    for (const url of urls) {
      try {
        console.log(`Scraping: ${url}`);
        const html = await this.fetchPage(url);
        const data = this.parsePage(html);
        this.results.push(...data);
        await this.delay(this.delayMs);
      } catch (error) {
        console.error(`Failed to scrape ${url}: ${error.message}`);
      }
    }
    return this.results;
  }

  saveJSON(filename) {
    writeFileSync(filename, JSON.stringify(this.results, null, 2));
    console.log(`Saved ${this.results.length} records to ${filename}`);
  }
}

// Usage
const scraper = new WebScraper({ delayMs: 2000, maxRetries: 3 });

// Override parsePage for your target site
scraper.parsePage = function(html) {
  const $ = cheerio.load(html);
  const items = [];
  $('.listing').each((i, el) => {
    items.push({
      title: $(el).find('h2').text().trim(),
      price: $(el).find('.price').text().trim(),
      location: $(el).find('.location').text().trim()
    });
  });
  return items;
};

const urls = Array.from({ length: 5 }, (_, i) => `https://example.com/listings?page=${i + 1}`);
await scraper.scrapeUrls(urls);
scraper.saveJSON('listings.json');

Handling Common Anti-Scraping Measures

Rotating User Agents

const userAgents = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
  'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
];

function randomUserAgent() {
  return userAgents[Math.floor(Math.random() * userAgents.length)];
}

Handling CAPTCHAs and Cloudflare

If you encounter Cloudflare protection, you have several options:

Use a headless browser (Puppeteer/Playwright) instead of HTTP requests
Use residential proxies to appear as a real user — see our residential proxy guide
Use stealth plugins like puppeteer-extra-plugin-stealth

npm install puppeteer-extra puppeteer-extra-plugin-stealth

import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';

puppeteer.use(StealthPlugin());

const browser = await puppeteer.launch({ headless: 'new' });

Choosing the Right Tool

Feature	Axios + Cheerio	Puppeteer	Playwright
Speed	Fastest	Slow	Medium
JS Rendering	No	Yes	Yes
Memory Usage	Low	High	Medium
Multi-browser	N/A	Chrome only	Chrome, FF, Safari
Learning Curve	Easy	Medium	Medium
Best For	Static HTML	Chrome automation	Cross-browser

Use Axios + Cheerio when pages don’t need JavaScript to render content. Use Puppeteer when you need Chrome specifically or have existing Puppeteer code. Use Playwright for new projects needing browser automation — it’s more modern and reliable.

Common Pitfalls and Troubleshooting

1. “Request failed with status code 403”

The site is blocking your request. Add proper headers, use a realistic User-Agent, or switch to a browser-based approach.

2. “Navigation timeout exceeded”

The page took too long to load. Increase the timeout or use waitUntil: 'domcontentloaded' instead of networkidle2.

3. Empty results from page.evaluate()

Make sure you’re waiting for content to load. Use waitForSelector() before extracting data.

4. Memory issues with Puppeteer

Close pages and browsers when done. For large-scale scraping, process URLs in batches and restart the browser periodically.

5. Rate limiting (429 errors)

Add delays between requests, rotate IP addresses with proxy rotation, and respect robots.txt.

FAQ

Is web scraping with JavaScript legal?

Web scraping public data is generally legal, but you should respect website terms of service, robots.txt rules, and applicable data protection laws like GDPR. Avoid scraping personal data without consent. See our web scraping compliance guide for details.

Should I use Puppeteer or Playwright for web scraping?

For new projects in 2026, Playwright is generally the better choice. It has better auto-waiting, multi-browser support, and more consistent behavior. Puppeteer remains excellent for Chrome-specific tasks. See our Puppeteer vs Playwright comparison for a detailed breakdown.

How fast can I scrape with Node.js?

With Axios + Cheerio, you can process hundreds of pages per minute. Browser-based scraping (Puppeteer/Playwright) is slower — typically 5-20 pages per minute depending on complexity. Always add delays between requests to be respectful.

How do I handle JavaScript-rendered content?

Use Puppeteer or Playwright to launch a real browser that executes JavaScript. Alternatively, check if the site has an API that returns the data directly — intercepting network requests is often more efficient than parsing rendered HTML.

Can I use Node.js for large-scale scraping?

Yes, but you’ll need proxy rotation, proper error handling, and potentially a distributed architecture. For enterprise-scale needs, consider using a dedicated scraping infrastructure with rotating residential proxies.

Next Steps

Now that you have a solid foundation in JavaScript web scraping, explore these related guides:

Puppeteer Web Scraping: Complete Tutorial for advanced browser automation
Cheerio.js Web Scraping Guide for deep-dive HTML parsing
Axios + Cheerio: Lightweight Scraping for optimized static scraping
Best Python Web Scraping Libraries to compare with the Python ecosystem
Web Scraping Proxy Guide for scaling your scrapers with proxies
aiohttp + BeautifulSoup: Async Python Scraping
Axios + Cheerio: Lightweight Node.js Scraping
How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
API vs Web Scraping: When You Need Proxies (and When You Don’t)
ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
How to Build an Ethical Web Scraping Policy for Your Company
aiohttp + BeautifulSoup: Async Python Scraping
Axios + Cheerio: Lightweight Node.js Scraping
How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
API vs Web Scraping: When You Need Proxies (and When You Don’t)
ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
How to Build an Ethical Web Scraping Policy for Your Company

Web Scraping with JavaScript/Node.js: Full Tutorial

Prerequisites

Project Setup

Approach 1: Axios + Cheerio (Static Pages)

Installation

Basic Example: Scraping a Product Page

Handling Pagination

Saving Data to CSV

Approach 2: Puppeteer (Browser Automation)

Installation

Basic Puppeteer Scraping

Handling Infinite Scroll

Intercepting API Calls

Approach 3: Playwright (Modern Alternative)

Installation

Playwright Scraping Example

Using Proxies with Node.js Scrapers

Axios with Proxy

Puppeteer with Proxy

Playwright with Proxy

Building a Production-Ready Scraper

Handling Common Anti-Scraping Measures

Rotating User Agents

Handling CAPTCHAs and Cloudflare

Choosing the Right Tool

Common Pitfalls and Troubleshooting

1. “Request failed with status code 403”

2. “Navigation timeout exceeded”

3. Empty results from page.evaluate()

4. Memory issues with Puppeteer

5. Rate limiting (429 errors)

FAQ

Is web scraping with JavaScript legal?

Should I use Puppeteer or Playwright for web scraping?

How fast can I scrape with Node.js?

How do I handle JavaScript-rendered content?

Can I use Node.js for large-scale scraping?

Next Steps

Related Reading