n8n Web Scraping: Complete Workflow Guide

n8n Web Scraping: Complete Workflow Guide

n8n (pronounced “n-eight-n”) is an open-source workflow automation platform that has become a favorite among developers and no-code builders for its flexibility and self-hosting capability. While it’s known for connecting apps and automating business processes, n8n is also a powerful tool for building web scraping and data collection pipelines.

What makes n8n compelling for scraping is the visual workflow builder — you can chain together HTTP requests, HTML parsing, data transformation, and output nodes without writing a traditional scraper script. And when you need more power, you can add custom JavaScript or Python code nodes, integrate with AI scraping tools like Firecrawl, or connect to proxy services for reliable data collection.

Table of Contents

Why Use n8n for Web Scraping?

Advantages Over Traditional Scripts

AdvantageDescription
Visual workflowsSee the entire scraping pipeline at a glance
No-code friendlyBuild scrapers without programming (with code options when needed)
Built-in schedulingSchedule scraping jobs with cron triggers
200+ integrationsSend data directly to Google Sheets, databases, Slack, etc.
Error handlingBuilt-in retry logic and error workflows
Self-hostableRun on your own servers for full control
Community workflowsImport pre-built scraping workflows from the community

When n8n Is the Right Choice

  • You want a visual, maintainable scraping pipeline
  • You need to scrape data and immediately process or store it somewhere
  • Your scraping needs are moderate (not millions of pages)
  • You’re already using n8n for other automations
  • You want scheduling without writing cron jobs

When to Use Something Else

  • You need to scrape millions of pages (use Crawl4ai or Scrapy)
  • You need real-time, sub-second scraping
  • Your scraping requires complex browser interactions (use Browser Use)

Setting Up n8n

Option 1: n8n Cloud (Easiest)

Sign up at n8n.io for a managed instance. Free tier includes 5 workflows with 50 executions per day.

Option 2: Self-Hosted with Docker

docker run -it --rm \
  --name n8n \
  -p 5678:5678 \
  -v n8n_data:/home/node/.n8n \
  docker.n8n.io/n8nio/n8n

Access n8n at http://localhost:5678.

Option 3: Self-Hosted with npm

npm install -g n8n
n8n start

Basic Scraping Workflow

Workflow: Fetch a Web Page

The simplest scraping workflow uses the HTTP Request node:

  1. Add a Manual Trigger node
  2. Add an HTTP Request node
  3. Configure the URL and method

HTTP Request node settings:

  • Method: GET
  • URL: https://example.com/page-to-scrape
  • Response Format: String (to get raw HTML)

Extracting Data with the HTML Node

After fetching the page, use the HTML node (formerly HTML Extract) to pull specific data:

HTML node settings:

  • Source Data: From previous node
  • Extraction Values:
  • Key: title | CSS Selector: h1 | Return: Text
  • Key: price | CSS Selector: .price | Return: Text
  • Key: description | CSS Selector: .product-description | Return: Text

Complete Basic Workflow

[Manual Trigger] → [HTTP Request] → [HTML Extract] → [Google Sheets]

This workflow fetches a page, extracts data, and saves it to a Google Sheet — no code required.

Parsing HTML Content

Using CSS Selectors

The HTML node supports standard CSS selectors:

h1                    → First h1 heading
.product-card         → Elements with class "product-card"
#main-content p       → Paragraphs inside #main-content
a[href^="https"]      → Links starting with https
table tr td:nth-child(2) → Second column of table rows

Extracting Multiple Items

To extract a list of items (e.g., all products on a page):

  1. Set the HTML node to Extract multiple values
  2. Use a base selector for the repeating container (e.g., .product-card)
  3. Define field selectors relative to the container

Using the Code Node for Complex Parsing

When CSS selectors aren’t enough, use a Code node with JavaScript:

// Code node - Parse complex HTML
const cheerio = require('cheerio');
const html = $input.first().json.data;
const $ = cheerio.load(html);

const products = [];
$('.product-card').each((i, el) => {
  products.push({
    name: $(el).find('h3').text().trim(),
    price: parseFloat($(el).find('.price').text().replace('$', '')),
    rating: parseFloat($(el).find('.rating').attr('data-value')),
    url: $(el).find('a').attr('href'),
  });
});

return products.map(p => ({ json: p }));

Scraping Multiple Pages

Using the Split In Batches Node

To scrape a list of URLs:

[Manual Trigger] → [Set URLs] → [Split In Batches] → [HTTP Request] → [HTML Extract] → [Merge]
  1. Set node: Define an array of URLs
  2. Split In Batches: Process 1-5 URLs at a time
  3. HTTP Request: Fetch each URL
  4. HTML Extract: Parse each page
  5. Merge: Combine all results

URL List Example

// Code node - Generate URL list
const urls = [];
for (let i = 1; i <= 20; i++) {
  urls.push({
    json: {
      url: `https://example.com/category/page/${i}`
    }
  });
}
return urls;

Handling Pagination

Auto-Pagination with Loop

[Trigger] → [Set Page=1] → [HTTP Request] → [HTML Extract] → [Check Has Next] → [Increment Page] → (loop back to HTTP Request)
// Code node - Check for next page
const items = $input.all();
const hasProducts = items.length > 0 && items[0].json.products?.length > 0;
const currentPage = $('Set Page').first().json.page;

if (hasProducts && currentPage < 50) {
  return [{ json: { page: currentPage + 1, continue: true } }];
} else {
  return [{ json: { continue: false } }];
}

Using Wait Nodes for Rate Limiting

Add a Wait node between requests to avoid overwhelming the target server:

[HTTP Request] → [Wait 2s] → [Next Request]

JavaScript-Rendered Pages

Standard HTTP requests can’t handle JavaScript-rendered pages. You have several options in n8n:

Option 1: Use a Headless Browser Service

Services like ScrapingBee or Browserless handle rendering:

// HTTP Request node
// URL: https://app.scrapingbee.com/api/v1/
// Query Parameters:
//   api_key: your-api-key
//   url: https://target-site.com
//   render_js: true

Option 2: Use Firecrawl

Firecrawl handles JavaScript rendering and returns clean markdown. See our n8n + Firecrawl integration guide for the complete setup.

Option 3: Use a Code Node with Puppeteer

For self-hosted n8n with Puppeteer installed:

const puppeteer = require('puppeteer');

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com/spa', { waitUntil: 'networkidle0' });
const html = await page.content();
await browser.close();

return [{ json: { html } }];

Using Proxies in n8n

For reliable scraping, especially at scale, proxies prevent IP blocks.

HTTP Request with Proxy

In the HTTP Request node, configure the proxy under Options:

  • Proxy URL: http://username:password@proxy-server:8080

Rotating Proxies

Use a code node to select proxies:

// Code node - Select random proxy
const proxies = [
  'http://user:pass@proxy1:8080',
  'http://user:pass@proxy2:8080',
  'http://user:pass@proxy3:8080',
];

const proxy = proxies[Math.floor(Math.random() * proxies.length)];

return [{
  json: {
    ...($input.first().json),
    proxy
  }
}];

Then reference {{ $json.proxy }} in the HTTP Request node’s proxy settings.

Using Residential Proxies

For scraping protected sites, residential proxies provide the best success rates. Most providers offer a single rotating endpoint:

http://customer-id:password@gate.provider.com:7777

Configure this as your proxy URL in the HTTP Request node, and each request automatically uses a different IP.

Scheduling Scraping Jobs

Cron Trigger

Replace the Manual Trigger with a Schedule Trigger:

  • Every hour: 0
  • Daily at 9 AM: 0 9 *
  • Every Monday at midnight: 0 0 1
  • Every 15 minutes: /15 *

Webhook Trigger

Trigger scraping from external events:

[Webhook] → [HTTP Request] → [Parse] → [Store]

Call the webhook URL from any application to trigger the scraping workflow.

Storing Scraped Data

Google Sheets

The Google Sheets node can append rows directly:

[HTML Extract] → [Google Sheets: Append Row]

PostgreSQL / MySQL

[HTML Extract] → [Postgres: Insert]

Airtable

[HTML Extract] → [Airtable: Create Record]

JSON File

// Code node - Save to file
const fs = require('fs');
const data = $input.all().map(item => item.json);
fs.writeFileSync('/data/scrape_results.json', JSON.stringify(data, null, 2));
return [{ json: { saved: true, count: data.length } }];

AI-Enhanced Scraping with n8n

Using OpenAI for Data Extraction

Add an OpenAI node after fetching HTML to extract structured data:

[HTTP Request] → [OpenAI: Chat] → [Parse JSON] → [Store]

OpenAI node prompt:

Extract the following from this web page HTML:
- Product name
- Price
- Rating
- Availability

Return as JSON array.

HTML: {{ $json.data }}

Using the AI Agent Node

n8n’s built-in AI Agent node can handle complex extraction tasks:

[HTTP Request] → [AI Agent] → [Store]

Configure the AI Agent with tools for parsing, searching, and extracting.

Error Handling & Retries

Built-in Retry

On the HTTP Request node:

  • On Error: Continue (returns error info)
  • Retry On Fail: Yes
  • Max Retries: 3
  • Wait Between Retries: 2000ms

Error Workflow

Create a separate error handling workflow:

[Error Trigger] → [Slack: Send Message] → [Log Error]

Custom Error Handling

// Code node - Handle errors
const items = $input.all();
const successful = [];
const failed = [];

for (const item of items) {
  if (item.json.error) {
    failed.push(item);
  } else {
    successful.push(item);
  }
}

// Output 1: successful items, Output 2: failed items
return [successful, failed];

Real-World Workflow Examples

Price Monitor

[Schedule: Daily 9AM] → [HTTP Request: Product Page] → [HTML: Extract Price]
→ [Compare with Previous] → [IF Price Changed] → [Slack: Notify] + [Google Sheets: Log]

Content Aggregator

[Schedule: Hourly] → [Code: Generate RSS URLs] → [Split In Batches]
→ [HTTP Request] → [HTML: Extract Articles] → [Filter New] → [Airtable: Store]

Competitor Monitoring

[Schedule: Weekly] → [HTTP Request: Competitor Pages] → [AI: Extract Data]
→ [Compare with Last Week] → [Generate Report] → [Email: Send Report]

FAQ

Is n8n free for web scraping?

n8n’s open-source version (self-hosted) is free with no execution limits. The cloud version has a free tier with 50 executions per day. For serious scraping, self-hosting is recommended as there are no usage restrictions.

Can n8n handle JavaScript-rendered pages?

Not natively with the HTTP Request node. You need to use an external service (ScrapingBee, Browserless) or integrate with Firecrawl. Self-hosted n8n with custom Docker images can include Puppeteer for direct browser rendering.

How many pages can n8n scrape?

There’s no hard limit for self-hosted n8n. Cloud instances have execution limits based on your plan. For very large-scale scraping (millions of pages), dedicated tools like Crawl4ai or Scrapy are more appropriate. n8n works best for moderate-scale, scheduled scraping.

Do I need proxies with n8n scraping?

For small-scale personal projects, you may not. For anything scraping more than a few dozen pages per day, or scraping sites with anti-bot protection, residential proxies are recommended. See our proxy guide for details.

Can I sell my n8n scraping workflows?

You can share and sell workflow templates, but be aware of the data collection legalities. Ensure your workflows comply with the target site’s terms of service and applicable laws. See our web scraping compliance guide for more information.


Related Reading

Scroll to Top