n8n Web Scraping: Complete Workflow Guide
n8n (pronounced “n-eight-n”) is an open-source workflow automation platform that has become a favorite among developers and no-code builders for its flexibility and self-hosting capability. While it’s known for connecting apps and automating business processes, n8n is also a powerful tool for building web scraping and data collection pipelines.
What makes n8n compelling for scraping is the visual workflow builder — you can chain together HTTP requests, HTML parsing, data transformation, and output nodes without writing a traditional scraper script. And when you need more power, you can add custom JavaScript or Python code nodes, integrate with AI scraping tools like Firecrawl, or connect to proxy services for reliable data collection.
Table of Contents
- Why Use n8n for Web Scraping?
- Setting Up n8n
- Basic Scraping Workflow
- Parsing HTML Content
- Scraping Multiple Pages
- Handling Pagination
- JavaScript-Rendered Pages
- Using Proxies in n8n
- Scheduling Scraping Jobs
- Storing Scraped Data
- AI-Enhanced Scraping with n8n
- Error Handling & Retries
- Real-World Workflow Examples
- FAQ
Why Use n8n for Web Scraping?
Advantages Over Traditional Scripts
| Advantage | Description |
|---|---|
| Visual workflows | See the entire scraping pipeline at a glance |
| No-code friendly | Build scrapers without programming (with code options when needed) |
| Built-in scheduling | Schedule scraping jobs with cron triggers |
| 200+ integrations | Send data directly to Google Sheets, databases, Slack, etc. |
| Error handling | Built-in retry logic and error workflows |
| Self-hostable | Run on your own servers for full control |
| Community workflows | Import pre-built scraping workflows from the community |
When n8n Is the Right Choice
- You want a visual, maintainable scraping pipeline
- You need to scrape data and immediately process or store it somewhere
- Your scraping needs are moderate (not millions of pages)
- You’re already using n8n for other automations
- You want scheduling without writing cron jobs
When to Use Something Else
- You need to scrape millions of pages (use Crawl4ai or Scrapy)
- You need real-time, sub-second scraping
- Your scraping requires complex browser interactions (use Browser Use)
Setting Up n8n
Option 1: n8n Cloud (Easiest)
Sign up at n8n.io for a managed instance. Free tier includes 5 workflows with 50 executions per day.
Option 2: Self-Hosted with Docker
docker run -it --rm \
--name n8n \
-p 5678:5678 \
-v n8n_data:/home/node/.n8n \
docker.n8n.io/n8nio/n8nAccess n8n at http://localhost:5678.
Option 3: Self-Hosted with npm
npm install -g n8n
n8n startBasic Scraping Workflow
Workflow: Fetch a Web Page
The simplest scraping workflow uses the HTTP Request node:
- Add a Manual Trigger node
- Add an HTTP Request node
- Configure the URL and method
HTTP Request node settings:
- Method: GET
- URL:
https://example.com/page-to-scrape - Response Format: String (to get raw HTML)
Extracting Data with the HTML Node
After fetching the page, use the HTML node (formerly HTML Extract) to pull specific data:
HTML node settings:
- Source Data: From previous node
- Extraction Values:
- Key:
title| CSS Selector:h1| Return: Text - Key:
price| CSS Selector:.price| Return: Text - Key:
description| CSS Selector:.product-description| Return: Text
Complete Basic Workflow
[Manual Trigger] → [HTTP Request] → [HTML Extract] → [Google Sheets]This workflow fetches a page, extracts data, and saves it to a Google Sheet — no code required.
Parsing HTML Content
Using CSS Selectors
The HTML node supports standard CSS selectors:
h1 → First h1 heading
.product-card → Elements with class "product-card"
#main-content p → Paragraphs inside #main-content
a[href^="https"] → Links starting with https
table tr td:nth-child(2) → Second column of table rowsExtracting Multiple Items
To extract a list of items (e.g., all products on a page):
- Set the HTML node to Extract multiple values
- Use a base selector for the repeating container (e.g.,
.product-card) - Define field selectors relative to the container
Using the Code Node for Complex Parsing
When CSS selectors aren’t enough, use a Code node with JavaScript:
// Code node - Parse complex HTML
const cheerio = require('cheerio');
const html = $input.first().json.data;
const $ = cheerio.load(html);
const products = [];
$('.product-card').each((i, el) => {
products.push({
name: $(el).find('h3').text().trim(),
price: parseFloat($(el).find('.price').text().replace('$', '')),
rating: parseFloat($(el).find('.rating').attr('data-value')),
url: $(el).find('a').attr('href'),
});
});
return products.map(p => ({ json: p }));Scraping Multiple Pages
Using the Split In Batches Node
To scrape a list of URLs:
[Manual Trigger] → [Set URLs] → [Split In Batches] → [HTTP Request] → [HTML Extract] → [Merge]- Set node: Define an array of URLs
- Split In Batches: Process 1-5 URLs at a time
- HTTP Request: Fetch each URL
- HTML Extract: Parse each page
- Merge: Combine all results
URL List Example
// Code node - Generate URL list
const urls = [];
for (let i = 1; i <= 20; i++) {
urls.push({
json: {
url: `https://example.com/category/page/${i}`
}
});
}
return urls;Handling Pagination
Auto-Pagination with Loop
[Trigger] → [Set Page=1] → [HTTP Request] → [HTML Extract] → [Check Has Next] → [Increment Page] → (loop back to HTTP Request)// Code node - Check for next page
const items = $input.all();
const hasProducts = items.length > 0 && items[0].json.products?.length > 0;
const currentPage = $('Set Page').first().json.page;
if (hasProducts && currentPage < 50) {
return [{ json: { page: currentPage + 1, continue: true } }];
} else {
return [{ json: { continue: false } }];
}Using Wait Nodes for Rate Limiting
Add a Wait node between requests to avoid overwhelming the target server:
[HTTP Request] → [Wait 2s] → [Next Request]JavaScript-Rendered Pages
Standard HTTP requests can’t handle JavaScript-rendered pages. You have several options in n8n:
Option 1: Use a Headless Browser Service
Services like ScrapingBee or Browserless handle rendering:
// HTTP Request node
// URL: https://app.scrapingbee.com/api/v1/
// Query Parameters:
// api_key: your-api-key
// url: https://target-site.com
// render_js: trueOption 2: Use Firecrawl
Firecrawl handles JavaScript rendering and returns clean markdown. See our n8n + Firecrawl integration guide for the complete setup.
Option 3: Use a Code Node with Puppeteer
For self-hosted n8n with Puppeteer installed:
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com/spa', { waitUntil: 'networkidle0' });
const html = await page.content();
await browser.close();
return [{ json: { html } }];Using Proxies in n8n
For reliable scraping, especially at scale, proxies prevent IP blocks.
HTTP Request with Proxy
In the HTTP Request node, configure the proxy under Options:
- Proxy URL:
http://username:password@proxy-server:8080
Rotating Proxies
Use a code node to select proxies:
// Code node - Select random proxy
const proxies = [
'http://user:pass@proxy1:8080',
'http://user:pass@proxy2:8080',
'http://user:pass@proxy3:8080',
];
const proxy = proxies[Math.floor(Math.random() * proxies.length)];
return [{
json: {
...($input.first().json),
proxy
}
}];Then reference {{ $json.proxy }} in the HTTP Request node’s proxy settings.
Using Residential Proxies
For scraping protected sites, residential proxies provide the best success rates. Most providers offer a single rotating endpoint:
http://customer-id:password@gate.provider.com:7777Configure this as your proxy URL in the HTTP Request node, and each request automatically uses a different IP.
Scheduling Scraping Jobs
Cron Trigger
Replace the Manual Trigger with a Schedule Trigger:
- Every hour:
0 - Daily at 9 AM:
0 9 * - Every Monday at midnight:
0 0 1 - Every 15 minutes:
/15 *
Webhook Trigger
Trigger scraping from external events:
[Webhook] → [HTTP Request] → [Parse] → [Store]Call the webhook URL from any application to trigger the scraping workflow.
Storing Scraped Data
Google Sheets
The Google Sheets node can append rows directly:
[HTML Extract] → [Google Sheets: Append Row]PostgreSQL / MySQL
[HTML Extract] → [Postgres: Insert]Airtable
[HTML Extract] → [Airtable: Create Record]JSON File
// Code node - Save to file
const fs = require('fs');
const data = $input.all().map(item => item.json);
fs.writeFileSync('/data/scrape_results.json', JSON.stringify(data, null, 2));
return [{ json: { saved: true, count: data.length } }];AI-Enhanced Scraping with n8n
Using OpenAI for Data Extraction
Add an OpenAI node after fetching HTML to extract structured data:
[HTTP Request] → [OpenAI: Chat] → [Parse JSON] → [Store]OpenAI node prompt:
Extract the following from this web page HTML:
- Product name
- Price
- Rating
- Availability
Return as JSON array.
HTML: {{ $json.data }}Using the AI Agent Node
n8n’s built-in AI Agent node can handle complex extraction tasks:
[HTTP Request] → [AI Agent] → [Store]Configure the AI Agent with tools for parsing, searching, and extracting.
Error Handling & Retries
Built-in Retry
On the HTTP Request node:
- On Error: Continue (returns error info)
- Retry On Fail: Yes
- Max Retries: 3
- Wait Between Retries: 2000ms
Error Workflow
Create a separate error handling workflow:
[Error Trigger] → [Slack: Send Message] → [Log Error]Custom Error Handling
// Code node - Handle errors
const items = $input.all();
const successful = [];
const failed = [];
for (const item of items) {
if (item.json.error) {
failed.push(item);
} else {
successful.push(item);
}
}
// Output 1: successful items, Output 2: failed items
return [successful, failed];Real-World Workflow Examples
Price Monitor
[Schedule: Daily 9AM] → [HTTP Request: Product Page] → [HTML: Extract Price]
→ [Compare with Previous] → [IF Price Changed] → [Slack: Notify] + [Google Sheets: Log]Content Aggregator
[Schedule: Hourly] → [Code: Generate RSS URLs] → [Split In Batches]
→ [HTTP Request] → [HTML: Extract Articles] → [Filter New] → [Airtable: Store]Competitor Monitoring
[Schedule: Weekly] → [HTTP Request: Competitor Pages] → [AI: Extract Data]
→ [Compare with Last Week] → [Generate Report] → [Email: Send Report]FAQ
Is n8n free for web scraping?
n8n’s open-source version (self-hosted) is free with no execution limits. The cloud version has a free tier with 50 executions per day. For serious scraping, self-hosting is recommended as there are no usage restrictions.
Can n8n handle JavaScript-rendered pages?
Not natively with the HTTP Request node. You need to use an external service (ScrapingBee, Browserless) or integrate with Firecrawl. Self-hosted n8n with custom Docker images can include Puppeteer for direct browser rendering.
How many pages can n8n scrape?
There’s no hard limit for self-hosted n8n. Cloud instances have execution limits based on your plan. For very large-scale scraping (millions of pages), dedicated tools like Crawl4ai or Scrapy are more appropriate. n8n works best for moderate-scale, scheduled scraping.
Do I need proxies with n8n scraping?
For small-scale personal projects, you may not. For anything scraping more than a few dozen pages per day, or scraping sites with anti-bot protection, residential proxies are recommended. See our proxy guide for details.
Can I sell my n8n scraping workflows?
You can share and sell workflow templates, but be aware of the data collection legalities. Ensure your workflows comply with the target site’s terms of service and applicable laws. See our web scraping compliance guide for more information.
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
Related Reading
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data