n8n Web Scraping and Deep Research Automation: the Complete Guide
n8n has become one of the most popular workflow automation platforms for developers who want full control over their automations without vendor lock-in. its open-source nature, self-hosting capability, and visual workflow builder make it an excellent choice for building deep research pipelines that combine web scraping, data processing, and AI analysis.
this guide walks you through building production-ready n8n workflows that scrape data from multiple sources, route it through proxy infrastructure, process it with AI models, and deliver structured research outputs. whether you are monitoring competitors, gathering market intelligence, or building automated research assistants, these patterns will save you significant development time.
Why n8n for Web Scraping Automation
before jumping into workflows, here is why n8n works well for scraping and research automation compared to writing standalone scripts:
- visual debugging. when a scraping workflow breaks at step 7 of 15, you can see exactly where it failed and inspect the data at each node. this beats reading through logs from a monolithic Python script.
- built-in scheduling. n8n’s cron triggers let you schedule scraping jobs without setting up external schedulers or cron jobs.
- error handling at the node level. you can retry individual steps, add fallback paths, and route errors to notification channels without writing try/except blocks everywhere.
- self-hosted control. unlike cloud-only automation tools, self-hosted n8n means your scraped data never passes through third-party servers.
- community nodes. the n8n community has built nodes for proxy services, AI providers, databases, and messaging platforms that plug directly into your workflows.
Setting Up n8n for Scraping Workflows
Self-Hosted Installation
for scraping workloads, I recommend self-hosting n8n with Docker. this gives you full control over network configuration, proxy setup, and resource allocation.
# create a docker-compose.yml for n8n
mkdir n8n-scraping && cd n8n-scraping
# docker-compose.yml
version: '3.8'
services:
n8n:
image: n8nio/n8n:latest
ports:
- "5678:5678"
environment:
- N8N_BASIC_AUTH_ACTIVE=true
- N8N_BASIC_AUTH_USER=admin
- N8N_BASIC_AUTH_PASSWORD=your_secure_password
- EXECUTIONS_TIMEOUT=600
- EXECUTIONS_TIMEOUT_MAX=1200
- N8N_PAYLOAD_SIZE_MAX=256
volumes:
- n8n_data:/home/node/.n8n
- ./scripts:/home/node/scripts
restart: unless-stopped
volumes:
n8n_data:
the key settings here are EXECUTIONS_TIMEOUT (set higher for long scraping jobs) and N8N_PAYLOAD_SIZE_MAX (increased to handle large scraped datasets).
docker-compose up -d
Installing Required Community Nodes
after launching n8n, install these community nodes through the UI (Settings > Community Nodes):
- n8n-nodes-puppeteer for browser-based scraping
- n8n-nodes-cheerio for HTML parsing
- @n8n/n8n-nodes-langchain for AI/LLM integration (built-in since v1.19)
Workflow 1: Basic Web Scraping with Proxy Rotation
let us start with a foundational workflow that scrapes a list of URLs through rotating proxies and stores the results.
Workflow Structure
[Schedule Trigger] → [Read URL List] → [HTTP Request with Proxy] → [HTML Parse] → [Store Results]
Step-by-Step Setup
1. Schedule Trigger node:
set to run daily at a time when target sites have lower traffic. for ecommerce monitoring, early morning (UTC 05:00) often works well.
2. Read URL List from Google Sheets or a database:
connect a Google Sheets node or Postgres node that contains your target URLs.
3. HTTP Request node with proxy configuration:
in the HTTP Request node, configure the proxy under Settings > Proxy:
Proxy URL: http://username:password@proxy.provider.com:port
for rotating residential proxies, you typically get a gateway endpoint that handles rotation automatically:
http://user-zone-residential:pass@gateway.proxyservice.com:7777
4. HTML parsing with the Code node:
// n8n Code node - parse HTML response
const cheerio = require('cheerio');
const html = $input.first().json.data;
const $ = cheerio.load(html);
const results = [];
$('.product-card').each((i, el) => {
results.push({
title: $(el).find('.product-title').text().trim(),
price: $(el).find('.product-price').text().trim(),
url: $(el).find('a').attr('href'),
scraped_at: new Date().toISOString()
});
});
return results.map(item => ({ json: item }));
5. Store in a Postgres database or Google Sheets:
map the parsed fields to your storage destination.
Workflow 2: Deep Research Pipeline with AI Analysis
this is where n8n truly shines. you can build a multi-stage research pipeline that gathers data from several sources and uses AI to synthesize findings.
Workflow Architecture
[Trigger: Topic Input]
↓
[Split into Research Paths]
├─→ [Google Search API] → [Scrape Top Results]
├─→ [Reddit API Search] → [Scrape Threads]
├─→ [Academic Search] → [Scrape Abstracts]
↓
[Merge All Data]
↓
[AI Analysis (Claude/GPT)]
↓
[Generate Report]
↓
[Deliver via Email/Slack]
Building the Research Trigger
use a Webhook node to accept research topics:
{
"topic": "residential proxy market trends 2026",
"depth": "comprehensive",
"sources": ["web", "reddit", "academic"],
"max_results_per_source": 10
}
The Google Search Scraping Path
// n8n Code node - build search URLs
const topic = $input.first().json.topic;
const encodedTopic = encodeURIComponent(topic);
// use a SERP API or scrape Google through proxies
const searchUrl = `https://serpapi.com/search.json?q=${encodedTopic}&num=10&api_key=${$env.SERPAPI_KEY}`;
return [{ json: { url: searchUrl, topic: topic } }];
after getting search results, use a Split In Batches node to scrape each result URL individually. this prevents overwhelming target servers and lets you add delays between requests.
The Reddit Research Path
// n8n Code node - search Reddit via API
const topic = $input.first().json.topic;
const encodedTopic = encodeURIComponent(topic);
const redditSearchUrl = `https://www.reddit.com/search.json?q=${encodedTopic}&sort=relevance&t=year&limit=10`;
return [{ json: { url: redditSearchUrl } }];
configure the HTTP Request node for Reddit with appropriate headers:
User-Agent: n8n-research-bot/1.0
Merging and AI Analysis
after all research paths complete, use a Merge node (mode: Append) to combine all scraped content into a single dataset. then pass it to an AI node.
// n8n Code node - prepare context for AI analysis
const allData = $input.all().map(item => item.json);
const context = allData.map((item, i) => {
return `Source ${i + 1} (${item.source_type}): ${item.title}\n${item.content}\n---`;
}).join('\n\n');
const prompt = `you are a research analyst. analyze the following sources about "${$input.first().json.topic}" and produce a structured research report with:
1. executive summary (3-5 sentences)
2. key findings (bulleted list)
3. data points and statistics mentioned
4. conflicting information or debates
5. gaps in the available information
6. recommended next steps for deeper research
sources:
${context}`;
return [{ json: { prompt: prompt, context_length: context.length } }];
connect this to an OpenAI node or HTTP Request node pointing to the Claude API:
// n8n Code node - call Claude API for analysis
const prompt = $input.first().json.prompt;
const response = await fetch('https://api.anthropic.com/v1/messages', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': $env.ANTHROPIC_API_KEY,
'anthropic-version': '2023-06-01'
},
body: JSON.stringify({
model: 'claude-sonnet-4-20250514',
max_tokens: 4096,
messages: [{ role: 'user', content: prompt }]
})
});
const data = await response.json();
return [{ json: { report: data.content[0].text } }];
Workflow 3: Competitive Intelligence Monitor
this workflow monitors competitor websites for changes and alerts you when something significant happens.
Monitoring Configuration
// n8n Code node - define monitoring targets
const competitors = [
{
name: "Competitor A",
url: "https://competitor-a.com/pricing",
selectors: {
prices: ".pricing-card .amount",
features: ".feature-list li",
plans: ".plan-name"
}
},
{
name: "Competitor B",
url: "https://competitor-b.com/pricing",
selectors: {
prices: ".price-value",
features: ".features span",
plans: ".plan-title"
}
}
];
return competitors.map(c => ({ json: c }));
Change Detection Logic
// n8n Code node - compare current vs previous scrape
const currentData = $input.first().json.current;
const previousData = $input.first().json.previous;
const changes = [];
// compare prices
currentData.prices.forEach((price, i) => {
if (previousData.prices[i] && price !== previousData.prices[i]) {
changes.push({
type: 'price_change',
field: `Plan ${i + 1}`,
old_value: previousData.prices[i],
new_value: price,
direction: parseFloat(price) > parseFloat(previousData.prices[i]) ? 'increase' : 'decrease'
});
}
});
// compare features
const newFeatures = currentData.features.filter(f => !previousData.features.includes(f));
const removedFeatures = previousData.features.filter(f => !currentData.features.includes(f));
if (newFeatures.length > 0) {
changes.push({ type: 'new_features', items: newFeatures });
}
if (removedFeatures.length > 0) {
changes.push({ type: 'removed_features', items: removedFeatures });
}
return [{ json: { changes, has_changes: changes.length > 0, competitor: $input.first().json.name } }];
Alerting
connect a Switch node after change detection:
– if has_changes is true, route to Slack/Email notification
– if false, route to a No Operation node (skip alerting)
Proxy Integration Patterns for n8n
Pattern 1: Gateway Proxy (Simplest)
most proxy providers offer a single gateway endpoint that handles rotation internally.
in the HTTP Request node settings:
Proxy: http://user:pass@gate.smartproxy.com:7777
Pattern 2: Proxy Rotation via Code Node
for providers that give you a list of proxy IPs, rotate them in a Code node:
// n8n Code node - proxy rotation
const proxies = [
'http://user:pass@proxy1.example.com:8080',
'http://user:pass@proxy2.example.com:8080',
'http://user:pass@proxy3.example.com:8080',
];
const url = $input.first().json.url;
const proxyIndex = Math.floor(Math.random() * proxies.length);
const selectedProxy = proxies[proxyIndex];
return [{ json: { url, proxy: selectedProxy } }];
Pattern 3: Geo-Targeted Requests
when you need results from specific countries:
// n8n Code node - geo-targeted proxy selection
const targetCountry = $input.first().json.country || 'us';
// most proxy providers support country targeting in the username
const proxy = `http://user-country-${targetCountry}:pass@gate.provider.com:7777`;
return [{ json: { ...($input.first().json), proxy } }];
Error Handling and Retry Logic
Handling Failed Scrapes
add an Error Trigger node connected to your scraping workflow. this catches failures at any node:
// n8n Code node - handle scraping errors
const error = $input.first().json;
const errorType = error.message || 'unknown';
let action = 'log';
let retryDelay = 0;
if (errorType.includes('403') || errorType.includes('blocked')) {
action = 'retry_with_different_proxy';
retryDelay = 30000; // 30 seconds
} else if (errorType.includes('429') || errorType.includes('rate limit')) {
action = 'retry_with_delay';
retryDelay = 60000; // 1 minute
} else if (errorType.includes('timeout')) {
action = 'retry_immediately';
retryDelay = 5000;
}
return [{ json: { action, retryDelay, originalError: errorType } }];
Rate Limiting Between Requests
use a Wait node between scraping iterations to respect rate limits:
[Split In Batches] → [HTTP Request] → [Wait 2-5 seconds] → [Process Response] → [Loop Back]
Performance Optimization Tips
1. Use Sub-Workflows for Reusable Scraping Logic
create a sub-workflow called “Scrape URL with Proxy” that accepts a URL and returns parsed content. call this from any main workflow to avoid duplicating scraping logic.
2. Batch Processing
instead of scraping URLs one at a time, use the Split In Batches node with a batch size of 5 to 10 (depending on your proxy plan’s concurrent connection limit).
3. Caching Responses
add a Redis or database check before scraping to see if you already have recent data for a URL:
// n8n Code node - cache check
const url = $input.first().json.url;
const cacheKey = `scrape_cache:${Buffer.from(url).toString('base64')}`;
const maxAge = 3600 * 24; // 24 hours
// check cache (connect this to a Redis node)
return [{ json: { cacheKey, url, maxAge } }];
4. Parallel Execution
n8n supports parallel execution within workflows. configure the Split In Batches node to run batches in parallel when your targets are different domains (to avoid hitting the same server simultaneously).
Common Deep Research Workflow Templates
Market Research Template
[Input: Industry/Topic]
→ [Search Google, Bing, DuckDuckGo in parallel]
→ [Scrape top 20 results through proxies]
→ [Extract key data points with AI]
→ [Cross-reference statistics]
→ [Generate market brief]
→ [Email to stakeholders]
Competitor Pricing Monitor
[Cron: Every 6 hours]
→ [Load competitor URL list]
→ [Scrape pricing pages with proxies]
→ [Compare with last scrape]
→ [If changes detected: Alert via Slack]
→ [Store in database for trend analysis]
Lead Research Enrichment
[Input: Company list from CRM]
→ [Scrape company websites for key info]
→ [Search LinkedIn (via proxy) for decision makers]
→ [Scrape news mentions]
→ [AI: Generate company brief]
→ [Update CRM records]
Security Considerations
when running scraping workflows in n8n, keep these security practices in mind:
- store proxy credentials in n8n’s credential store, not in plaintext within nodes.
- use environment variables for API keys and sensitive configuration.
- restrict n8n access with authentication and network-level controls.
- log scraping activity for compliance and debugging purposes.
- respect robots.txt and terms of service for target websites.
Conclusion
n8n transforms web scraping from isolated scripts into maintainable, visual workflows that non-technical team members can understand and modify. the combination of proxy integration, AI analysis, and workflow automation makes it possible to build research pipelines that would otherwise require a dedicated engineering team.
start with the basic scraping workflow to get comfortable with n8n’s node system, then graduate to the deep research pipeline as your needs grow. the investment in setting up proper workflows pays off quickly when you need to run the same research process repeatedly or hand it off to colleagues who are not comfortable with Python scripts.