Scrapy Proxy Middleware: Rotate Mobile Proxies for Large-Scale Crawls
Scrapy is built for scale. While libraries like requests and browser automation tools handle pages one at a time (or in small batches), Scrapy’s asynchronous architecture processes hundreds of concurrent requests through a pipeline of middlewares, spiders, and item processors. It is the standard framework for crawling millions of pages.
But Scrapy’s power creates a problem: sending hundreds of concurrent requests from a single IP address will trigger rate limits and blocks almost immediately. You need proxy rotation integrated directly into Scrapy’s request pipeline, and the built-in proxy support is minimal.
This guide covers building custom downloader middleware for proxy rotation, configuring popular proxy middleware packages, managing proxy pools with health checks, and optimizing concurrent requests when routing through mobile proxies.
Scrapy Architecture and Where Proxies Fit
Understanding Scrapy’s request flow is essential for implementing proxy rotation correctly.
The Request Pipeline
- Spider generates
Requestobjects - Spider Middleware processes requests (optional modifications)
- Scheduler queues requests
- Downloader Middleware processes requests before they are sent (this is where proxy rotation happens)
- Downloader sends the HTTP request
- Downloader Middleware processes the response
- Spider receives the response and extracts data
Proxy rotation belongs in the Downloader Middleware layer. Each request passes through your middleware before being sent, giving you the opportunity to assign a proxy, modify headers, and handle proxy-specific errors.
Built-in Proxy Support
Scrapy has minimal built-in proxy support via the HttpProxyMiddleware. It reads the proxy from the proxy meta key on the request or from environment variables:
# Set proxy via request meta
yield scrapy.Request(
url='https://example.com',
meta={'proxy': 'http://user:pass@proxy-host:port'}
)# Or via environment variable
export http_proxy=http://user:pass@proxy-host:port
export https_proxy=http://user:pass@proxy-host:portThis works for a single proxy, but it does not handle rotation, health checks, or pool management. For production crawls, you need custom middleware.
Building Custom Proxy Rotation Middleware
Basic Round-Robin Middleware
# middlewares.py
import itertools
import logging
logger = logging.getLogger(__name__)
class ProxyRotationMiddleware:
def __init__(self, proxy_list):
self.proxies = proxy_list
self.proxy_cycle = itertools.cycle(self.proxies)
@classmethod
def from_crawler(cls, crawler):
proxy_list = crawler.settings.getlist('PROXY_LIST')
if not proxy_list:
raise ValueError("PROXY_LIST setting is required")
return cls(proxy_list)
def process_request(self, request, spider):
proxy = next(self.proxy_cycle)
request.meta['proxy'] = proxy
logger.debug(f"Using proxy: {proxy} for {request.url}")
def process_exception(self, request, exception, spider):
logger.warning(f"Proxy failed for {request.url}: {exception}")
# Retry with next proxy
request.meta['proxy'] = next(self.proxy_cycle)
return requestSettings Configuration
# settings.py
PROXY_LIST = [
'http://user:pass@mobile-proxy1.example.com:8080',
'http://user:pass@mobile-proxy2.example.com:8080',
'http://user:pass@mobile-proxy3.example.com:8080',
]
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.ProxyRotationMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}The priority number matters. Your custom middleware (350) must run before Scrapy’s HttpProxyMiddleware (400), which reads the proxy meta key your middleware sets.
Advanced Middleware with Health Tracking
Production crawls need proxy health monitoring. Remove failing proxies from rotation and re-add them after a cooldown:
# middlewares.py
import random
import time
import logging
from collections import defaultdict
logger = logging.getLogger(__name__)
class SmartProxyMiddleware:
def __init__(self, proxy_list, max_failures=5, cooldown=300):
self.all_proxies = proxy_list
self.active_proxies = list(proxy_list)
self.failed_proxies = {} # proxy -> (failure_count, last_failure_time)
self.max_failures = max_failures
self.cooldown = cooldown
self.stats = defaultdict(lambda: {'success': 0, 'failure': 0})
@classmethod
def from_crawler(cls, crawler):
proxy_list = crawler.settings.getlist('PROXY_LIST')
max_failures = crawler.settings.getint('PROXY_MAX_FAILURES', 5)
cooldown = crawler.settings.getint('PROXY_COOLDOWN', 300)
return cls(proxy_list, max_failures, cooldown)
def _recover_proxies(self):
"""Re-add proxies that have cooled down."""
now = time.time()
recovered = []
for proxy, (failures, last_time) in list(self.failed_proxies.items()):
if now - last_time > self.cooldown:
self.active_proxies.append(proxy)
recovered.append(proxy)
logger.info(f"Recovered proxy: {proxy}")
for proxy in recovered:
del self.failed_proxies[proxy]
def _get_proxy(self):
self._recover_proxies()
if not self.active_proxies:
logger.warning("No active proxies. Resetting all proxies.")
self.active_proxies = list(self.all_proxies)
self.failed_proxies.clear()
return random.choice(self.active_proxies)
def _mark_failure(self, proxy):
self.stats[proxy]['failure'] += 1
if proxy in self.failed_proxies:
failures, _ = self.failed_proxies[proxy]
self.failed_proxies[proxy] = (failures + 1, time.time())
else:
self.failed_proxies[proxy] = (1, time.time())
failures = self.failed_proxies[proxy][0]
if failures >= self.max_failures and proxy in self.active_proxies:
self.active_proxies.remove(proxy)
logger.warning(f"Removed proxy after {failures} failures: {proxy}")
def _mark_success(self, proxy):
self.stats[proxy]['success'] += 1
if proxy in self.failed_proxies:
del self.failed_proxies[proxy]
def process_request(self, request, spider):
proxy = self._get_proxy()
request.meta['proxy'] = proxy
request.meta['_proxy_used'] = proxy
def process_response(self, request, response, spider):
proxy = request.meta.get('_proxy_used')
if proxy:
if response.status in (403, 429, 503):
self._mark_failure(proxy)
# Retry with different proxy
new_request = request.copy()
new_request.dont_filter = True
return new_request
else:
self._mark_success(proxy)
return response
def process_exception(self, request, exception, spider):
proxy = request.meta.get('_proxy_used')
if proxy:
self._mark_failure(proxy)
# Return request to retry with different proxy
request.dont_filter = True
return requestSettings for Advanced Middleware
# settings.py
PROXY_LIST = [
'http://user:pass@mobile-proxy1.example.com:8080',
'http://user:pass@mobile-proxy2.example.com:8080',
'http://user:pass@mobile-proxy3.example.com:8080',
]
PROXY_MAX_FAILURES = 5
PROXY_COOLDOWN = 300 # seconds
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.SmartProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}
# Retry settings
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]Using scrapy-rotating-proxies
If you prefer a ready-made solution, the scrapy-rotating-proxies package handles rotation and health tracking out of the box:
pip install scrapy-rotating-proxiesConfiguration
# settings.py
ROTATING_PROXY_LIST = [
'http://user:pass@mobile-proxy1.example.com:8080',
'http://user:pass@mobile-proxy2.example.com:8080',
'http://user:pass@mobile-proxy3.example.com:8080',
]
# Or load from a file
# ROTATING_PROXY_LIST_PATH = '/path/to/proxy_list.txt'
ROTATING_PROXY_PAGE_RETRY_TIMES = 5
ROTATING_PROXY_CLOSE_SPIDER = False # Don't close spider if all proxies are dead
ROTATING_PROXY_BACKOFF_BASE = 300 # Backoff time for dead proxies
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}Custom Ban Detection
By default, scrapy-rotating-proxies considers any non-200 response as a ban. Customize this with a ban detection policy:
# middlewares.py
from rotating_proxies.policy import BanDetectionPolicy
class CustomBanPolicy(BanDetectionPolicy):
NOT_BAN_STATUSES = {200, 301, 302, 404}
NOT_BAN_EXCEPTIONS = set()
def response_is_ban(self, request, response):
if response.status not in self.NOT_BAN_STATUSES:
return True
# Check for soft bans
if b'captcha' in response.body.lower():
return True
if b'access denied' in response.body.lower():
return True
return False
def exception_is_ban(self, request, exception):
return True# settings.py
ROTATING_PROXY_BAN_POLICY = 'myproject.middlewares.CustomBanPolicy'Proxy Pool Management
Loading Proxies from External Sources
For production crawls, proxy lists should be dynamic — loaded from a database or API rather than hardcoded:
# middlewares.py
import requests as http_requests
class DynamicProxyMiddleware:
def __init__(self, proxy_api_url):
self.proxy_api_url = proxy_api_url
self.proxies = []
self.last_refresh = 0
self.refresh_interval = 300 # Refresh every 5 minutes
@classmethod
def from_crawler(cls, crawler):
api_url = crawler.settings.get('PROXY_API_URL')
return cls(api_url)
def _refresh_proxies(self):
now = time.time()
if now - self.last_refresh > self.refresh_interval:
try:
response = http_requests.get(self.proxy_api_url, timeout=10)
self.proxies = response.json()['proxies']
self.last_refresh = now
logger.info(f"Refreshed proxy pool: {len(self.proxies)} proxies")
except Exception as e:
logger.error(f"Failed to refresh proxies: {e}")
def process_request(self, request, spider):
self._refresh_proxies()
if self.proxies:
proxy = random.choice(self.proxies)
request.meta['proxy'] = proxyProxy Usage Statistics
Track which proxies perform best:
# extensions.py
from scrapy import signals
import json
class ProxyStatsExtension:
def __init__(self):
self.stats = defaultdict(lambda: {
'requests': 0,
'success': 0,
'failures': 0,
'avg_response_time': 0,
'total_response_time': 0
})
@classmethod
def from_crawler(cls, crawler):
ext = cls()
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_closed(self, spider):
report = json.dumps(dict(self.stats), indent=2)
with open('proxy_stats.json', 'w') as f:
f.write(report)
spider.logger.info(f"Proxy stats written to proxy_stats.json")Optimizing Concurrent Requests
Concurrency Settings for Proxy Scraping
# settings.py
# Total concurrent requests across all domains
CONCURRENT_REQUESTS = 16
# Concurrent requests per domain
CONCURRENT_REQUESTS_PER_DOMAIN = 4
# Download delay (seconds between requests to same domain)
DOWNLOAD_DELAY = 1
# Randomize download delay (0.5x to 1.5x of DOWNLOAD_DELAY)
RANDOMIZE_DOWNLOAD_DELAY = True
# Auto-throttle (adjusts speed based on server response)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0Matching Concurrency to Proxy Pool Size
A common mistake is setting concurrency higher than your proxy pool can support. If you have 3 mobile proxy ports, sending 50 concurrent requests means each proxy handles ~17 simultaneous connections. This looks unnatural and triggers rate limits.
Rule of thumb: Set CONCURRENT_REQUESTS to 3-5x the number of proxy ports. With 3 mobile proxies, target 9-15 concurrent requests.
AutoThrottle for Adaptive Speed
Scrapy’s AutoThrottle extension automatically adjusts request speed based on server response times and latency:
# settings.py
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2 # Initial delay
AUTOTHROTTLE_MAX_DELAY = 30 # Maximum delay if server is slow
AUTOTHROTTLE_TARGET_CONCURRENCY = 3.0 # Target parallel requests per server
AUTOTHROTTLE_DEBUG = True # Log throttle adjustmentsAutoThrottle is especially useful with mobile proxies because mobile connection speeds vary. The extension adapts to real-time network conditions.
Complete Spider Example
Here is a full Scrapy spider with proxy middleware, item pipeline, and error handling:
# spiders/product_spider.py
import scrapy
from scrapy.exceptions import CloseSpider
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products?page=1']
custom_settings = {
'CONCURRENT_REQUESTS': 10,
'DOWNLOAD_DELAY': 1.5,
'RANDOMIZE_DOWNLOAD_DELAY': True,
'RETRY_TIMES': 3,
}
def parse(self, response):
# Check for blocks
if response.status == 403 or b'access denied' in response.body.lower():
self.logger.warning(f"Blocked on {response.url}")
return
# Extract products
products = response.css('.product-card')
if not products:
self.logger.warning(f"No products found on {response.url}")
return
for product in products:
yield {
'title': product.css('.title::text').get('').strip(),
'price': product.css('.price::text').get('').strip(),
'url': response.urljoin(product.css('a::attr(href)').get()),
'source_url': response.url,
}
# Follow pagination
next_page = response.css('a.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)Pipeline for Data Processing
# pipelines.py
import json
import logging
class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open('products.jsonl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + '\n'
self.file.write(line)
return item
class DataValidationPipeline:
def process_item(self, item, spider):
if not item.get('title'):
raise scrapy.exceptions.DropItem(f"Missing title: {item}")
if not item.get('price'):
raise scrapy.exceptions.DropItem(f"Missing price: {item}")
return itemFull Settings
# settings.py
BOT_NAME = 'product_scraper'
SPIDER_MODULES = ['product_scraper.spiders']
# Proxy configuration
PROXY_LIST = [
'http://user:pass@mobile-proxy1.example.com:8080',
'http://user:pass@mobile-proxy2.example.com:8080',
'http://user:pass@mobile-proxy3.example.com:8080',
]
DOWNLOADER_MIDDLEWARES = {
'product_scraper.middlewares.SmartProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
ITEM_PIPELINES = {
'product_scraper.pipelines.DataValidationPipeline': 100,
'product_scraper.pipelines.JsonWriterPipeline': 200,
}
# Request settings
CONCURRENT_REQUESTS = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 3
DOWNLOAD_DELAY = 1.5
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOAD_TIMEOUT = 30
# Retry
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
# AutoThrottle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_MAX_DELAY = 15
AUTOTHROTTLE_TARGET_CONCURRENCY = 3.0
# Caching (useful for development)
HTTPCACHE_ENABLED = False
# Respect robots.txt
ROBOTSTXT_OBEY = True
# User agent
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
}When to Use Scrapy vs. Other Tools
Scrapy is the right choice when:
- You are crawling thousands to millions of pages
- You need structured pipelines for data processing
- The target does not require JavaScript rendering
- You want built-in rate limiting, retry logic, and statistics
Scrapy is not the right choice when:
- The target requires JavaScript rendering (use Playwright or Puppeteer)
- You are scraping a small number of pages (use Python requests)
- You need to bypass heavy JavaScript challenges (see our Cloudflare bypass guide)
For JavaScript-heavy targets, consider scrapy-playwright or scrapy-splash to add rendering capability to Scrapy.
Conclusion
Scrapy’s middleware architecture makes it the ideal framework for large-scale proxy-based scraping. The custom SmartProxyMiddleware shown in this guide handles rotation, health tracking, and automatic recovery — everything you need for production crawls.
The critical factor is proxy quality. DataResearchTools mobile proxies provide CGNAT-based IP addresses that anti-bot systems cannot aggressively block. Combined with Scrapy’s efficient request pipeline, you get a scraping system that scales to millions of pages while maintaining high success rates.
Connect your Scrapy project to mobile proxies and start crawling at scale with confidence.
- Mobile Proxies for Web Scraping: Why They Work When Others Don’t
- How to Use Mobile Proxies with Puppeteer for Web Scraping
- Selenium Proxy Setup: Complete Guide for Web Scraping
- Playwright Proxy Configuration: Step-by-Step Scraping Guide
- Python Requests + Proxies: Scraping Setup from Scratch
- Headless Browser + Proxy Setup: The Anti-Detection Stack
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- aiohttp + BeautifulSoup: Async Python Scraping
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- Axios + Cheerio: Lightweight Node.js Scraping
- How to Build an Ethical Web Scraping Policy for Your Company
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- aiohttp + BeautifulSoup: Async Python Scraping
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- Axios + Cheerio: Lightweight Node.js Scraping
- How to Build an Ethical Web Scraping Policy for Your Company
Related Reading
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- aiohttp + BeautifulSoup: Async Python Scraping
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- Axios + Cheerio: Lightweight Node.js Scraping
- How to Build an Ethical Web Scraping Policy for Your Company