Scrapy Proxy Middleware: Rotate Mobile Proxies for Large-Scale Crawls

Scrapy Proxy Middleware: Rotate Mobile Proxies for Large-Scale Crawls

Scrapy is built for scale. While libraries like requests and browser automation tools handle pages one at a time (or in small batches), Scrapy’s asynchronous architecture processes hundreds of concurrent requests through a pipeline of middlewares, spiders, and item processors. It is the standard framework for crawling millions of pages.

But Scrapy’s power creates a problem: sending hundreds of concurrent requests from a single IP address will trigger rate limits and blocks almost immediately. You need proxy rotation integrated directly into Scrapy’s request pipeline, and the built-in proxy support is minimal.

This guide covers building custom downloader middleware for proxy rotation, configuring popular proxy middleware packages, managing proxy pools with health checks, and optimizing concurrent requests when routing through mobile proxies.

Scrapy Architecture and Where Proxies Fit

Understanding Scrapy’s request flow is essential for implementing proxy rotation correctly.

The Request Pipeline

  1. Spider generates Request objects
  2. Spider Middleware processes requests (optional modifications)
  3. Scheduler queues requests
  4. Downloader Middleware processes requests before they are sent (this is where proxy rotation happens)
  5. Downloader sends the HTTP request
  6. Downloader Middleware processes the response
  7. Spider receives the response and extracts data

Proxy rotation belongs in the Downloader Middleware layer. Each request passes through your middleware before being sent, giving you the opportunity to assign a proxy, modify headers, and handle proxy-specific errors.

Built-in Proxy Support

Scrapy has minimal built-in proxy support via the HttpProxyMiddleware. It reads the proxy from the proxy meta key on the request or from environment variables:

# Set proxy via request meta
yield scrapy.Request(
    url='https://example.com',
    meta={'proxy': 'http://user:pass@proxy-host:port'}
)
# Or via environment variable
export http_proxy=http://user:pass@proxy-host:port
export https_proxy=http://user:pass@proxy-host:port

This works for a single proxy, but it does not handle rotation, health checks, or pool management. For production crawls, you need custom middleware.

Building Custom Proxy Rotation Middleware

Basic Round-Robin Middleware

# middlewares.py
import itertools
import logging

logger = logging.getLogger(__name__)

class ProxyRotationMiddleware:
    def __init__(self, proxy_list):
        self.proxies = proxy_list
        self.proxy_cycle = itertools.cycle(self.proxies)

    @classmethod
    def from_crawler(cls, crawler):
        proxy_list = crawler.settings.getlist('PROXY_LIST')
        if not proxy_list:
            raise ValueError("PROXY_LIST setting is required")
        return cls(proxy_list)

    def process_request(self, request, spider):
        proxy = next(self.proxy_cycle)
        request.meta['proxy'] = proxy
        logger.debug(f"Using proxy: {proxy} for {request.url}")

    def process_exception(self, request, exception, spider):
        logger.warning(f"Proxy failed for {request.url}: {exception}")
        # Retry with next proxy
        request.meta['proxy'] = next(self.proxy_cycle)
        return request

Settings Configuration

# settings.py
PROXY_LIST = [
    'http://user:pass@mobile-proxy1.example.com:8080',
    'http://user:pass@mobile-proxy2.example.com:8080',
    'http://user:pass@mobile-proxy3.example.com:8080',
]

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.ProxyRotationMiddleware': 350,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}

The priority number matters. Your custom middleware (350) must run before Scrapy’s HttpProxyMiddleware (400), which reads the proxy meta key your middleware sets.

Advanced Middleware with Health Tracking

Production crawls need proxy health monitoring. Remove failing proxies from rotation and re-add them after a cooldown:

# middlewares.py
import random
import time
import logging
from collections import defaultdict

logger = logging.getLogger(__name__)

class SmartProxyMiddleware:
    def __init__(self, proxy_list, max_failures=5, cooldown=300):
        self.all_proxies = proxy_list
        self.active_proxies = list(proxy_list)
        self.failed_proxies = {}  # proxy -> (failure_count, last_failure_time)
        self.max_failures = max_failures
        self.cooldown = cooldown
        self.stats = defaultdict(lambda: {'success': 0, 'failure': 0})

    @classmethod
    def from_crawler(cls, crawler):
        proxy_list = crawler.settings.getlist('PROXY_LIST')
        max_failures = crawler.settings.getint('PROXY_MAX_FAILURES', 5)
        cooldown = crawler.settings.getint('PROXY_COOLDOWN', 300)
        return cls(proxy_list, max_failures, cooldown)

    def _recover_proxies(self):
        """Re-add proxies that have cooled down."""
        now = time.time()
        recovered = []
        for proxy, (failures, last_time) in list(self.failed_proxies.items()):
            if now - last_time > self.cooldown:
                self.active_proxies.append(proxy)
                recovered.append(proxy)
                logger.info(f"Recovered proxy: {proxy}")

        for proxy in recovered:
            del self.failed_proxies[proxy]

    def _get_proxy(self):
        self._recover_proxies()

        if not self.active_proxies:
            logger.warning("No active proxies. Resetting all proxies.")
            self.active_proxies = list(self.all_proxies)
            self.failed_proxies.clear()

        return random.choice(self.active_proxies)

    def _mark_failure(self, proxy):
        self.stats[proxy]['failure'] += 1

        if proxy in self.failed_proxies:
            failures, _ = self.failed_proxies[proxy]
            self.failed_proxies[proxy] = (failures + 1, time.time())
        else:
            self.failed_proxies[proxy] = (1, time.time())

        failures = self.failed_proxies[proxy][0]
        if failures >= self.max_failures and proxy in self.active_proxies:
            self.active_proxies.remove(proxy)
            logger.warning(f"Removed proxy after {failures} failures: {proxy}")

    def _mark_success(self, proxy):
        self.stats[proxy]['success'] += 1
        if proxy in self.failed_proxies:
            del self.failed_proxies[proxy]

    def process_request(self, request, spider):
        proxy = self._get_proxy()
        request.meta['proxy'] = proxy
        request.meta['_proxy_used'] = proxy

    def process_response(self, request, response, spider):
        proxy = request.meta.get('_proxy_used')
        if proxy:
            if response.status in (403, 429, 503):
                self._mark_failure(proxy)
                # Retry with different proxy
                new_request = request.copy()
                new_request.dont_filter = True
                return new_request
            else:
                self._mark_success(proxy)
        return response

    def process_exception(self, request, exception, spider):
        proxy = request.meta.get('_proxy_used')
        if proxy:
            self._mark_failure(proxy)
        # Return request to retry with different proxy
        request.dont_filter = True
        return request

Settings for Advanced Middleware

# settings.py
PROXY_LIST = [
    'http://user:pass@mobile-proxy1.example.com:8080',
    'http://user:pass@mobile-proxy2.example.com:8080',
    'http://user:pass@mobile-proxy3.example.com:8080',
]

PROXY_MAX_FAILURES = 5
PROXY_COOLDOWN = 300  # seconds

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.SmartProxyMiddleware': 350,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}

# Retry settings
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

Using scrapy-rotating-proxies

If you prefer a ready-made solution, the scrapy-rotating-proxies package handles rotation and health tracking out of the box:

pip install scrapy-rotating-proxies

Configuration

# settings.py
ROTATING_PROXY_LIST = [
    'http://user:pass@mobile-proxy1.example.com:8080',
    'http://user:pass@mobile-proxy2.example.com:8080',
    'http://user:pass@mobile-proxy3.example.com:8080',
]

# Or load from a file
# ROTATING_PROXY_LIST_PATH = '/path/to/proxy_list.txt'

ROTATING_PROXY_PAGE_RETRY_TIMES = 5
ROTATING_PROXY_CLOSE_SPIDER = False  # Don't close spider if all proxies are dead
ROTATING_PROXY_BACKOFF_BASE = 300  # Backoff time for dead proxies

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

Custom Ban Detection

By default, scrapy-rotating-proxies considers any non-200 response as a ban. Customize this with a ban detection policy:

# middlewares.py
from rotating_proxies.policy import BanDetectionPolicy

class CustomBanPolicy(BanDetectionPolicy):
    NOT_BAN_STATUSES = {200, 301, 302, 404}
    NOT_BAN_EXCEPTIONS = set()

    def response_is_ban(self, request, response):
        if response.status not in self.NOT_BAN_STATUSES:
            return True
        # Check for soft bans
        if b'captcha' in response.body.lower():
            return True
        if b'access denied' in response.body.lower():
            return True
        return False

    def exception_is_ban(self, request, exception):
        return True
# settings.py
ROTATING_PROXY_BAN_POLICY = 'myproject.middlewares.CustomBanPolicy'

Proxy Pool Management

Loading Proxies from External Sources

For production crawls, proxy lists should be dynamic — loaded from a database or API rather than hardcoded:

# middlewares.py
import requests as http_requests

class DynamicProxyMiddleware:
    def __init__(self, proxy_api_url):
        self.proxy_api_url = proxy_api_url
        self.proxies = []
        self.last_refresh = 0
        self.refresh_interval = 300  # Refresh every 5 minutes

    @classmethod
    def from_crawler(cls, crawler):
        api_url = crawler.settings.get('PROXY_API_URL')
        return cls(api_url)

    def _refresh_proxies(self):
        now = time.time()
        if now - self.last_refresh > self.refresh_interval:
            try:
                response = http_requests.get(self.proxy_api_url, timeout=10)
                self.proxies = response.json()['proxies']
                self.last_refresh = now
                logger.info(f"Refreshed proxy pool: {len(self.proxies)} proxies")
            except Exception as e:
                logger.error(f"Failed to refresh proxies: {e}")

    def process_request(self, request, spider):
        self._refresh_proxies()
        if self.proxies:
            proxy = random.choice(self.proxies)
            request.meta['proxy'] = proxy

Proxy Usage Statistics

Track which proxies perform best:

# extensions.py
from scrapy import signals
import json

class ProxyStatsExtension:
    def __init__(self):
        self.stats = defaultdict(lambda: {
            'requests': 0,
            'success': 0,
            'failures': 0,
            'avg_response_time': 0,
            'total_response_time': 0
        })

    @classmethod
    def from_crawler(cls, crawler):
        ext = cls()
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        return ext

    def spider_closed(self, spider):
        report = json.dumps(dict(self.stats), indent=2)
        with open('proxy_stats.json', 'w') as f:
            f.write(report)
        spider.logger.info(f"Proxy stats written to proxy_stats.json")

Optimizing Concurrent Requests

Concurrency Settings for Proxy Scraping

# settings.py

# Total concurrent requests across all domains
CONCURRENT_REQUESTS = 16

# Concurrent requests per domain
CONCURRENT_REQUESTS_PER_DOMAIN = 4

# Download delay (seconds between requests to same domain)
DOWNLOAD_DELAY = 1

# Randomize download delay (0.5x to 1.5x of DOWNLOAD_DELAY)
RANDOMIZE_DOWNLOAD_DELAY = True

# Auto-throttle (adjusts speed based on server response)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0

Matching Concurrency to Proxy Pool Size

A common mistake is setting concurrency higher than your proxy pool can support. If you have 3 mobile proxy ports, sending 50 concurrent requests means each proxy handles ~17 simultaneous connections. This looks unnatural and triggers rate limits.

Rule of thumb: Set CONCURRENT_REQUESTS to 3-5x the number of proxy ports. With 3 mobile proxies, target 9-15 concurrent requests.

AutoThrottle for Adaptive Speed

Scrapy’s AutoThrottle extension automatically adjusts request speed based on server response times and latency:

# settings.py
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2       # Initial delay
AUTOTHROTTLE_MAX_DELAY = 30        # Maximum delay if server is slow
AUTOTHROTTLE_TARGET_CONCURRENCY = 3.0  # Target parallel requests per server
AUTOTHROTTLE_DEBUG = True          # Log throttle adjustments

AutoThrottle is especially useful with mobile proxies because mobile connection speeds vary. The extension adapts to real-time network conditions.

Complete Spider Example

Here is a full Scrapy spider with proxy middleware, item pipeline, and error handling:

# spiders/product_spider.py
import scrapy
from scrapy.exceptions import CloseSpider

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products?page=1']

    custom_settings = {
        'CONCURRENT_REQUESTS': 10,
        'DOWNLOAD_DELAY': 1.5,
        'RANDOMIZE_DOWNLOAD_DELAY': True,
        'RETRY_TIMES': 3,
    }

    def parse(self, response):
        # Check for blocks
        if response.status == 403 or b'access denied' in response.body.lower():
            self.logger.warning(f"Blocked on {response.url}")
            return

        # Extract products
        products = response.css('.product-card')
        if not products:
            self.logger.warning(f"No products found on {response.url}")
            return

        for product in products:
            yield {
                'title': product.css('.title::text').get('').strip(),
                'price': product.css('.price::text').get('').strip(),
                'url': response.urljoin(product.css('a::attr(href)').get()),
                'source_url': response.url,
            }

        # Follow pagination
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Pipeline for Data Processing

# pipelines.py
import json
import logging

class JsonWriterPipeline:
    def open_spider(self, spider):
        self.file = open('products.jsonl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + '\n'
        self.file.write(line)
        return item

class DataValidationPipeline:
    def process_item(self, item, spider):
        if not item.get('title'):
            raise scrapy.exceptions.DropItem(f"Missing title: {item}")
        if not item.get('price'):
            raise scrapy.exceptions.DropItem(f"Missing price: {item}")
        return item

Full Settings

# settings.py
BOT_NAME = 'product_scraper'
SPIDER_MODULES = ['product_scraper.spiders']

# Proxy configuration
PROXY_LIST = [
    'http://user:pass@mobile-proxy1.example.com:8080',
    'http://user:pass@mobile-proxy2.example.com:8080',
    'http://user:pass@mobile-proxy3.example.com:8080',
]

DOWNLOADER_MIDDLEWARES = {
    'product_scraper.middlewares.SmartProxyMiddleware': 350,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

ITEM_PIPELINES = {
    'product_scraper.pipelines.DataValidationPipeline': 100,
    'product_scraper.pipelines.JsonWriterPipeline': 200,
}

# Request settings
CONCURRENT_REQUESTS = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 3
DOWNLOAD_DELAY = 1.5
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOAD_TIMEOUT = 30

# Retry
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

# AutoThrottle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_MAX_DELAY = 15
AUTOTHROTTLE_TARGET_CONCURRENCY = 3.0

# Caching (useful for development)
HTTPCACHE_ENABLED = False

# Respect robots.txt
ROBOTSTXT_OBEY = True

# User agent
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
}

When to Use Scrapy vs. Other Tools

Scrapy is the right choice when:

  • You are crawling thousands to millions of pages
  • You need structured pipelines for data processing
  • The target does not require JavaScript rendering
  • You want built-in rate limiting, retry logic, and statistics

Scrapy is not the right choice when:

For JavaScript-heavy targets, consider scrapy-playwright or scrapy-splash to add rendering capability to Scrapy.

Conclusion

Scrapy’s middleware architecture makes it the ideal framework for large-scale proxy-based scraping. The custom SmartProxyMiddleware shown in this guide handles rotation, health tracking, and automatic recovery — everything you need for production crawls.

The critical factor is proxy quality. DataResearchTools mobile proxies provide CGNAT-based IP addresses that anti-bot systems cannot aggressively block. Combined with Scrapy’s efficient request pipeline, you get a scraping system that scales to millions of pages while maintaining high success rates.

Connect your Scrapy project to mobile proxies and start crawling at scale with confidence.


Related Reading

Scroll to Top