how to use proxies with scrapy: middleware, rotation, and headers (2026)
scrapy supports proxies three ways: per-request meta, the built-in httpproxymiddleware, and custom rotating middleware. for a single proxy, set request.meta["proxy"]. for rotation, write a downloader middleware that picks a fresh proxy per request and tracks dead ones. for production, pair the rotating middleware with header spoofing and a retry policy. this tutorial gives you working code for all three patterns plus the gotchas that bite at scale.
we cover the basics, then build a production-ready rotating middleware with health checks and exponential backoff.
the simplest pattern: per-request proxy
set proxy in request.meta. scrapy’s built-in httpproxymiddleware (enabled by default) reads it.
import scrapy
class SimpleSpider(scrapy.Spider):
name = "simple"
start_urls = ["https://httpbin.org/ip"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
meta={"proxy": "http://user:pass@1.2.3.4:8080"},
)
def parse(self, response):
self.logger.info(f"saw ip: {response.json()}")
this is the right pattern for jobs with one or two static proxies. for rotation, build a middleware.
env-based proxy via http_proxy
if you want every request to go through one proxy without touching code, scrapy honors the http_proxy and https_proxy env vars:
export HTTP_PROXY="http://user:pass@1.2.3.4:8080"
export HTTPS_PROXY="http://user:pass@1.2.3.4:8080"
scrapy crawl simple
this works for ci pipelines and one-off runs. for fine-grained control, use the middleware approach below.
rotating proxy middleware
create myproject/middlewares.py:
import random
import time
import logging
from collections import defaultdict
from scrapy import signals
logger = logging.getLogger(__name__)
class RotatingProxyMiddleware:
"""rotating proxy with health tracking and exponential cooldown."""
def __init__(self, proxies, cooldown_sec=300):
self.proxies = list(proxies)
self.cooldown_sec = cooldown_sec
self.bad_until = defaultdict(float)
self.fail_count = defaultdict(int)
if not self.proxies:
raise ValueError("rotating proxy middleware: no proxies configured")
@classmethod
def from_crawler(cls, crawler):
proxies = crawler.settings.getlist("ROTATING_PROXIES")
cooldown = crawler.settings.getint("ROTATING_PROXY_COOLDOWN_SEC", 300)
return cls(proxies=proxies, cooldown_sec=cooldown)
def get_proxy(self):
now = time.time()
live = [p for p in self.proxies if self.bad_until[p] < now]
if not live:
logger.warning("all proxies cooling down. resetting.")
self.bad_until.clear()
live = self.proxies
return random.choice(live)
def mark_bad(self, proxy):
self.fail_count[proxy] += 1
cooldown = self.cooldown_sec * (5 ** (self.fail_count[proxy] - 1))
self.bad_until[proxy] = time.time() + cooldown
logger.info(f"proxy {proxy} marked bad. cooldown {cooldown}s.")
def mark_good(self, proxy):
self.fail_count[proxy] = 0
def process_request(self, request, spider):
if "proxy" in request.meta and request.meta.get("_proxy_assigned"):
return
proxy = self.get_proxy()
request.meta["proxy"] = proxy
request.meta["_proxy_assigned"] = True
def process_response(self, request, response, spider):
proxy = request.meta.get("proxy")
if not proxy:
return response
if response.status in (407, 502, 503, 504):
self.mark_bad(proxy)
elif 200 <= response.status < 400:
self.mark_good(proxy)
return response
def process_exception(self, request, exception, spider):
proxy = request.meta.get("proxy")
if proxy:
self.mark_bad(proxy)
enable in settings.py:
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 750,
"myproject.middlewares.RotatingProxyMiddleware": 760,
}
ROTATING_PROXIES = [
"http://user:pass@1.2.3.4:8080",
"http://user:pass@5.6.7.8:8080",
"http://user:pass@9.10.11.12:8080",
]
ROTATING_PROXY_COOLDOWN_SEC = 300
the middleware picks a fresh proxy per request, marks dead proxies on 407/502/503/504 responses or exceptions, and applies exponential cooldown so a flaky proxy comes back online after a short rest.
sticky session middleware for login flows
some scrapes need the same proxy across multiple requests (login then crawl). hash the session id to a fixed proxy:
import hashlib
class StickyProxyMiddleware:
def __init__(self, proxies):
self.proxies = list(proxies)
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings.getlist("STICKY_PROXIES"))
def process_request(self, request, spider):
session_id = request.meta.get("session_id")
if not session_id:
return
h = hashlib.md5(session_id.encode()).hexdigest()
idx = int(h, 16) % len(self.proxies)
request.meta["proxy"] = self.proxies[idx]
usage in spider:
yield scrapy.Request(
"https://example.com/dashboard",
meta={"session_id": "user_abc"},
callback=self.parse_dashboard,
)
every request with session_id="user_abc" gets the same proxy. swap to a different session id and you get a different proxy.
for the deeper architecture pattern across multiple workers, see our proxy load balancing architecture guide.
header spoofing alongside proxies
a fresh ip with stale headers fingerprints obviously. pair the rotating middleware with rotating user agents and accept-language headers:
class RotatingHeadersMiddleware:
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
]
def process_request(self, request, spider):
request.headers["User-Agent"] = random.choice(self.USER_AGENTS)
request.headers["Accept-Language"] = "en-US,en;q=0.9"
request.headers["Accept-Encoding"] = "gzip, deflate, br"
enable below the proxy middleware in settings.py:
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 750,
"myproject.middlewares.RotatingProxyMiddleware": 760,
"myproject.middlewares.RotatingHeadersMiddleware": 770,
}
for finer fingerprint control (tls, http2, browser headers), use a managed scraping api or a headless browser. plain http requests cannot fully spoof a chrome client.
scrapy retry settings
scrapy ships with a retry middleware. configure it to match the rotating proxy logic:
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [403, 408, 429, 500, 502, 503, 504]
DOWNLOAD_TIMEOUT = 15
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 0.5
RANDOMIZE_DOWNLOAD_DELAY = True
RETRY_HTTP_CODES = [403, 408, 429, 500, 502, 503, 504] retries common rate-limit and proxy-failure responses. combined with the rotating middleware, each retry picks a fresh proxy.
CONCURRENT_REQUESTS_PER_DOMAIN = 8 is conservative. tune up for tolerant targets, down for strict ones. the rotating middleware does not rate-limit; that is the autothrottle’s job.
autothrottle for rate-limit safety
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1.0
AUTOTHROTTLE_MAX_DELAY = 60.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0
AUTOTHROTTLE_DEBUG = False
autothrottle backs off when the target slows down or returns errors. with rotating proxies, this prevents a target from blocking your full pool by detecting a burst.
handling 407 proxy auth required
if you see 407 proxy authentication required errors, three checks:
- proxy url format is
http://user:pass@host:portexactly. no leading whitespace, no url-encoded user. - some providers require username sessions (
user-session-abc123). use the full session-username from your dashboard. - scrapy’s httpproxymiddleware does not always pass the basic-auth header automatically for some legacy versions. if you hit this, add proxy-authorization explicitly:
from base64 import b64encode
class ProxyAuthMiddleware:
def process_request(self, request, spider):
proxy = request.meta.get("proxy")
if not proxy or "@" not in proxy:
return
creds = proxy.split("//", 1)[1].split("@", 1)[0]
token = b64encode(creds.encode()).decode()
request.headers["Proxy-Authorization"] = f"Basic {token}"
scrapy 2.11+ handles this automatically. older versions need this snippet.
benchmark: 10,000 pages with rotating proxies
across 10,000 pages of a tolerant ecommerce target, with a 50-proxy residential pool, the configuration above completed in roughly 22 minutes on a single mac workstation. that is around 7.5 requests per second sustained.
failed requests (mostly 503s) hit 4 percent. retries succeeded 92 percent of the time. proxies marked bad: 11 of 50 over the run. all 11 came back online within an hour as cooldown expired.
scaling to 100,000 pages, the same config runs in 3 to 4 hours. for higher throughput, run multiple scrapy processes against the same proxy pool with a shared bad-proxy state stored in redis.
production checklist
four items separate hobby spiders from production scrapy deployments.
shared bad-proxy state. for multi-worker setups, store the bad-proxy list in redis instead of in-process memory. otherwise each worker re-discovers the same dead proxies independently.
per-domain proxy pools. for sites that ban entire ranges, segment your proxy pool by target domain. keep a clean residential pool for hard targets and reuse a cheaper datacenter pool for tolerant ones.
playwright integration. for js-heavy targets, use scrapy-playwright. it integrates with the rotating middleware via request.meta["playwright_context_kwargs"]["proxy"].
logging. log every request with proxy, status, latency, and final response code. for postmortems on broken scrapes, this is what you analyze.
for the broader python scraping context see our web scraping with python guide.
faq
what is the easiest way to add a proxy in scrapy?
set request.meta["proxy"] = "http://user:pass@host:port" per request. scrapy’s built-in httpproxymiddleware handles the rest. enabled by default.
does scrapy support proxy rotation out of the box?
no. scrapy’s httpproxymiddleware uses one proxy per request based on request.meta. for rotation across requests, write a downloader middleware (full code in this tutorial) or install scrapy-rotating-proxies from pypi.
how do i use socks5 proxies with scrapy?
scrapy supports socks5 via twisted. use socks5://user:pass@host:port in request.meta["proxy"]. older scrapy versions need pip install txsocksx for full socks5 support.
why am i getting 407 errors with scrapy proxies?
usually wrong credentials format. confirm http://user:pass@host:port exactly. for residential providers using session-id auth, paste the full session-username (e.g. user-session-abc123) in the user field.
should i use scrapy-rotating-proxies or write my own middleware?
scrapy-rotating-proxies is fine for simple rotation. for production with custom health checks, sticky sessions, or per-domain pools, write your own. the middleware in this tutorial is around 50 lines and gives full control.
how do i debug scrapy proxy issues?
run with -L DEBUG to see every request and proxy assignment. log the response status and request.meta["proxy"] in your spider’s parse methods. for tls or auth issues, run the same proxy against curl -x first to isolate scrapy from the proxy itself. official docs at the scrapy reference.
the bottom line
scrapy’s proxy story is built on three pieces: per-request meta, the built-in httpproxymiddleware, and your custom rotating middleware. with the middleware in this tutorial plus header rotation and autothrottle, you have a production-grade scraper that survives dead proxies, rate limits, and the long tail of target-specific failures.
for jobs above 100,000 pages or with strict anti-bot, pair this stack with residential proxies and a shared redis bad-proxy state. for lighter jobs, the in-process version above is enough.
start with the per-request pattern, add the rotating middleware once you have more than 5 proxies, and add sticky sessions when you hit your first login flow. each layer composes cleanly with scrapy’s existing machinery.