Proxy Load Balancing: Architecture & Implementation
Sending all your traffic through a single proxy is like routing every car through one toll booth. Proxy load balancing distributes requests across multiple proxies, maximizing throughput, minimizing detection, and ensuring reliability when one proxy goes down.
This guide covers load balancing algorithms, health checking, failover strategies, and production implementations for proxy pools.
Why Load Balance Proxies?
Without load balancing, you face:
- Single point of failure: One proxy goes down, everything stops
- Rate limiting: One IP making too many requests triggers blocks
- Uneven wear: Some proxies get hammered while others sit idle
- Geographic bottlenecks: All requests originate from one location
- Poor performance: No way to route to the fastest proxy
Load balancing solves all of these by distributing traffic intelligently across your proxy pool.
Load Balancing Algorithms
Round Robin
The simplest approach — cycle through proxies in order:
import itertools
class RoundRobinBalancer:
def __init__(self, proxies):
self.proxies = proxies
self.cycle = itertools.cycle(proxies)
def get_proxy(self):
return next(self.cycle)
def add_proxy(self, proxy):
self.proxies.append(proxy)
self.cycle = itertools.cycle(self.proxies)
def remove_proxy(self, proxy):
self.proxies.remove(proxy)
self.cycle = itertools.cycle(self.proxies)
Pros: Simple, even distribution
Cons: Doesn’t account for proxy speed or health
Weighted Round Robin
Assign more traffic to faster or more reliable proxies:
class WeightedRoundRobinBalancer:
def __init__(self, proxy_weights):
"""proxy_weights: dict of {proxy_url: weight}"""
self.proxy_weights = proxy_weights
self._build_sequence()
def _build_sequence(self):
self.sequence = []
for proxy, weight in self.proxy_weights.items():
self.sequence.extend([proxy] * weight)
self.cycle = itertools.cycle(self.sequence)
def get_proxy(self):
return next(self.cycle)
def update_weight(self, proxy, weight):
self.proxy_weights[proxy] = weight
self._build_sequence()
Usage
balancer = WeightedRoundRobinBalancer({
"http://fast-proxy:8080": 5, # Gets 5x traffic
"http://medium-proxy:8080": 3, # Gets 3x traffic
"http://slow-proxy:8080": 1 # Gets 1x traffic
})
Least Connections
Route to the proxy with fewest active requests:
import threading
class LeastConnectionsBalancer:
def __init__(self, proxies):
self.proxies = {proxy: 0 for proxy in proxies}
self.lock = threading.Lock()
def get_proxy(self):
with self.lock:
proxy = min(self.proxies, key=self.proxies.get)
self.proxies[proxy] += 1
return proxy
def release_proxy(self, proxy):
with self.lock:
self.proxies[proxy] = max(0, self.proxies[proxy] - 1)
Best for: Long-running requests where some proxies are slower than others.
Latency-Based
Route to the proxy with the lowest recent response time:
import time
from collections import defaultdict
class LatencyBasedBalancer:
def __init__(self, proxies):
self.proxies = proxies
self.latencies = defaultdict(lambda: [])
self.window_size = 10
def get_proxy(self):
if not all(self.latencies[p] for p in self.proxies):
# Not enough data, use round robin
return random.choice(self.proxies)
avg_latencies = {
proxy: sum(times[-self.window_size:]) / len(times[-self.window_size:])
for proxy, times in self.latencies.items()
if times
}
return min(avg_latencies, key=avg_latencies.get)
def record_latency(self, proxy, latency):
self.latencies[proxy].append(latency)
# Keep only recent measurements
if len(self.latencies[proxy]) > 100:
self.latencies[proxy] = self.latencies[proxy][-50:]
Random with Health Checks
Simple but effective — randomly select from healthy proxies:
import random
class RandomHealthyBalancer:
def __init__(self, proxies):
self.healthy_proxies = set(proxies)
self.unhealthy_proxies = set()
def get_proxy(self):
if not self.healthy_proxies:
raise Exception("No healthy proxies available")
return random.choice(list(self.healthy_proxies))
def mark_unhealthy(self, proxy):
self.healthy_proxies.discard(proxy)
self.unhealthy_proxies.add(proxy)
def mark_healthy(self, proxy):
self.unhealthy_proxies.discard(proxy)
self.healthy_proxies.add(proxy)
Production-Grade Proxy Load Balancer
Here’s a complete implementation combining multiple strategies:
import asyncio
import aiohttp
import time
import random
import logging
from dataclasses import dataclass, field
from typing import Optional
from collections import defaultdict
import threading
logger = logging.getLogger(__name__)
@dataclass
class ProxyStats:
total_requests: int = 0
successful_requests: int = 0
failed_requests: int = 0
total_latency: float = 0
last_used: float = 0
last_failure: float = 0
consecutive_failures: int = 0
is_healthy: bool = True
class ProxyLoadBalancer:
def __init__(self, proxies, strategy="weighted_random",
health_check_interval=60, max_failures=3):
self.stats = {proxy: ProxyStats() for proxy in proxies}
self.strategy = strategy
self.max_failures = max_failures
self.health_check_interval = health_check_interval
self.lock = threading.Lock()
self.domain_usage = defaultdict(lambda: defaultdict(float))
def get_proxy(self, domain=None):
"""Get best proxy based on strategy and health."""
with self.lock:
healthy = [p for p, s in self.stats.items() if s.is_healthy]
if not healthy:
self._recover_proxies()
healthy = [p for p, s in self.stats.items() if s.is_healthy]
if not healthy:
raise Exception("All proxies are unhealthy")
if domain:
healthy = self._filter_by_domain(healthy, domain)
if self.strategy == "round_robin":
return self._round_robin(healthy)
elif self.strategy == "weighted_random":
return self._weighted_random(healthy)
elif self.strategy == "least_latency":
return self._least_latency(healthy)
else:
return random.choice(healthy)
def _weighted_random(self, proxies):
"""Select proxy weighted by success rate."""
weights = []
for proxy in proxies:
stats = self.stats[proxy]
if stats.total_requests == 0:
weights.append(1.0)
else:
success_rate = stats.successful_requests / stats.total_requests
weights.append(max(0.1, success_rate))
return random.choices(proxies, weights=weights, k=1)[0]
def _least_latency(self, proxies):
"""Select proxy with lowest average latency."""
def avg_latency(proxy):
stats = self.stats[proxy]
if stats.successful_requests == 0:
return float("inf")
return stats.total_latency / stats.successful_requests
return min(proxies, key=avg_latency)
def _round_robin(self, proxies):
"""Simple round robin through available proxies."""
oldest = min(proxies, key=lambda p: self.stats[p].last_used)
self.stats[oldest].last_used = time.time()
return oldest
def _filter_by_domain(self, proxies, domain):
"""Avoid using the same proxy for the same domain too frequently."""
now = time.time()
filtered = [
p for p in proxies
if now - self.domain_usage[domain][p] > 5
]
return filtered or proxies
def report_success(self, proxy, latency, domain=None):
"""Record successful request."""
with self.lock:
stats = self.stats[proxy]
stats.total_requests += 1
stats.successful_requests += 1
stats.total_latency += latency
stats.last_used = time.time()
stats.consecutive_failures = 0
if domain:
self.domain_usage[domain][proxy] = time.time()
def report_failure(self, proxy, domain=None):
"""Record failed request and potentially mark unhealthy."""
with self.lock:
stats = self.stats[proxy]
stats.total_requests += 1
stats.failed_requests += 1
stats.last_failure = time.time()
stats.consecutive_failures += 1
if stats.consecutive_failures >= self.max_failures:
stats.is_healthy = False
logger.warning(f"Proxy {proxy} marked unhealthy "
f"after {self.max_failures} consecutive failures")
def _recover_proxies(self):
"""Try to recover unhealthy proxies after cooldown."""
now = time.time()
for proxy, stats in self.stats.items():
if not stats.is_healthy:
if now - stats.last_failure > self.health_check_interval:
stats.is_healthy = True
stats.consecutive_failures = 0
logger.info(f"Proxy {proxy} recovered")
def get_stats(self):
"""Return stats for all proxies."""
return {
proxy: {
"healthy": stats.is_healthy,
"success_rate": (
stats.successful_requests / stats.total_requests
if stats.total_requests > 0 else 0
),
"avg_latency": (
stats.total_latency / stats.successful_requests
if stats.successful_requests > 0 else 0
),
"total_requests": stats.total_requests
}
for proxy, stats in self.stats.items()
}
def add_proxy(self, proxy):
with self.lock:
self.stats[proxy] = ProxyStats()
def remove_proxy(self, proxy):
with self.lock:
del self.stats[proxy]
Using the Load Balancer
import requests
balancer = ProxyLoadBalancer(
proxies=[
"http://proxy1:8080",
"http://proxy2:8080",
"http://proxy3:8080",
"http://proxy4:8080",
"http://proxy5:8080",
],
strategy="weighted_random",
max_failures=3
)
def scrape(url):
domain = urlparse(url).netloc
proxy = balancer.get_proxy(domain=domain)
start = time.time()
try:
response = requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=30
)
response.raise_for_status()
latency = time.time() - start
balancer.report_success(proxy, latency, domain)
return response
except Exception as e:
balancer.report_failure(proxy, domain)
raise
HAProxy Configuration for Proxy Load Balancing
For infrastructure-level load balancing, use HAProxy:
# haproxy.cfg
global
maxconn 4096
log stdout format raw local0
defaults
mode tcp
timeout connect 10s
timeout client 30s
timeout server 30s
retries 3
frontend proxy_frontend
bind *:8080
default_backend proxy_pool
backend proxy_pool
balance roundrobin
option tcp-check
server proxy1 proxy1.example.com:8080 check inter 30s fall 3 rise 2 weight 100
server proxy2 proxy2.example.com:8080 check inter 30s fall 3 rise 2 weight 100
server proxy3 proxy3.example.com:8080 check inter 30s fall 3 rise 2 weight 50
server proxy4 proxy4.example.com:8080 check inter 30s fall 3 rise 2 backup
listen stats
bind *:8404
stats enable
stats uri /stats
stats refresh 10s
Nginx Load Balancing for Proxies
# nginx.conf
upstream proxy_pool {
least_conn;
server proxy1.example.com:8080 weight=5 max_fails=3 fail_timeout=30s;
server proxy2.example.com:8080 weight=3 max_fails=3 fail_timeout=30s;
server proxy3.example.com:8080 weight=1 max_fails=3 fail_timeout=30s;
server proxy4.example.com:8080 backup;
}
server {
listen 8080;
location / {
proxy_pass http://proxy_pool;
proxy_connect_timeout 10s;
proxy_read_timeout 30s;
proxy_next_upstream error timeout http_502 http_503;
proxy_next_upstream_tries 3;
}
}
Health Checking Strategies
Active Health Checks
Periodically test proxies with known-good requests:
import asyncio
import aiohttp
class ProxyHealthChecker:
def __init__(self, balancer, check_url="http://httpbin.org/ip",
interval=60):
self.balancer = balancer
self.check_url = check_url
self.interval = interval
async def check_proxy(self, proxy):
try:
async with aiohttp.ClientSession() as session:
async with session.get(
self.check_url,
proxy=proxy,
timeout=aiohttp.ClientTimeout(total=10)
) as response:
if response.status == 200:
return True
except Exception:
pass
return False
async def run_health_checks(self):
while True:
for proxy in list(self.balancer.stats.keys()):
healthy = await self.check_proxy(proxy)
if healthy:
self.balancer.stats[proxy].is_healthy = True
self.balancer.stats[proxy].consecutive_failures = 0
else:
self.balancer.report_failure(proxy)
await asyncio.sleep(self.interval)
Passive Health Checks
Monitor actual request outcomes (already built into the ProxyLoadBalancer above). This is more efficient because it doesn’t generate extra traffic.
Geographic Load Balancing
Route requests to geographically appropriate proxies:
class GeoProxyBalancer:
def __init__(self):
self.geo_pools = {
"us": ProxyLoadBalancer(us_proxies),
"eu": ProxyLoadBalancer(eu_proxies),
"asia": ProxyLoadBalancer(asia_proxies),
}
self.domain_regions = {
"amazon.com": "us",
"amazon.co.uk": "eu",
"amazon.co.jp": "asia",
}
def get_proxy(self, url):
domain = urlparse(url).netloc
region = self.domain_regions.get(domain, "us")
return self.geo_pools[region].get_proxy(domain)
FAQ
Which load balancing algorithm is best for web scraping?
Weighted random with health checks works best for most scraping. It distributes load while accounting for proxy quality, and the randomness makes traffic patterns less detectable.
How many proxies should I load balance across?
For light scraping (1K pages/day), 5-10 proxies suffice. For heavy scraping (100K+/day), use 50-100+ proxies. With residential proxy providers, the provider handles the pool — you just connect to their gateway.
Should I use application-level or infrastructure-level load balancing?
Application-level (Python code) gives you more control over retry logic and per-domain policies. Infrastructure-level (HAProxy/Nginx) is simpler and offloads work from your scraper. Many production setups use both.
How do I test my load balancer?
Use a test endpoint like httpbin.org/ip to verify that different proxies are being used. Log which proxy handles each request and check the distribution matches your strategy.
What happens when all proxies fail?
Implement a circuit breaker that pauses scraping and alerts you. After a cooldown period, gradually re-test proxies. Always have backup proxy sources configured.
Conclusion
Proxy load balancing is essential for any serious scraping operation. Start with weighted random selection for simplicity, add health checks for reliability, and implement domain-aware routing to avoid detection. The ProxyLoadBalancer class in this guide handles all of these patterns and can be dropped into any Python scraping project.