Proxy Connection Pooling: Maximize Throughput & Reduce Overhead

Proxy Connection Pooling: Maximize Throughput & Reduce Overhead

Every new TCP connection through a proxy involves a three-way handshake to the proxy, then another to the target — adding 100-300ms of overhead per request. Connection pooling reuses established connections, eliminating this overhead and delivering 3-10x throughput improvements.

This guide covers how connection pooling works with proxy servers, optimal pool sizing, and implementations across popular languages.

Why Connection Pooling Matters

Without Pooling (new connection per request):
Request 1: TCP handshake (50ms) → TLS (80ms) → CONNECT (30ms) → GET (100ms) = 260ms
Request 2: TCP handshake (50ms) → TLS (80ms) → CONNECT (30ms) → GET (100ms) = 260ms
Request 3: TCP handshake (50ms) → TLS (80ms) → CONNECT (30ms) → GET (100ms) = 260ms
Total: 780ms for 3 requests

With Pooling (reuse connections):
Request 1: TCP handshake (50ms) → TLS (80ms) → CONNECT (30ms) → GET (100ms) = 260ms
Request 2: GET (100ms) = 100ms  (reused connection)
Request 3: GET (100ms) = 100ms  (reused connection)
Total: 460ms for 3 requests (41% faster)

For 1,000 requests, the difference compounds:

  • Without pooling: ~260 seconds
  • With pooling: ~103 seconds (2.5x faster)

Python: httpx Connection Pooling

import httpx
import asyncio
import time

class PooledProxyScraper:
    """Scraper with optimized connection pooling through proxy."""

    def __init__(self, proxy_url: str):
        self.proxy_url = proxy_url
        self.client = httpx.AsyncClient(
            proxy=proxy_url,
            limits=httpx.Limits(
                max_connections=100,          # Total pool size
                max_keepalive_connections=20, # Keep-alive connections
                keepalive_expiry=30,          # Seconds before expiry
            ),
            timeout=httpx.Timeout(
                connect=10,
                read=30,
                write=10,
                pool=5,  # Wait for available connection
            ),
            http2=True,  # HTTP/2 multiplexing
        )

    async def scrape(self, urls: list) -> list:
        """Scrape URLs using pooled connections."""
        tasks = [self._fetch(url) for url in urls]
        return await asyncio.gather(*tasks, return_exceptions=True)

    async def _fetch(self, url: str):
        response = await self.client.get(url)
        return {
            "url": url,
            "status": response.status_code,
            "size": len(response.content),
        }

    async def close(self):
        await self.client.aclose()

# Benchmark: pooled vs unpooled
async def benchmark():
    urls = [f"https://httpbin.org/get?i={i}" for i in range(100)]
    proxy = "http://user:pass@proxy.example.com:8080"

    # Pooled
    scraper = PooledProxyScraper(proxy)
    start = time.time()
    results = await scraper.scrape(urls)
    pooled_time = time.time() - start
    await scraper.close()

    # Unpooled (new client per request)
    start = time.time()
    for url in urls:
        async with httpx.AsyncClient(proxy=proxy) as client:
            await client.get(url)
    unpooled_time = time.time() - start

    print(f"Pooled:   {pooled_time:.1f}s ({len(urls)/pooled_time:.1f} req/s)")
    print(f"Unpooled: {unpooled_time:.1f}s ({len(urls)/unpooled_time:.1f} req/s)")
    print(f"Speedup:  {unpooled_time/pooled_time:.1f}x")

asyncio.run(benchmark())

Pool Sizing Guidelines

Pool Size Formula:
optimal_pool_size = concurrent_requests * (1 + avg_latency / avg_processing_time)

Example:
- 20 concurrent scrapers
- 200ms average proxy latency
- 50ms processing time per response

optimal_pool_size = 20 * (1 + 200/50) = 20 * 5 = 100 connections

Rules of Thumb:
┌──────────────────────────┬──────────────────┐
│ Scenario                 │ Pool Size        │
├──────────────────────────┼──────────────────┤
│ Light scraping (1-5 rps) │ 10-20            │
│ Medium (5-50 rps)        │ 20-50            │
│ Heavy (50-200 rps)       │ 50-100           │
│ Enterprise (200+ rps)    │ 100-500          │
└──────────────────────────┴──────────────────┘

Dynamic Pool Sizing

class DynamicPool:
    """Automatically adjust pool size based on demand."""

    def __init__(self, proxy_url, min_size=10, max_size=200):
        self.proxy_url = proxy_url
        self.min_size = min_size
        self.max_size = max_size
        self.current_size = min_size
        self._rebuild_client()

    def _rebuild_client(self):
        self.client = httpx.AsyncClient(
            proxy=self.proxy_url,
            limits=httpx.Limits(
                max_connections=self.current_size,
                max_keepalive_connections=self.current_size // 2,
            ),
        )

    async def adjust_pool(self, pending_requests: int):
        """Scale pool based on pending requests."""
        target = min(max(pending_requests * 2, self.min_size), self.max_size)

        if target != self.current_size:
            await self.client.aclose()
            self.current_size = target
            self._rebuild_client()
            print(f"Pool resized to {self.current_size}")

Connection Pooling with aiohttp

import aiohttp

async def scrape_with_aiohttp_pool(urls, proxy_url):
    """aiohttp has built-in connection pooling via TCPConnector."""
    connector = aiohttp.TCPConnector(
        limit=100,              # Total connection limit
        limit_per_host=10,      # Per-host limit
        ttl_dns_cache=300,      # DNS cache TTL
        keepalive_timeout=30,   # Keep-alive timeout
        enable_cleanup_closed=True,
    )

    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = []
        for url in urls:
            tasks.append(session.get(url, proxy=proxy_url))

        responses = await asyncio.gather(*tasks)
        results = []
        for resp in responses:
            results.append({
                "status": resp.status,
                "size": len(await resp.read()),
            })
            resp.release()  # Return connection to pool

        return results

Node.js Connection Pooling

const { HttpsProxyAgent } = require('https-proxy-agent');
const http = require('http');
const https = require('https');

// Create agent with connection pooling
const proxyAgent = new HttpsProxyAgent('http://user:pass@proxy.com:8080', {
    keepAlive: true,
    keepAliveMsecs: 30000,
    maxSockets: 100,
    maxFreeSockets: 20,
    timeout: 30000,
});

// Use with fetch or axios
const axios = require('axios');
const instance = axios.create({
    httpAgent: proxyAgent,
    httpsAgent: proxyAgent,
    timeout: 30000,
});

async function scrapeWithPool(urls) {
    const results = await Promise.all(
        urls.map(url => instance.get(url).catch(err => ({ error: err.message })))
    );
    return results;
}

Go Connection Pooling

package main

import (
    "fmt"
    "net/http"
    "net/url"
    "time"
    "io/ioutil"
)

func createPooledClient(proxyURL string) *http.Client {
    proxy, _ := url.Parse(proxyURL)

    transport := &http.Transport{
        Proxy:                 http.ProxyURL(proxy),
        MaxIdleConns:          100,
        MaxIdleConnsPerHost:   20,
        MaxConnsPerHost:       50,
        IdleConnTimeout:       30 * time.Second,
        TLSHandshakeTimeout:  10 * time.Second,
        ResponseHeaderTimeout: 10 * time.Second,
        DisableKeepAlives:     false,  // Enable keep-alive
    }

    return &http.Client{
        Transport: transport,
        Timeout:   30 * time.Second,
    }
}

func main() {
    client := createPooledClient("http://user:pass@proxy.com:8080")

    urls := []string{
        "https://httpbin.org/get",
        "https://httpbin.org/ip",
    }

    for _, u := range urls {
        resp, err := client.Get(u)
        if err != nil {
            fmt.Printf("Error: %v\n", err)
            continue
        }
        body, _ := ioutil.ReadAll(resp.Body)
        resp.Body.Close()  // Important: close body to return connection
        fmt.Printf("%s: %d bytes\n", u, len(body))
    }
}

Common Pooling Mistakes

Mistake 1: Not Closing Response Bodies

# WRONG — connection stays checked out
async with httpx.AsyncClient(proxy=proxy) as client:
    response = await client.get(url)
    # Never read the response body
    # Connection cannot be reused

# RIGHT — read and release
async with httpx.AsyncClient(proxy=proxy) as client:
    response = await client.get(url)
    _ = response.content  # Read body, connection returned to pool

Mistake 2: Creating New Clients Per Request

# WRONG — no connection reuse
for url in urls:
    async with httpx.AsyncClient(proxy=proxy) as client:  # New pool each time
        await client.get(url)

# RIGHT — share the client
async with httpx.AsyncClient(proxy=proxy) as client:  # One pool
    for url in urls:
        await client.get(url)  # Connections reused

Mistake 3: Pool Too Small

# WRONG — pool bottleneck, requests queue up
client = httpx.AsyncClient(
    proxy=proxy,
    limits=httpx.Limits(max_connections=5)  # Too small for 100 concurrent tasks
)

# RIGHT — size pool to match concurrency
client = httpx.AsyncClient(
    proxy=proxy,
    limits=httpx.Limits(max_connections=100)
)

Monitoring Pool Health

class PoolMonitor:
    """Monitor connection pool utilization."""

    def __init__(self, client: httpx.AsyncClient):
        self.client = client

    def get_pool_stats(self):
        pool = self.client._transport._pool
        return {
            "active_connections": len(pool._requests),
            "idle_connections": len(pool._idle_connections)
                if hasattr(pool, '_idle_connections') else 'N/A',
        }

Internal Links

FAQ

What is the difference between connection pooling and HTTP/2 multiplexing?

Connection pooling reuses TCP connections sequentially — one request at a time per connection. HTTP/2 multiplexing sends multiple requests simultaneously over a single connection. They complement each other: pool HTTP/2 connections for the best performance.

How long should I keep idle connections alive?

30-60 seconds is typical. Too short and you waste connections; too long and you hold resources for proxies that may rotate your IP. Match the keep-alive timeout to your scraping pattern — continuous scraping benefits from longer timeouts.

Does connection pooling work with rotating proxies?

It depends. If the proxy gateway handles rotation on the server side, pooling works perfectly — you maintain connections to the gateway while the gateway rotates exit IPs. If rotation requires connecting to different proxy servers, each server gets its own pool entry.

Can connection pooling cause IP detection issues?

Yes, if you reuse connections too aggressively to the same target. A single connection sending hundreds of requests looks automated. Balance pooling efficiency with natural browsing patterns by limiting requests-per-connection.

How do I handle connection pool exhaustion?

Set a pool timeout to wait briefly for available connections before failing. Monitor pool utilization and increase the max size if requests frequently wait. For spiky workloads, use dynamic pool sizing that scales with demand.


Related Reading

Scroll to Top