Python Requests + Proxies: Scraping Setup from Scratch

Python Requests + Proxies: Scraping Setup from Scratch

Not every scraping job needs a headless browser. When the target serves content in the initial HTML response — no JavaScript rendering required — the Python requests library with proxy rotation is faster, lighter, and cheaper than browser-based approaches. You can scrape thousands of pages per minute with minimal server resources.

This guide builds a complete proxy-enabled scraper from scratch using Python’s requests library. It covers proxy configuration, session management, rotation logic, retry strategies, HTTPS and SOCKS5 support, and a production-ready scraper class you can adapt to any project.

When to Use Requests vs. Browser Automation

Use requests for:

  • Server-rendered HTML pages (content visible in page source)
  • REST APIs and JSON endpoints
  • Sites without JavaScript challenges
  • High-volume scraping where speed matters
  • RSS feeds, XML sitemaps, and structured data endpoints

Use browser automation (Puppeteer, Playwright, Selenium) for:

  • Single-page applications (React, Angular, Vue)
  • Sites with JavaScript challenges (Cloudflare, Akamai)
  • Pages that require interaction (clicks, scrolls, form submissions)
  • Sites that validate browser fingerprints

If you are unsure, check the target site’s page source (Ctrl+U in your browser). If the data you need is in the raw HTML, requests will work. If the page source is mostly JavaScript with a

, you need a browser.

Basic Proxy Configuration

HTTP Proxy

import requests

proxies = {
    'http': 'http://proxy-host:proxy-port',
    'https': 'http://proxy-host:proxy-port'
}

response = requests.get('https://httpbin.org/ip', proxies=proxies)
print(response.json())

Both http and https keys are needed. The http key handles HTTP requests, and the https key handles HTTPS requests. Despite the https value starting with http://, the actual connection to the target site is still encrypted — the proxy acts as a tunnel via HTTP CONNECT.

Authenticated Proxy

proxies = {
    'http': 'http://username:password@proxy-host:proxy-port',
    'https': 'http://username:password@proxy-host:proxy-port'
}

response = requests.get('https://httpbin.org/ip', proxies=proxies)
print(response.json())

Credentials are passed inline in the URL. Special characters in passwords need URL encoding:

from urllib.parse import quote

password = quote('p@ss!word#123')
proxies = {
    'http': f'http://username:{password}@proxy-host:proxy-port',
    'https': f'http://username:{password}@proxy-host:proxy-port'
}

SOCKS5 Support

SOCKS5 proxies require the requests[socks] extra (which installs PySocks):

pip install requests[socks]
proxies = {
    'http': 'socks5h://username:password@proxy-host:proxy-port',
    'https': 'socks5h://username:password@proxy-host:proxy-port'
}

response = requests.get('https://httpbin.org/ip', proxies=proxies)

Use socks5h:// (with the h) to route DNS queries through the proxy. Without the h, DNS resolution happens locally, which can leak your real DNS queries.

Session Objects: Why They Matter

The requests.Session object persists settings across requests: cookies, headers, and connection pooling. For scraping, sessions are critical because they maintain state and reuse TCP connections for better performance.

Basic Session with Proxy

import requests

session = requests.Session()
session.proxies = {
    'http': 'http://user:pass@proxy-host:port',
    'https': 'http://user:pass@proxy-host:port'
}
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br'
})

# All requests through this session use the proxy and headers
response = session.get('https://example.com/page1')
response2 = session.get('https://example.com/page2')

Connection Pooling

Sessions reuse TCP connections to the same host, which significantly reduces latency for sequential requests to the same domain:

from requests.adapters import HTTPAdapter

session = requests.Session()
adapter = HTTPAdapter(
    pool_connections=10,
    pool_maxsize=20,
    max_retries=0  # We handle retries ourselves
)
session.mount('http://', adapter)
session.mount('https://', adapter)

Proxy Rotation

A single proxy IP will get rate-limited or blocked on most targets. Rotation is essential.

Simple Round-Robin Rotation

import requests
import itertools

proxy_list = [
    {'http': 'http://user:pass@proxy1:port', 'https': 'http://user:pass@proxy1:port'},
    {'http': 'http://user:pass@proxy2:port', 'https': 'http://user:pass@proxy2:port'},
    {'http': 'http://user:pass@proxy3:port', 'https': 'http://user:pass@proxy3:port'},
]

proxy_cycle = itertools.cycle(proxy_list)

def scrape_with_rotation(url):
    proxy = next(proxy_cycle)
    response = requests.get(url, proxies=proxy, timeout=15)
    return response

Weighted Random Rotation

Not all proxies perform equally. Weight the selection toward better-performing proxies:

import random

proxy_pool = [
    {'proxy': {'http': 'http://user:pass@proxy1:port', 'https': 'http://user:pass@proxy1:port'}, 'weight': 3, 'failures': 0},
    {'proxy': {'http': 'http://user:pass@proxy2:port', 'https': 'http://user:pass@proxy2:port'}, 'weight': 2, 'failures': 0},
    {'proxy': {'http': 'http://user:pass@proxy3:port', 'https': 'http://user:pass@proxy3:port'}, 'weight': 1, 'failures': 0},
]

def get_weighted_proxy():
    active_proxies = [p for p in proxy_pool if p['failures'] < 5]
    if not active_proxies:
        # Reset all proxies if all have failed
        for p in proxy_pool:
            p['failures'] = 0
        active_proxies = proxy_pool

    weights = [p['weight'] for p in active_proxies]
    selected = random.choices(active_proxies, weights=weights, k=1)[0]
    return selected

def mark_proxy_failure(proxy_entry):
    proxy_entry['failures'] += 1
    proxy_entry['weight'] = max(1, proxy_entry['weight'] - 1)

def mark_proxy_success(proxy_entry):
    proxy_entry['failures'] = 0
    proxy_entry['weight'] = min(5, proxy_entry['weight'] + 1)

Rotating Gateway (Simplest)

If your mobile proxy provider offers a rotating gateway, the provider handles IP rotation and you do not need client-side rotation logic:

# Single endpoint -- provider rotates IP per request
proxies = {
    'http': 'http://user:pass@rotating-gateway.provider.com:port',
    'https': 'http://user:pass@rotating-gateway.provider.com:port'
}

session = requests.Session()
session.proxies = proxies

for url in urls:
    response = session.get(url, timeout=15)
    # Each request may use a different IP

Retry Logic and Error Handling

Network errors, proxy timeouts, and temporary blocks are inevitable. Robust retry logic keeps your scraper running.

Exponential Backoff Retry

import time
import requests
from requests.exceptions import ProxyError, ConnectionError, Timeout, HTTPError

def scrape_with_retry(url, proxies, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            response = requests.get(
                url,
                proxies=proxies,
                timeout=15,
                headers={
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                                  'Chrome/120.0.0.0 Safari/537.36'
                }
            )
            response.raise_for_status()

            # Check for soft blocks (200 status but CAPTCHA content)
            if 'captcha' in response.text.lower() or 'access denied' in response.text.lower():
                raise HTTPError(f"Soft block detected on {url}")

            return response

        except (ProxyError, ConnectionError, Timeout) as e:
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.1f}s")
            time.sleep(delay)

        except HTTPError as e:
            if response.status_code == 429:  # Rate limited
                delay = base_delay * (2 ** attempt) * 2
                print(f"Rate limited. Waiting {delay:.1f}s")
                time.sleep(delay)
            elif response.status_code in (403, 503):  # Blocked
                print(f"Blocked ({response.status_code}). Rotating proxy.")
                return None  # Signal to caller to rotate proxy
            else:
                raise

    return None

Timeout Configuration

Set both connection and read timeouts separately:

# (connection_timeout, read_timeout) in seconds
response = requests.get(url, proxies=proxies, timeout=(5, 15))
  • Connection timeout: How long to wait for the proxy to accept the connection. 5 seconds is reasonable for mobile proxies.
  • Read timeout: How long to wait for the response data. 15-30 seconds for complex pages.

Handling HTTPS Properly

SSL Verification

Always keep SSL verification enabled in production. Disabling it opens you to man-in-the-middle attacks:

# Do this
response = requests.get(url, proxies=proxies, verify=True)

# Avoid this unless debugging
response = requests.get(url, proxies=proxies, verify=False)

If you encounter SSL errors with specific proxies, it usually means the proxy is intercepting HTTPS traffic. Use a different proxy or contact the provider.

Certificate Pinning

Some targets use certificate pinning, which causes SSL errors when traffic goes through a proxy. The requests library handles standard HTTPS tunneling correctly (the proxy creates a tunnel and does not see the encrypted content), so certificate pinning is usually not an issue with properly configured HTTP CONNECT proxies.

Building a Complete Proxy Scraper

Here is a production-ready scraper class that combines everything:

import requests
from requests.adapters import HTTPAdapter
import random
import time
import logging
from urllib.parse import quote
from concurrent.futures import ThreadPoolExecutor, as_completed

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ProxyRotatingScraper:
    def __init__(self, proxies, max_retries=3, concurrency=5):
        self.proxies = proxies
        self.max_retries = max_retries
        self.concurrency = concurrency
        self.session = self._create_session()

    def _create_session(self):
        session = requests.Session()
        adapter = HTTPAdapter(pool_connections=20, pool_maxsize=20)
        session.mount('http://', adapter)
        session.mount('https://', adapter)
        session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                          'AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/120.0.0.0 Safari/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive'
        })
        return session

    def _get_proxy(self):
        proxy = random.choice(self.proxies)
        return {
            'http': f"http://{proxy['user']}:{quote(proxy['pass'])}@{proxy['host']}:{proxy['port']}",
            'https': f"http://{proxy['user']}:{quote(proxy['pass'])}@{proxy['host']}:{proxy['port']}"
        }

    def scrape(self, url):
        for attempt in range(self.max_retries):
            proxy = self._get_proxy()
            try:
                time.sleep(random.uniform(0.5, 2.0))

                response = self.session.get(
                    url,
                    proxies=proxy,
                    timeout=(5, 15)
                )

                if response.status_code == 200:
                    if self._is_blocked(response):
                        logger.warning(f"Soft block on {url}, attempt {attempt + 1}")
                        continue
                    return {'url': url, 'status': 200, 'content': response.text, 'success': True}

                elif response.status_code == 429:
                    delay = 2 ** attempt + random.uniform(0, 1)
                    logger.warning(f"Rate limited on {url}, waiting {delay:.1f}s")
                    time.sleep(delay)

                elif response.status_code in (403, 503):
                    logger.warning(f"Blocked ({response.status_code}) on {url}")
                    continue

            except (requests.exceptions.ProxyError,
                    requests.exceptions.ConnectionError,
                    requests.exceptions.Timeout) as e:
                logger.error(f"Network error on {url}: {e}")
                continue

        logger.error(f"All {self.max_retries} attempts failed for {url}")
        return {'url': url, 'status': None, 'content': None, 'success': False}

    def _is_blocked(self, response):
        indicators = ['captcha', 'access denied', 'please verify', 'blocked']
        text_lower = response.text[:2000].lower()
        return any(ind in text_lower for ind in indicators)

    def scrape_many(self, urls):
        results = []
        with ThreadPoolExecutor(max_workers=self.concurrency) as executor:
            futures = {executor.submit(self.scrape, url): url for url in urls}
            for future in as_completed(futures):
                result = future.result()
                results.append(result)
                if result['success']:
                    logger.info(f"OK: {result['url']}")
        return results

# Usage
proxies = [
    {'host': 'mobile-proxy.example.com', 'port': '8080', 'user': 'user', 'pass': 'pass'},
]

scraper = ProxyRotatingScraper(proxies, max_retries=3, concurrency=5)

urls = [f'https://example.com/page/{i}' for i in range(1, 101)]
results = scraper.scrape_many(urls)

successful = sum(1 for r in results if r['success'])
logger.info(f"Results: {successful}/{len(urls)} successful")

Advanced Patterns

Async Scraping with aiohttp

For maximum throughput, use aiohttp for async HTTP requests:

import aiohttp
import asyncio
import random

async def scrape_async(url, proxy_url, semaphore):
    async with semaphore:
        try:
            await asyncio.sleep(random.uniform(0.5, 1.5))
            async with aiohttp.ClientSession() as session:
                async with session.get(
                    url,
                    proxy=proxy_url,
                    timeout=aiohttp.ClientTimeout(total=15),
                    headers={'User-Agent': 'Mozilla/5.0 ...'}
                ) as response:
                    content = await response.text()
                    return {'url': url, 'success': True, 'content': content}
        except Exception as e:
            return {'url': url, 'success': False, 'error': str(e)}

async def main():
    proxy_url = 'http://user:pass@mobile-proxy:port'
    semaphore = asyncio.Semaphore(10)

    urls = [f'https://example.com/page/{i}' for i in range(100)]
    tasks = [scrape_async(url, proxy_url, semaphore) for url in urls]
    results = await asyncio.gather(*tasks)

    successful = sum(1 for r in results if r['success'])
    print(f"{successful}/{len(urls)} successful")

asyncio.run(main())

Parsing with BeautifulSoup

from bs4 import BeautifulSoup

result = scraper.scrape('https://example.com/products')
if result['success']:
    soup = BeautifulSoup(result['content'], 'lxml')
    products = soup.select('.product-card')
    for product in products:
        title = product.select_one('.title').text.strip()
        price = product.select_one('.price').text.strip()
        print(f'{title}: {price}')

Respecting robots.txt

from urllib.robotparser import RobotFileParser

def can_fetch(url, user_agent='*'):
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f'{parsed.scheme}://{parsed.netloc}/robots.txt'

    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp.can_fetch(user_agent, url)

When Requests Is Not Enough

The requests library hits its limits when:

Conclusion

Python requests with proxy rotation is the fastest and most resource-efficient way to scrape sites that serve content in HTML. The combination of session management, connection pooling, retry logic, and proxy rotation creates a scraper that runs reliably at scale.

The proxy you use determines your success rate. Mobile proxies from DataResearchTools provide the IP trust scores needed to scrape even moderately protected sites without triggering blocks. For heavily protected targets that require JavaScript rendering, pair our mobile proxies with browser automation.

Get started with mobile proxies for your Python scraping projects and reduce your block rate from day one.


Related Reading

Scroll to Top