Python Requests + Proxies: Scraping Setup from Scratch
Not every scraping job needs a headless browser. When the target serves content in the initial HTML response — no JavaScript rendering required — the Python requests library with proxy rotation is faster, lighter, and cheaper than browser-based approaches. You can scrape thousands of pages per minute with minimal server resources.
This guide builds a complete proxy-enabled scraper from scratch using Python’s requests library. It covers proxy configuration, session management, rotation logic, retry strategies, HTTPS and SOCKS5 support, and a production-ready scraper class you can adapt to any project.
When to Use Requests vs. Browser Automation
Use requests for:
- Server-rendered HTML pages (content visible in page source)
- REST APIs and JSON endpoints
- Sites without JavaScript challenges
- High-volume scraping where speed matters
- RSS feeds, XML sitemaps, and structured data endpoints
Use browser automation (Puppeteer, Playwright, Selenium) for:
- Single-page applications (React, Angular, Vue)
- Sites with JavaScript challenges (Cloudflare, Akamai)
- Pages that require interaction (clicks, scrolls, form submissions)
- Sites that validate browser fingerprints
If you are unsure, check the target site’s page source (Ctrl+U in your browser). If the data you need is in the raw HTML, requests will work. If the page source is mostly JavaScript with a
, you need a browser.
Basic Proxy Configuration
HTTP Proxy
import requests
proxies = {
'http': 'http://proxy-host:proxy-port',
'https': 'http://proxy-host:proxy-port'
}
response = requests.get('https://httpbin.org/ip', proxies=proxies)
print(response.json())Both http and https keys are needed. The http key handles HTTP requests, and the https key handles HTTPS requests. Despite the https value starting with http://, the actual connection to the target site is still encrypted — the proxy acts as a tunnel via HTTP CONNECT.
Authenticated Proxy
proxies = {
'http': 'http://username:password@proxy-host:proxy-port',
'https': 'http://username:password@proxy-host:proxy-port'
}
response = requests.get('https://httpbin.org/ip', proxies=proxies)
print(response.json())Credentials are passed inline in the URL. Special characters in passwords need URL encoding:
from urllib.parse import quote
password = quote('p@ss!word#123')
proxies = {
'http': f'http://username:{password}@proxy-host:proxy-port',
'https': f'http://username:{password}@proxy-host:proxy-port'
}SOCKS5 Support
SOCKS5 proxies require the requests[socks] extra (which installs PySocks):
pip install requests[socks]proxies = {
'http': 'socks5h://username:password@proxy-host:proxy-port',
'https': 'socks5h://username:password@proxy-host:proxy-port'
}
response = requests.get('https://httpbin.org/ip', proxies=proxies)Use socks5h:// (with the h) to route DNS queries through the proxy. Without the h, DNS resolution happens locally, which can leak your real DNS queries.
Session Objects: Why They Matter
The requests.Session object persists settings across requests: cookies, headers, and connection pooling. For scraping, sessions are critical because they maintain state and reuse TCP connections for better performance.
Basic Session with Proxy
import requests
session = requests.Session()
session.proxies = {
'http': 'http://user:pass@proxy-host:port',
'https': 'http://user:pass@proxy-host:port'
}
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br'
})
# All requests through this session use the proxy and headers
response = session.get('https://example.com/page1')
response2 = session.get('https://example.com/page2')Connection Pooling
Sessions reuse TCP connections to the same host, which significantly reduces latency for sequential requests to the same domain:
from requests.adapters import HTTPAdapter
session = requests.Session()
adapter = HTTPAdapter(
pool_connections=10,
pool_maxsize=20,
max_retries=0 # We handle retries ourselves
)
session.mount('http://', adapter)
session.mount('https://', adapter)Proxy Rotation
A single proxy IP will get rate-limited or blocked on most targets. Rotation is essential.
Simple Round-Robin Rotation
import requests
import itertools
proxy_list = [
{'http': 'http://user:pass@proxy1:port', 'https': 'http://user:pass@proxy1:port'},
{'http': 'http://user:pass@proxy2:port', 'https': 'http://user:pass@proxy2:port'},
{'http': 'http://user:pass@proxy3:port', 'https': 'http://user:pass@proxy3:port'},
]
proxy_cycle = itertools.cycle(proxy_list)
def scrape_with_rotation(url):
proxy = next(proxy_cycle)
response = requests.get(url, proxies=proxy, timeout=15)
return responseWeighted Random Rotation
Not all proxies perform equally. Weight the selection toward better-performing proxies:
import random
proxy_pool = [
{'proxy': {'http': 'http://user:pass@proxy1:port', 'https': 'http://user:pass@proxy1:port'}, 'weight': 3, 'failures': 0},
{'proxy': {'http': 'http://user:pass@proxy2:port', 'https': 'http://user:pass@proxy2:port'}, 'weight': 2, 'failures': 0},
{'proxy': {'http': 'http://user:pass@proxy3:port', 'https': 'http://user:pass@proxy3:port'}, 'weight': 1, 'failures': 0},
]
def get_weighted_proxy():
active_proxies = [p for p in proxy_pool if p['failures'] < 5]
if not active_proxies:
# Reset all proxies if all have failed
for p in proxy_pool:
p['failures'] = 0
active_proxies = proxy_pool
weights = [p['weight'] for p in active_proxies]
selected = random.choices(active_proxies, weights=weights, k=1)[0]
return selected
def mark_proxy_failure(proxy_entry):
proxy_entry['failures'] += 1
proxy_entry['weight'] = max(1, proxy_entry['weight'] - 1)
def mark_proxy_success(proxy_entry):
proxy_entry['failures'] = 0
proxy_entry['weight'] = min(5, proxy_entry['weight'] + 1)Rotating Gateway (Simplest)
If your mobile proxy provider offers a rotating gateway, the provider handles IP rotation and you do not need client-side rotation logic:
# Single endpoint -- provider rotates IP per request
proxies = {
'http': 'http://user:pass@rotating-gateway.provider.com:port',
'https': 'http://user:pass@rotating-gateway.provider.com:port'
}
session = requests.Session()
session.proxies = proxies
for url in urls:
response = session.get(url, timeout=15)
# Each request may use a different IPRetry Logic and Error Handling
Network errors, proxy timeouts, and temporary blocks are inevitable. Robust retry logic keeps your scraper running.
Exponential Backoff Retry
import time
import requests
from requests.exceptions import ProxyError, ConnectionError, Timeout, HTTPError
def scrape_with_retry(url, proxies, max_retries=3, base_delay=1):
for attempt in range(max_retries):
try:
response = requests.get(
url,
proxies=proxies,
timeout=15,
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/120.0.0.0 Safari/537.36'
}
)
response.raise_for_status()
# Check for soft blocks (200 status but CAPTCHA content)
if 'captcha' in response.text.lower() or 'access denied' in response.text.lower():
raise HTTPError(f"Soft block detected on {url}")
return response
except (ProxyError, ConnectionError, Timeout) as e:
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.1f}s")
time.sleep(delay)
except HTTPError as e:
if response.status_code == 429: # Rate limited
delay = base_delay * (2 ** attempt) * 2
print(f"Rate limited. Waiting {delay:.1f}s")
time.sleep(delay)
elif response.status_code in (403, 503): # Blocked
print(f"Blocked ({response.status_code}). Rotating proxy.")
return None # Signal to caller to rotate proxy
else:
raise
return NoneTimeout Configuration
Set both connection and read timeouts separately:
# (connection_timeout, read_timeout) in seconds
response = requests.get(url, proxies=proxies, timeout=(5, 15))- Connection timeout: How long to wait for the proxy to accept the connection. 5 seconds is reasonable for mobile proxies.
- Read timeout: How long to wait for the response data. 15-30 seconds for complex pages.
Handling HTTPS Properly
SSL Verification
Always keep SSL verification enabled in production. Disabling it opens you to man-in-the-middle attacks:
# Do this
response = requests.get(url, proxies=proxies, verify=True)
# Avoid this unless debugging
response = requests.get(url, proxies=proxies, verify=False)If you encounter SSL errors with specific proxies, it usually means the proxy is intercepting HTTPS traffic. Use a different proxy or contact the provider.
Certificate Pinning
Some targets use certificate pinning, which causes SSL errors when traffic goes through a proxy. The requests library handles standard HTTPS tunneling correctly (the proxy creates a tunnel and does not see the encrypted content), so certificate pinning is usually not an issue with properly configured HTTP CONNECT proxies.
Building a Complete Proxy Scraper
Here is a production-ready scraper class that combines everything:
import requests
from requests.adapters import HTTPAdapter
import random
import time
import logging
from urllib.parse import quote
from concurrent.futures import ThreadPoolExecutor, as_completed
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ProxyRotatingScraper:
def __init__(self, proxies, max_retries=3, concurrency=5):
self.proxies = proxies
self.max_retries = max_retries
self.concurrency = concurrency
self.session = self._create_session()
def _create_session(self):
session = requests.Session()
adapter = HTTPAdapter(pool_connections=20, pool_maxsize=20)
session.mount('http://', adapter)
session.mount('https://', adapter)
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive'
})
return session
def _get_proxy(self):
proxy = random.choice(self.proxies)
return {
'http': f"http://{proxy['user']}:{quote(proxy['pass'])}@{proxy['host']}:{proxy['port']}",
'https': f"http://{proxy['user']}:{quote(proxy['pass'])}@{proxy['host']}:{proxy['port']}"
}
def scrape(self, url):
for attempt in range(self.max_retries):
proxy = self._get_proxy()
try:
time.sleep(random.uniform(0.5, 2.0))
response = self.session.get(
url,
proxies=proxy,
timeout=(5, 15)
)
if response.status_code == 200:
if self._is_blocked(response):
logger.warning(f"Soft block on {url}, attempt {attempt + 1}")
continue
return {'url': url, 'status': 200, 'content': response.text, 'success': True}
elif response.status_code == 429:
delay = 2 ** attempt + random.uniform(0, 1)
logger.warning(f"Rate limited on {url}, waiting {delay:.1f}s")
time.sleep(delay)
elif response.status_code in (403, 503):
logger.warning(f"Blocked ({response.status_code}) on {url}")
continue
except (requests.exceptions.ProxyError,
requests.exceptions.ConnectionError,
requests.exceptions.Timeout) as e:
logger.error(f"Network error on {url}: {e}")
continue
logger.error(f"All {self.max_retries} attempts failed for {url}")
return {'url': url, 'status': None, 'content': None, 'success': False}
def _is_blocked(self, response):
indicators = ['captcha', 'access denied', 'please verify', 'blocked']
text_lower = response.text[:2000].lower()
return any(ind in text_lower for ind in indicators)
def scrape_many(self, urls):
results = []
with ThreadPoolExecutor(max_workers=self.concurrency) as executor:
futures = {executor.submit(self.scrape, url): url for url in urls}
for future in as_completed(futures):
result = future.result()
results.append(result)
if result['success']:
logger.info(f"OK: {result['url']}")
return results
# Usage
proxies = [
{'host': 'mobile-proxy.example.com', 'port': '8080', 'user': 'user', 'pass': 'pass'},
]
scraper = ProxyRotatingScraper(proxies, max_retries=3, concurrency=5)
urls = [f'https://example.com/page/{i}' for i in range(1, 101)]
results = scraper.scrape_many(urls)
successful = sum(1 for r in results if r['success'])
logger.info(f"Results: {successful}/{len(urls)} successful")Advanced Patterns
Async Scraping with aiohttp
For maximum throughput, use aiohttp for async HTTP requests:
import aiohttp
import asyncio
import random
async def scrape_async(url, proxy_url, semaphore):
async with semaphore:
try:
await asyncio.sleep(random.uniform(0.5, 1.5))
async with aiohttp.ClientSession() as session:
async with session.get(
url,
proxy=proxy_url,
timeout=aiohttp.ClientTimeout(total=15),
headers={'User-Agent': 'Mozilla/5.0 ...'}
) as response:
content = await response.text()
return {'url': url, 'success': True, 'content': content}
except Exception as e:
return {'url': url, 'success': False, 'error': str(e)}
async def main():
proxy_url = 'http://user:pass@mobile-proxy:port'
semaphore = asyncio.Semaphore(10)
urls = [f'https://example.com/page/{i}' for i in range(100)]
tasks = [scrape_async(url, proxy_url, semaphore) for url in urls]
results = await asyncio.gather(*tasks)
successful = sum(1 for r in results if r['success'])
print(f"{successful}/{len(urls)} successful")
asyncio.run(main())Parsing with BeautifulSoup
from bs4 import BeautifulSoup
result = scraper.scrape('https://example.com/products')
if result['success']:
soup = BeautifulSoup(result['content'], 'lxml')
products = soup.select('.product-card')
for product in products:
title = product.select_one('.title').text.strip()
price = product.select_one('.price').text.strip()
print(f'{title}: {price}')Respecting robots.txt
from urllib.robotparser import RobotFileParser
def can_fetch(url, user_agent='*'):
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f'{parsed.scheme}://{parsed.netloc}/robots.txt'
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp.can_fetch(user_agent, url)When Requests Is Not Enough
The requests library hits its limits when:
- The target requires JavaScript rendering. Switch to Playwright or Puppeteer.
- The target presents JavaScript challenges (Cloudflare, Akamai). See our guides on bypassing Cloudflare and bypassing Akamai.
- You need to scrape thousands of pages per minute across many domains. Consider Scrapy with proxy middleware.
Conclusion
Python requests with proxy rotation is the fastest and most resource-efficient way to scrape sites that serve content in HTML. The combination of session management, connection pooling, retry logic, and proxy rotation creates a scraper that runs reliably at scale.
The proxy you use determines your success rate. Mobile proxies from DataResearchTools provide the IP trust scores needed to scrape even moderately protected sites without triggering blocks. For heavily protected targets that require JavaScript rendering, pair our mobile proxies with browser automation.
Get started with mobile proxies for your Python scraping projects and reduce your block rate from day one.
- Mobile Proxies for Web Scraping: Why They Work When Others Don’t
- How to Use Mobile Proxies with Puppeteer for Web Scraping
- Selenium Proxy Setup: Complete Guide for Web Scraping
- Playwright Proxy Configuration: Step-by-Step Scraping Guide
- Scrapy Proxy Middleware: Rotate Mobile Proxies for Large-Scale Crawls
- Headless Browser + Proxy Setup: The Anti-Detection Stack
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- aiohttp + BeautifulSoup: Async Python Scraping
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- Axios + Cheerio: Lightweight Node.js Scraping
- How to Build an Ethical Web Scraping Policy for Your Company
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- aiohttp + BeautifulSoup: Async Python Scraping
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- Axios + Cheerio: Lightweight Node.js Scraping
- How to Build an Ethical Web Scraping Policy for Your Company
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- aiohttp + BeautifulSoup: Async Python Scraping
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- Axios + Cheerio: Lightweight Node.js Scraping
- How to Build an Ethical Web Scraping Policy for Your Company
Related Reading
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- aiohttp + BeautifulSoup: Async Python Scraping
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- Axios + Cheerio: Lightweight Node.js Scraping
- How to Build an Ethical Web Scraping Policy for Your Company