Web Scraping with Python: The Complete 2026 Guide

Web Scraping with Python: The Complete 2026 Guide

Python is the undisputed king of web scraping. Its rich ecosystem of libraries — from simple HTTP clients to full browser automation frameworks — makes it possible to extract data from virtually any website, whether it serves static HTML or relies entirely on JavaScript rendering.

This guide covers everything you need to go from zero to production-grade web scraping with Python. We’ll walk through every major library, show real code examples for common scraping patterns, and teach you how to handle the challenges that trip up most scrapers: JavaScript rendering, pagination, anti-bot systems, and IP blocking.

Table of Contents

  1. Why Python for Web Scraping?
  2. Setting Up Your Environment
  3. HTTP Fundamentals for Scrapers
  4. Requests + BeautifulSoup: The Classic Stack
  5. HTTPX: The Modern HTTP Client
  6. lxml: Speed-First HTML Parsing
  7. Scrapy: Industrial-Scale Scraping
  8. Selenium: Browser Automation
  9. Playwright: The Modern Alternative to Selenium
  10. Handling JavaScript-Rendered Content
  11. Pagination and Crawling Patterns
  12. Using Proxies with Python Scrapers
  13. Handling Anti-Bot Systems
  14. Storing Scraped Data
  15. Real-World Example: Scraping a Product Page
  16. Performance Optimization
  17. Legal and Ethical Considerations
  18. Python Scraping Libraries Comparison
  19. FAQ

Why Python for Web Scraping?

Python dominates web scraping for several reasons:

  • Low barrier to entry — Clean syntax makes scraping scripts readable and maintainable
  • Massive library ecosystem — From requests to scrapy, there’s a mature tool for every scraping pattern
  • Strong data processing pipeline — Seamless integration with pandas, SQLAlchemy, and data science tools
  • Async support — Modern Python (3.11+) handles thousands of concurrent requests efficiently
  • Community — The largest web scraping community means abundant tutorials, StackOverflow answers, and open-source tools

Other languages (Node.js, Go, Rust) have their niches, but Python remains the default for 90%+ of scraping projects in 2026.

Setting Up Your Environment

Prerequisites

  • Python 3.10+ (we recommend 3.12 or later)
  • pip or uv for package management
  • A code editor (VS Code, PyCharm, or similar)

Install Core Libraries

# Create a virtual environment

python -m venv scraper-env

source scraper-env/bin/activate # macOS/Linux

scraper-env\Scripts\activate # Windows

Install the essentials

pip install requests beautifulsoup4 lxml httpx playwright scrapy

Install Playwright browsers

playwright install chromium

Recommended Project Structure

my-scraper/

├── scraper/

│ ├── __init__.py

│ ├── client.py # HTTP client with proxy/retry logic

│ ├── parsers/

│ │ ├── __init__.py

│ │ └── product.py # Page-specific parsing logic

│ └── storage.py # Save to CSV/DB/JSON

├── config.py # Proxy credentials, target URLs

├── main.py # Entry point

└── requirements.txt

HTTP Fundamentals for Scrapers

Before diving into libraries, understand what happens when you scrape a page:

  1. Your script sends an HTTP GET request to the target URL
  2. The server responds with HTML, JSON, or a redirect
  3. You parse the response to extract the data you need
  4. Anti-bot systems may intervene with CAPTCHAs, blocks, or JavaScript challenges

Key HTTP concepts every scraper must handle:

ConceptWhy It Matters
User-AgentServers check this header; a missing or bot-like UA triggers blocks
CookiesMany sites require session cookies for proper page rendering
Status Codes200 = success, 403 = blocked, 429 = rate limited, 503 = anti-bot challenge
RedirectsLogin walls and geo-redirects change the page you actually receive
HeadersReferer, Accept-Language, and other headers affect what content is served

Requests + BeautifulSoup: The Classic Stack

The requests + BeautifulSoup combination is the starting point for most Python scrapers. It’s simple, readable, and handles 80% of scraping tasks.

Basic Example: Fetching and Parsing a Page

import requests

from bs4 import BeautifulSoup

Set headers to mimic a real browser

headers = {

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",

"Accept-Language": "en-US,en;q=0.9",

}

url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"

response = requests.get(url, headers=headers, timeout=15)

if response.status_code == 200:

soup = BeautifulSoup(response.text, "lxml")

# Extract product data

title = soup.select_one("h1").text

price = soup.select_one(".price_color").text

availability = soup.select_one(".availability").text.strip()

description = soup.select_one("#product_description ~ p").text

print(f"Title: {title}")

print(f"Price: {price}")

print(f"Availability: {availability}")

print(f"Description: {description[:100]}...")

else:

print(f"Failed with status: {response.status_code}")

Using Sessions for Cookie Persistence

session = requests.Session()

session.headers.update(headers)

First request sets cookies

session.get("https://example.com")

Subsequent requests carry cookies automatically

response = session.get("https://example.com/data-page")

CSS Selectors vs. XPath

BeautifulSoup supports CSS selectors natively via .select(). For XPath, use lxml instead.

# CSS Selectors (BeautifulSoup)

titles = soup.select("div.product-card h2.title")

prices = soup.select("span.price[data-currency='USD']")

links = soup.select("a.product-link[href]")

Get attribute values

for link in links:

print(link["href"])

When to use Requests + BeautifulSoup:

  • Static HTML pages
  • Simple scraping tasks
  • Prototyping and one-off scripts
  • When you don’t need JavaScript rendering

HTTPX: The Modern HTTP Client

HTTPX is a modern replacement for requests that supports async operations, HTTP/2, and has a more robust feature set. In 2026, it’s the preferred HTTP client for new scraping projects.

Sync Usage (Drop-in Requests Replacement)

import httpx

response = httpx.get(

"https://books.toscrape.com",

headers=headers,

timeout=15,

follow_redirects=True,

)

Async Usage (High-Concurrency Scraping)

import asyncio

import httpx

from bs4 import BeautifulSoup

async def scrape_page(client: httpx.AsyncClient, url: str) -> dict:

response = await client.get(url, timeout=15)

if response.status_code == 200:

soup = BeautifulSoup(response.text, "lxml")

title = soup.select_one("h1").text

return {"url": url, "title": title}

return {"url": url, "error": response.status_code}

async def main():

urls = [

"https://books.toscrape.com/catalogue/page-1.html",

"https://books.toscrape.com/catalogue/page-2.html",

"https://books.toscrape.com/catalogue/page-3.html",

# ... more URLs

]

async with httpx.AsyncClient(headers=headers) as client:

tasks = [scrape_page(client, url) for url in urls]

results = await asyncio.gather(*tasks, return_exceptions=True)

for result in results:

print(result)

asyncio.run(main())

HTTP/2 Support

# HTTP/2 can improve performance on sites that support it

async with httpx.AsyncClient(http2=True) as client:

response = await client.get("https://example.com")

When to use HTTPX:

  • New projects (prefer over requests)
  • High-concurrency scraping (async)
  • When you need HTTP/2
  • When building production scraping systems

lxml: Speed-First HTML Parsing

lxml is the fastest HTML/XML parser available in Python. While BeautifulSoup is more forgiving with malformed HTML, lxml is 5–10x faster for large documents.

Parsing with lxml Directly

from lxml import html

import httpx

response = httpx.get("https://books.toscrape.com", headers=headers)

tree = html.fromstring(response.text)

XPath selectors

titles = tree.xpath("//article[@class='product_pod']/h3/a/@title")

prices = tree.xpath("//article[@class='product_pod']//p[@class='price_color']/text()")

for title, price in zip(titles, prices):

print(f"{title}: {price}")

XPath vs. CSS Selectors

# XPath — more powerful, can select by text content

tree.xpath("//div[contains(@class, 'product')]//span[contains(text(), '$')]")

XPath — navigate up the tree (not possible with CSS)

tree.xpath("//span[@class='price']/ancestor::div[@class='product']//h2/text()")

CSS via lxml's cssselect

from lxml.cssselect import CSSSelector

sel = CSSSelector("div.product h2")

elements = sel(tree)

When to use lxml:

  • Parsing large HTML documents where speed matters
  • When you need XPath (navigating up the DOM, text-based selection)
  • As the parser engine inside BeautifulSoup (BeautifulSoup(html, "lxml"))

Scrapy: Industrial-Scale Scraping

Scrapy is a full-featured scraping framework — not just a library. It handles request scheduling, concurrency, retries, data pipelines, and middleware out of the box.

Creating a Scrapy Project

scrapy startproject bookstore

cd bookstore

scrapy genspider books books.toscrape.com

Writing a Spider

# bookstore/spiders/books.py

import scrapy

class BooksSpider(scrapy.Spider):

name = "books"

start_urls = ["https://books.toscrape.com/catalogue/page-1.html"]

def parse(self, response):

# Extract book data from listing page

for book in response.css("article.product_pod"):

yield {

"title": book.css("h3 a::attr(title)").get(),

"price": book.css(".price_color::text").get(),

"rating": book.css("p.star-rating::attr(class)").get(),

"url": response.urljoin(book.css("h3 a::attr(href)").get()),

}

# Follow pagination

next_page = response.css("li.next a::attr(href)").get()

if next_page:

yield response.follow(next_page, callback=self.parse)

Running the Spider

# Output to JSON

scrapy crawl books -O books.json

Output to CSV

scrapy crawl books -O books.csv

Scrapy Settings for Production

# bookstore/settings.py

Respect robots.txt

ROBOTSTXT_OBEY = True

Concurrency

CONCURRENT_REQUESTS = 16

CONCURRENT_REQUESTS_PER_DOMAIN = 8

DOWNLOAD_DELAY = 1 # seconds between requests to same domain

Retry configuration

RETRY_TIMES = 3

RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

User-Agent rotation

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"

Enable AutoThrottle for adaptive rate limiting

AUTOTHROTTLE_ENABLED = True

AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0

Scrapy with Proxy Middleware

# bookstore/middlewares.py

import random

class ProxyMiddleware:

def __init__(self):

self.proxy_url = "http://user:pass@gate.provider.com:7777"

def process_request(self, request, spider):

request.meta["proxy"] = self.proxy_url

When to use Scrapy:

  • Large-scale crawling (thousands to millions of pages)
  • Projects that need structured pipelines (crawl -> parse -> clean -> store)
  • When you need built-in concurrency, retries, and rate limiting
  • Team projects where a framework enforces structure

Selenium: Browser Automation

Selenium automates a real browser — it can click buttons, fill forms, scroll pages, and execute JavaScript. It’s essential for sites that require full browser rendering.

Basic Selenium Setup

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

Configure Chrome

options = Options()

options.add_argument("--headless=new") # Run without visible browser

options.add_argument("--no-sandbox")

options.add_argument("--disable-dev-shm-usage")

options.add_argument("--window-size=1920,1080")

options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")

driver = webdriver.Chrome(options=options)

try:

driver.get("https://example.com/products")

# Wait for dynamic content to load

WebDriverWait(driver, 10).until(

EC.presence_of_element_located((By.CSS_SELECTOR, ".product-card"))

)

# Extract data

products = driver.find_elements(By.CSS_SELECTOR, ".product-card")

for product in products:

title = product.find_element(By.CSS_SELECTOR, "h2").text

price = product.find_element(By.CSS_SELECTOR, ".price").text

print(f"{title}: {price}")

finally:

driver.quit()

Interacting with Pages

# Click a button

button = driver.find_element(By.CSS_SELECTOR, "button.load-more")

button.click()

Fill a search form

search_input = driver.find_element(By.CSS_SELECTOR, "input[name='q']")

search_input.clear()

search_input.send_keys("proxy providers")

search_input.submit()

Scroll to bottom (trigger infinite scroll)

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

Wait for new content after scroll

import time

time.sleep(2)

Using Proxies with Selenium

from selenium.webdriver.chrome.options import Options

options = Options()

options.add_argument("--proxy-server=http://gate.provider.com:7777")

For authenticated proxies, use a browser extension or seleniumwire

pip install selenium-wire

from seleniumwire import webdriver

proxy_options = {

"proxy": {

"http": "http://user:pass@gate.provider.com:7777",

"https": "http://user:pass@gate.provider.com:7777",

}

}

driver = webdriver.Chrome(

options=options,

seleniumwire_options=proxy_options

)

When to use Selenium:

  • Sites that require JavaScript rendering and cannot be scraped with HTTP clients
  • When you need to interact with the page (click, scroll, fill forms)
  • Legacy projects already built on Selenium
  • When you need screenshots or PDF generation

Playwright: The Modern Alternative to Selenium

Playwright is Microsoft’s browser automation framework and has rapidly become the preferred choice over Selenium for new projects in 2026. It’s faster, more reliable, and has better async support.

Basic Playwright Usage

from playwright.sync_api import sync_playwright

with sync_playwright() as p:

browser = p.chromium.launch(headless=True)

context = browser.new_context(

user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",

viewport={"width": 1920, "height": 1080},

)

page = context.new_page()

page.goto("https://example.com/products")

# Wait for content (Playwright auto-waits, but explicit waits are available)

page.wait_for_selector(".product-card")

# Extract data

products = page.query_selector_all(".product-card")

for product in products:

title = product.query_selector("h2").inner_text()

price = product.query_selector(".price").inner_text()

print(f"{title}: {price}")

browser.close()

Async Playwright (High Concurrency)

import asyncio

from playwright.async_api import async_playwright

async def scrape_page(browser, url):

context = await browser.new_context()

page = await context.new_page()

await page.goto(url, wait_until="networkidle")

title = await page.title()

content = await page.inner_text("body")

await context.close()

return {"url": url, "title": title, "length": len(content)}

async def main():

async with async_playwright() as p:

browser = await p.chromium.launch(headless=True)

urls = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 11)]

tasks = [scrape_page(browser, url) for url in urls]

results = await asyncio.gather(*tasks)

for result in results:

print(result)

await browser.close()

asyncio.run(main())

Playwright with Proxies

from playwright.sync_api import sync_playwright

with sync_playwright() as p:

browser = p.chromium.launch(

headless=True,

proxy={

"server": "http://gate.provider.com:7777",

"username": "your_user",

"password": "your_pass",

}

)

page = browser.new_page()

page.goto("https://httpbin.org/ip")

print(page.inner_text("body"))

browser.close()

Intercepting Network Requests

# Block images and CSS for faster scraping

def route_handler(route):

if route.request.resource_type in ["image", "stylesheet", "font"]:

route.abort()

else:

route.continue_()

page.route("*/", route_handler)

page.goto("https://example.com")

When to use Playwright:

  • Any project that would otherwise use Selenium (Playwright is strictly better in 2026)
  • JavaScript-heavy SPAs (React, Vue, Angular sites)
  • When you need request interception or network monitoring
  • Stealth scraping (Playwright is harder for sites to detect than Selenium)

Handling JavaScript-Rendered Content

Many modern websites render content with JavaScript — meaning the initial HTML response is mostly empty. Here’s how to handle each scenario:

Detect If JS Rendering Is Needed

import httpx

from bs4 import BeautifulSoup

response = httpx.get("https://target-site.com", headers=headers)

soup = BeautifulSoup(response.text, "lxml")

If the content you need is missing, JS rendering is required

target_data = soup.select(".product-price")

if not target_data:

print("Content not in initial HTML — JS rendering needed")

Strategy 1: Find the API Endpoint

Before reaching for a browser, check if the site loads data from an API. This is faster and more reliable.

# Open browser DevTools > Network tab > XHR/Fetch

Look for JSON API calls that contain the data you need

import httpx

Often the site's frontend calls an internal API

api_url = "https://target-site.com/api/products?page=1&limit=20"

response = httpx.get(api_url, headers=headers)

data = response.json()

for product in data["products"]:

print(f"{product['name']}: ${product['price']}")

This technique bypasses JS rendering entirely and is 10x faster. Check our web scraping guides for site-specific API discovery tips.

Strategy 2: Use Playwright for Full Rendering

from playwright.sync_api import sync_playwright

with sync_playwright() as p:

browser = p.chromium.launch(headless=True)

page = browser.new_page()

page.goto("https://spa-site.com/products")

page.wait_for_selector("[data-product-id]") # Wait for React/Vue to render

# Now parse the fully rendered HTML

html = page.content()

# ... parse with BeautifulSoup or lxml

Strategy 3: Execute JS Manually

# If only a small JS snippet needs to run

from playwright.sync_api import sync_playwright

with sync_playwright() as p:

browser = p.chromium.launch(headless=True)

page = browser.new_page()

page.goto("https://example.com")

# Execute JavaScript and get the result

data = page.evaluate("""

() => {

return Array.from(document.querySelectorAll('.product')).map(el => ({

name: el.querySelector('h2').textContent,

price: el.querySelector('.price').textContent,

}));

}

""")

print(data)

Pagination and Crawling Patterns

Pattern 1: Next-Page Links

import httpx

from bs4 import BeautifulSoup

def scrape_paginated(base_url):

all_items = []

url = base_url

while url:

response = httpx.get(url, headers=headers, timeout=15)

soup = BeautifulSoup(response.text, "lxml")

# Extract items from current page

items = soup.select(".product-card")

for item in items:

all_items.append({

"title": item.select_one("h2").text,

"price": item.select_one(".price").text,

})

# Find next page link

next_link = soup.select_one("a.next-page")

url = next_link["href"] if next_link else None

print(f"Scraped {len(items)} items, total: {len(all_items)}")

return all_items

data = scrape_paginated("https://books.toscrape.com/catalogue/page-1.html")

Pattern 2: Page Number Parameters

import httpx

from bs4 import BeautifulSoup

all_items = []

for page_num in range(1, 51): # Pages 1-50

url = f"https://example.com/products?page={page_num}"

response = httpx.get(url, headers=headers)

if response.status_code != 200:

break

soup = BeautifulSoup(response.text, "lxml")

items = soup.select(".product")

if not items:

break # No more items = last page reached

all_items.extend([parse_item(item) for item in items])

Pattern 3: Infinite Scroll (Playwright)

from playwright.sync_api import sync_playwright

with sync_playwright() as p:

browser = p.chromium.launch(headless=True)

page = browser.new_page()

page.goto("https://infinite-scroll-site.com")

previous_height = 0

while True:

# Scroll to bottom

page.evaluate("window.scrollTo(0, document.body.scrollHeight)")

page.wait_for_timeout(2000) # Wait for new content

current_height = page.evaluate("document.body.scrollHeight")

if current_height == previous_height:

break # No more content loaded

previous_height = current_height

# Parse all loaded content

items = page.query_selector_all(".item")

print(f"Loaded {len(items)} items via infinite scroll")

Pattern 4: API-Based Pagination (Cursor/Offset)

import httpx

all_data = []

cursor = None

while True:

params = {"limit": 100}

if cursor:

params["cursor"] = cursor

response = httpx.get("https://api.example.com/products", params=params, headers=headers)

data = response.json()

all_data.extend(data["results"])

cursor = data.get("next_cursor")

if not cursor:

break

print(f"Total items: {len(all_data)}")

Using Proxies with Python Scrapers

Proxies are essential for any serious scraping project. Without them, you’ll face IP bans within minutes on most commercial websites. For a detailed comparison of providers, see our best proxy providers guide.

Proxies with Requests/HTTPX

import httpx

Rotating residential proxy

proxy_url = "http://user:pass@gate.provider.com:7777"

HTTPX

response = httpx.get(

"https://target.com",

proxy=proxy_url,

headers=headers,

timeout=30,

)

Proxy Rotation Pattern

import httpx

import random

PROXY_LIST = [

"http://user:pass@us.provider.com:7777",

"http://user:pass@uk.provider.com:7777",

"http://user:pass@de.provider.com:7777",

]

def get_with_rotation(url, max_retries=3):

for attempt in range(max_retries):

proxy = random.choice(PROXY_LIST)

try:

response = httpx.get(url, proxy=proxy, headers=headers, timeout=20)

if response.status_code == 200:

return response

elif response.status_code in (403, 429):

continue # Try different proxy

except httpx.TimeoutException:

continue

return None

Proxy Authentication Methods

# Method 1: URL-embedded credentials

proxy = "http://username:password@gate.provider.com:7777"

Method 2: Country/city targeting via username

Most providers encode targeting in the username

proxy = "http://user-country-us-city-newyork:pass@gate.provider.com:7777"

Method 3: Session ID for sticky sessions

proxy = "http://user-session-abc123:pass@gate.provider.com:7777"

For detailed proxy setup instructions across different tools and platforms, see our proxy setup guides and proxy integration tutorials.

Handling Anti-Bot Systems

Modern websites use sophisticated anti-bot systems like Cloudflare, PerimeterX, DataDome, and Akamai Bot Manager. Here’s how to handle them:

Level 1: Header Rotation

import random

USER_AGENTS = [

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",

"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",

]

def get_random_headers():

return {

"User-Agent": random.choice(USER_AGENTS),

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",

"Accept-Language": random.choice(["en-US,en;q=0.9", "en-GB,en;q=0.9"]),

"Accept-Encoding": "gzip, deflate, br",

"Sec-Ch-Ua": '"Chromium";v="124", "Google Chrome";v="124"',

"Sec-Ch-Ua-Mobile": "?0",

"Sec-Ch-Ua-Platform": '"Windows"',

"Sec-Fetch-Dest": "document",

"Sec-Fetch-Mode": "navigate",

"Sec-Fetch-Site": "none",

}

Level 2: Rate Limiting and Delays

import time

import random

def polite_request(url, session, min_delay=1, max_delay=3):

"""Add human-like delays between requests."""

time.sleep(random.uniform(min_delay, max_delay))

return session.get(url, timeout=15)

Level 3: Playwright Stealth

# pip install playwright-stealth

from playwright.sync_api import sync_playwright

from playwright_stealth import stealth_sync

with sync_playwright() as p:

browser = p.chromium.launch(headless=True)

page = browser.new_page()

# Apply stealth patches to avoid detection

stealth_sync(page)

page.goto("https://bot-protected-site.com")

Level 4: Use a Web Unlocker Service

For heavily protected sites, the most reliable approach is to use a provider’s web unlocker service — Bright Data’s Web Unlocker, Oxylabs’ Web Scraper API, or Smartproxy’s Site Unblocker handle CAPTCHAs, fingerprinting, and JavaScript challenges automatically.

import httpx

Bright Data Web Unlocker example

response = httpx.get(

"https://heavily-protected-site.com/products",

proxy="http://brd-customer-XXXX-zone-unblocker:PASS@brd.superproxy.io:33335",

timeout=60, # Unblockers need more time

)

For more anti-bot strategies, see our guides on anti-detect browsers and proxy troubleshooting.

Storing Scraped Data

CSV Output

import csv

data = [

{"title": "Product A", "price": "$29.99", "url": "https://..."},

{"title": "Product B", "price": "$49.99", "url": "https://..."},

]

with open("products.csv", "w", newline="", encoding="utf-8") as f:

writer = csv.DictWriter(f, fieldnames=["title", "price", "url"])

writer.writeheader()

writer.writerows(data)

JSON Output

import json

with open("products.json", "w", encoding="utf-8") as f:

json.dump(data, f, indent=2, ensure_ascii=False)

SQLite Database

import sqlite3

conn = sqlite3.connect("products.db")

cursor = conn.cursor()

cursor.execute("""

CREATE TABLE IF NOT EXISTS products (

id INTEGER PRIMARY KEY AUTOINCREMENT,

title TEXT NOT NULL,

price TEXT,

url TEXT UNIQUE,

scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP

)

""")

for item in data:

cursor.execute(

"INSERT OR IGNORE INTO products (title, price, url) VALUES (?, ?, ?)",

(item["title"], item["price"], item["url"])

)

conn.commit()

conn.close()

Pandas DataFrame

import pandas as pd

df = pd.DataFrame(data)

df.to_csv("products.csv", index=False)

df.to_json("products.json", orient="records", indent=2)

df.to_excel("products.xlsx", index=False)

Real-World Example: Scraping a Product Page

Here’s a complete, production-ready example that scrapes product data from an e-commerce listing page with proxy rotation, error handling, and data storage.

"""

Production-ready product scraper with proxy rotation and error handling.

"""

import httpx

import json

import time

import random

from bs4 import BeautifulSoup

from dataclasses import dataclass, asdict

from typing import Optional

--- Configuration ---

PROXY_URL = "http://user:pass@gate.provider.com:7777"

HEADERS = {

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",

"Accept-Language": "en-US,en;q=0.9",

}

--- Data Model ---

@dataclass

class Product:

title: str

price: str

rating: Optional[str]

availability: str

url: str

--- Scraping Logic ---

def scrape_listing_page(client: httpx.Client, url: str) -> list[dict]:

"""Scrape all products from a single listing page."""

response = client.get(url, timeout=20)

response.raise_for_status()

soup = BeautifulSoup(response.text, "lxml")

products = []

for card in soup.select("article.product_pod"):

product = Product(

title=card.select_one("h3 a")["title"],

price=card.select_one(".price_color").text,

rating=card.select_one("p.star-rating")["class"][1], # e.g., "Three"

availability="In stock" if card.select_one(".instock") else "Out of stock",

url=card.select_one("h3 a")["href"],

)

products.append(asdict(product))

return products

def scrape_all_pages(base_url: str, max_pages: int = 50) -> list[dict]:

"""Scrape all pages with retry logic and rate limiting."""

all_products = []

with httpx.Client(headers=HEADERS, proxy=PROXY_URL, follow_redirects=True) as client:

for page_num in range(1, max_pages + 1):

url = f"{base_url}/catalogue/page-{page_num}.html"

for attempt in range(3): # Retry up to 3 times

try:

products = scrape_listing_page(client, url)

if not products:

print(f"No products on page {page_num} — stopping.")

return all_products

all_products.extend(products)

print(f"Page {page_num}: {len(products)} products (total: {len(all_products)})")

# Polite delay

time.sleep(random.uniform(1, 2.5))

break

except httpx.HTTPStatusError as e:

if e.response.status_code == 404:

return all_products # Last page reached

print(f"HTTP {e.response.status_code} on page {page_num}, attempt {attempt + 1}")

time.sleep(5)

except httpx.TimeoutException:

print(f"Timeout on page {page_num}, attempt {attempt + 1}")

time.sleep(5)

return all_products

--- Main ---

if __name__ == "__main__":

products = scrape_all_pages("https://books.toscrape.com")

# Save results

with open("products.json", "w", encoding="utf-8") as f:

json.dump(products, f, indent=2)

print(f"\nDone! Scraped {len(products)} products.")

Performance Optimization

1. Async Concurrency

The biggest performance gain comes from making requests concurrently rather than sequentially.

import asyncio

import httpx

SEMAPHORE = asyncio.Semaphore(10) # Limit to 10 concurrent requests

async def fetch(client, url):

async with SEMAPHORE:

response = await client.get(url, timeout=20)

return response

async def main():

urls = [f"https://example.com/page/{i}" for i in range(1, 1001)]

async with httpx.AsyncClient(headers=HEADERS, proxy=PROXY_URL) as client:

tasks = [fetch(client, url) for url in urls]

responses = await asyncio.gather(*tasks, return_exceptions=True)

successful = [r for r in responses if isinstance(r, httpx.Response) and r.status_code == 200]

print(f"Success: {len(successful)}/{len(urls)}")

2. Resource Blocking in Playwright

# Block unnecessary resources to speed up rendering

page.route("*/.{png,jpg,jpeg,gif,svg,woff,woff2}", lambda route: route.abort())

page.route("*/analytics", lambda route: route.abort())

page.route("*/tracking", lambda route: route.abort())

3. Connection Pooling

# HTTPX and requests.Session reuse TCP connections automatically

Just ensure you use a client/session object across requests

async with httpx.AsyncClient(limits=httpx.Limits(max_connections=20)) as client:

# All requests share the connection pool

pass

4. Parsing Optimization

# Use lxml parser (faster than html.parser)

soup = BeautifulSoup(html, "lxml") # 5-10x faster than "html.parser"

Or use lxml directly for maximum speed

from lxml import html

tree = html.fromstring(html_string)

Performance Comparison

MethodRequests/minuteBest For
Sync requests~30-60Simple scripts
Async HTTPX (10 concurrent)~300-600Medium-scale static sites
Async HTTPX (50 concurrent)~1,000-2,000High-volume API scraping
Scrapy~500-2,000Large crawling projects
Playwright (1 browser)~10-30JS-rendered sites
Playwright (5 contexts)~50-100Parallel JS rendering

Legal and Ethical Considerations

Web scraping exists in a legal gray area that varies by jurisdiction. Key points:

  1. Check Terms of Service — Many sites explicitly prohibit scraping
  2. Respect robots.txt — It’s not legally binding everywhere but is considered best practice
  3. Don’t overload servers — Rate limit your requests to avoid causing damage
  4. Public vs. private data — Scraping publicly accessible data is generally safer legally
  5. Personal data (GDPR/CCPA) — Scraping personal information carries significant legal risk
  6. The CFAA and CFAA-like laws — Unauthorized access to computer systems is a crime in most jurisdictions

For a comprehensive analysis of the legal landscape, read our pillar guide: Is Web Scraping Legal? The Complete 2026 Guide. We also cover web scraping compliance in depth.

Python Scraping Libraries Comparison

LibraryTypeJS SupportSpeedLearning CurveBest For
requestsHTTP clientFastEasySimple scripts, APIs
HTTPXHTTP clientFastEasyModern projects, async
BeautifulSoupParserMediumEasyBeginners, small projects
lxmlParserFastestMediumLarge documents, XPath
ScrapyFramework❌*FastSteepLarge-scale crawling
SeleniumBrowserSlowMediumLegacy, form interaction
PlaywrightBrowserMediumMediumJS sites, stealth

Scrapy can integrate with Playwright via scrapy-playwright for JS rendering.

Decision Flowchart

  1. Is the data in the initial HTML? → Use HTTPX + BeautifulSoup/lxml
  2. Does the site load data from a JSON API? → Use HTTPX to call the API directly
  3. Does the site require JavaScript rendering? → Use Playwright
  4. Are you scraping millions of pages? → Use Scrapy (+ scrapy-playwright if JS needed)
  5. Are you dealing with heavy anti-bot protection? → Use Playwright with stealth + residential proxies

FAQ

What is the best Python library for web scraping in 2026?

For most projects, HTTPX + BeautifulSoup is the best starting combination. HTTPX provides async support and HTTP/2 out of the box, and BeautifulSoup with the lxml parser handles 90% of parsing needs. For JavaScript-rendered sites, add Playwright. For large-scale crawls, use Scrapy.

Is web scraping with Python legal?

It depends on what you scrape, how you scrape it, and your jurisdiction. Scraping publicly available, non-personal data while respecting rate limits is generally considered acceptable. However, violating a site’s Terms of Service, scraping personal data, or circumventing access controls can create legal risk. Read our full legal guide for details.

How do I scrape a website that uses JavaScript?

Use Playwright (recommended) or Selenium to render the JavaScript in a headless browser. Before reaching for a browser, check the site’s Network tab in DevTools — many JS-rendered sites load data from API endpoints that you can call directly with HTTPX, which is much faster.

How do I avoid getting blocked while web scraping?

Use a combination of: (1) rotating residential proxies to distribute requests across many IPs, (2) realistic headers including a current User-Agent, (3) rate limiting with random delays between requests, (4) session management with cookies, and (5) Playwright with stealth patches for heavily protected sites. See our proxy provider comparison for provider recommendations.

How fast can Python scrape websites?

With async HTTPX and 50 concurrent connections, Python can scrape 1,000-2,000 static pages per minute. Browser-based scraping (Playwright) is slower at 10-100 pages per minute depending on parallelism. The main bottleneck is usually anti-bot rate limits, not Python’s speed.

Should I use Scrapy or Playwright for my scraping project?

Use Scrapy for large-scale crawling of static sites (thousands to millions of pages) — it handles scheduling, retries, and pipelines automatically. Use Playwright for JavaScript-rendered sites or when you need browser interaction. For JS-heavy sites at scale, combine them with the scrapy-playwright plugin.

How do I handle CAPTCHAs in Python web scraping?

Options ranked by reliability: (1) Use a web unlocker service like Bright Data’s Web Unlocker that solves CAPTCHAs automatically, (2) use a CAPTCHA-solving API (2Captcha, Anti-Captcha), (3) rotate proxies to avoid triggering CAPTCHAs in the first place. Mobile proxies have the lowest CAPTCHA rates due to their high trust scores.

Can I scrape data and store it in a database?

Yes. Python integrates with every major database. For simple projects, use SQLite (built into Python). For production, use PostgreSQL with SQLAlchemy or MongoDB for unstructured data. Scrapy has built-in pipeline support for database storage. Pandas DataFrames also export directly to SQL databases.

Next Steps

Now that you understand the Python scraping ecosystem, here’s where to go next:

  1. Choose the right proxies — Read our best proxy providers comparison to pick a provider that fits your budget and targets
  2. Understand proxy types — Our proxy types guide explains when to use residential vs. datacenter vs. mobile
  3. Learn site-specific techniques — Browse our scraping tutorials for guides on scraping specific platforms like Amazon, Google Maps, and LinkedIn
  4. Set up anti-detect browsers — For account management, see our anti-detect browser guides
  5. Stay compliant — Review our web scraping legal guide and compliance resources

last updated: March 12, 2026

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)