Web Scraping with Python: The Complete 2026 Guide

Python is the undisputed king of web scraping. Its rich ecosystem of libraries — from simple HTTP clients to full browser automation frameworks — makes it possible to extract data from virtually any website, whether it serves static HTML or relies entirely on JavaScript rendering.

This guide covers everything you need to go from zero to production-grade web scraping with Python. We’ll walk through every major library, show real code examples for common scraping patterns, and teach you how to handle the challenges that trip up most scrapers: JavaScript rendering, pagination, anti-bot systems, and IP blocking.

—

Why Python for Web Scraping?
Setting Up Your Environment
HTTP Fundamentals for Scrapers
Requests + BeautifulSoup: The Classic Stack
HTTPX: The Modern HTTP Client
lxml: Speed-First HTML Parsing
Scrapy: Industrial-Scale Scraping
Selenium: Browser Automation
Playwright: The Modern Alternative to Selenium
Handling JavaScript-Rendered Content
Pagination and Crawling Patterns
Using Proxies with Python Scrapers
Handling Anti-Bot Systems
Storing Scraped Data
Real-World Example: Scraping a Product Page
Performance Optimization
Legal and Ethical Considerations
Python Scraping Libraries Comparison
FAQ

—

Why Python for Web Scraping?

Python dominates web scraping for several reasons:

Low barrier to entry — Clean syntax makes scraping scripts readable and maintainable
Massive library ecosystem — From requests to scrapy, there’s a mature tool for every scraping pattern
Strong data processing pipeline — Seamless integration with pandas, SQLAlchemy, and data science tools
Async support — Modern Python (3.11+) handles thousands of concurrent requests efficiently
Community — The largest web scraping community means abundant tutorials, StackOverflow answers, and open-source tools

Other languages (Node.js, Go, Rust) have their niches, but Python remains the default for 90%+ of scraping projects in 2026.

—

Setting Up Your Environment

Prerequisites

Python 3.10+ (we recommend 3.12 or later)
pip or uv for package management
A code editor (VS Code, PyCharm, or similar)

Install Core Libraries

# Create a virtual environment
python -m venv scraper-env
source scraper-env/bin/activate  # macOS/Linux
scraper-env\Scripts\activate   # Windows

Install the essentials
pip install requests beautifulsoup4 lxml httpx playwright scrapy

Install Playwright browsers
playwright install chromium

Recommended Project Structure

my-scraper/
├── scraper/
│   ├── __init__.py
│   ├── client.py          # HTTP client with proxy/retry logic
│   ├── parsers/
│   │   ├── __init__.py
│   │   └── product.py     # Page-specific parsing logic
│   └── storage.py         # Save to CSV/DB/JSON
├── config.py              # Proxy credentials, target URLs
├── main.py                # Entry point
└── requirements.txt

—

HTTP Fundamentals for Scrapers

Before diving into libraries, understand what happens when you scrape a page:

Your script sends an HTTP GET request to the target URL
The server responds with HTML, JSON, or a redirect
You parse the response to extract the data you need
Anti-bot systems may intervene with CAPTCHAs, blocks, or JavaScript challenges

Key HTTP concepts every scraper must handle:

Concept	Why It Matters
User-Agent	Servers check this header; a missing or bot-like UA triggers blocks
Cookies	Many sites require session cookies for proper page rendering
Status Codes	200 = success, 403 = blocked, 429 = rate limited, 503 = anti-bot challenge
Redirects	Login walls and geo-redirects change the page you actually receive
Headers	`Referer`, `Accept-Language`, and other headers affect what content is served

—

Requests + BeautifulSoup: The Classic Stack

The requests + BeautifulSoup combination is the starting point for most Python scrapers. It’s simple, readable, and handles 80% of scraping tasks.

Basic Example: Fetching and Parsing a Page

import requests
from bs4 import BeautifulSoup

Set headers to mimic a real browser
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}

url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
response = requests.get(url, headers=headers, timeout=15)

if response.status_code == 200:
soup = BeautifulSoup(response.text, "lxml")

# Extract product data
title = soup.select_one("h1").text
price = soup.select_one(".price_color").text
availability = soup.select_one(".availability").text.strip()
description = soup.select_one("#product_description ~ p").text

print(f"Title: {title}")
print(f"Price: {price}")
print(f"Availability: {availability}")
print(f"Description: {description[:100]}...")
else:
print(f"Failed with status: {response.status_code}")

Using Sessions for Cookie Persistence

session = requests.Session()
session.headers.update(headers)

First request sets cookies
session.get("https://example.com")

Subsequent requests carry cookies automatically
response = session.get("https://example.com/data-page")

CSS Selectors vs. XPath

BeautifulSoup supports CSS selectors natively via .select(). For XPath, use lxml instead.

# CSS Selectors (BeautifulSoup)
titles = soup.select("div.product-card h2.title")
prices = soup.select("span.price[data-currency='USD']")
links = soup.select("a.product-link[href]")

Get attribute values
for link in links:
print(link["href"])

When to use Requests + BeautifulSoup:

Static HTML pages
Simple scraping tasks
Prototyping and one-off scripts
When you don’t need JavaScript rendering

—

HTTPX: The Modern HTTP Client

HTTPX is a modern replacement for requests that supports async operations, HTTP/2, and has a more robust feature set. In 2026, it’s the preferred HTTP client for new scraping projects.

Sync Usage (Drop-in Requests Replacement)

import httpx

response = httpx.get(
"https://books.toscrape.com",
headers=headers,
timeout=15,
follow_redirects=True,
)

Async Usage (High-Concurrency Scraping)

import asyncio
import httpx
from bs4 import BeautifulSoup

async def scrape_page(client: httpx.AsyncClient, url: str) -> dict:
response = await client.get(url, timeout=15)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "lxml")
title = soup.select_one("h1").text
return {"url": url, "title": title}
return {"url": url, "error": response.status_code}

async def main():
urls = [
"https://books.toscrape.com/catalogue/page-1.html",
"https://books.toscrape.com/catalogue/page-2.html",
"https://books.toscrape.com/catalogue/page-3.html",
# ... more URLs
]

async with httpx.AsyncClient(headers=headers) as client:
tasks = [scrape_page(client, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)

for result in results:
print(result)

asyncio.run(main())

HTTP/2 Support

# HTTP/2 can improve performance on sites that support it
async with httpx.AsyncClient(http2=True) as client:
response = await client.get("https://example.com")

When to use HTTPX:

New projects (prefer over requests)
High-concurrency scraping (async)
When you need HTTP/2
When building production scraping systems

—

lxml: Speed-First HTML Parsing

lxml is the fastest HTML/XML parser available in Python. While BeautifulSoup is more forgiving with malformed HTML, lxml is 5–10x faster for large documents.

Parsing with lxml Directly

from lxml import html
import httpx

response = httpx.get("https://books.toscrape.com", headers=headers)
tree = html.fromstring(response.text)

XPath selectors
titles = tree.xpath("//article[@class='product_pod']/h3/a/@title")
prices = tree.xpath("//article[@class='product_pod']//p[@class='price_color']/text()")

for title, price in zip(titles, prices):
print(f"{title}: {price}")

XPath vs. CSS Selectors

# XPath — more powerful, can select by text content
tree.xpath("//div[contains(@class, 'product')]//span[contains(text(), '$')]")

XPath — navigate up the tree (not possible with CSS)
tree.xpath("//span[@class='price']/ancestor::div[@class='product']//h2/text()")

CSS via lxml's cssselect
from lxml.cssselect import CSSSelector
sel = CSSSelector("div.product h2")
elements = sel(tree)

When to use lxml:

Parsing large HTML documents where speed matters
When you need XPath (navigating up the DOM, text-based selection)
As the parser engine inside BeautifulSoup (BeautifulSoup(html, "lxml"))

—

Scrapy: Industrial-Scale Scraping

Scrapy is a full-featured scraping framework — not just a library. It handles request scheduling, concurrency, retries, data pipelines, and middleware out of the box.

Creating a Scrapy Project

scrapy startproject bookstore
cd bookstore
scrapy genspider books books.toscrape.com

Writing a Spider

# bookstore/spiders/books.py
import scrapy

class BooksSpider(scrapy.Spider):
name = "books"
start_urls = ["https://books.toscrape.com/catalogue/page-1.html"]

def parse(self, response):
# Extract book data from listing page
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css(".price_color::text").get(),
"rating": book.css("p.star-rating::attr(class)").get(),
"url": response.urljoin(book.css("h3 a::attr(href)").get()),
}

# Follow pagination
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)

Running the Spider

# Output to JSON
scrapy crawl books -O books.json

Output to CSV
scrapy crawl books -O books.csv

Scrapy Settings for Production

# bookstore/settings.py

Respect robots.txt
ROBOTSTXT_OBEY = True

Concurrency
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 1  # seconds between requests to same domain

Retry configuration
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

User-Agent rotation
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"

Enable AutoThrottle for adaptive rate limiting
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0

Scrapy with Proxy Middleware

# bookstore/middlewares.py
import random

class ProxyMiddleware:
def __init__(self):
self.proxy_url = "http://user:pass@gate.provider.com:7777"

def process_request(self, request, spider):
request.meta["proxy"] = self.proxy_url

When to use Scrapy:

Large-scale crawling (thousands to millions of pages)
Projects that need structured pipelines (crawl -> parse -> clean -> store)
When you need built-in concurrency, retries, and rate limiting
Team projects where a framework enforces structure

—

Selenium: Browser Automation

Selenium automates a real browser — it can click buttons, fill forms, scroll pages, and execute JavaScript. It’s essential for sites that require full browser rendering.

Basic Selenium Setup

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

Configure Chrome
options = Options()
options.add_argument("--headless=new")  # Run without visible browser
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--window-size=1920,1080")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")

driver = webdriver.Chrome(options=options)

try:
driver.get("https://example.com/products")

# Wait for dynamic content to load
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".product-card"))
)

# Extract data
products = driver.find_elements(By.CSS_SELECTOR, ".product-card")
for product in products:
title = product.find_element(By.CSS_SELECTOR, "h2").text
price = product.find_element(By.CSS_SELECTOR, ".price").text
print(f"{title}: {price}")

finally:
driver.quit()

Interacting with Pages

# Click a button
button = driver.find_element(By.CSS_SELECTOR, "button.load-more")
button.click()

Fill a search form
search_input = driver.find_element(By.CSS_SELECTOR, "input[name='q']")
search_input.clear()
search_input.send_keys("proxy providers")
search_input.submit()

Scroll to bottom (trigger infinite scroll)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

Wait for new content after scroll
import time
time.sleep(2)

Using Proxies with Selenium

from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--proxy-server=http://gate.provider.com:7777")

For authenticated proxies, use a browser extension or seleniumwire
pip install selenium-wire
from seleniumwire import webdriver

proxy_options = {
"proxy": {
"http": "http://user:pass@gate.provider.com:7777",
"https": "http://user:pass@gate.provider.com:7777",
}
}

driver = webdriver.Chrome(
options=options,
seleniumwire_options=proxy_options
)

When to use Selenium:

Sites that require JavaScript rendering and cannot be scraped with HTTP clients
When you need to interact with the page (click, scroll, fill forms)
Legacy projects already built on Selenium
When you need screenshots or PDF generation

—

Playwright: The Modern Alternative to Selenium

Playwright is Microsoft’s browser automation framework and has rapidly become the preferred choice over Selenium for new projects in 2026. It’s faster, more reliable, and has better async support.

Basic Playwright Usage

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
viewport={"width": 1920, "height": 1080},
)
page = context.new_page()

page.goto("https://example.com/products")

# Wait for content (Playwright auto-waits, but explicit waits are available)
page.wait_for_selector(".product-card")

# Extract data
products = page.query_selector_all(".product-card")
for product in products:
title = product.query_selector("h2").inner_text()
price = product.query_selector(".price").inner_text()
print(f"{title}: {price}")

browser.close()

Async Playwright (High Concurrency)

import asyncio
from playwright.async_api import async_playwright

async def scrape_page(browser, url):
context = await browser.new_context()
page = await context.new_page()

await page.goto(url, wait_until="networkidle")
title = await page.title()
content = await page.inner_text("body")

await context.close()
return {"url": url, "title": title, "length": len(content)}

async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)

urls = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 11)]
tasks = [scrape_page(browser, url) for url in urls]
results = await asyncio.gather(*tasks)

for result in results:
print(result)

await browser.close()

asyncio.run(main())

Playwright with Proxies

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy={
"server": "http://gate.provider.com:7777",
"username": "your_user",
"password": "your_pass",
}
)
page = browser.new_page()
page.goto("https://httpbin.org/ip")
print(page.inner_text("body"))
browser.close()

Intercepting Network Requests

# Block images and CSS for faster scraping
def route_handler(route):
if route.request.resource_type in ["image", "stylesheet", "font"]:
route.abort()
else:
route.continue_()

page.route("*/", route_handler)
page.goto("https://example.com")

When to use Playwright:

Any project that would otherwise use Selenium (Playwright is strictly better in 2026)
JavaScript-heavy SPAs (React, Vue, Angular sites)
When you need request interception or network monitoring
Stealth scraping (Playwright is harder for sites to detect than Selenium)

—

Handling JavaScript-Rendered Content

Many modern websites render content with JavaScript — meaning the initial HTML response is mostly empty. Here’s how to handle each scenario:

Detect If JS Rendering Is Needed

import httpx
from bs4 import BeautifulSoup

response = httpx.get("https://target-site.com", headers=headers)
soup = BeautifulSoup(response.text, "lxml")

If the content you need is missing, JS rendering is required
target_data = soup.select(".product-price")
if not target_data:
print("Content not in initial HTML — JS rendering needed")

Strategy 1: Find the API Endpoint

Before reaching for a browser, check if the site loads data from an API. This is faster and more reliable.

# Open browser DevTools > Network tab > XHR/Fetch
Look for JSON API calls that contain the data you need

import httpx

Often the site's frontend calls an internal API
api_url = "https://target-site.com/api/products?page=1&limit=20"
response = httpx.get(api_url, headers=headers)
data = response.json()

for product in data["products"]:
print(f"{product['name']}: ${product['price']}")

This technique bypasses JS rendering entirely and is 10x faster. Check our web scraping guides for site-specific API discovery tips.

Strategy 2: Use Playwright for Full Rendering

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()

page.goto("https://spa-site.com/products")
page.wait_for_selector("[data-product-id]")  # Wait for React/Vue to render

# Now parse the fully rendered HTML
html = page.content()
# ... parse with BeautifulSoup or lxml

Strategy 3: Execute JS Manually

# If only a small JS snippet needs to run
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com")

# Execute JavaScript and get the result
data = page.evaluate("""
() => {
return Array.from(document.querySelectorAll('.product')).map(el => ({
name: el.querySelector('h2').textContent,
price: el.querySelector('.price').textContent,
}));
}
""")
print(data)

—

Pagination and Crawling Patterns

Pattern 1: Next-Page Links

import httpx
from bs4 import BeautifulSoup

def scrape_paginated(base_url):
all_items = []
url = base_url

while url:
response = httpx.get(url, headers=headers, timeout=15)
soup = BeautifulSoup(response.text, "lxml")

# Extract items from current page
items = soup.select(".product-card")
for item in items:
all_items.append({
"title": item.select_one("h2").text,
"price": item.select_one(".price").text,
})

# Find next page link
next_link = soup.select_one("a.next-page")
url = next_link["href"] if next_link else None

print(f"Scraped {len(items)} items, total: {len(all_items)}")

return all_items

data = scrape_paginated("https://books.toscrape.com/catalogue/page-1.html")

Pattern 2: Page Number Parameters

import httpx
from bs4 import BeautifulSoup

all_items = []

for page_num in range(1, 51):  # Pages 1-50
url = f"https://example.com/products?page={page_num}"
response = httpx.get(url, headers=headers)

if response.status_code != 200:
break

soup = BeautifulSoup(response.text, "lxml")
items = soup.select(".product")

if not items:
break  # No more items = last page reached

all_items.extend([parse_item(item) for item in items])

Pattern 3: Infinite Scroll (Playwright)

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://infinite-scroll-site.com")

previous_height = 0
while True:
# Scroll to bottom
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)  # Wait for new content

current_height = page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break  # No more content loaded
previous_height = current_height

# Parse all loaded content
items = page.query_selector_all(".item")
print(f"Loaded {len(items)} items via infinite scroll")

Pattern 4: API-Based Pagination (Cursor/Offset)

import httpx

all_data = []
cursor = None

while True:
params = {"limit": 100}
if cursor:
params["cursor"] = cursor

response = httpx.get("https://api.example.com/products", params=params, headers=headers)
data = response.json()

all_data.extend(data["results"])

cursor = data.get("next_cursor")
if not cursor:
break

print(f"Total items: {len(all_data)}")

—

Using Proxies with Python Scrapers

Proxies are essential for any serious scraping project. Without them, you’ll face IP bans within minutes on most commercial websites. For a detailed comparison of providers, see our best proxy providers guide.

Proxies with Requests/HTTPX

import httpx

Rotating residential proxy
proxy_url = "http://user:pass@gate.provider.com:7777"

HTTPX
response = httpx.get(
"https://target.com",
proxy=proxy_url,
headers=headers,
timeout=30,
)

Proxy Rotation Pattern

import httpx
import random

PROXY_LIST = [
"http://user:pass@us.provider.com:7777",
"http://user:pass@uk.provider.com:7777",
"http://user:pass@de.provider.com:7777",
]

def get_with_rotation(url, max_retries=3):
for attempt in range(max_retries):
proxy = random.choice(PROXY_LIST)
try:
response = httpx.get(url, proxy=proxy, headers=headers, timeout=20)
if response.status_code == 200:
return response
elif response.status_code in (403, 429):
continue  # Try different proxy
except httpx.TimeoutException:
continue
return None

Proxy Authentication Methods

# Method 1: URL-embedded credentials
proxy = "http://username:password@gate.provider.com:7777"

Method 2: Country/city targeting via username
Most providers encode targeting in the username
proxy = "http://user-country-us-city-newyork:pass@gate.provider.com:7777"

Method 3: Session ID for sticky sessions
proxy = "http://user-session-abc123:pass@gate.provider.com:7777"

For detailed proxy setup instructions across different tools and platforms, see our proxy setup guides and proxy integration tutorials.

—

Handling Anti-Bot Systems

Modern websites use sophisticated anti-bot systems like Cloudflare, PerimeterX, DataDome, and Akamai Bot Manager. Here’s how to handle them:

Level 1: Header Rotation

import random

USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
]

def get_random_headers():
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",
"Accept-Language": random.choice(["en-US,en;q=0.9", "en-GB,en;q=0.9"]),
"Accept-Encoding": "gzip, deflate, br",
"Sec-Ch-Ua": '"Chromium";v="124", "Google Chrome";v="124"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"Windows"',
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
}

Level 2: Rate Limiting and Delays

import time
import random

def polite_request(url, session, min_delay=1, max_delay=3):
"""Add human-like delays between requests."""
time.sleep(random.uniform(min_delay, max_delay))
return session.get(url, timeout=15)

Level 3: Playwright Stealth

# pip install playwright-stealth
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()

# Apply stealth patches to avoid detection
stealth_sync(page)

page.goto("https://bot-protected-site.com")

Level 4: Use a Web Unlocker Service

For heavily protected sites, the most reliable approach is to use a provider’s web unlocker service — Bright Data’s Web Unlocker, Oxylabs’ Web Scraper API, or Smartproxy’s Site Unblocker handle CAPTCHAs, fingerprinting, and JavaScript challenges automatically.

import httpx

Bright Data Web Unlocker example
response = httpx.get(
"https://heavily-protected-site.com/products",
proxy="http://brd-customer-XXXX-zone-unblocker:PASS@brd.superproxy.io:33335",
timeout=60,  # Unblockers need more time
)

For more anti-bot strategies, see our guides on anti-detect browsers and proxy troubleshooting.

—

Storing Scraped Data

CSV Output

import csv

data = [
{"title": "Product A", "price": "$29.99", "url": "https://..."},
{"title": "Product B", "price": "$49.99", "url": "https://..."},
]

with open("products.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price", "url"])
writer.writeheader()
writer.writerows(data)

JSON Output

import json

with open("products.json", "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)

SQLite Database

import sqlite3

conn = sqlite3.connect("products.db")
cursor = conn.cursor()

cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
price TEXT,
url TEXT UNIQUE,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")

for item in data:
cursor.execute(
"INSERT OR IGNORE INTO products (title, price, url) VALUES (?, ?, ?)",
(item["title"], item["price"], item["url"])
)

conn.commit()
conn.close()

Pandas DataFrame

import pandas as pd

df = pd.DataFrame(data)
df.to_csv("products.csv", index=False)
df.to_json("products.json", orient="records", indent=2)
df.to_excel("products.xlsx", index=False)

—

Real-World Example: Scraping a Product Page

Here’s a complete, production-ready example that scrapes product data from an e-commerce listing page with proxy rotation, error handling, and data storage.

"""
Production-ready product scraper with proxy rotation and error handling.
"""

import httpx
import json
import time
import random
from bs4 import BeautifulSoup
from dataclasses import dataclass, asdict
from typing import Optional

--- Configuration ---

PROXY_URL = "http://user:pass@gate.provider.com:7777"

HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}

--- Data Model ---

@dataclass
class Product:
title: str
price: str
rating: Optional[str]
availability: str
url: str

--- Scraping Logic ---

def scrape_listing_page(client: httpx.Client, url: str) -> list[dict]:
"""Scrape all products from a single listing page."""
response = client.get(url, timeout=20)
response.raise_for_status()

soup = BeautifulSoup(response.text, "lxml")
products = []

for card in soup.select("article.product_pod"):
product = Product(
title=card.select_one("h3 a")["title"],
price=card.select_one(".price_color").text,
rating=card.select_one("p.star-rating")["class"][1],  # e.g., "Three"
availability="In stock" if card.select_one(".instock") else "Out of stock",
url=card.select_one("h3 a")["href"],
)
products.append(asdict(product))

return products

def scrape_all_pages(base_url: str, max_pages: int = 50) -> list[dict]:
"""Scrape all pages with retry logic and rate limiting."""
all_products = []

with httpx.Client(headers=HEADERS, proxy=PROXY_URL, follow_redirects=True) as client:
for page_num in range(1, max_pages + 1):
url = f"{base_url}/catalogue/page-{page_num}.html"

for attempt in range(3):  # Retry up to 3 times
try:
products = scrape_listing_page(client, url)

if not products:
print(f"No products on page {page_num} — stopping.")
return all_products

all_products.extend(products)
print(f"Page {page_num}: {len(products)} products (total: {len(all_products)})")

# Polite delay
time.sleep(random.uniform(1, 2.5))
break

except httpx.HTTPStatusError as e:
if e.response.status_code == 404:
return all_products  # Last page reached
print(f"HTTP {e.response.status_code} on page {page_num}, attempt {attempt + 1}")
time.sleep(5)

except httpx.TimeoutException:
print(f"Timeout on page {page_num}, attempt {attempt + 1}")
time.sleep(5)

return all_products

--- Main ---

if __name__ == "__main__":
products = scrape_all_pages("https://books.toscrape.com")

# Save results
with open("products.json", "w", encoding="utf-8") as f:
json.dump(products, f, indent=2)

print(f"\nDone! Scraped {len(products)} products.")

—

Performance Optimization

1. Async Concurrency

The biggest performance gain comes from making requests concurrently rather than sequentially.

import asyncio
import httpx

SEMAPHORE = asyncio.Semaphore(10)  # Limit to 10 concurrent requests

async def fetch(client, url):
async with SEMAPHORE:
response = await client.get(url, timeout=20)
return response

async def main():
urls = [f"https://example.com/page/{i}" for i in range(1, 1001)]

async with httpx.AsyncClient(headers=HEADERS, proxy=PROXY_URL) as client:
tasks = [fetch(client, url) for url in urls]
responses = await asyncio.gather(*tasks, return_exceptions=True)

successful = [r for r in responses if isinstance(r, httpx.Response) and r.status_code == 200]
print(f"Success: {len(successful)}/{len(urls)}")

2. Resource Blocking in Playwright

# Block unnecessary resources to speed up rendering
page.route("*/.{png,jpg,jpeg,gif,svg,woff,woff2}", lambda route: route.abort())
page.route("*/analytics", lambda route: route.abort())
page.route("*/tracking", lambda route: route.abort())

3. Connection Pooling

# HTTPX and requests.Session reuse TCP connections automatically
Just ensure you use a client/session object across requests
async with httpx.AsyncClient(limits=httpx.Limits(max_connections=20)) as client:
# All requests share the connection pool
pass

4. Parsing Optimization

# Use lxml parser (faster than html.parser)
soup = BeautifulSoup(html, "lxml")  # 5-10x faster than "html.parser"

Or use lxml directly for maximum speed
from lxml import html
tree = html.fromstring(html_string)

Performance Comparison

Method	Requests/minute	Best For
Sync requests	~30-60	Simple scripts
Async HTTPX (10 concurrent)	~300-600	Medium-scale static sites
Async HTTPX (50 concurrent)	~1,000-2,000	High-volume API scraping
Scrapy	~500-2,000	Large crawling projects
Playwright (1 browser)	~10-30	JS-rendered sites
Playwright (5 contexts)	~50-100	Parallel JS rendering

—

Legal and Ethical Considerations

Web scraping exists in a legal gray area that varies by jurisdiction. Key points:

Check Terms of Service — Many sites explicitly prohibit scraping
Respect robots.txt — It’s not legally binding everywhere but is considered best practice
Don’t overload servers — Rate limit your requests to avoid causing damage
Public vs. private data — Scraping publicly accessible data is generally safer legally
Personal data (GDPR/CCPA) — Scraping personal information carries significant legal risk
The CFAA and CFAA-like laws — Unauthorized access to computer systems is a crime in most jurisdictions

For a comprehensive analysis of the legal landscape, read our pillar guide: Is Web Scraping Legal? The Complete 2026 Guide. We also cover web scraping compliance in depth.

—

Python Scraping Libraries Comparison

Library	Type	JS Support	Speed	Learning Curve	Best For
requests	HTTP client	❌	Fast	Easy	Simple scripts, APIs
HTTPX	HTTP client	❌	Fast	Easy	Modern projects, async
BeautifulSoup	Parser	❌	Medium	Easy	Beginners, small projects
lxml	Parser	❌	Fastest	Medium	Large documents, XPath
Scrapy	Framework	❌*	Fast	Steep	Large-scale crawling
Selenium	Browser	✅	Slow	Medium	Legacy, form interaction
Playwright	Browser	✅	Medium	Medium	JS sites, stealth

Scrapy can integrate with Playwright via scrapy-playwright for JS rendering.

Decision Flowchart

Is the data in the initial HTML? → Use HTTPX + BeautifulSoup/lxml
Does the site load data from a JSON API? → Use HTTPX to call the API directly
Does the site require JavaScript rendering? → Use Playwright
Are you scraping millions of pages? → Use Scrapy (+ scrapy-playwright if JS needed)
Are you dealing with heavy anti-bot protection? → Use Playwright with stealth + residential proxies

—

FAQ

What is the best Python library for web scraping in 2026?

For most projects, HTTPX + BeautifulSoup is the best starting combination. HTTPX provides async support and HTTP/2 out of the box, and BeautifulSoup with the lxml parser handles 90% of parsing needs. For JavaScript-rendered sites, add Playwright. For large-scale crawls, use Scrapy.

Is web scraping with Python legal?

It depends on what you scrape, how you scrape it, and your jurisdiction. Scraping publicly available, non-personal data while respecting rate limits is generally considered acceptable. However, violating a site’s Terms of Service, scraping personal data, or circumventing access controls can create legal risk. Read our full legal guide for details.

How do I scrape a website that uses JavaScript?

Use Playwright (recommended) or Selenium to render the JavaScript in a headless browser. Before reaching for a browser, check the site’s Network tab in DevTools — many JS-rendered sites load data from API endpoints that you can call directly with HTTPX, which is much faster.

How do I avoid getting blocked while web scraping?

Use a combination of: (1) rotating residential proxies to distribute requests across many IPs, (2) realistic headers including a current User-Agent, (3) rate limiting with random delays between requests, (4) session management with cookies, and (5) Playwright with stealth patches for heavily protected sites. See our proxy provider comparison for provider recommendations.

How fast can Python scrape websites?

With async HTTPX and 50 concurrent connections, Python can scrape 1,000-2,000 static pages per minute. Browser-based scraping (Playwright) is slower at 10-100 pages per minute depending on parallelism. The main bottleneck is usually anti-bot rate limits, not Python’s speed.

Should I use Scrapy or Playwright for my scraping project?

Use Scrapy for large-scale crawling of static sites (thousands to millions of pages) — it handles scheduling, retries, and pipelines automatically. Use Playwright for JavaScript-rendered sites or when you need browser interaction. For JS-heavy sites at scale, combine them with the scrapy-playwright plugin.

How do I handle CAPTCHAs in Python web scraping?

Options ranked by reliability: (1) Use a web unlocker service like Bright Data’s Web Unlocker that solves CAPTCHAs automatically, (2) use a CAPTCHA-solving API (2Captcha, Anti-Captcha), (3) rotate proxies to avoid triggering CAPTCHAs in the first place. Mobile proxies have the lowest CAPTCHA rates due to their high trust scores.

Can I scrape data and store it in a database?

Yes. Python integrates with every major database. For simple projects, use SQLite (built into Python). For production, use PostgreSQL with SQLAlchemy or MongoDB for unstructured data. Scrapy has built-in pipeline support for database storage. Pandas DataFrames also export directly to SQL databases.

—

Next Steps

Now that you understand the Python scraping ecosystem, here’s where to go next:

Choose the right proxies — Read our best proxy providers comparison to pick a provider that fits your budget and targets
Understand proxy types — Our proxy types guide explains when to use residential vs. datacenter vs. mobile
Learn site-specific techniques — Browse our scraping tutorials for guides on scraping specific platforms like Amazon, Google Maps, and LinkedIn
Set up anti-detect browsers — For account management, see our anti-detect browser guides
Stay compliant — Review our web scraping legal guide and compliance resources

last updated: March 12, 2026

Web Scraping with Python: The Complete 2026 Guide

Table of Contents

Why Python for Web Scraping?

Setting Up Your Environment

Prerequisites

Install Core Libraries

scraper-env\Scripts\activate # Windows

Install the essentials

Install Playwright browsers

Recommended Project Structure

HTTP Fundamentals for Scrapers

Requests + BeautifulSoup: The Classic Stack

Basic Example: Fetching and Parsing a Page

Set headers to mimic a real browser

Using Sessions for Cookie Persistence

First request sets cookies

Subsequent requests carry cookies automatically

CSS Selectors vs. XPath

Get attribute values

HTTPX: The Modern HTTP Client

Sync Usage (Drop-in Requests Replacement)

Async Usage (High-Concurrency Scraping)

HTTP/2 Support

lxml: Speed-First HTML Parsing

Parsing with lxml Directly

XPath selectors

XPath vs. CSS Selectors

XPath — navigate up the tree (not possible with CSS)

CSS via lxml's cssselect

Scrapy: Industrial-Scale Scraping

Creating a Scrapy Project

Writing a Spider

Running the Spider

Output to CSV

Scrapy Settings for Production

Respect robots.txt

Concurrency

Retry configuration

User-Agent rotation

Enable AutoThrottle for adaptive rate limiting

Scrapy with Proxy Middleware

Selenium: Browser Automation

Basic Selenium Setup

Configure Chrome

Interacting with Pages

Fill a search form

Scroll to bottom (trigger infinite scroll)

Wait for new content after scroll

Using Proxies with Selenium

For authenticated proxies, use a browser extension or seleniumwire

pip install selenium-wire