Web Scraping with Python: The Complete 2026 Guide
Python is the undisputed king of web scraping. Its rich ecosystem of libraries — from simple HTTP clients to full browser automation frameworks — makes it possible to extract data from virtually any website, whether it serves static HTML or relies entirely on JavaScript rendering.
This guide covers everything you need to go from zero to production-grade web scraping with Python. We’ll walk through every major library, show real code examples for common scraping patterns, and teach you how to handle the challenges that trip up most scrapers: JavaScript rendering, pagination, anti-bot systems, and IP blocking.
—
Table of Contents
- Why Python for Web Scraping?
- Setting Up Your Environment
- HTTP Fundamentals for Scrapers
- Requests + BeautifulSoup: The Classic Stack
- HTTPX: The Modern HTTP Client
- lxml: Speed-First HTML Parsing
- Scrapy: Industrial-Scale Scraping
- Selenium: Browser Automation
- Playwright: The Modern Alternative to Selenium
- Handling JavaScript-Rendered Content
- Pagination and Crawling Patterns
- Using Proxies with Python Scrapers
- Handling Anti-Bot Systems
- Storing Scraped Data
- Real-World Example: Scraping a Product Page
- Performance Optimization
- Legal and Ethical Considerations
- Python Scraping Libraries Comparison
- FAQ
—
Why Python for Web Scraping?
Python dominates web scraping for several reasons:
- Low barrier to entry — Clean syntax makes scraping scripts readable and maintainable
- Massive library ecosystem — From
requeststoscrapy, there’s a mature tool for every scraping pattern - Strong data processing pipeline — Seamless integration with pandas, SQLAlchemy, and data science tools
- Async support — Modern Python (3.11+) handles thousands of concurrent requests efficiently
- Community — The largest web scraping community means abundant tutorials, StackOverflow answers, and open-source tools
Other languages (Node.js, Go, Rust) have their niches, but Python remains the default for 90%+ of scraping projects in 2026.
—
Setting Up Your Environment
Prerequisites
- Python 3.10+ (we recommend 3.12 or later)
piporuvfor package management- A code editor (VS Code, PyCharm, or similar)
Install Core Libraries
# Create a virtual environment
python -m venv scraper-env
source scraper-env/bin/activate # macOS/Linux
scraper-env\Scripts\activate # Windows
Install the essentials
pip install requests beautifulsoup4 lxml httpx playwright scrapy
Install Playwright browsers
playwright install chromium
Recommended Project Structure
my-scraper/
├── scraper/
│ ├── __init__.py
│ ├── client.py # HTTP client with proxy/retry logic
│ ├── parsers/
│ │ ├── __init__.py
│ │ └── product.py # Page-specific parsing logic
│ └── storage.py # Save to CSV/DB/JSON
├── config.py # Proxy credentials, target URLs
├── main.py # Entry point
└── requirements.txt
—
HTTP Fundamentals for Scrapers
Before diving into libraries, understand what happens when you scrape a page:
- Your script sends an HTTP GET request to the target URL
- The server responds with HTML, JSON, or a redirect
- You parse the response to extract the data you need
- Anti-bot systems may intervene with CAPTCHAs, blocks, or JavaScript challenges
Key HTTP concepts every scraper must handle:
| Concept | Why It Matters |
|---|---|
| User-Agent | Servers check this header; a missing or bot-like UA triggers blocks |
| Cookies | Many sites require session cookies for proper page rendering |
| Status Codes | 200 = success, 403 = blocked, 429 = rate limited, 503 = anti-bot challenge |
| Redirects | Login walls and geo-redirects change the page you actually receive |
| Headers | Referer, Accept-Language, and other headers affect what content is served |
—
Requests + BeautifulSoup: The Classic Stack
The requests + BeautifulSoup combination is the starting point for most Python scrapers. It’s simple, readable, and handles 80% of scraping tasks.
Basic Example: Fetching and Parsing a Page
import requests
from bs4 import BeautifulSoup
Set headers to mimic a real browser
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
response = requests.get(url, headers=headers, timeout=15)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "lxml")
# Extract product data
title = soup.select_one("h1").text
price = soup.select_one(".price_color").text
availability = soup.select_one(".availability").text.strip()
description = soup.select_one("#product_description ~ p").text
print(f"Title: {title}")
print(f"Price: {price}")
print(f"Availability: {availability}")
print(f"Description: {description[:100]}...")
else:
print(f"Failed with status: {response.status_code}")
Using Sessions for Cookie Persistence
session = requests.Session()
session.headers.update(headers)
First request sets cookies
session.get("https://example.com")
Subsequent requests carry cookies automatically
response = session.get("https://example.com/data-page")
CSS Selectors vs. XPath
BeautifulSoup supports CSS selectors natively via .select(). For XPath, use lxml instead.
# CSS Selectors (BeautifulSoup)
titles = soup.select("div.product-card h2.title")
prices = soup.select("span.price[data-currency='USD']")
links = soup.select("a.product-link[href]")
Get attribute values
for link in links:
print(link["href"])
When to use Requests + BeautifulSoup:
- Static HTML pages
- Simple scraping tasks
- Prototyping and one-off scripts
- When you don’t need JavaScript rendering
—
HTTPX: The Modern HTTP Client
HTTPX is a modern replacement for requests that supports async operations, HTTP/2, and has a more robust feature set. In 2026, it’s the preferred HTTP client for new scraping projects.
Sync Usage (Drop-in Requests Replacement)
import httpx
response = httpx.get(
"https://books.toscrape.com",
headers=headers,
timeout=15,
follow_redirects=True,
)
Async Usage (High-Concurrency Scraping)
import asyncio
import httpx
from bs4 import BeautifulSoup
async def scrape_page(client: httpx.AsyncClient, url: str) -> dict:
response = await client.get(url, timeout=15)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "lxml")
title = soup.select_one("h1").text
return {"url": url, "title": title}
return {"url": url, "error": response.status_code}
async def main():
urls = [
"https://books.toscrape.com/catalogue/page-1.html",
"https://books.toscrape.com/catalogue/page-2.html",
"https://books.toscrape.com/catalogue/page-3.html",
# ... more URLs
]
async with httpx.AsyncClient(headers=headers) as client:
tasks = [scrape_page(client, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
for result in results:
print(result)
asyncio.run(main())
HTTP/2 Support
# HTTP/2 can improve performance on sites that support it
async with httpx.AsyncClient(http2=True) as client:
response = await client.get("https://example.com")
When to use HTTPX:
- New projects (prefer over
requests) - High-concurrency scraping (async)
- When you need HTTP/2
- When building production scraping systems
—
lxml: Speed-First HTML Parsing
lxml is the fastest HTML/XML parser available in Python. While BeautifulSoup is more forgiving with malformed HTML, lxml is 5–10x faster for large documents.
Parsing with lxml Directly
from lxml import html
import httpx
response = httpx.get("https://books.toscrape.com", headers=headers)
tree = html.fromstring(response.text)
XPath selectors
titles = tree.xpath("//article[@class='product_pod']/h3/a/@title")
prices = tree.xpath("//article[@class='product_pod']//p[@class='price_color']/text()")
for title, price in zip(titles, prices):
print(f"{title}: {price}")
XPath vs. CSS Selectors
# XPath — more powerful, can select by text content
tree.xpath("//div[contains(@class, 'product')]//span[contains(text(), '$')]")
XPath — navigate up the tree (not possible with CSS)
tree.xpath("//span[@class='price']/ancestor::div[@class='product']//h2/text()")
CSS via lxml's cssselect
from lxml.cssselect import CSSSelector
sel = CSSSelector("div.product h2")
elements = sel(tree)
When to use lxml:
- Parsing large HTML documents where speed matters
- When you need XPath (navigating up the DOM, text-based selection)
- As the parser engine inside BeautifulSoup (
BeautifulSoup(html, "lxml"))
—
Scrapy: Industrial-Scale Scraping
Scrapy is a full-featured scraping framework — not just a library. It handles request scheduling, concurrency, retries, data pipelines, and middleware out of the box.
Creating a Scrapy Project
scrapy startproject bookstore
cd bookstore
scrapy genspider books books.toscrape.com
Writing a Spider
# bookstore/spiders/books.py
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
start_urls = ["https://books.toscrape.com/catalogue/page-1.html"]
def parse(self, response):
# Extract book data from listing page
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css(".price_color::text").get(),
"rating": book.css("p.star-rating::attr(class)").get(),
"url": response.urljoin(book.css("h3 a::attr(href)").get()),
}
# Follow pagination
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Running the Spider
# Output to JSON
scrapy crawl books -O books.json
Output to CSV
scrapy crawl books -O books.csv
Scrapy Settings for Production
# bookstore/settings.py
Respect robots.txt
ROBOTSTXT_OBEY = True
Concurrency
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 1 # seconds between requests to same domain
Retry configuration
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
User-Agent rotation
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
Enable AutoThrottle for adaptive rate limiting
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0
Scrapy with Proxy Middleware
# bookstore/middlewares.py
import random
class ProxyMiddleware:
def __init__(self):
self.proxy_url = "http://user:pass@gate.provider.com:7777"
def process_request(self, request, spider):
request.meta["proxy"] = self.proxy_url
When to use Scrapy:
- Large-scale crawling (thousands to millions of pages)
- Projects that need structured pipelines (crawl -> parse -> clean -> store)
- When you need built-in concurrency, retries, and rate limiting
- Team projects where a framework enforces structure
—
Selenium: Browser Automation
Selenium automates a real browser — it can click buttons, fill forms, scroll pages, and execute JavaScript. It’s essential for sites that require full browser rendering.
Basic Selenium Setup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Configure Chrome
options = Options()
options.add_argument("--headless=new") # Run without visible browser
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--window-size=1920,1080")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
driver = webdriver.Chrome(options=options)
try:
driver.get("https://example.com/products")
# Wait for dynamic content to load
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".product-card"))
)
# Extract data
products = driver.find_elements(By.CSS_SELECTOR, ".product-card")
for product in products:
title = product.find_element(By.CSS_SELECTOR, "h2").text
price = product.find_element(By.CSS_SELECTOR, ".price").text
print(f"{title}: {price}")
finally:
driver.quit()
Interacting with Pages
# Click a button
button = driver.find_element(By.CSS_SELECTOR, "button.load-more")
button.click()
Fill a search form
search_input = driver.find_element(By.CSS_SELECTOR, "input[name='q']")
search_input.clear()
search_input.send_keys("proxy providers")
search_input.submit()
Scroll to bottom (trigger infinite scroll)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
Wait for new content after scroll
import time
time.sleep(2)
Using Proxies with Selenium
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--proxy-server=http://gate.provider.com:7777")
For authenticated proxies, use a browser extension or seleniumwire
pip install selenium-wire
from seleniumwire import webdriver
proxy_options = {
"proxy": {
"http": "http://user:pass@gate.provider.com:7777",
"https": "http://user:pass@gate.provider.com:7777",
}
}
driver = webdriver.Chrome(
options=options,
seleniumwire_options=proxy_options
)
When to use Selenium:
- Sites that require JavaScript rendering and cannot be scraped with HTTP clients
- When you need to interact with the page (click, scroll, fill forms)
- Legacy projects already built on Selenium
- When you need screenshots or PDF generation
—
Playwright: The Modern Alternative to Selenium
Playwright is Microsoft’s browser automation framework and has rapidly become the preferred choice over Selenium for new projects in 2026. It’s faster, more reliable, and has better async support.
Basic Playwright Usage
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
viewport={"width": 1920, "height": 1080},
)
page = context.new_page()
page.goto("https://example.com/products")
# Wait for content (Playwright auto-waits, but explicit waits are available)
page.wait_for_selector(".product-card")
# Extract data
products = page.query_selector_all(".product-card")
for product in products:
title = product.query_selector("h2").inner_text()
price = product.query_selector(".price").inner_text()
print(f"{title}: {price}")
browser.close()
Async Playwright (High Concurrency)
import asyncio
from playwright.async_api import async_playwright
async def scrape_page(browser, url):
context = await browser.new_context()
page = await context.new_page()
await page.goto(url, wait_until="networkidle")
title = await page.title()
content = await page.inner_text("body")
await context.close()
return {"url": url, "title": title, "length": len(content)}
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
urls = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 11)]
tasks = [scrape_page(browser, url) for url in urls]
results = await asyncio.gather(*tasks)
for result in results:
print(result)
await browser.close()
asyncio.run(main())
Playwright with Proxies
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy={
"server": "http://gate.provider.com:7777",
"username": "your_user",
"password": "your_pass",
}
)
page = browser.new_page()
page.goto("https://httpbin.org/ip")
print(page.inner_text("body"))
browser.close()
Intercepting Network Requests
# Block images and CSS for faster scraping
def route_handler(route):
if route.request.resource_type in ["image", "stylesheet", "font"]:
route.abort()
else:
route.continue_()
page.route("*/", route_handler)
page.goto("https://example.com")
When to use Playwright:
- Any project that would otherwise use Selenium (Playwright is strictly better in 2026)
- JavaScript-heavy SPAs (React, Vue, Angular sites)
- When you need request interception or network monitoring
- Stealth scraping (Playwright is harder for sites to detect than Selenium)
—
Handling JavaScript-Rendered Content
Many modern websites render content with JavaScript — meaning the initial HTML response is mostly empty. Here’s how to handle each scenario:
Detect If JS Rendering Is Needed
import httpx
from bs4 import BeautifulSoup
response = httpx.get("https://target-site.com", headers=headers)
soup = BeautifulSoup(response.text, "lxml")
If the content you need is missing, JS rendering is required
target_data = soup.select(".product-price")
if not target_data:
print("Content not in initial HTML — JS rendering needed")
Strategy 1: Find the API Endpoint
Before reaching for a browser, check if the site loads data from an API. This is faster and more reliable.
# Open browser DevTools > Network tab > XHR/Fetch
Look for JSON API calls that contain the data you need
import httpx
Often the site's frontend calls an internal API
api_url = "https://target-site.com/api/products?page=1&limit=20"
response = httpx.get(api_url, headers=headers)
data = response.json()
for product in data["products"]:
print(f"{product['name']}: ${product['price']}")
This technique bypasses JS rendering entirely and is 10x faster. Check our web scraping guides for site-specific API discovery tips.
Strategy 2: Use Playwright for Full Rendering
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://spa-site.com/products")
page.wait_for_selector("[data-product-id]") # Wait for React/Vue to render
# Now parse the fully rendered HTML
html = page.content()
# ... parse with BeautifulSoup or lxml
Strategy 3: Execute JS Manually
# If only a small JS snippet needs to run
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com")
# Execute JavaScript and get the result
data = page.evaluate("""
() => {
return Array.from(document.querySelectorAll('.product')).map(el => ({
name: el.querySelector('h2').textContent,
price: el.querySelector('.price').textContent,
}));
}
""")
print(data)
—
Pagination and Crawling Patterns
Pattern 1: Next-Page Links
import httpx
from bs4 import BeautifulSoup
def scrape_paginated(base_url):
all_items = []
url = base_url
while url:
response = httpx.get(url, headers=headers, timeout=15)
soup = BeautifulSoup(response.text, "lxml")
# Extract items from current page
items = soup.select(".product-card")
for item in items:
all_items.append({
"title": item.select_one("h2").text,
"price": item.select_one(".price").text,
})
# Find next page link
next_link = soup.select_one("a.next-page")
url = next_link["href"] if next_link else None
print(f"Scraped {len(items)} items, total: {len(all_items)}")
return all_items
data = scrape_paginated("https://books.toscrape.com/catalogue/page-1.html")
Pattern 2: Page Number Parameters
import httpx
from bs4 import BeautifulSoup
all_items = []
for page_num in range(1, 51): # Pages 1-50
url = f"https://example.com/products?page={page_num}"
response = httpx.get(url, headers=headers)
if response.status_code != 200:
break
soup = BeautifulSoup(response.text, "lxml")
items = soup.select(".product")
if not items:
break # No more items = last page reached
all_items.extend([parse_item(item) for item in items])
Pattern 3: Infinite Scroll (Playwright)
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://infinite-scroll-site.com")
previous_height = 0
while True:
# Scroll to bottom
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000) # Wait for new content
current_height = page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break # No more content loaded
previous_height = current_height
# Parse all loaded content
items = page.query_selector_all(".item")
print(f"Loaded {len(items)} items via infinite scroll")
Pattern 4: API-Based Pagination (Cursor/Offset)
import httpx
all_data = []
cursor = None
while True:
params = {"limit": 100}
if cursor:
params["cursor"] = cursor
response = httpx.get("https://api.example.com/products", params=params, headers=headers)
data = response.json()
all_data.extend(data["results"])
cursor = data.get("next_cursor")
if not cursor:
break
print(f"Total items: {len(all_data)}")
—
Using Proxies with Python Scrapers
Proxies are essential for any serious scraping project. Without them, you’ll face IP bans within minutes on most commercial websites. For a detailed comparison of providers, see our best proxy providers guide.
Proxies with Requests/HTTPX
import httpx
Rotating residential proxy
proxy_url = "http://user:pass@gate.provider.com:7777"
HTTPX
response = httpx.get(
"https://target.com",
proxy=proxy_url,
headers=headers,
timeout=30,
)
Proxy Rotation Pattern
import httpx
import random
PROXY_LIST = [
"http://user:pass@us.provider.com:7777",
"http://user:pass@uk.provider.com:7777",
"http://user:pass@de.provider.com:7777",
]
def get_with_rotation(url, max_retries=3):
for attempt in range(max_retries):
proxy = random.choice(PROXY_LIST)
try:
response = httpx.get(url, proxy=proxy, headers=headers, timeout=20)
if response.status_code == 200:
return response
elif response.status_code in (403, 429):
continue # Try different proxy
except httpx.TimeoutException:
continue
return None
Proxy Authentication Methods
# Method 1: URL-embedded credentials
proxy = "http://username:password@gate.provider.com:7777"
Method 2: Country/city targeting via username
Most providers encode targeting in the username
proxy = "http://user-country-us-city-newyork:pass@gate.provider.com:7777"
Method 3: Session ID for sticky sessions
proxy = "http://user-session-abc123:pass@gate.provider.com:7777"
For detailed proxy setup instructions across different tools and platforms, see our proxy setup guides and proxy integration tutorials.
—
Handling Anti-Bot Systems
Modern websites use sophisticated anti-bot systems like Cloudflare, PerimeterX, DataDome, and Akamai Bot Manager. Here’s how to handle them:
Level 1: Header Rotation
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
]
def get_random_headers():
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",
"Accept-Language": random.choice(["en-US,en;q=0.9", "en-GB,en;q=0.9"]),
"Accept-Encoding": "gzip, deflate, br",
"Sec-Ch-Ua": '"Chromium";v="124", "Google Chrome";v="124"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"Windows"',
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
}
Level 2: Rate Limiting and Delays
import time
import random
def polite_request(url, session, min_delay=1, max_delay=3):
"""Add human-like delays between requests."""
time.sleep(random.uniform(min_delay, max_delay))
return session.get(url, timeout=15)
Level 3: Playwright Stealth
# pip install playwright-stealth
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Apply stealth patches to avoid detection
stealth_sync(page)
page.goto("https://bot-protected-site.com")
Level 4: Use a Web Unlocker Service
For heavily protected sites, the most reliable approach is to use a provider’s web unlocker service — Bright Data’s Web Unlocker, Oxylabs’ Web Scraper API, or Smartproxy’s Site Unblocker handle CAPTCHAs, fingerprinting, and JavaScript challenges automatically.
import httpx
Bright Data Web Unlocker example
response = httpx.get(
"https://heavily-protected-site.com/products",
proxy="http://brd-customer-XXXX-zone-unblocker:PASS@brd.superproxy.io:33335",
timeout=60, # Unblockers need more time
)
For more anti-bot strategies, see our guides on anti-detect browsers and proxy troubleshooting.
—
Storing Scraped Data
CSV Output
import csv
data = [
{"title": "Product A", "price": "$29.99", "url": "https://..."},
{"title": "Product B", "price": "$49.99", "url": "https://..."},
]
with open("products.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price", "url"])
writer.writeheader()
writer.writerows(data)
JSON Output
import json
with open("products.json", "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
SQLite Database
import sqlite3
conn = sqlite3.connect("products.db")
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
price TEXT,
url TEXT UNIQUE,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
for item in data:
cursor.execute(
"INSERT OR IGNORE INTO products (title, price, url) VALUES (?, ?, ?)",
(item["title"], item["price"], item["url"])
)
conn.commit()
conn.close()
Pandas DataFrame
import pandas as pd
df = pd.DataFrame(data)
df.to_csv("products.csv", index=False)
df.to_json("products.json", orient="records", indent=2)
df.to_excel("products.xlsx", index=False)
—
Real-World Example: Scraping a Product Page
Here’s a complete, production-ready example that scrapes product data from an e-commerce listing page with proxy rotation, error handling, and data storage.
"""
Production-ready product scraper with proxy rotation and error handling.
"""
import httpx
import json
import time
import random
from bs4 import BeautifulSoup
from dataclasses import dataclass, asdict
from typing import Optional
--- Configuration ---
PROXY_URL = "http://user:pass@gate.provider.com:7777"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
--- Data Model ---
@dataclass
class Product:
title: str
price: str
rating: Optional[str]
availability: str
url: str
--- Scraping Logic ---
def scrape_listing_page(client: httpx.Client, url: str) -> list[dict]:
"""Scrape all products from a single listing page."""
response = client.get(url, timeout=20)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
products = []
for card in soup.select("article.product_pod"):
product = Product(
title=card.select_one("h3 a")["title"],
price=card.select_one(".price_color").text,
rating=card.select_one("p.star-rating")["class"][1], # e.g., "Three"
availability="In stock" if card.select_one(".instock") else "Out of stock",
url=card.select_one("h3 a")["href"],
)
products.append(asdict(product))
return products
def scrape_all_pages(base_url: str, max_pages: int = 50) -> list[dict]:
"""Scrape all pages with retry logic and rate limiting."""
all_products = []
with httpx.Client(headers=HEADERS, proxy=PROXY_URL, follow_redirects=True) as client:
for page_num in range(1, max_pages + 1):
url = f"{base_url}/catalogue/page-{page_num}.html"
for attempt in range(3): # Retry up to 3 times
try:
products = scrape_listing_page(client, url)
if not products:
print(f"No products on page {page_num} — stopping.")
return all_products
all_products.extend(products)
print(f"Page {page_num}: {len(products)} products (total: {len(all_products)})")
# Polite delay
time.sleep(random.uniform(1, 2.5))
break
except httpx.HTTPStatusError as e:
if e.response.status_code == 404:
return all_products # Last page reached
print(f"HTTP {e.response.status_code} on page {page_num}, attempt {attempt + 1}")
time.sleep(5)
except httpx.TimeoutException:
print(f"Timeout on page {page_num}, attempt {attempt + 1}")
time.sleep(5)
return all_products
--- Main ---
if __name__ == "__main__":
products = scrape_all_pages("https://books.toscrape.com")
# Save results
with open("products.json", "w", encoding="utf-8") as f:
json.dump(products, f, indent=2)
print(f"\nDone! Scraped {len(products)} products.")
—
Performance Optimization
1. Async Concurrency
The biggest performance gain comes from making requests concurrently rather than sequentially.
import asyncio
import httpx
SEMAPHORE = asyncio.Semaphore(10) # Limit to 10 concurrent requests
async def fetch(client, url):
async with SEMAPHORE:
response = await client.get(url, timeout=20)
return response
async def main():
urls = [f"https://example.com/page/{i}" for i in range(1, 1001)]
async with httpx.AsyncClient(headers=HEADERS, proxy=PROXY_URL) as client:
tasks = [fetch(client, url) for url in urls]
responses = await asyncio.gather(*tasks, return_exceptions=True)
successful = [r for r in responses if isinstance(r, httpx.Response) and r.status_code == 200]
print(f"Success: {len(successful)}/{len(urls)}")
2. Resource Blocking in Playwright
# Block unnecessary resources to speed up rendering
page.route("*/.{png,jpg,jpeg,gif,svg,woff,woff2}", lambda route: route.abort())
page.route("*/analytics", lambda route: route.abort())
page.route("*/tracking", lambda route: route.abort())
3. Connection Pooling
# HTTPX and requests.Session reuse TCP connections automatically
Just ensure you use a client/session object across requests
async with httpx.AsyncClient(limits=httpx.Limits(max_connections=20)) as client:
# All requests share the connection pool
pass
4. Parsing Optimization
# Use lxml parser (faster than html.parser)
soup = BeautifulSoup(html, "lxml") # 5-10x faster than "html.parser"
Or use lxml directly for maximum speed
from lxml import html
tree = html.fromstring(html_string)
Performance Comparison
| Method | Requests/minute | Best For |
|---|---|---|
| Sync requests | ~30-60 | Simple scripts |
| Async HTTPX (10 concurrent) | ~300-600 | Medium-scale static sites |
| Async HTTPX (50 concurrent) | ~1,000-2,000 | High-volume API scraping |
| Scrapy | ~500-2,000 | Large crawling projects |
| Playwright (1 browser) | ~10-30 | JS-rendered sites |
| Playwright (5 contexts) | ~50-100 | Parallel JS rendering |
—
Legal and Ethical Considerations
Web scraping exists in a legal gray area that varies by jurisdiction. Key points:
- Check Terms of Service — Many sites explicitly prohibit scraping
- Respect robots.txt — It’s not legally binding everywhere but is considered best practice
- Don’t overload servers — Rate limit your requests to avoid causing damage
- Public vs. private data — Scraping publicly accessible data is generally safer legally
- Personal data (GDPR/CCPA) — Scraping personal information carries significant legal risk
- The CFAA and CFAA-like laws — Unauthorized access to computer systems is a crime in most jurisdictions
For a comprehensive analysis of the legal landscape, read our pillar guide: Is Web Scraping Legal? The Complete 2026 Guide. We also cover web scraping compliance in depth.
—
Python Scraping Libraries Comparison
| Library | Type | JS Support | Speed | Learning Curve | Best For |
|---|---|---|---|---|---|
| requests | HTTP client | ❌ | Fast | Easy | Simple scripts, APIs |
| HTTPX | HTTP client | ❌ | Fast | Easy | Modern projects, async |
| BeautifulSoup | Parser | ❌ | Medium | Easy | Beginners, small projects |
| lxml | Parser | ❌ | Fastest | Medium | Large documents, XPath |
| Scrapy | Framework | ❌* | Fast | Steep | Large-scale crawling |
| Selenium | Browser | ✅ | Slow | Medium | Legacy, form interaction |
| Playwright | Browser | ✅ | Medium | Medium | JS sites, stealth |
Scrapy can integrate with Playwright via scrapy-playwright for JS rendering.
Decision Flowchart
- Is the data in the initial HTML? → Use HTTPX + BeautifulSoup/lxml
- Does the site load data from a JSON API? → Use HTTPX to call the API directly
- Does the site require JavaScript rendering? → Use Playwright
- Are you scraping millions of pages? → Use Scrapy (+ scrapy-playwright if JS needed)
- Are you dealing with heavy anti-bot protection? → Use Playwright with stealth + residential proxies
—
FAQ
What is the best Python library for web scraping in 2026?
For most projects, HTTPX + BeautifulSoup is the best starting combination. HTTPX provides async support and HTTP/2 out of the box, and BeautifulSoup with the lxml parser handles 90% of parsing needs. For JavaScript-rendered sites, add Playwright. For large-scale crawls, use Scrapy.
Is web scraping with Python legal?
It depends on what you scrape, how you scrape it, and your jurisdiction. Scraping publicly available, non-personal data while respecting rate limits is generally considered acceptable. However, violating a site’s Terms of Service, scraping personal data, or circumventing access controls can create legal risk. Read our full legal guide for details.
How do I scrape a website that uses JavaScript?
Use Playwright (recommended) or Selenium to render the JavaScript in a headless browser. Before reaching for a browser, check the site’s Network tab in DevTools — many JS-rendered sites load data from API endpoints that you can call directly with HTTPX, which is much faster.
How do I avoid getting blocked while web scraping?
Use a combination of: (1) rotating residential proxies to distribute requests across many IPs, (2) realistic headers including a current User-Agent, (3) rate limiting with random delays between requests, (4) session management with cookies, and (5) Playwright with stealth patches for heavily protected sites. See our proxy provider comparison for provider recommendations.
How fast can Python scrape websites?
With async HTTPX and 50 concurrent connections, Python can scrape 1,000-2,000 static pages per minute. Browser-based scraping (Playwright) is slower at 10-100 pages per minute depending on parallelism. The main bottleneck is usually anti-bot rate limits, not Python’s speed.
Should I use Scrapy or Playwright for my scraping project?
Use Scrapy for large-scale crawling of static sites (thousands to millions of pages) — it handles scheduling, retries, and pipelines automatically. Use Playwright for JavaScript-rendered sites or when you need browser interaction. For JS-heavy sites at scale, combine them with the scrapy-playwright plugin.
How do I handle CAPTCHAs in Python web scraping?
Options ranked by reliability: (1) Use a web unlocker service like Bright Data’s Web Unlocker that solves CAPTCHAs automatically, (2) use a CAPTCHA-solving API (2Captcha, Anti-Captcha), (3) rotate proxies to avoid triggering CAPTCHAs in the first place. Mobile proxies have the lowest CAPTCHA rates due to their high trust scores.
Can I scrape data and store it in a database?
Yes. Python integrates with every major database. For simple projects, use SQLite (built into Python). For production, use PostgreSQL with SQLAlchemy or MongoDB for unstructured data. Scrapy has built-in pipeline support for database storage. Pandas DataFrames also export directly to SQL databases.
—
Next Steps
Now that you understand the Python scraping ecosystem, here’s where to go next:
- Choose the right proxies — Read our best proxy providers comparison to pick a provider that fits your budget and targets
- Understand proxy types — Our proxy types guide explains when to use residential vs. datacenter vs. mobile
- Learn site-specific techniques — Browse our scraping tutorials for guides on scraping specific platforms like Amazon, Google Maps, and LinkedIn
- Set up anti-detect browsers — For account management, see our anti-detect browser guides
- Stay compliant — Review our web scraping legal guide and compliance resources
last updated: March 12, 2026