Web Scraping with Python: Complete 2026 Guide

Web Scraping with Python: Complete 2026 Guide

Python dominates web scraping. Its readable syntax, massive library ecosystem, and mature frameworks make it the default language for everything from quick data pulls to enterprise-scale crawling. Over 70% of web scraping projects use Python, and for good reason — no other language offers the same breadth of scraping tools.

This guide covers every major Python scraping approach, from simple HTTP requests to headless browser automation. You’ll learn which tool fits which job, how to handle real-world challenges like JavaScript rendering and anti-bot systems, and how to integrate proxies for reliable large-scale scraping.

Table of Contents

Why Python for Web Scraping

Python’s web scraping advantage comes down to three things:

  1. Library ecosystem — Requests, BeautifulSoup, Scrapy, Selenium, Playwright, HTTPX, lxml, Parsel — every scraping need has a battle-tested solution
  2. Readability — Scraping scripts are often maintained by non-specialists. Python’s clean syntax makes them accessible
  3. Data pipeline integration — Python connects directly to pandas, databases, CSV/JSON exports, and machine learning tools

The language handles everything from a 10-line script that grabs a single table to a distributed crawler processing millions of pages daily.

Setting Up Your Environment

Create a dedicated virtual environment for your scraping project:

python -m venv scraping-env
source scraping-env/bin/activate  # On Windows: scraping-env\Scripts\activate

# Install the core libraries
pip install requests beautifulsoup4 lxml httpx parsel scrapy
pip install selenium playwright

# Install Playwright browsers
playwright install chromium

Verify everything works:

import requests
from bs4 import BeautifulSoup
print("Ready to scrape!")

Method 1: Requests + BeautifulSoup

The simplest and most common Python scraping stack. Requests handles HTTP, BeautifulSoup handles parsing.

Best for: Static HTML pages, simple data extraction, beginners.

import requests
from bs4 import BeautifulSoup

# Fetch a page
url = "https://books.toscrape.com/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()

# Parse HTML
soup = BeautifulSoup(response.text, "lxml")

# Extract book data
books = []
for article in soup.select("article.product_pod"):
    title = article.h3.a["title"]
    price = article.select_one(".price_color").text
    rating = article.p["class"][1]  # e.g., "Three"
    books.append({"title": title, "price": price, "rating": rating})

for book in books[:5]:
    print(f"{book['title']} — {book['price']} ({book['rating']} stars)")

Pagination

import requests
from bs4 import BeautifulSoup

base_url = "https://books.toscrape.com/catalogue/page-{}.html"
all_books = []

for page in range(1, 51):
    url = base_url.format(page)
    response = requests.get(url, timeout=30)

    if response.status_code != 200:
        break

    soup = BeautifulSoup(response.text, "lxml")
    articles = soup.select("article.product_pod")

    for article in articles:
        all_books.append({
            "title": article.h3.a["title"],
            "price": article.select_one(".price_color").text,
        })

    print(f"Page {page}: {len(articles)} books")

print(f"Total: {len(all_books)} books")

For a deeper dive into BeautifulSoup, see our Beautiful Soup tutorial.

Method 2: HTTPX for Modern HTTP

HTTPX is the modern replacement for Requests. It supports async, HTTP/2, and connection pooling out of the box.

Best for: High-performance scraping, async pipelines, HTTP/2 sites.

import httpx
from bs4 import BeautifulSoup

# Synchronous usage (drop-in Requests replacement)
with httpx.Client(http2=True, follow_redirects=True) as client:
    response = client.get("https://books.toscrape.com/", timeout=30)
    soup = BeautifulSoup(response.text, "lxml")
    titles = [a["title"] for a in soup.select("article.product_pod h3 a")]
    print(titles[:5])

Async Scraping with HTTPX

import asyncio
import httpx
from bs4 import BeautifulSoup

async def scrape_page(client, url):
    response = await client.get(url, timeout=30)
    soup = BeautifulSoup(response.text, "lxml")
    return [
        {"title": a["title"], "url": url}
        for a in soup.select("article.product_pod h3 a")
    ]

async def main():
    urls = [
        f"https://books.toscrape.com/catalogue/page-{i}.html"
        for i in range(1, 11)
    ]

    async with httpx.AsyncClient(http2=True) as client:
        tasks = [scrape_page(client, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    all_books = []
    for result in results:
        if isinstance(result, list):
            all_books.extend(result)

    print(f"Scraped {len(all_books)} books from 10 pages concurrently")

asyncio.run(main())

For more on the HTTPX + Parsel stack, see our HTTPX + Parsel guide.

Method 3: Scrapy for Large Projects

Scrapy is a full-featured scraping framework with built-in concurrency, pipelines, middleware, and export.

Best for: Large-scale projects, crawling entire sites, production systems.

# quickstart.py — run with: scrapy runspider quickstart.py -o books.json
import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css(".price_color::text").get(),
                "url": response.urljoin(book.css("h3 a::attr(href)").get()),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Scrapy handles pagination, concurrency, retries, and data export automatically. For a full walkthrough, see our Scrapy tutorial.

Method 4: Selenium for Browser Automation

Selenium controls a real browser, making it ideal for JavaScript-heavy sites and interaction-based scraping.

Best for: Sites requiring login, clicking, scrolling, or complex JavaScript rendering.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

# Configure headless Chrome
options = Options()
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(options=options)

try:
    driver.get("https://books.toscrape.com/")

    # Wait for content to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "article.product_pod"))
    )

    # Extract data
    books = driver.find_elements(By.CSS_SELECTOR, "article.product_pod")
    for book in books:
        title = book.find_element(By.CSS_SELECTOR, "h3 a").get_attribute("title")
        price = book.find_element(By.CSS_SELECTOR, ".price_color").text
        print(f"{title}: {price}")
finally:
    driver.quit()

For complete Selenium coverage, see our Selenium web scraping tutorial.

Method 5: Playwright for Modern Sites

Playwright is the newest browser automation library, offering better performance and reliability than Selenium for modern web applications.

Best for: SPAs, modern React/Vue/Angular sites, stealth scraping.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    page.goto("https://books.toscrape.com/")
    page.wait_for_selector("article.product_pod")

    books = page.query_selector_all("article.product_pod")
    for book in books:
        title = book.query_selector("h3 a").get_attribute("title")
        price = book.query_selector(".price_color").text_content()
        print(f"{title}: {price}")

    browser.close()

Async Playwright

import asyncio
from playwright.async_api import async_playwright

async def scrape():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        await page.goto("https://books.toscrape.com/")
        await page.wait_for_selector("article.product_pod")

        books = await page.query_selector_all("article.product_pod")
        for book in books:
            title = await (await book.query_selector("h3 a")).get_attribute("title")
            price = await (await book.query_selector(".price_color")).text_content()
            print(f"{title}: {price}")

        await browser.close()

asyncio.run(scrape())

For more details, see our Playwright web scraping guide.

Handling Common Challenges

Rate Limiting and Delays

import time
import random
import requests
from bs4 import BeautifulSoup

def polite_scrape(urls, min_delay=1, max_delay=3):
    results = []
    session = requests.Session()
    session.headers.update({"User-Agent": "Mozilla/5.0 (compatible; research bot)"})

    for url in urls:
        try:
            response = session.get(url, timeout=30)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, "lxml")
            results.append({"url": url, "title": soup.title.string if soup.title else None})
        except requests.RequestException as e:
            results.append({"url": url, "error": str(e)})

        time.sleep(random.uniform(min_delay, max_delay))

    return results

Handling JavaScript-Rendered Content

If a site loads data via JavaScript, check the network tab first. Many SPAs fetch data from APIs:

import requests

# Instead of rendering JS, call the API directly
api_url = "https://api.example.com/products?page=1&limit=50"
headers = {
    "Accept": "application/json",
    "User-Agent": "Mozilla/5.0",
}
response = requests.get(api_url, headers=headers)
data = response.json()

for product in data.get("results", []):
    print(product["name"], product["price"])

Retry Logic

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retries = Retry(
    total=3,
    backoff_factor=1,
    status_forcelist=[429, 500, 502, 503, 504],
)
session.mount("https://", HTTPAdapter(max_retries=retries))
session.mount("http://", HTTPAdapter(max_retries=retries))

response = session.get("https://books.toscrape.com/", timeout=30)

Proxy Integration

Proxies are essential for large-scale scraping to avoid IP blocks and access geo-restricted content. Learn more about proxy types in our proxy glossary.

With Requests

import requests

proxies = {
    "http": "http://user:pass@proxy.example.com:8080",
    "https": "http://user:pass@proxy.example.com:8080",
}

response = requests.get("https://httpbin.org/ip", proxies=proxies, timeout=30)
print(response.json())

With HTTPX

import httpx

proxy = "http://user:pass@proxy.example.com:8080"
with httpx.Client(proxy=proxy) as client:
    response = client.get("https://httpbin.org/ip")
    print(response.json())

With Scrapy

# settings.py
DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 110,
}

# In your spider
def start_requests(self):
    yield scrapy.Request(
        url="https://example.com",
        meta={"proxy": "http://user:pass@proxy.example.com:8080"},
    )

Rotating Proxies

import random
import requests

proxy_list = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]

def get_with_rotating_proxy(url, max_retries=3):
    for attempt in range(max_retries):
        proxy = random.choice(proxy_list)
        try:
            response = requests.get(
                url,
                proxies={"http": proxy, "https": proxy},
                timeout=30,
            )
            response.raise_for_status()
            return response
        except requests.RequestException:
            continue
    raise Exception(f"Failed after {max_retries} attempts")

For proxy setup guides, see our web scraping proxy guide.

Storing Scraped Data

CSV

import csv

data = [{"title": "Book 1", "price": "$9.99"}, {"title": "Book 2", "price": "$14.99"}]

with open("books.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "price"])
    writer.writeheader()
    writer.writerows(data)

JSON

import json

with open("books.json", "w", encoding="utf-8") as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

SQLite

import sqlite3

conn = sqlite3.connect("books.db")
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS books (title TEXT, price TEXT)")

for item in data:
    cursor.execute("INSERT INTO books VALUES (?, ?)", (item["title"], item["price"]))

conn.commit()
conn.close()

Pandas DataFrame

import pandas as pd

df = pd.DataFrame(data)
df.to_csv("books.csv", index=False)
df.to_excel("books.xlsx", index=False)
df.to_json("books.json", orient="records", indent=2)

Best Practices

  1. Respect robots.txt — Check /robots.txt before scraping. Use urllib.robotparser to parse it programmatically
  2. Set reasonable delays — 1-3 seconds between requests minimum. Match the site’s capacity
  3. Use sessions — Reuse requests.Session() or httpx.Client() for connection pooling
  4. Handle errors gracefully — Implement retries, timeouts, and logging
  5. Rotate user agents — Vary your User-Agent header to avoid pattern detection
  6. Use proxies for scale — Rotate through residential or datacenter proxies for large projects
  7. Cache responses — Save raw HTML during development to avoid re-fetching
  8. Check the API first — Many sites have public or semi-public APIs that are faster and more reliable than scraping
  9. Store raw data — Save the complete response before parsing, so you can re-parse without re-fetching

FAQ

What is the best Python library for web scraping?

It depends on your use case. For simple static pages, Requests + BeautifulSoup is the fastest to get started. For large projects with many pages, Scrapy provides built-in concurrency and data pipelines. For JavaScript-heavy sites, Playwright offers the best performance and reliability. See our Python web scraping libraries comparison for a detailed breakdown.

Is web scraping legal in Python?

Web scraping legality depends on what you scrape, not the language. Generally, scraping publicly available data is legal, but you should respect terms of service, avoid scraping personal data without consent, and comply with laws like GDPR and CFAA. Check our web scraping compliance guides for detailed legal guidance.

How do I scrape a website that uses JavaScript?

You have three options: (1) Check the browser’s Network tab for API calls — many SPAs load data from JSON endpoints you can call directly with Requests. (2) Use a headless browser like Playwright or Selenium to render JavaScript. (3) Use a combination like Scrapy + Playwright for large-scale JS scraping.

How do I avoid getting blocked while scraping?

Use rotating proxies, vary your User-Agent headers, add random delays between requests, and respect the site’s robots.txt. For heavily protected sites, consider using anti-detect browser configurations or residential proxies.

How fast can Python scrape websites?

With async libraries like HTTPX or Scrapy’s built-in concurrency, Python can process hundreds of pages per second on static sites. Browser-based scraping (Selenium, Playwright) is slower — typically 1-10 pages per second — due to rendering overhead. The bottleneck is usually network latency and rate limiting, not Python’s speed.


For more scraping tutorials, explore our web scraping proxy guides and proxy glossary.

External Resources:


Related Reading

Scroll to Top