Web Scraping with Python: Complete 2026 Guide
Python dominates web scraping. Its readable syntax, massive library ecosystem, and mature frameworks make it the default language for everything from quick data pulls to enterprise-scale crawling. Over 70% of web scraping projects use Python, and for good reason — no other language offers the same breadth of scraping tools.
This guide covers every major Python scraping approach, from simple HTTP requests to headless browser automation. You’ll learn which tool fits which job, how to handle real-world challenges like JavaScript rendering and anti-bot systems, and how to integrate proxies for reliable large-scale scraping.
Table of Contents
- Why Python for Web Scraping
- Setting Up Your Environment
- Method 1: Requests + BeautifulSoup
- Method 2: HTTPX for Modern HTTP
- Method 3: Scrapy for Large Projects
- Method 4: Selenium for Browser Automation
- Method 5: Playwright for Modern Sites
- Handling Common Challenges
- Proxy Integration
- Storing Scraped Data
- Best Practices
- FAQ
Why Python for Web Scraping
Python’s web scraping advantage comes down to three things:
- Library ecosystem — Requests, BeautifulSoup, Scrapy, Selenium, Playwright, HTTPX, lxml, Parsel — every scraping need has a battle-tested solution
- Readability — Scraping scripts are often maintained by non-specialists. Python’s clean syntax makes them accessible
- Data pipeline integration — Python connects directly to pandas, databases, CSV/JSON exports, and machine learning tools
The language handles everything from a 10-line script that grabs a single table to a distributed crawler processing millions of pages daily.
Setting Up Your Environment
Create a dedicated virtual environment for your scraping project:
python -m venv scraping-env
source scraping-env/bin/activate # On Windows: scraping-env\Scripts\activate
# Install the core libraries
pip install requests beautifulsoup4 lxml httpx parsel scrapy
pip install selenium playwright
# Install Playwright browsers
playwright install chromiumVerify everything works:
import requests
from bs4 import BeautifulSoup
print("Ready to scrape!")Method 1: Requests + BeautifulSoup
The simplest and most common Python scraping stack. Requests handles HTTP, BeautifulSoup handles parsing.
Best for: Static HTML pages, simple data extraction, beginners.
import requests
from bs4 import BeautifulSoup
# Fetch a page
url = "https://books.toscrape.com/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
# Parse HTML
soup = BeautifulSoup(response.text, "lxml")
# Extract book data
books = []
for article in soup.select("article.product_pod"):
title = article.h3.a["title"]
price = article.select_one(".price_color").text
rating = article.p["class"][1] # e.g., "Three"
books.append({"title": title, "price": price, "rating": rating})
for book in books[:5]:
print(f"{book['title']} — {book['price']} ({book['rating']} stars)")Pagination
import requests
from bs4 import BeautifulSoup
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
all_books = []
for page in range(1, 51):
url = base_url.format(page)
response = requests.get(url, timeout=30)
if response.status_code != 200:
break
soup = BeautifulSoup(response.text, "lxml")
articles = soup.select("article.product_pod")
for article in articles:
all_books.append({
"title": article.h3.a["title"],
"price": article.select_one(".price_color").text,
})
print(f"Page {page}: {len(articles)} books")
print(f"Total: {len(all_books)} books")For a deeper dive into BeautifulSoup, see our Beautiful Soup tutorial.
Method 2: HTTPX for Modern HTTP
HTTPX is the modern replacement for Requests. It supports async, HTTP/2, and connection pooling out of the box.
Best for: High-performance scraping, async pipelines, HTTP/2 sites.
import httpx
from bs4 import BeautifulSoup
# Synchronous usage (drop-in Requests replacement)
with httpx.Client(http2=True, follow_redirects=True) as client:
response = client.get("https://books.toscrape.com/", timeout=30)
soup = BeautifulSoup(response.text, "lxml")
titles = [a["title"] for a in soup.select("article.product_pod h3 a")]
print(titles[:5])Async Scraping with HTTPX
import asyncio
import httpx
from bs4 import BeautifulSoup
async def scrape_page(client, url):
response = await client.get(url, timeout=30)
soup = BeautifulSoup(response.text, "lxml")
return [
{"title": a["title"], "url": url}
for a in soup.select("article.product_pod h3 a")
]
async def main():
urls = [
f"https://books.toscrape.com/catalogue/page-{i}.html"
for i in range(1, 11)
]
async with httpx.AsyncClient(http2=True) as client:
tasks = [scrape_page(client, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
all_books = []
for result in results:
if isinstance(result, list):
all_books.extend(result)
print(f"Scraped {len(all_books)} books from 10 pages concurrently")
asyncio.run(main())For more on the HTTPX + Parsel stack, see our HTTPX + Parsel guide.
Method 3: Scrapy for Large Projects
Scrapy is a full-featured scraping framework with built-in concurrency, pipelines, middleware, and export.
Best for: Large-scale projects, crawling entire sites, production systems.
# quickstart.py — run with: scrapy runspider quickstart.py -o books.json
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
start_urls = ["https://books.toscrape.com/"]
def parse(self, response):
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css(".price_color::text").get(),
"url": response.urljoin(book.css("h3 a::attr(href)").get()),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)Scrapy handles pagination, concurrency, retries, and data export automatically. For a full walkthrough, see our Scrapy tutorial.
Method 4: Selenium for Browser Automation
Selenium controls a real browser, making it ideal for JavaScript-heavy sites and interaction-based scraping.
Best for: Sites requiring login, clicking, scrolling, or complex JavaScript rendering.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
# Configure headless Chrome
options = Options()
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(options=options)
try:
driver.get("https://books.toscrape.com/")
# Wait for content to load
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "article.product_pod"))
)
# Extract data
books = driver.find_elements(By.CSS_SELECTOR, "article.product_pod")
for book in books:
title = book.find_element(By.CSS_SELECTOR, "h3 a").get_attribute("title")
price = book.find_element(By.CSS_SELECTOR, ".price_color").text
print(f"{title}: {price}")
finally:
driver.quit()For complete Selenium coverage, see our Selenium web scraping tutorial.
Method 5: Playwright for Modern Sites
Playwright is the newest browser automation library, offering better performance and reliability than Selenium for modern web applications.
Best for: SPAs, modern React/Vue/Angular sites, stealth scraping.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://books.toscrape.com/")
page.wait_for_selector("article.product_pod")
books = page.query_selector_all("article.product_pod")
for book in books:
title = book.query_selector("h3 a").get_attribute("title")
price = book.query_selector(".price_color").text_content()
print(f"{title}: {price}")
browser.close()Async Playwright
import asyncio
from playwright.async_api import async_playwright
async def scrape():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto("https://books.toscrape.com/")
await page.wait_for_selector("article.product_pod")
books = await page.query_selector_all("article.product_pod")
for book in books:
title = await (await book.query_selector("h3 a")).get_attribute("title")
price = await (await book.query_selector(".price_color")).text_content()
print(f"{title}: {price}")
await browser.close()
asyncio.run(scrape())For more details, see our Playwright web scraping guide.
Handling Common Challenges
Rate Limiting and Delays
import time
import random
import requests
from bs4 import BeautifulSoup
def polite_scrape(urls, min_delay=1, max_delay=3):
results = []
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (compatible; research bot)"})
for url in urls:
try:
response = session.get(url, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
results.append({"url": url, "title": soup.title.string if soup.title else None})
except requests.RequestException as e:
results.append({"url": url, "error": str(e)})
time.sleep(random.uniform(min_delay, max_delay))
return resultsHandling JavaScript-Rendered Content
If a site loads data via JavaScript, check the network tab first. Many SPAs fetch data from APIs:
import requests
# Instead of rendering JS, call the API directly
api_url = "https://api.example.com/products?page=1&limit=50"
headers = {
"Accept": "application/json",
"User-Agent": "Mozilla/5.0",
}
response = requests.get(api_url, headers=headers)
data = response.json()
for product in data.get("results", []):
print(product["name"], product["price"])Retry Logic
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retries = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
)
session.mount("https://", HTTPAdapter(max_retries=retries))
session.mount("http://", HTTPAdapter(max_retries=retries))
response = session.get("https://books.toscrape.com/", timeout=30)Proxy Integration
Proxies are essential for large-scale scraping to avoid IP blocks and access geo-restricted content. Learn more about proxy types in our proxy glossary.
With Requests
import requests
proxies = {
"http": "http://user:pass@proxy.example.com:8080",
"https": "http://user:pass@proxy.example.com:8080",
}
response = requests.get("https://httpbin.org/ip", proxies=proxies, timeout=30)
print(response.json())With HTTPX
import httpx
proxy = "http://user:pass@proxy.example.com:8080"
with httpx.Client(proxy=proxy) as client:
response = client.get("https://httpbin.org/ip")
print(response.json())With Scrapy
# settings.py
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 110,
}
# In your spider
def start_requests(self):
yield scrapy.Request(
url="https://example.com",
meta={"proxy": "http://user:pass@proxy.example.com:8080"},
)Rotating Proxies
import random
import requests
proxy_list = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
def get_with_rotating_proxy(url, max_retries=3):
for attempt in range(max_retries):
proxy = random.choice(proxy_list)
try:
response = requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=30,
)
response.raise_for_status()
return response
except requests.RequestException:
continue
raise Exception(f"Failed after {max_retries} attempts")For proxy setup guides, see our web scraping proxy guide.
Storing Scraped Data
CSV
import csv
data = [{"title": "Book 1", "price": "$9.99"}, {"title": "Book 2", "price": "$14.99"}]
with open("books.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price"])
writer.writeheader()
writer.writerows(data)JSON
import json
with open("books.json", "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)SQLite
import sqlite3
conn = sqlite3.connect("books.db")
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS books (title TEXT, price TEXT)")
for item in data:
cursor.execute("INSERT INTO books VALUES (?, ?)", (item["title"], item["price"]))
conn.commit()
conn.close()Pandas DataFrame
import pandas as pd
df = pd.DataFrame(data)
df.to_csv("books.csv", index=False)
df.to_excel("books.xlsx", index=False)
df.to_json("books.json", orient="records", indent=2)Best Practices
- Respect robots.txt — Check
/robots.txtbefore scraping. Useurllib.robotparserto parse it programmatically - Set reasonable delays — 1-3 seconds between requests minimum. Match the site’s capacity
- Use sessions — Reuse
requests.Session()orhttpx.Client()for connection pooling - Handle errors gracefully — Implement retries, timeouts, and logging
- Rotate user agents — Vary your User-Agent header to avoid pattern detection
- Use proxies for scale — Rotate through residential or datacenter proxies for large projects
- Cache responses — Save raw HTML during development to avoid re-fetching
- Check the API first — Many sites have public or semi-public APIs that are faster and more reliable than scraping
- Store raw data — Save the complete response before parsing, so you can re-parse without re-fetching
FAQ
What is the best Python library for web scraping?
It depends on your use case. For simple static pages, Requests + BeautifulSoup is the fastest to get started. For large projects with many pages, Scrapy provides built-in concurrency and data pipelines. For JavaScript-heavy sites, Playwright offers the best performance and reliability. See our Python web scraping libraries comparison for a detailed breakdown.
Is web scraping legal in Python?
Web scraping legality depends on what you scrape, not the language. Generally, scraping publicly available data is legal, but you should respect terms of service, avoid scraping personal data without consent, and comply with laws like GDPR and CFAA. Check our web scraping compliance guides for detailed legal guidance.
How do I scrape a website that uses JavaScript?
You have three options: (1) Check the browser’s Network tab for API calls — many SPAs load data from JSON endpoints you can call directly with Requests. (2) Use a headless browser like Playwright or Selenium to render JavaScript. (3) Use a combination like Scrapy + Playwright for large-scale JS scraping.
How do I avoid getting blocked while scraping?
Use rotating proxies, vary your User-Agent headers, add random delays between requests, and respect the site’s robots.txt. For heavily protected sites, consider using anti-detect browser configurations or residential proxies.
How fast can Python scrape websites?
With async libraries like HTTPX or Scrapy’s built-in concurrency, Python can process hundreds of pages per second on static sites. Browser-based scraping (Selenium, Playwright) is slower — typically 1-10 pages per second — due to rendering overhead. The bottleneck is usually network latency and rate limiting, not Python’s speed.
For more scraping tutorials, explore our web scraping proxy guides and proxy glossary.
External Resources:
- Python Requests Documentation
- BeautifulSoup Documentation
- Scrapy Documentation
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
Related Reading
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company