Web Scraping with Python: Complete Tutorial for Beginners (2026)
Python is the most popular language for web scraping, and for good reason: its ecosystem of libraries makes it possible to build a working scraper in under 20 lines of code.
This tutorial takes you from zero to a fully functional web scraping project. You’ll learn the core libraries, handle common challenges like pagination and JavaScript rendering, use proxies to avoid blocks, and store your data in usable formats.
Prerequisites: Basic Python knowledge (variables, loops, functions). Python 3.8+ installed on your machine.
Table of Contents
- Setting Up Your Environment
- Your First Scraper: Requests + BeautifulSoup
- Selecting Elements: CSS Selectors & XPath
- Handling Pagination
- Scraping JavaScript-Rendered Pages
- Using Proxies to Avoid Blocks
- Storing Data: CSV, JSON, and Databases
- Building a Real-World Project
- Scaling Up with Scrapy
- Error Handling & Best Practices
- Ethical Scraping Guidelines
- Next Steps
Setting Up Your Environment {#setup}
Install Python and pip
If you don’t already have Python installed, download it from python.org. Verify your installation:
python --version # Should show 3.8+
pip --version # Package managerCreate a Virtual Environment
Always use a virtual environment to keep your project dependencies isolated:
# Create a new project directory
mkdir my-scraper && cd my-scraper
# Create virtual environment
python -m venv venv
# Activate it
# macOS/Linux:
source venv/bin/activate
# Windows:
venv\Scripts\activateInstall Core Libraries
pip install requests beautifulsoup4 lxml pandas| Library | Purpose |
|---|---|
| requests | Sending HTTP requests |
| beautifulsoup4 | Parsing HTML and extracting data |
| lxml | Fast HTML/XML parser (used as BS4 backend) |
| pandas | Data manipulation and export |
For JavaScript-heavy sites, you’ll also need:
pip install selenium playwright
playwright install chromiumYour First Scraper: Requests + BeautifulSoup {#first-scraper}
Let’s start with the most common pattern: fetching a web page and extracting data from its HTML.
Step 1: Fetch the Page
import requests
url = "https://books.toscrape.com/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers)
# Check if request was successful
if response.status_code == 200:
html = response.text
print(f"Fetched {len(html)} characters")
else:
print(f"Failed with status code: {response.status_code}")Why set a User-Agent? Websites can reject requests that don’t have a browser-like User-Agent header. Always set one that mimics a real browser.
Step 2: Parse the HTML
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
# Find all book containers
books = soup.select("article.product_pod")
print(f"Found {len(books)} books")Step 3: Extract Data
results = []
for book in books:
title = book.select_one("h3 a")["title"]
price = book.select_one(".price_color").text
rating = book.select_one("p.star-rating")["class"][1] # e.g., "Three"
availability = book.select_one(".availability").text.strip()
results.append({
"title": title,
"price": price,
"rating": rating,
"in_stock": "In stock" in availability
})
# Print first 3 results
for book in results[:3]:
print(book)Output:
{'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three', 'in_stock': True}
{'title': 'Tipping the Velvet', 'price': '£53.74', 'rating': 'One', 'in_stock': True}
{'title': 'Soumission', 'price': '£50.10', 'rating': 'One', 'in_stock': True}Complete First Scraper
Here’s the full script in one piece:
import requests
from bs4 import BeautifulSoup
def scrape_books(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
books = soup.select("article.product_pod")
results = []
for book in books:
results.append({
"title": book.select_one("h3 a")["title"],
"price": book.select_one(".price_color").text,
"rating": book.select_one("p.star-rating")["class"][1],
"in_stock": "In stock" in book.select_one(".availability").text
})
return results
if __name__ == "__main__":
books = scrape_books("https://books.toscrape.com/")
print(f"Scraped {len(books)} books")
for book in books[:5]:
print(f" {book['title']} - {book['price']}")Selecting Elements: CSS Selectors & XPath {#selectors}
Knowing how to target the right HTML elements is the most important scraping skill.
CSS Selectors (Recommended)
CSS selectors are the preferred method for most scrapers. They’re concise, readable, and familiar to anyone who has written CSS.
| Selector | Example | Matches |
|---|---|---|
| Tag | soup.select("h1") | All elements |
| Class | soup.select(".price") | Elements with class=”price” |
| ID | soup.select("#main") | Element with id=”main” |
| Descendant | soup.select("div.product .price") | .price inside div.product |
| Attribute | soup.select("a[href*='product']") | Links containing “product” in href |
| nth-child | soup.select("tr:nth-child(2)") | Second table row |
| Multiple | soup.select("h1, h2, h3") | All h1, h2, and h3 elements |
XPath (For Complex Selections)
XPath is more powerful than CSS selectors for certain patterns. Use it with lxml:
from lxml import html
tree = html.fromstring(response.text)
# Select by text content
prices = tree.xpath('//span[contains(@class, "price")]/text()')
# Select parent element
parent = tree.xpath('//span[@class="price"]/..')
# Select following sibling
description = tree.xpath('//h3/following-sibling::p/text()')Pro Tips for Finding Selectors
- Use browser DevTools — Right-click an element, choose “Inspect,” and look at its HTML structure
- Copy selector from DevTools — Right-click the element in the Elements panel and choose “Copy > Copy selector”
- Test selectors in the console — Run
document.querySelectorAll(".your-selector")in the browser console before coding - Prefer stable selectors — IDs and semantic class names are more reliable than positional selectors
Handling Pagination {#pagination}
Most real-world scraping involves multiple pages. Here are the common patterns:
Pattern 1: Next Page Link
import requests
from bs4 import BeautifulSoup
import time
def scrape_all_pages(base_url):
all_results = []
url = base_url
while url:
print(f"Scraping: {url}")
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "lxml")
# Extract data from current page
for item in soup.select(".product-card"):
all_results.append({
"name": item.select_one(".title").text.strip(),
"price": item.select_one(".price").text.strip()
})
# Find next page link
next_link = soup.select_one("a.next")
if next_link:
url = next_link["href"]
if not url.startswith("http"):
url = base_url.rsplit("/", 1)[0] + "/" + url
else:
url = None # No more pages
time.sleep(1) # Be polite - wait between requests
return all_resultsPattern 2: Page Numbers
def scrape_numbered_pages(base_url, total_pages):
all_results = []
for page in range(1, total_pages + 1):
url = f"{base_url}?page={page}"
print(f"Scraping page {page}/{total_pages}")
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.text, "lxml")
items = soup.select(".product-card")
if not items:
break # No more results
for item in items:
all_results.append({
"name": item.select_one(".title").text.strip(),
"price": item.select_one(".price").text.strip()
})
time.sleep(1)
return all_resultsPattern 3: Infinite Scroll (API Calls)
Many modern websites load data via API calls when you scroll. Inspect the Network tab in DevTools to find these:
def scrape_api_pagination(api_url):
all_results = []
offset = 0
limit = 20
while True:
response = requests.get(
api_url,
params={"offset": offset, "limit": limit},
headers={"User-Agent": "Mozilla/5.0"}
)
data = response.json()
if not data.get("results"):
break
all_results.extend(data["results"])
offset += limit
print(f"Fetched {len(all_results)} items so far...")
time.sleep(1)
return all_resultsScraping JavaScript-Rendered Pages {#javascript}
Many modern websites use JavaScript frameworks (React, Vue, Angular) that render content in the browser. Plain requests + BeautifulSoup won’t work because they only see the initial HTML before JavaScript executes.
Option 1: Playwright (Recommended)
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
def scrape_js_page(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate and wait for content to load
page.goto(url)
page.wait_for_selector(".product-card", timeout=10000)
# Get the fully rendered HTML
html = page.content()
browser.close()
# Parse with BeautifulSoup as usual
soup = BeautifulSoup(html, "lxml")
products = []
for card in soup.select(".product-card"):
products.append({
"name": card.select_one(".title").text.strip(),
"price": card.select_one(".price").text.strip()
})
return productsOption 2: Selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape_with_selenium(url):
options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
driver = webdriver.Chrome(options=options)
driver.get(url)
# Wait for dynamic content
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".product-card"))
)
products = []
cards = driver.find_elements(By.CSS_SELECTOR, ".product-card")
for card in cards:
products.append({
"name": card.find_element(By.CSS_SELECTOR, ".title").text,
"price": card.find_element(By.CSS_SELECTOR, ".price").text
})
driver.quit()
return productsOption 3: Find the Hidden API
Before reaching for a headless browser, check if the website has a hidden API. Open DevTools, go to the Network tab, filter by XHR/Fetch, and look for JSON responses. If the data comes from an API, you can call it directly with requests — much faster and more efficient.
# Instead of rendering JS, call the API directly
response = requests.get(
"https://example.com/api/products",
params={"category": "electronics", "page": 1},
headers={
"User-Agent": "Mozilla/5.0",
"Accept": "application/json"
}
)
data = response.json()Using Proxies to Avoid Blocks {#proxies}
Once you’re scraping more than a few hundred pages, you’ll start getting blocked. Proxies distribute your requests across multiple IP addresses to avoid detection.
Basic Proxy Usage with Requests
import requests
proxies = {
"http": "http://username:password@proxy-gateway.provider.com:7777",
"https": "http://username:password@proxy-gateway.provider.com:7777"
}
response = requests.get(
"https://example.com",
proxies=proxies,
timeout=30
)Rotating Proxies with a Proxy List
import requests
import random
proxy_list = [
"http://user:pass@gate.provider.com:7777",
"http://user:pass@gate.provider.com:7778",
"http://user:pass@gate.provider.com:7779",
]
def get_with_proxy(url, max_retries=3):
for attempt in range(max_retries):
proxy = random.choice(proxy_list)
try:
response = requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=30,
headers={"User-Agent": "Mozilla/5.0"}
)
if response.status_code == 200:
return response
except requests.exceptions.RequestException as e:
print(f"Attempt {attempt + 1} failed: {e}")
return NoneUsing Proxies with Playwright
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy={
"server": "http://proxy-gateway.provider.com:7777",
"username": "your_username",
"password": "your_password"
}
)
page = browser.new_page()
page.goto("https://example.com")
print(page.content())
browser.close()Anti-Detection Tips
Beyond proxies, these techniques help avoid blocks:
- Rotate User-Agent strings — Use a list of real browser User-Agents
- Random delays — Add 1-5 second random delays between requests
- Respect rate limits — If you get 429 responses, slow down
- Handle cookies — Use
requests.Session()to maintain cookies like a real browser - Mimic browser headers — Include Accept, Accept-Language, and Referer headers
import random
import time
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
]
def polite_request(url, session=None):
if session is None:
session = requests.Session()
headers = {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
}
time.sleep(random.uniform(1, 3))
return session.get(url, headers=headers, timeout=30)For a complete guide to proxy selection, see What Is a Residential Proxy?.
Storing Data: CSV, JSON, and Databases {#storing-data}
CSV (Simple Tabular Data)
import csv
def save_to_csv(data, filename="output.csv"):
if not data:
return
keys = data[0].keys()
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
writer.writerows(data)
print(f"Saved {len(data)} records to {filename}")JSON (Nested/Complex Data)
import json
def save_to_json(data, filename="output.json"):
with open(filename, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"Saved {len(data)} records to {filename}")Pandas DataFrame (Analysis-Ready)
import pandas as pd
def save_with_pandas(data, filename="output"):
df = pd.DataFrame(data)
# Clean and transform
df["price"] = df["price"].str.replace("£", "").astype(float)
df["in_stock"] = df["in_stock"].astype(bool)
# Export to multiple formats
df.to_csv(f"{filename}.csv", index=False)
df.to_json(f"{filename}.json", orient="records", indent=2)
df.to_excel(f"{filename}.xlsx", index=False)
print(df.describe())
return dfSQLite Database (Persistent Storage)
import sqlite3
def save_to_database(data, db_name="scraping.db", table="products"):
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
# Create table dynamically based on data keys
if data:
columns = ", ".join(f"{key} TEXT" for key in data[0].keys())
cursor.execute(f"CREATE TABLE IF NOT EXISTS {table} ({columns})")
placeholders = ", ".join("?" * len(data[0]))
for record in data:
cursor.execute(
f"INSERT INTO {table} VALUES ({placeholders})",
list(record.values())
)
conn.commit()
print(f"Saved {len(data)} records to {db_name}")
conn.close()Building a Real-World Project {#real-world-project}
Let’s build a complete scraper that collects book data from all 50 pages of books.toscrape.com:
"""
Complete web scraper: Books to Scrape
Collects all books across all pages with error handling,
rate limiting, and data export.
"""
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import logging
from urllib.parse import urljoin
# Setup logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
logger = logging.getLogger(__name__)
BASE_URL = "https://books.toscrape.com/"
RATING_MAP = {
"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5
}
def create_session():
"""Create a requests session with default headers."""
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
})
return session
def scrape_page(session, url):
"""Scrape a single page and return book data + next page URL."""
time.sleep(random.uniform(0.5, 1.5))
try:
response = session.get(url, timeout=30)
response.raise_for_status()
except requests.exceptions.RequestException as e:
logger.error(f"Failed to fetch {url}: {e}")
return [], None
soup = BeautifulSoup(response.text, "lxml")
books = []
for article in soup.select("article.product_pod"):
try:
title = article.select_one("h3 a")["title"]
price_text = article.select_one(".price_color").text
price = float(price_text.replace("£", "").replace("Â", ""))
rating_class = article.select_one("p.star-rating")["class"][1]
rating = RATING_MAP.get(rating_class, 0)
in_stock = "In stock" in article.select_one(".availability").text
detail_url = urljoin(url, article.select_one("h3 a")["href"])
books.append({
"title": title,
"price_gbp": price,
"rating": rating,
"in_stock": in_stock,
"detail_url": detail_url
})
except (AttributeError, KeyError, ValueError) as e:
logger.warning(f"Failed to parse a book: {e}")
continue
# Find next page
next_btn = soup.select_one("li.next a")
next_url = urljoin(url, next_btn["href"]) if next_btn else None
return books, next_url
def scrape_all_books():
"""Scrape all books from all pages."""
session = create_session()
all_books = []
url = BASE_URL
page_num = 1
while url:
logger.info(f"Scraping page {page_num}: {url}")
books, next_url = scrape_page(session, url)
all_books.extend(books)
logger.info(f" Found {len(books)} books (total: {len(all_books)})")
url = next_url
page_num += 1
return all_books
def analyze_and_export(books):
"""Analyze scraped data and export to files."""
df = pd.DataFrame(books)
# Analysis
logger.info(f"\n{'='*50}")
logger.info(f"Total books scraped: {len(df)}")
logger.info(f"Price range: £{df['price_gbp'].min():.2f} - £{df['price_gbp'].max():.2f}")
logger.info(f"Average price: £{df['price_gbp'].mean():.2f}")
logger.info(f"Average rating: {df['rating'].mean():.1f}/5")
logger.info(f"In stock: {df['in_stock'].sum()} / {len(df)}")
# Export
df.to_csv("books_data.csv", index=False)
df.to_json("books_data.json", orient="records", indent=2)
logger.info(f"Exported to books_data.csv and books_data.json")
return df
if __name__ == "__main__":
books = scrape_all_books()
df = analyze_and_export(books)This scraper demonstrates all the core concepts: session management, error handling, pagination, data cleaning, and export.
Scaling Up with Scrapy {#scrapy}
When your scraping needs outgrow simple scripts, Scrapy provides a production-ready framework:
pip install scrapy
scrapy startproject bookstore# bookstore/spiders/books_spider.py
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
start_urls = ["https://books.toscrape.com/"]
custom_settings = {
"DOWNLOAD_DELAY": 1,
"CONCURRENT_REQUESTS": 4,
"FEEDS": {
"books.json": {"format": "json", "overwrite": True},
"books.csv": {"format": "csv", "overwrite": True},
}
}
def parse(self, response):
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css(".price_color::text").get(),
"rating": book.css("p.star-rating::attr(class)").get().split()[-1],
"url": response.urljoin(book.css("h3 a::attr(href)").get()),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)Run it:
cd bookstore
scrapy crawl booksScrapy handles concurrency, rate limiting, retries, and data export automatically. For large projects, it’s worth the learning investment. See our guide to web scraping tools for a full comparison.
Error Handling & Best Practices {#best-practices}
Robust Error Handling
import requests
from requests.exceptions import RequestException, Timeout, ConnectionError
import time
def resilient_request(url, max_retries=3, backoff_factor=2):
"""Make a request with retries and exponential backoff."""
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=30, headers={
"User-Agent": "Mozilla/5.0"
})
if response.status_code == 200:
return response
elif response.status_code == 429:
# Rate limited - wait longer
wait_time = backoff_factor ** (attempt + 2)
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
elif response.status_code >= 500:
# Server error - retry
time.sleep(backoff_factor ** attempt)
else:
print(f"Unexpected status: {response.status_code}")
return None
except Timeout:
print(f"Timeout on attempt {attempt + 1}")
except ConnectionError:
print(f"Connection error on attempt {attempt + 1}")
time.sleep(backoff_factor ** attempt)
except RequestException as e:
print(f"Request failed: {e}")
return None
print(f"All {max_retries} attempts failed for {url}")
return NoneBest Practices Checklist
- Always set timeouts — Never let requests hang indefinitely
- Use sessions — Reuse connections for better performance
- Handle encoding — Use
response.encodingand handle Unicode properly - Log everything — Record URLs scraped, errors, and timing
- Save progress — Write data incrementally, not just at the end
- Validate data — Check for empty fields and malformed data
- Monitor performance — Track success rates and response times
Ethical Scraping Guidelines {#ethics}
Web scraping carries responsibilities. Follow these guidelines to scrape ethically:
- Check robots.txt — Read
https://example.com/robots.txtbefore scraping - Respect rate limits — Add delays between requests (1-3 seconds minimum)
- Don’t overload servers — Limit concurrent requests
- Identify yourself — Use a User-Agent that includes contact info for large projects
- Avoid personal data — Don’t collect PII unless you have a legal basis
- Check the ToS — Read the website’s Terms of Service
- Use APIs when available — If the site offers an API, prefer it over scraping
- Cache results — Don’t re-scrape pages unnecessarily
For a deep dive into the legal side, read our guide: Is Web Scraping Legal?
Next Steps {#next-steps}
You now have the foundation to build web scrapers for any project. Here’s where to go next:
- Practice — Build scrapers for sites you actually need data from
- Learn Scrapy — Graduate to the full framework for production projects
- Master Playwright — Essential for JavaScript-heavy sites
- Set up proxies — Required once you scale beyond testing (Web Scraping Proxy Guide)
- Explore scraping tools — See our comparison of 15 tools to find the right fit
- Check compliance — Use our Data Collection Compliance Checker for your specific use case
Happy scraping.
- 15 Best Web Scraping Tools in 2026: Expert Comparison
- Free Proxy List 2026: 100+ Tested & Working Proxies (Updated Daily)
- 10 Myths About Web Scraping Debunked
- What Is a Datacenter Proxy? Complete Guide
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- Best Proxy Providers 2026: Ultimate Comparison Guide
- 15 Best Web Scraping Tools in 2026: Expert Comparison
- 10 Myths About Web Scraping That Need to Die in 2026
- Are Proxies Legal? Understanding the Law Around Proxy Servers
- 403 Forbidden Error: What It Means & How to Fix It
- 407 Proxy Authentication Required: Fix Guide
Related Reading
- Best Proxy Providers 2026: Ultimate Comparison Guide
- 15 Best Web Scraping Tools in 2026: Expert Comparison
- 10 Myths About Web Scraping That Need to Die in 2026
- Are Proxies Legal? Understanding the Law Around Proxy Servers
- 403 Forbidden Error: What It Means & How to Fix It
- 407 Proxy Authentication Required: Fix Guide