Web Scraping with Python: Complete Tutorial for Beginners (2026)

Web Scraping with Python: Complete Tutorial for Beginners (2026)

Python is the most popular language for web scraping, and for good reason: its ecosystem of libraries makes it possible to build a working scraper in under 20 lines of code.

This tutorial takes you from zero to a fully functional web scraping project. You’ll learn the core libraries, handle common challenges like pagination and JavaScript rendering, use proxies to avoid blocks, and store your data in usable formats.

Prerequisites: Basic Python knowledge (variables, loops, functions). Python 3.8+ installed on your machine.

Table of Contents


Setting Up Your Environment {#setup}

Install Python and pip

If you don’t already have Python installed, download it from python.org. Verify your installation:

python --version   # Should show 3.8+
pip --version      # Package manager

Create a Virtual Environment

Always use a virtual environment to keep your project dependencies isolated:

# Create a new project directory
mkdir my-scraper && cd my-scraper

# Create virtual environment
python -m venv venv

# Activate it
# macOS/Linux:
source venv/bin/activate
# Windows:
venv\Scripts\activate

Install Core Libraries

pip install requests beautifulsoup4 lxml pandas
LibraryPurpose
requestsSending HTTP requests
beautifulsoup4Parsing HTML and extracting data
lxmlFast HTML/XML parser (used as BS4 backend)
pandasData manipulation and export

For JavaScript-heavy sites, you’ll also need:

pip install selenium playwright
playwright install chromium

Your First Scraper: Requests + BeautifulSoup {#first-scraper}

Let’s start with the most common pattern: fetching a web page and extracting data from its HTML.

Step 1: Fetch the Page

import requests

url = "https://books.toscrape.com/"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}

response = requests.get(url, headers=headers)

# Check if request was successful
if response.status_code == 200:
    html = response.text
    print(f"Fetched {len(html)} characters")
else:
    print(f"Failed with status code: {response.status_code}")

Why set a User-Agent? Websites can reject requests that don’t have a browser-like User-Agent header. Always set one that mimics a real browser.

Step 2: Parse the HTML

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")

# Find all book containers
books = soup.select("article.product_pod")
print(f"Found {len(books)} books")

Step 3: Extract Data

results = []

for book in books:
    title = book.select_one("h3 a")["title"]
    price = book.select_one(".price_color").text
    rating = book.select_one("p.star-rating")["class"][1]  # e.g., "Three"
    availability = book.select_one(".availability").text.strip()

    results.append({
        "title": title,
        "price": price,
        "rating": rating,
        "in_stock": "In stock" in availability
    })

# Print first 3 results
for book in results[:3]:
    print(book)

Output:

{'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three', 'in_stock': True}
{'title': 'Tipping the Velvet', 'price': '£53.74', 'rating': 'One', 'in_stock': True}
{'title': 'Soumission', 'price': '£50.10', 'rating': 'One', 'in_stock': True}

Complete First Scraper

Here’s the full script in one piece:

import requests
from bs4 import BeautifulSoup

def scrape_books(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")
    books = soup.select("article.product_pod")

    results = []
    for book in books:
        results.append({
            "title": book.select_one("h3 a")["title"],
            "price": book.select_one(".price_color").text,
            "rating": book.select_one("p.star-rating")["class"][1],
            "in_stock": "In stock" in book.select_one(".availability").text
        })

    return results

if __name__ == "__main__":
    books = scrape_books("https://books.toscrape.com/")
    print(f"Scraped {len(books)} books")
    for book in books[:5]:
        print(f"  {book['title']} - {book['price']}")

Selecting Elements: CSS Selectors & XPath {#selectors}

Knowing how to target the right HTML elements is the most important scraping skill.

CSS Selectors (Recommended)

CSS selectors are the preferred method for most scrapers. They’re concise, readable, and familiar to anyone who has written CSS.

SelectorExampleMatches
Tagsoup.select("h1")All

elements

Classsoup.select(".price")Elements with class=”price”
IDsoup.select("#main")Element with id=”main”
Descendantsoup.select("div.product .price").price inside div.product
Attributesoup.select("a[href*='product']")Links containing “product” in href
nth-childsoup.select("tr:nth-child(2)")Second table row
Multiplesoup.select("h1, h2, h3")All h1, h2, and h3 elements

XPath (For Complex Selections)

XPath is more powerful than CSS selectors for certain patterns. Use it with lxml:

from lxml import html

tree = html.fromstring(response.text)

# Select by text content
prices = tree.xpath('//span[contains(@class, "price")]/text()')

# Select parent element
parent = tree.xpath('//span[@class="price"]/..')

# Select following sibling
description = tree.xpath('//h3/following-sibling::p/text()')

Pro Tips for Finding Selectors

  1. Use browser DevTools — Right-click an element, choose “Inspect,” and look at its HTML structure
  2. Copy selector from DevTools — Right-click the element in the Elements panel and choose “Copy > Copy selector”
  3. Test selectors in the console — Run document.querySelectorAll(".your-selector") in the browser console before coding
  4. Prefer stable selectors — IDs and semantic class names are more reliable than positional selectors

Handling Pagination {#pagination}

Most real-world scraping involves multiple pages. Here are the common patterns:

Pattern 1: Next Page Link

import requests
from bs4 import BeautifulSoup
import time

def scrape_all_pages(base_url):
    all_results = []
    url = base_url

    while url:
        print(f"Scraping: {url}")
        response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
        soup = BeautifulSoup(response.text, "lxml")

        # Extract data from current page
        for item in soup.select(".product-card"):
            all_results.append({
                "name": item.select_one(".title").text.strip(),
                "price": item.select_one(".price").text.strip()
            })

        # Find next page link
        next_link = soup.select_one("a.next")
        if next_link:
            url = next_link["href"]
            if not url.startswith("http"):
                url = base_url.rsplit("/", 1)[0] + "/" + url
        else:
            url = None  # No more pages

        time.sleep(1)  # Be polite - wait between requests

    return all_results

Pattern 2: Page Numbers

def scrape_numbered_pages(base_url, total_pages):
    all_results = []

    for page in range(1, total_pages + 1):
        url = f"{base_url}?page={page}"
        print(f"Scraping page {page}/{total_pages}")

        response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
        soup = BeautifulSoup(response.text, "lxml")

        items = soup.select(".product-card")
        if not items:
            break  # No more results

        for item in items:
            all_results.append({
                "name": item.select_one(".title").text.strip(),
                "price": item.select_one(".price").text.strip()
            })

        time.sleep(1)

    return all_results

Pattern 3: Infinite Scroll (API Calls)

Many modern websites load data via API calls when you scroll. Inspect the Network tab in DevTools to find these:

def scrape_api_pagination(api_url):
    all_results = []
    offset = 0
    limit = 20

    while True:
        response = requests.get(
            api_url,
            params={"offset": offset, "limit": limit},
            headers={"User-Agent": "Mozilla/5.0"}
        )
        data = response.json()

        if not data.get("results"):
            break

        all_results.extend(data["results"])
        offset += limit

        print(f"Fetched {len(all_results)} items so far...")
        time.sleep(1)

    return all_results

Scraping JavaScript-Rendered Pages {#javascript}

Many modern websites use JavaScript frameworks (React, Vue, Angular) that render content in the browser. Plain requests + BeautifulSoup won’t work because they only see the initial HTML before JavaScript executes.

Option 1: Playwright (Recommended)

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_js_page(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Navigate and wait for content to load
        page.goto(url)
        page.wait_for_selector(".product-card", timeout=10000)

        # Get the fully rendered HTML
        html = page.content()
        browser.close()

    # Parse with BeautifulSoup as usual
    soup = BeautifulSoup(html, "lxml")
    products = []
    for card in soup.select(".product-card"):
        products.append({
            "name": card.select_one(".title").text.strip(),
            "price": card.select_one(".price").text.strip()
        })

    return products

Option 2: Selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_with_selenium(url):
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--no-sandbox")

    driver = webdriver.Chrome(options=options)
    driver.get(url)

    # Wait for dynamic content
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, ".product-card"))
    )

    products = []
    cards = driver.find_elements(By.CSS_SELECTOR, ".product-card")
    for card in cards:
        products.append({
            "name": card.find_element(By.CSS_SELECTOR, ".title").text,
            "price": card.find_element(By.CSS_SELECTOR, ".price").text
        })

    driver.quit()
    return products

Option 3: Find the Hidden API

Before reaching for a headless browser, check if the website has a hidden API. Open DevTools, go to the Network tab, filter by XHR/Fetch, and look for JSON responses. If the data comes from an API, you can call it directly with requests — much faster and more efficient.

# Instead of rendering JS, call the API directly
response = requests.get(
    "https://example.com/api/products",
    params={"category": "electronics", "page": 1},
    headers={
        "User-Agent": "Mozilla/5.0",
        "Accept": "application/json"
    }
)
data = response.json()

Using Proxies to Avoid Blocks {#proxies}

Once you’re scraping more than a few hundred pages, you’ll start getting blocked. Proxies distribute your requests across multiple IP addresses to avoid detection.

Basic Proxy Usage with Requests

import requests

proxies = {
    "http": "http://username:password@proxy-gateway.provider.com:7777",
    "https": "http://username:password@proxy-gateway.provider.com:7777"
}

response = requests.get(
    "https://example.com",
    proxies=proxies,
    timeout=30
)

Rotating Proxies with a Proxy List

import requests
import random

proxy_list = [
    "http://user:pass@gate.provider.com:7777",
    "http://user:pass@gate.provider.com:7778",
    "http://user:pass@gate.provider.com:7779",
]

def get_with_proxy(url, max_retries=3):
    for attempt in range(max_retries):
        proxy = random.choice(proxy_list)
        try:
            response = requests.get(
                url,
                proxies={"http": proxy, "https": proxy},
                timeout=30,
                headers={"User-Agent": "Mozilla/5.0"}
            )
            if response.status_code == 200:
                return response
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")

    return None

Using Proxies with Playwright

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        proxy={
            "server": "http://proxy-gateway.provider.com:7777",
            "username": "your_username",
            "password": "your_password"
        }
    )
    page = browser.new_page()
    page.goto("https://example.com")
    print(page.content())
    browser.close()

Anti-Detection Tips

Beyond proxies, these techniques help avoid blocks:

  1. Rotate User-Agent strings — Use a list of real browser User-Agents
  2. Random delays — Add 1-5 second random delays between requests
  3. Respect rate limits — If you get 429 responses, slow down
  4. Handle cookies — Use requests.Session() to maintain cookies like a real browser
  5. Mimic browser headers — Include Accept, Accept-Language, and Referer headers
import random
import time

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
]

def polite_request(url, session=None):
    if session is None:
        session = requests.Session()

    headers = {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
    }

    time.sleep(random.uniform(1, 3))
    return session.get(url, headers=headers, timeout=30)

For a complete guide to proxy selection, see What Is a Residential Proxy?.


Storing Data: CSV, JSON, and Databases {#storing-data}

CSV (Simple Tabular Data)

import csv

def save_to_csv(data, filename="output.csv"):
    if not data:
        return

    keys = data[0].keys()
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=keys)
        writer.writeheader()
        writer.writerows(data)

    print(f"Saved {len(data)} records to {filename}")

JSON (Nested/Complex Data)

import json

def save_to_json(data, filename="output.json"):
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

    print(f"Saved {len(data)} records to {filename}")

Pandas DataFrame (Analysis-Ready)

import pandas as pd

def save_with_pandas(data, filename="output"):
    df = pd.DataFrame(data)

    # Clean and transform
    df["price"] = df["price"].str.replace("£", "").astype(float)
    df["in_stock"] = df["in_stock"].astype(bool)

    # Export to multiple formats
    df.to_csv(f"{filename}.csv", index=False)
    df.to_json(f"{filename}.json", orient="records", indent=2)
    df.to_excel(f"{filename}.xlsx", index=False)

    print(df.describe())
    return df

SQLite Database (Persistent Storage)

import sqlite3

def save_to_database(data, db_name="scraping.db", table="products"):
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()

    # Create table dynamically based on data keys
    if data:
        columns = ", ".join(f"{key} TEXT" for key in data[0].keys())
        cursor.execute(f"CREATE TABLE IF NOT EXISTS {table} ({columns})")

        placeholders = ", ".join("?" * len(data[0]))
        for record in data:
            cursor.execute(
                f"INSERT INTO {table} VALUES ({placeholders})",
                list(record.values())
            )

    conn.commit()
    print(f"Saved {len(data)} records to {db_name}")
    conn.close()

Building a Real-World Project {#real-world-project}

Let’s build a complete scraper that collects book data from all 50 pages of books.toscrape.com:

"""
Complete web scraper: Books to Scrape
Collects all books across all pages with error handling,
rate limiting, and data export.
"""

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import logging
from urllib.parse import urljoin

# Setup logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
logger = logging.getLogger(__name__)

BASE_URL = "https://books.toscrape.com/"

RATING_MAP = {
    "One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5
}

def create_session():
    """Create a requests session with default headers."""
    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9",
    })
    return session

def scrape_page(session, url):
    """Scrape a single page and return book data + next page URL."""
    time.sleep(random.uniform(0.5, 1.5))

    try:
        response = session.get(url, timeout=30)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        logger.error(f"Failed to fetch {url}: {e}")
        return [], None

    soup = BeautifulSoup(response.text, "lxml")
    books = []

    for article in soup.select("article.product_pod"):
        try:
            title = article.select_one("h3 a")["title"]
            price_text = article.select_one(".price_color").text
            price = float(price_text.replace("£", "").replace("Â", ""))
            rating_class = article.select_one("p.star-rating")["class"][1]
            rating = RATING_MAP.get(rating_class, 0)
            in_stock = "In stock" in article.select_one(".availability").text
            detail_url = urljoin(url, article.select_one("h3 a")["href"])

            books.append({
                "title": title,
                "price_gbp": price,
                "rating": rating,
                "in_stock": in_stock,
                "detail_url": detail_url
            })
        except (AttributeError, KeyError, ValueError) as e:
            logger.warning(f"Failed to parse a book: {e}")
            continue

    # Find next page
    next_btn = soup.select_one("li.next a")
    next_url = urljoin(url, next_btn["href"]) if next_btn else None

    return books, next_url

def scrape_all_books():
    """Scrape all books from all pages."""
    session = create_session()
    all_books = []
    url = BASE_URL
    page_num = 1

    while url:
        logger.info(f"Scraping page {page_num}: {url}")
        books, next_url = scrape_page(session, url)
        all_books.extend(books)
        logger.info(f"  Found {len(books)} books (total: {len(all_books)})")

        url = next_url
        page_num += 1

    return all_books

def analyze_and_export(books):
    """Analyze scraped data and export to files."""
    df = pd.DataFrame(books)

    # Analysis
    logger.info(f"\n{'='*50}")
    logger.info(f"Total books scraped: {len(df)}")
    logger.info(f"Price range: £{df['price_gbp'].min():.2f} - £{df['price_gbp'].max():.2f}")
    logger.info(f"Average price: £{df['price_gbp'].mean():.2f}")
    logger.info(f"Average rating: {df['rating'].mean():.1f}/5")
    logger.info(f"In stock: {df['in_stock'].sum()} / {len(df)}")

    # Export
    df.to_csv("books_data.csv", index=False)
    df.to_json("books_data.json", orient="records", indent=2)
    logger.info(f"Exported to books_data.csv and books_data.json")

    return df

if __name__ == "__main__":
    books = scrape_all_books()
    df = analyze_and_export(books)

This scraper demonstrates all the core concepts: session management, error handling, pagination, data cleaning, and export.


Scaling Up with Scrapy {#scrapy}

When your scraping needs outgrow simple scripts, Scrapy provides a production-ready framework:

pip install scrapy
scrapy startproject bookstore
# bookstore/spiders/books_spider.py
import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["https://books.toscrape.com/"]

    custom_settings = {
        "DOWNLOAD_DELAY": 1,
        "CONCURRENT_REQUESTS": 4,
        "FEEDS": {
            "books.json": {"format": "json", "overwrite": True},
            "books.csv": {"format": "csv", "overwrite": True},
        }
    }

    def parse(self, response):
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css(".price_color::text").get(),
                "rating": book.css("p.star-rating::attr(class)").get().split()[-1],
                "url": response.urljoin(book.css("h3 a::attr(href)").get()),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Run it:

cd bookstore
scrapy crawl books

Scrapy handles concurrency, rate limiting, retries, and data export automatically. For large projects, it’s worth the learning investment. See our guide to web scraping tools for a full comparison.


Error Handling & Best Practices {#best-practices}

Robust Error Handling

import requests
from requests.exceptions import RequestException, Timeout, ConnectionError
import time

def resilient_request(url, max_retries=3, backoff_factor=2):
    """Make a request with retries and exponential backoff."""
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=30, headers={
                "User-Agent": "Mozilla/5.0"
            })

            if response.status_code == 200:
                return response
            elif response.status_code == 429:
                # Rate limited - wait longer
                wait_time = backoff_factor ** (attempt + 2)
                print(f"Rate limited. Waiting {wait_time}s...")
                time.sleep(wait_time)
            elif response.status_code >= 500:
                # Server error - retry
                time.sleep(backoff_factor ** attempt)
            else:
                print(f"Unexpected status: {response.status_code}")
                return None

        except Timeout:
            print(f"Timeout on attempt {attempt + 1}")
        except ConnectionError:
            print(f"Connection error on attempt {attempt + 1}")
            time.sleep(backoff_factor ** attempt)
        except RequestException as e:
            print(f"Request failed: {e}")
            return None

    print(f"All {max_retries} attempts failed for {url}")
    return None

Best Practices Checklist

  1. Always set timeouts — Never let requests hang indefinitely
  2. Use sessions — Reuse connections for better performance
  3. Handle encoding — Use response.encoding and handle Unicode properly
  4. Log everything — Record URLs scraped, errors, and timing
  5. Save progress — Write data incrementally, not just at the end
  6. Validate data — Check for empty fields and malformed data
  7. Monitor performance — Track success rates and response times

Ethical Scraping Guidelines {#ethics}

Web scraping carries responsibilities. Follow these guidelines to scrape ethically:

  1. Check robots.txt — Read https://example.com/robots.txt before scraping
  2. Respect rate limits — Add delays between requests (1-3 seconds minimum)
  3. Don’t overload servers — Limit concurrent requests
  4. Identify yourself — Use a User-Agent that includes contact info for large projects
  5. Avoid personal data — Don’t collect PII unless you have a legal basis
  6. Check the ToS — Read the website’s Terms of Service
  7. Use APIs when available — If the site offers an API, prefer it over scraping
  8. Cache results — Don’t re-scrape pages unnecessarily

For a deep dive into the legal side, read our guide: Is Web Scraping Legal?


Next Steps {#next-steps}

You now have the foundation to build web scrapers for any project. Here’s where to go next:

  1. Practice — Build scrapers for sites you actually need data from
  2. Learn Scrapy — Graduate to the full framework for production projects
  3. Master Playwright — Essential for JavaScript-heavy sites
  4. Set up proxies — Required once you scale beyond testing (Web Scraping Proxy Guide)
  5. Explore scraping tools — See our comparison of 15 tools to find the right fit
  6. Check compliance — Use our Data Collection Compliance Checker for your specific use case

Happy scraping.


Related Reading

Scroll to Top