Beautiful Soup Tutorial: Python HTML Parsing Guide

Beautiful Soup Tutorial: Python HTML Parsing Guide

Beautiful Soup is the most popular HTML parsing library in Python. It turns messy, real-world HTML into a navigable tree that you can search with CSS selectors, element names, or attributes. Combined with the Requests library for fetching pages, it forms the simplest web scraping stack available.

This tutorial covers everything from basic tag extraction to advanced techniques for parsing complex, broken HTML. Every example uses real patterns you’ll encounter in production scraping.

Prerequisites

  • Python 3.8+ installed
  • Basic Python knowledge (lists, dictionaries, loops)
  • Understanding of HTML structure (tags, attributes, nesting)
  • A terminal or command prompt

Installation

pip install beautifulsoup4 requests lxml
  • beautifulsoup4 — the parsing library itself
  • requests — for fetching web pages
  • lxml — a fast HTML/XML parser (optional but recommended)

Verify installation:

from bs4 import BeautifulSoup
print(BeautifulSoup("<p>Hello</p>", "html.parser").p.text)
# Output: Hello

Basic Usage

Parsing HTML

from bs4 import BeautifulSoup

html = """
<html>
<head><title>My Store</title></head>
<body>
    <div class="products">
        <div class="product" data-id="1">
            <h2>Laptop</h2>
            <span class="price">$999.99</span>
            <p class="description">High-performance laptop</p>
        </div>
        <div class="product" data-id="2">
            <h2>Tablet</h2>
            <span class="price">$499.99</span>
            <p class="description">10-inch display tablet</p>
        </div>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(html, "lxml")  # or "html.parser"

# Get the page title
print(soup.title.string)  # "My Store"

# Find the first product
product = soup.find("div", class_="product")
print(product.h2.text)  # "Laptop"
print(product.find("span", class_="price").text)  # "$999.99"

Choosing a Parser

Beautiful Soup supports multiple parsers:

ParserInstallSpeedHandles Broken HTML
html.parserBuilt-inMediumDecent
lxmlpip install lxmlFastestGood
html5libpip install html5libSlowestBest

Recommendation: Use lxml for speed on well-formed HTML. Use html5lib only for severely broken HTML. html.parser works fine for simple tasks.

# Fast parsing with lxml
soup = BeautifulSoup(html, "lxml")

# Strict HTML5 parsing
soup = BeautifulSoup(html, "html5lib")

# Built-in parser (no extra install)
soup = BeautifulSoup(html, "html.parser")

Finding Elements

find() and find_all()

# Find the first matching element
first_product = soup.find("div", class_="product")

# Find all matching elements
all_products = soup.find_all("div", class_="product")

# Find by attribute
product_1 = soup.find("div", attrs={"data-id": "1"})

# Find by ID
main = soup.find("div", id="main-content")

# Find with multiple criteria
specific = soup.find("span", class_="price", string="$999.99")

# Limit results
first_three = soup.find_all("div", class_="product", limit=3)

CSS Selectors with select()

CSS selectors are often more readable and powerful:

# Select by class
products = soup.select("div.product")

# Select by ID
main = soup.select_one("#main-content")

# Nested selection
prices = soup.select("div.product span.price")

# Attribute selectors
links = soup.select("a[href^='https']")  # href starts with https
data_items = soup.select("[data-id]")      # has data-id attribute

# Nth child
second_product = soup.select_one("div.product:nth-of-type(2)")

# Direct child vs descendant
direct_children = soup.select("div.products > div")  # direct children only
all_descendants = soup.select("div.products div")     # all nested divs

# Multiple selectors
headings = soup.select("h1, h2, h3")

Navigating the Tree

product = soup.find("div", class_="product")

# Children
for child in product.children:
    if child.name:  # skip NavigableString (whitespace)
        print(child.name, child.text.strip())

# Parent
parent = product.parent
print(parent.name)  # "div" (the products container)

# Siblings
next_product = product.find_next_sibling("div", class_="product")
prev_product = product.find_previous_sibling("div", class_="product")

# All next siblings
for sibling in product.find_next_siblings("div"):
    print(sibling.h2.text)

Extracting Data

Getting Text

# Get text of an element
title = soup.find("h2").text           # includes child text
title = soup.find("h2").get_text()     # same as .text
title = soup.find("h2").string         # only if there's a single text node

# Get text with separator
full_text = soup.get_text(separator=" ", strip=True)

# Strip whitespace
clean = soup.find("p").get_text(strip=True)

Getting Attributes

link = soup.find("a")

# Get attribute
href = link.get("href")        # Returns None if missing
href = link["href"]            # Raises KeyError if missing
href = link.attrs.get("href") # Same as .get()

# Get all attributes
print(link.attrs)  # {'href': '/page', 'class': ['nav-link'], 'id': 'link1'}

# Check if attribute exists
if link.has_attr("data-id"):
    print(link["data-id"])

Extracting Tables

table = soup.find("table")
rows = []

for tr in table.find_all("tr"):
    cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
    rows.append(cells)

# First row is typically headers
headers = rows[0]
data = rows[1:]

# Convert to list of dictionaries
records = [dict(zip(headers, row)) for row in data]
print(records)
# [{'Name': 'Alice', 'Age': '30', 'City': 'NYC'}, ...]

Complete Web Scraping Example

Here’s a full scraping workflow with Requests + Beautiful Soup:

import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_books():
    base_url = "https://books.toscrape.com/catalogue/page-{}.html"
    all_books = []

    for page_num in range(1, 51):
        url = base_url.format(page_num)
        print(f"Scraping page {page_num}...")

        response = requests.get(url, headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        })

        if response.status_code != 200:
            print(f"Failed to fetch page {page_num}: {response.status_code}")
            break

        soup = BeautifulSoup(response.text, "lxml")
        books = soup.select("article.product_pod")

        for book in books:
            title = book.select_one("h3 a")["title"]
            price = book.select_one(".price_color").text.strip()
            rating_class = book.select_one("p.star-rating")["class"][1]
            availability = book.select_one(".instock") is not None
            link = book.select_one("h3 a")["href"]

            rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

            all_books.append({
                "title": title,
                "price": float(price.replace("£", "")),
                "rating": rating_map.get(rating_class, 0),
                "available": availability,
                "url": f"https://books.toscrape.com/catalogue/{link}",
            })

        time.sleep(1)  # Respectful delay

    return all_books


def save_to_csv(books, filename="books.csv"):
    if not books:
        return

    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=books[0].keys())
        writer.writeheader()
        writer.writerows(books)
    print(f"Saved {len(books)} books to {filename}")


if __name__ == "__main__":
    books = scrape_books()
    save_to_csv(books)

Advanced Techniques

Searching with Regular Expressions

import re

# Find tags with class matching a pattern
items = soup.find_all("div", class_=re.compile(r"product-\d+"))

# Find tags with text matching a pattern
prices = soup.find_all(string=re.compile(r"\$\d+\.\d{2}"))

# Find links with specific URL patterns
api_links = soup.find_all("a", href=re.compile(r"/api/v\d+/"))

Custom Filter Functions

# Find all tags with exactly 2 CSS classes
def has_two_classes(tag):
    return tag.has_attr("class") and len(tag["class"]) == 2

results = soup.find_all(has_two_classes)

# Find divs with a specific data attribute value range
def price_range(tag):
    if tag.name == "div" and tag.has_attr("data-price"):
        price = float(tag["data-price"])
        return 10 <= price <= 100
    return False

affordable = soup.find_all(price_range)

Modifying the Parse Tree

# Remove unwanted elements before parsing
for script in soup.find_all("script"):
    script.decompose()

for style in soup.find_all("style"):
    style.decompose()

# Remove ads and navigation
for ad in soup.select(".ad-banner, .sidebar-ad, nav"):
    ad.decompose()

# Now extract clean content
content = soup.select_one("article.main-content")
clean_text = content.get_text(separator="\n", strip=True)

Handling Broken HTML

# Real-world HTML is often messy
broken_html = """
<div class="item">
    <p>Price: <b>$29.99</p></b>
    <img src="photo.jpg" alt="Product>
    <a href="/buy">Buy now
</div>
"""

# html.parser may struggle, use html5lib for best results
soup = BeautifulSoup(broken_html, "html5lib")
print(soup.find("b").text)  # "$29.99"

Using Proxies with Requests

For large-scale scraping, you’ll need proxy rotation to avoid IP blocks:

import requests

proxies = {
    "http": "http://user:pass@proxy-server:8080",
    "https": "http://user:pass@proxy-server:8080",
}

response = requests.get(
    "https://example.com",
    proxies=proxies,
    headers={"User-Agent": "Mozilla/5.0 ..."},
    timeout=15
)

soup = BeautifulSoup(response.text, "lxml")

For rotating proxies across many requests, check our proxy rotation guide and residential proxy setup.

Common Pitfalls and Troubleshooting

1. AttributeError: 'NoneType' object has no attribute 'text'

The element wasn’t found. Always check if find() returned None before accessing attributes:

element = soup.find("div", class_="price")
price = element.text if element else "N/A"

2. Getting empty results when you can see content in the browser

The content is loaded by JavaScript. Beautiful Soup only parses the initial HTML. You need a browser automation tool like Selenium or Playwright.

3. class is a Python reserved word

Use class_ (with underscore) in find() and find_all():

soup.find("div", class_="product")  # Correct

4. Encoding issues with special characters

Specify encoding when reading:

response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, "lxml")

5. Performance is slow on large HTML documents

Switch from html.parser to lxml. For very large files, consider using lxml directly or Parsel instead of Beautiful Soup. See our lxml vs BeautifulSoup comparison.

FAQ

Is Beautiful Soup enough for web scraping?

Beautiful Soup handles HTML parsing, but you still need an HTTP library (like Requests) to fetch pages. For simple projects, Requests + Beautiful Soup is all you need. For complex projects with many pages, consider Scrapy which bundles everything together.

Can Beautiful Soup parse XML?

Yes. Use the lxml-xml parser (or just xml):

soup = BeautifulSoup(xml_string, "lxml-xml")

How does Beautiful Soup compare to Scrapy?

Beautiful Soup is a parser — it only handles HTML parsing. Scrapy is a full framework with HTTP requests, concurrency, data pipelines, and more. Use Beautiful Soup for quick scripts; use Scrapy for structured projects. See our detailed comparison.

What’s the difference between .text and .string?

.text (or .get_text()) concatenates all text within an element, including children. .string returns the text only if the element has a single text node — it returns None if there are child elements.

Can Beautiful Soup handle dynamic/JavaScript pages?

No. Beautiful Soup only parses static HTML. For JavaScript-rendered content, use Selenium, Playwright, or Puppeteer to render the page first, then pass the rendered HTML to Beautiful Soup.

Next Steps


Related Reading

Scroll to Top