Beautiful Soup Tutorial: Python HTML Parsing Guide
Beautiful Soup is the most popular HTML parsing library in Python. It turns messy, real-world HTML into a navigable tree that you can search with CSS selectors, element names, or attributes. Combined with the Requests library for fetching pages, it forms the simplest web scraping stack available.
This tutorial covers everything from basic tag extraction to advanced techniques for parsing complex, broken HTML. Every example uses real patterns you’ll encounter in production scraping.
Prerequisites
- Python 3.8+ installed
- Basic Python knowledge (lists, dictionaries, loops)
- Understanding of HTML structure (tags, attributes, nesting)
- A terminal or command prompt
Installation
pip install beautifulsoup4 requests lxmlbeautifulsoup4— the parsing library itselfrequests— for fetching web pageslxml— a fast HTML/XML parser (optional but recommended)
Verify installation:
from bs4 import BeautifulSoup
print(BeautifulSoup("<p>Hello</p>", "html.parser").p.text)
# Output: HelloBasic Usage
Parsing HTML
from bs4 import BeautifulSoup
html = """
<html>
<head><title>My Store</title></head>
<body>
<div class="products">
<div class="product" data-id="1">
<h2>Laptop</h2>
<span class="price">$999.99</span>
<p class="description">High-performance laptop</p>
</div>
<div class="product" data-id="2">
<h2>Tablet</h2>
<span class="price">$499.99</span>
<p class="description">10-inch display tablet</p>
</div>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, "lxml") # or "html.parser"
# Get the page title
print(soup.title.string) # "My Store"
# Find the first product
product = soup.find("div", class_="product")
print(product.h2.text) # "Laptop"
print(product.find("span", class_="price").text) # "$999.99"Choosing a Parser
Beautiful Soup supports multiple parsers:
| Parser | Install | Speed | Handles Broken HTML |
|---|---|---|---|
html.parser | Built-in | Medium | Decent |
lxml | pip install lxml | Fastest | Good |
html5lib | pip install html5lib | Slowest | Best |
Recommendation: Use lxml for speed on well-formed HTML. Use html5lib only for severely broken HTML. html.parser works fine for simple tasks.
# Fast parsing with lxml
soup = BeautifulSoup(html, "lxml")
# Strict HTML5 parsing
soup = BeautifulSoup(html, "html5lib")
# Built-in parser (no extra install)
soup = BeautifulSoup(html, "html.parser")Finding Elements
find() and find_all()
# Find the first matching element
first_product = soup.find("div", class_="product")
# Find all matching elements
all_products = soup.find_all("div", class_="product")
# Find by attribute
product_1 = soup.find("div", attrs={"data-id": "1"})
# Find by ID
main = soup.find("div", id="main-content")
# Find with multiple criteria
specific = soup.find("span", class_="price", string="$999.99")
# Limit results
first_three = soup.find_all("div", class_="product", limit=3)CSS Selectors with select()
CSS selectors are often more readable and powerful:
# Select by class
products = soup.select("div.product")
# Select by ID
main = soup.select_one("#main-content")
# Nested selection
prices = soup.select("div.product span.price")
# Attribute selectors
links = soup.select("a[href^='https']") # href starts with https
data_items = soup.select("[data-id]") # has data-id attribute
# Nth child
second_product = soup.select_one("div.product:nth-of-type(2)")
# Direct child vs descendant
direct_children = soup.select("div.products > div") # direct children only
all_descendants = soup.select("div.products div") # all nested divs
# Multiple selectors
headings = soup.select("h1, h2, h3")Navigating the Tree
product = soup.find("div", class_="product")
# Children
for child in product.children:
if child.name: # skip NavigableString (whitespace)
print(child.name, child.text.strip())
# Parent
parent = product.parent
print(parent.name) # "div" (the products container)
# Siblings
next_product = product.find_next_sibling("div", class_="product")
prev_product = product.find_previous_sibling("div", class_="product")
# All next siblings
for sibling in product.find_next_siblings("div"):
print(sibling.h2.text)Extracting Data
Getting Text
# Get text of an element
title = soup.find("h2").text # includes child text
title = soup.find("h2").get_text() # same as .text
title = soup.find("h2").string # only if there's a single text node
# Get text with separator
full_text = soup.get_text(separator=" ", strip=True)
# Strip whitespace
clean = soup.find("p").get_text(strip=True)Getting Attributes
link = soup.find("a")
# Get attribute
href = link.get("href") # Returns None if missing
href = link["href"] # Raises KeyError if missing
href = link.attrs.get("href") # Same as .get()
# Get all attributes
print(link.attrs) # {'href': '/page', 'class': ['nav-link'], 'id': 'link1'}
# Check if attribute exists
if link.has_attr("data-id"):
print(link["data-id"])Extracting Tables
table = soup.find("table")
rows = []
for tr in table.find_all("tr"):
cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
rows.append(cells)
# First row is typically headers
headers = rows[0]
data = rows[1:]
# Convert to list of dictionaries
records = [dict(zip(headers, row)) for row in data]
print(records)
# [{'Name': 'Alice', 'Age': '30', 'City': 'NYC'}, ...]Complete Web Scraping Example
Here’s a full scraping workflow with Requests + Beautiful Soup:
import requests
from bs4 import BeautifulSoup
import csv
import time
def scrape_books():
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
all_books = []
for page_num in range(1, 51):
url = base_url.format(page_num)
print(f"Scraping page {page_num}...")
response = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
if response.status_code != 200:
print(f"Failed to fetch page {page_num}: {response.status_code}")
break
soup = BeautifulSoup(response.text, "lxml")
books = soup.select("article.product_pod")
for book in books:
title = book.select_one("h3 a")["title"]
price = book.select_one(".price_color").text.strip()
rating_class = book.select_one("p.star-rating")["class"][1]
availability = book.select_one(".instock") is not None
link = book.select_one("h3 a")["href"]
rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
all_books.append({
"title": title,
"price": float(price.replace("£", "")),
"rating": rating_map.get(rating_class, 0),
"available": availability,
"url": f"https://books.toscrape.com/catalogue/{link}",
})
time.sleep(1) # Respectful delay
return all_books
def save_to_csv(books, filename="books.csv"):
if not books:
return
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=books[0].keys())
writer.writeheader()
writer.writerows(books)
print(f"Saved {len(books)} books to {filename}")
if __name__ == "__main__":
books = scrape_books()
save_to_csv(books)Advanced Techniques
Searching with Regular Expressions
import re
# Find tags with class matching a pattern
items = soup.find_all("div", class_=re.compile(r"product-\d+"))
# Find tags with text matching a pattern
prices = soup.find_all(string=re.compile(r"\$\d+\.\d{2}"))
# Find links with specific URL patterns
api_links = soup.find_all("a", href=re.compile(r"/api/v\d+/"))Custom Filter Functions
# Find all tags with exactly 2 CSS classes
def has_two_classes(tag):
return tag.has_attr("class") and len(tag["class"]) == 2
results = soup.find_all(has_two_classes)
# Find divs with a specific data attribute value range
def price_range(tag):
if tag.name == "div" and tag.has_attr("data-price"):
price = float(tag["data-price"])
return 10 <= price <= 100
return False
affordable = soup.find_all(price_range)Modifying the Parse Tree
# Remove unwanted elements before parsing
for script in soup.find_all("script"):
script.decompose()
for style in soup.find_all("style"):
style.decompose()
# Remove ads and navigation
for ad in soup.select(".ad-banner, .sidebar-ad, nav"):
ad.decompose()
# Now extract clean content
content = soup.select_one("article.main-content")
clean_text = content.get_text(separator="\n", strip=True)Handling Broken HTML
# Real-world HTML is often messy
broken_html = """
<div class="item">
<p>Price: <b>$29.99</p></b>
<img src="photo.jpg" alt="Product>
<a href="/buy">Buy now
</div>
"""
# html.parser may struggle, use html5lib for best results
soup = BeautifulSoup(broken_html, "html5lib")
print(soup.find("b").text) # "$29.99"Using Proxies with Requests
For large-scale scraping, you’ll need proxy rotation to avoid IP blocks:
import requests
proxies = {
"http": "http://user:pass@proxy-server:8080",
"https": "http://user:pass@proxy-server:8080",
}
response = requests.get(
"https://example.com",
proxies=proxies,
headers={"User-Agent": "Mozilla/5.0 ..."},
timeout=15
)
soup = BeautifulSoup(response.text, "lxml")For rotating proxies across many requests, check our proxy rotation guide and residential proxy setup.
Common Pitfalls and Troubleshooting
1. AttributeError: 'NoneType' object has no attribute 'text'
The element wasn’t found. Always check if find() returned None before accessing attributes:
element = soup.find("div", class_="price")
price = element.text if element else "N/A"2. Getting empty results when you can see content in the browser
The content is loaded by JavaScript. Beautiful Soup only parses the initial HTML. You need a browser automation tool like Selenium or Playwright.
3. class is a Python reserved word
Use class_ (with underscore) in find() and find_all():
soup.find("div", class_="product") # Correct4. Encoding issues with special characters
Specify encoding when reading:
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, "lxml")5. Performance is slow on large HTML documents
Switch from html.parser to lxml. For very large files, consider using lxml directly or Parsel instead of Beautiful Soup. See our lxml vs BeautifulSoup comparison.
FAQ
Is Beautiful Soup enough for web scraping?
Beautiful Soup handles HTML parsing, but you still need an HTTP library (like Requests) to fetch pages. For simple projects, Requests + Beautiful Soup is all you need. For complex projects with many pages, consider Scrapy which bundles everything together.
Can Beautiful Soup parse XML?
Yes. Use the lxml-xml parser (or just xml):
soup = BeautifulSoup(xml_string, "lxml-xml")How does Beautiful Soup compare to Scrapy?
Beautiful Soup is a parser — it only handles HTML parsing. Scrapy is a full framework with HTTP requests, concurrency, data pipelines, and more. Use Beautiful Soup for quick scripts; use Scrapy for structured projects. See our detailed comparison.
What’s the difference between .text and .string?
.text (or .get_text()) concatenates all text within an element, including children. .string returns the text only if the element has a single text node — it returns None if there are child elements.
Can Beautiful Soup handle dynamic/JavaScript pages?
No. Beautiful Soup only parses static HTML. For JavaScript-rendered content, use Selenium, Playwright, or Puppeteer to render the page first, then pass the rendered HTML to Beautiful Soup.
Next Steps
- Scrapy vs BeautifulSoup: When to Use Each
- lxml vs BeautifulSoup: Speed Comparison
- HTTPX + Parsel: Modern Python Scraping Stack
- aiohttp + BeautifulSoup: Async Scraping
- Best Python Web Scraping Libraries
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
Related Reading
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company