lxml vs BeautifulSoup: Speed Comparison

lxml and BeautifulSoup are Python’s two most popular HTML parsing libraries, but they serve different audiences. BeautifulSoup prioritizes ease of use and tolerance for broken HTML. lxml prioritizes raw speed and powerful XPath queries. Choosing between them affects your scraping project’s performance, code style, and reliability.

This comparison includes real benchmarks, API comparisons, and clear guidance on when each library is the right choice.

Quick Comparison
Speed Benchmarks
API Comparison
Selector Comparison
Handling Broken HTML
Memory Usage
When to Use Each
Using Both Together
FAQ

Quick Comparison

Feature	lxml	BeautifulSoup
Speed	Very fast (C implementation)	Slower (Python layer)
CSS selectors	Via cssselect	Built-in
XPath	Full support	Not supported
Broken HTML	Less tolerant	Very tolerant
API style	Element tree / XPath	Pythonic / tag-based
Learning curve	Medium	Easy
Memory usage	Lower	Higher
Dependencies	C library (libxml2)	Pure Python option
Best for	Performance, XPath	Beginners, messy HTML

Speed Benchmarks

Parsing the same 500KB HTML document 100 times:

import time
from bs4 import BeautifulSoup
from lxml import html

# Load HTML
with open("large_page.html", "r") as f:
    html_content = f.read()

# Benchmark BeautifulSoup
start = time.time()
for _ in range(100):
    soup = BeautifulSoup(html_content, "lxml")  # Using lxml parser backend
    titles = [a.text for a in soup.select("h3 a")]
bs4_time = time.time() - start

# Benchmark lxml
start = time.time()
for _ in range(100):
    tree = html.fromstring(html_content)
    titles = tree.xpath("//h3/a/text()")
lxml_time = time.time() - start

print(f"BeautifulSoup: {bs4_time:.2f}s")
print(f"lxml:          {lxml_time:.2f}s")
print(f"lxml is {bs4_time / lxml_time:.1f}x faster")

Typical Results

Operation	BeautifulSoup (lxml parser)	lxml directly	Speed ratio
Parse 500KB HTML	45ms	8ms	5.6x
Extract 100 elements (CSS)	12ms	2ms	6x
Extract 100 elements (XPath)	N/A	1.5ms	—
Parse 5MB HTML	850ms	120ms	7x
10,000 small pages	28s	5s	5.6x

lxml is 5-10x faster than BeautifulSoup for parsing and extraction. The gap widens with larger documents.

Real-World Impact

For a scraper processing 10,000 pages:

BeautifulSoup: ~28s parsing time + network time
lxml: ~5s parsing time + network time
Savings: 23 seconds (parsing alone)

For most scrapers, network latency dominates, so the parsing speed difference only matters for large-scale or parse-heavy workloads.

API Comparison

Parsing HTML

# BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_string, "lxml")

# lxml
from lxml import html
tree = html.fromstring(html_string)

Finding Elements

# BeautifulSoup — multiple methods
soup.find("div", class_="product")          # First match
soup.find_all("div", class_="product")      # All matches
soup.select("div.product")                   # CSS selector
soup.select_one("div.product")               # First match via CSS

# lxml — XPath or CSS
tree.xpath("//div[@class='product']")        # XPath
tree.cssselect("div.product")               # CSS (via cssselect)

Extracting Text

# BeautifulSoup
element = soup.find("h2")
text = element.text                          # All text including children
text = element.string                        # Direct text content only
text = element.get_text(strip=True)          # Stripped text

# lxml
element = tree.xpath("//h2")[0]
text = element.text                          # Direct text only
text = element.text_content()                # All text including children
text = element.xpath("string()")             # XPath string extraction

Extracting Attributes

# BeautifulSoup
href = soup.find("a")["href"]
href = soup.find("a").get("href", "")
all_hrefs = [a["href"] for a in soup.find_all("a", href=True)]

# lxml
href = tree.xpath("//a/@href")              # Returns list of all hrefs
single = tree.xpath("//a/@href")[0]          # First href
element = tree.xpath("//a")[0]
href = element.get("href")                   # Attribute on element

Full Extraction Example

import requests
from bs4 import BeautifulSoup
from lxml import html as lxml_html

response = requests.get("https://books.toscrape.com/")

# BeautifulSoup approach
soup = BeautifulSoup(response.text, "lxml")
bs4_books = []
for book in soup.select("article.product_pod"):
    bs4_books.append({
        "title": book.select_one("h3 a")["title"],
        "price": book.select_one(".price_color").text,
    })

# lxml approach
tree = lxml_html.fromstring(response.content)
lxml_books = []
for book in tree.xpath("//article[contains(@class, 'product_pod')]"):
    lxml_books.append({
        "title": book.xpath(".//h3/a/@title")[0],
        "price": book.xpath(".//*[contains(@class, 'price_color')]/text()")[0],
    })

# Both produce identical results
assert len(bs4_books) == len(lxml_books)

Selector Comparison

CSS Selectors

# BeautifulSoup — native CSS selector support
soup.select("div.product h3 a")
soup.select("div.product > .price")
soup.select("[data-id='123']")
soup.select("tr:nth-child(odd)")

# lxml — CSS via cssselect
from lxml.cssselect import CSSSelector
sel = CSSSelector("div.product h3 a")
results = sel(tree)
# Or inline
tree.cssselect("div.product h3 a")

XPath (lxml only)

XPath is lxml’s superpower — it handles complex queries that CSS cannot express:

# Conditional selection
tree.xpath("//div[@class='product'][.//span[number(substring(text(),2)) < 50]]")

# Sibling navigation
tree.xpath("//h2[text()='Laptop']/following-sibling::span/text()")

# Text contains
tree.xpath("//div[contains(text(), 'In Stock')]")

# Position-based
tree.xpath("//table/tr[position() > 1]")  # Skip header row

# Multiple conditions
tree.xpath("//a[@href and @class='active']")

# Aggregate functions
tree.xpath("count(//div[@class='product'])")

# String manipulation
tree.xpath("normalize-space(//p[@class='description'])")

BeautifulSoup has no XPath support. If your parsing requires XPath, you must use lxml.

Handling Broken HTML

broken_html = "<div><p>Unclosed paragraph<p>Another<div>Nested wrong</p></div>"

# BeautifulSoup — very forgiving
soup = BeautifulSoup(broken_html, "html.parser")
print(soup.prettify())
# Reconstructs a reasonable tree structure

# lxml — less forgiving but still handles most cases
tree = lxml_html.fromstring(broken_html)
print(lxml_html.tostring(tree, pretty_print=True).decode())
# May produce different structure than BeautifulSoup

BeautifulSoup with the html.parser backend is the most tolerant of broken HTML. It is also the only option that does not require C dependencies.

Tolerance ranking:

BeautifulSoup + html.parser (most tolerant)
BeautifulSoup + html5lib (standards-compliant)
BeautifulSoup + lxml (fast but less tolerant)
lxml directly (least tolerant of severely broken HTML)

Memory Usage

Parsing a 10MB HTML file:

Parser	Peak Memory	Parse Time
BeautifulSoup + html.parser	~250MB	4.2s
BeautifulSoup + lxml	~180MB	1.8s
lxml directly	~90MB	0.8s

lxml uses roughly half the memory of BeautifulSoup because it doesn’t build the additional Python object layer that BeautifulSoup adds on top of the parse tree.

When to Use Each

Use BeautifulSoup When:

You are a beginner learning web scraping
The HTML is severely broken or malformed
You want the simplest possible API
You need to work without C dependencies (html.parser backend)
Parsing speed is not a bottleneck
You are working in a Jupyter notebook for quick exploration

# BeautifulSoup shines here: quick, readable, forgiving
from bs4 import BeautifulSoup
soup = BeautifulSoup(messy_html, "html.parser")
prices = [el.text for el in soup.select(".price")]

Full tutorial: Beautiful Soup tutorial.

Use lxml When:

Parsing speed is critical (large documents, many pages)
You need XPath queries
You are processing XML alongside HTML
Memory usage matters (large files)
You are comfortable with XPath syntax

# lxml shines here: fast, powerful XPath
from lxml import html
tree = html.fromstring(large_html)
# Complex query that CSS cannot express
products = tree.xpath(
    "//div[@class='product'][.//span[@class='price']"
    "[number(substring(text(),2)) < 50]]//h3/text()"
)

Consider Parsel When:

You want lxml speed with a cleaner API — Parsel wraps lxml with both CSS and XPath support:

from parsel import Selector
sel = Selector(text=html_string)
titles = sel.css("h3 a::attr(title)").getall()     # CSS
titles = sel.xpath("//h3/a/@title").getall()        # XPath
isbns = sel.css(".details").re(r"ISBN:\s*([\d-]+)")  # Regex

See our HTTPX + Parsel guide for more.

Using Both Together

The hybrid approach: use lxml as BeautifulSoup’s parser backend for speed, with BeautifulSoup’s API for convenience:

from bs4 import BeautifulSoup

# Use lxml as the parser backend — faster than html.parser
soup = BeautifulSoup(html_string, "lxml")

# BeautifulSoup API, lxml speed
products = soup.select("div.product")

This gives you ~70% of lxml’s speed with 100% of BeautifulSoup’s API convenience.

FAQ

Is lxml always faster than BeautifulSoup?

For parsing and extraction, yes — typically 5-10x faster. However, when you include network latency (which dominates most scraping workflows), the parsing speed difference may not matter. It only becomes significant when parsing large documents or processing thousands of pages.

Can I use XPath with BeautifulSoup?

No. BeautifulSoup does not support XPath. If you need XPath, use lxml directly or use Parsel (which supports both CSS and XPath). You can also use lxml as BeautifulSoup’s parser backend and access the underlying lxml tree when needed.

Should beginners start with lxml or BeautifulSoup?

Start with BeautifulSoup. Its API is more intuitive, its documentation is better for beginners, and its tolerance for broken HTML means fewer frustrating errors. Move to lxml when parsing speed becomes a bottleneck.

Does using lxml as BeautifulSoup’s parser give me lxml speed?

Partially. Using BeautifulSoup(html, "lxml") uses lxml for parsing (fast) but wraps results in BeautifulSoup objects (slower). You get faster parsing but slower element access compared to using lxml directly. It is a good middle ground.

What about html5lib?

html5lib is the most standards-compliant parser but also the slowest (10-100x slower than lxml). Use it only when you need the exact same parsing behavior as a web browser, which is rare in scraping.

Learn each library: BeautifulSoup tutorial, Python scraping libraries. For proxy setup, see our web scraping proxy guide.

External Resources:

lxml vs BeautifulSoup: Speed Comparison

lxml vs BeautifulSoup: Speed Comparison

Table of Contents

Quick Comparison

Speed Benchmarks

Typical Results

Real-World Impact

API Comparison

Parsing HTML

Finding Elements

Extracting Text

Extracting Attributes

Full Extraction Example

Selector Comparison

CSS Selectors

XPath (lxml only)

Handling Broken HTML

Memory Usage

When to Use Each

Use BeautifulSoup When:

Use lxml When:

Consider Parsel When:

Using Both Together

FAQ

Is lxml always faster than BeautifulSoup?

Can I use XPath with BeautifulSoup?

Should beginners start with lxml or BeautifulSoup?

Does using lxml as BeautifulSoup’s parser give me lxml speed?

What about html5lib?

Related Reading