lxml vs BeautifulSoup: Speed Comparison

lxml vs BeautifulSoup: Speed Comparison

lxml and BeautifulSoup are Python’s two most popular HTML parsing libraries, but they serve different audiences. BeautifulSoup prioritizes ease of use and tolerance for broken HTML. lxml prioritizes raw speed and powerful XPath queries. Choosing between them affects your scraping project’s performance, code style, and reliability.

This comparison includes real benchmarks, API comparisons, and clear guidance on when each library is the right choice.

Table of Contents

Quick Comparison

FeaturelxmlBeautifulSoup
SpeedVery fast (C implementation)Slower (Python layer)
CSS selectorsVia cssselectBuilt-in
XPathFull supportNot supported
Broken HTMLLess tolerantVery tolerant
API styleElement tree / XPathPythonic / tag-based
Learning curveMediumEasy
Memory usageLowerHigher
DependenciesC library (libxml2)Pure Python option
Best forPerformance, XPathBeginners, messy HTML

Speed Benchmarks

Parsing the same 500KB HTML document 100 times:

import time
from bs4 import BeautifulSoup
from lxml import html

# Load HTML
with open("large_page.html", "r") as f:
    html_content = f.read()

# Benchmark BeautifulSoup
start = time.time()
for _ in range(100):
    soup = BeautifulSoup(html_content, "lxml")  # Using lxml parser backend
    titles = [a.text for a in soup.select("h3 a")]
bs4_time = time.time() - start

# Benchmark lxml
start = time.time()
for _ in range(100):
    tree = html.fromstring(html_content)
    titles = tree.xpath("//h3/a/text()")
lxml_time = time.time() - start

print(f"BeautifulSoup: {bs4_time:.2f}s")
print(f"lxml:          {lxml_time:.2f}s")
print(f"lxml is {bs4_time / lxml_time:.1f}x faster")

Typical Results

OperationBeautifulSoup (lxml parser)lxml directlySpeed ratio
Parse 500KB HTML45ms8ms5.6x
Extract 100 elements (CSS)12ms2ms6x
Extract 100 elements (XPath)N/A1.5ms
Parse 5MB HTML850ms120ms7x
10,000 small pages28s5s5.6x

lxml is 5-10x faster than BeautifulSoup for parsing and extraction. The gap widens with larger documents.

Real-World Impact

For a scraper processing 10,000 pages:

  • BeautifulSoup: ~28s parsing time + network time
  • lxml: ~5s parsing time + network time
  • Savings: 23 seconds (parsing alone)

For most scrapers, network latency dominates, so the parsing speed difference only matters for large-scale or parse-heavy workloads.

API Comparison

Parsing HTML

# BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_string, "lxml")

# lxml
from lxml import html
tree = html.fromstring(html_string)

Finding Elements

# BeautifulSoup — multiple methods
soup.find("div", class_="product")          # First match
soup.find_all("div", class_="product")      # All matches
soup.select("div.product")                   # CSS selector
soup.select_one("div.product")               # First match via CSS

# lxml — XPath or CSS
tree.xpath("//div[@class='product']")        # XPath
tree.cssselect("div.product")               # CSS (via cssselect)

Extracting Text

# BeautifulSoup
element = soup.find("h2")
text = element.text                          # All text including children
text = element.string                        # Direct text content only
text = element.get_text(strip=True)          # Stripped text

# lxml
element = tree.xpath("//h2")[0]
text = element.text                          # Direct text only
text = element.text_content()                # All text including children
text = element.xpath("string()")             # XPath string extraction

Extracting Attributes

# BeautifulSoup
href = soup.find("a")["href"]
href = soup.find("a").get("href", "")
all_hrefs = [a["href"] for a in soup.find_all("a", href=True)]

# lxml
href = tree.xpath("//a/@href")              # Returns list of all hrefs
single = tree.xpath("//a/@href")[0]          # First href
element = tree.xpath("//a")[0]
href = element.get("href")                   # Attribute on element

Full Extraction Example

import requests
from bs4 import BeautifulSoup
from lxml import html as lxml_html

response = requests.get("https://books.toscrape.com/")

# BeautifulSoup approach
soup = BeautifulSoup(response.text, "lxml")
bs4_books = []
for book in soup.select("article.product_pod"):
    bs4_books.append({
        "title": book.select_one("h3 a")["title"],
        "price": book.select_one(".price_color").text,
    })

# lxml approach
tree = lxml_html.fromstring(response.content)
lxml_books = []
for book in tree.xpath("//article[contains(@class, 'product_pod')]"):
    lxml_books.append({
        "title": book.xpath(".//h3/a/@title")[0],
        "price": book.xpath(".//*[contains(@class, 'price_color')]/text()")[0],
    })

# Both produce identical results
assert len(bs4_books) == len(lxml_books)

Selector Comparison

CSS Selectors

# BeautifulSoup — native CSS selector support
soup.select("div.product h3 a")
soup.select("div.product > .price")
soup.select("[data-id='123']")
soup.select("tr:nth-child(odd)")

# lxml — CSS via cssselect
from lxml.cssselect import CSSSelector
sel = CSSSelector("div.product h3 a")
results = sel(tree)
# Or inline
tree.cssselect("div.product h3 a")

XPath (lxml only)

XPath is lxml’s superpower — it handles complex queries that CSS cannot express:

# Conditional selection
tree.xpath("//div[@class='product'][.//span[number(substring(text(),2)) < 50]]")

# Sibling navigation
tree.xpath("//h2[text()='Laptop']/following-sibling::span/text()")

# Text contains
tree.xpath("//div[contains(text(), 'In Stock')]")

# Position-based
tree.xpath("//table/tr[position() > 1]")  # Skip header row

# Multiple conditions
tree.xpath("//a[@href and @class='active']")

# Aggregate functions
tree.xpath("count(//div[@class='product'])")

# String manipulation
tree.xpath("normalize-space(//p[@class='description'])")

BeautifulSoup has no XPath support. If your parsing requires XPath, you must use lxml.

Handling Broken HTML

broken_html = "<div><p>Unclosed paragraph<p>Another<div>Nested wrong</p></div>"

# BeautifulSoup — very forgiving
soup = BeautifulSoup(broken_html, "html.parser")
print(soup.prettify())
# Reconstructs a reasonable tree structure

# lxml — less forgiving but still handles most cases
tree = lxml_html.fromstring(broken_html)
print(lxml_html.tostring(tree, pretty_print=True).decode())
# May produce different structure than BeautifulSoup

BeautifulSoup with the html.parser backend is the most tolerant of broken HTML. It is also the only option that does not require C dependencies.

Tolerance ranking:

  1. BeautifulSoup + html.parser (most tolerant)
  2. BeautifulSoup + html5lib (standards-compliant)
  3. BeautifulSoup + lxml (fast but less tolerant)
  4. lxml directly (least tolerant of severely broken HTML)

Memory Usage

Parsing a 10MB HTML file:

ParserPeak MemoryParse Time
BeautifulSoup + html.parser~250MB4.2s
BeautifulSoup + lxml~180MB1.8s
lxml directly~90MB0.8s

lxml uses roughly half the memory of BeautifulSoup because it doesn’t build the additional Python object layer that BeautifulSoup adds on top of the parse tree.

When to Use Each

Use BeautifulSoup When:

  • You are a beginner learning web scraping
  • The HTML is severely broken or malformed
  • You want the simplest possible API
  • You need to work without C dependencies (html.parser backend)
  • Parsing speed is not a bottleneck
  • You are working in a Jupyter notebook for quick exploration
# BeautifulSoup shines here: quick, readable, forgiving
from bs4 import BeautifulSoup
soup = BeautifulSoup(messy_html, "html.parser")
prices = [el.text for el in soup.select(".price")]

Full tutorial: Beautiful Soup tutorial.

Use lxml When:

  • Parsing speed is critical (large documents, many pages)
  • You need XPath queries
  • You are processing XML alongside HTML
  • Memory usage matters (large files)
  • You are comfortable with XPath syntax
# lxml shines here: fast, powerful XPath
from lxml import html
tree = html.fromstring(large_html)
# Complex query that CSS cannot express
products = tree.xpath(
    "//div[@class='product'][.//span[@class='price']"
    "[number(substring(text(),2)) < 50]]//h3/text()"
)

Consider Parsel When:

You want lxml speed with a cleaner API — Parsel wraps lxml with both CSS and XPath support:

from parsel import Selector
sel = Selector(text=html_string)
titles = sel.css("h3 a::attr(title)").getall()     # CSS
titles = sel.xpath("//h3/a/@title").getall()        # XPath
isbns = sel.css(".details").re(r"ISBN:\s*([\d-]+)")  # Regex

See our HTTPX + Parsel guide for more.

Using Both Together

The hybrid approach: use lxml as BeautifulSoup’s parser backend for speed, with BeautifulSoup’s API for convenience:

from bs4 import BeautifulSoup

# Use lxml as the parser backend — faster than html.parser
soup = BeautifulSoup(html_string, "lxml")

# BeautifulSoup API, lxml speed
products = soup.select("div.product")

This gives you ~70% of lxml’s speed with 100% of BeautifulSoup’s API convenience.

FAQ

Is lxml always faster than BeautifulSoup?

For parsing and extraction, yes — typically 5-10x faster. However, when you include network latency (which dominates most scraping workflows), the parsing speed difference may not matter. It only becomes significant when parsing large documents or processing thousands of pages.

Can I use XPath with BeautifulSoup?

No. BeautifulSoup does not support XPath. If you need XPath, use lxml directly or use Parsel (which supports both CSS and XPath). You can also use lxml as BeautifulSoup’s parser backend and access the underlying lxml tree when needed.

Should beginners start with lxml or BeautifulSoup?

Start with BeautifulSoup. Its API is more intuitive, its documentation is better for beginners, and its tolerance for broken HTML means fewer frustrating errors. Move to lxml when parsing speed becomes a bottleneck.

Does using lxml as BeautifulSoup’s parser give me lxml speed?

Partially. Using BeautifulSoup(html, "lxml") uses lxml for parsing (fast) but wraps results in BeautifulSoup objects (slower). You get faster parsing but slower element access compared to using lxml directly. It is a good middle ground.

What about html5lib?

html5lib is the most standards-compliant parser but also the slowest (10-100x slower than lxml). Use it only when you need the exact same parsing behavior as a web browser, which is rare in scraping.


Learn each library: BeautifulSoup tutorial, Python scraping libraries. For proxy setup, see our web scraping proxy guide.

External Resources:


Related Reading

Scroll to Top