lxml vs BeautifulSoup: Speed Comparison
lxml and BeautifulSoup are Python’s two most popular HTML parsing libraries, but they serve different audiences. BeautifulSoup prioritizes ease of use and tolerance for broken HTML. lxml prioritizes raw speed and powerful XPath queries. Choosing between them affects your scraping project’s performance, code style, and reliability.
This comparison includes real benchmarks, API comparisons, and clear guidance on when each library is the right choice.
Table of Contents
- Quick Comparison
- Speed Benchmarks
- API Comparison
- Selector Comparison
- Handling Broken HTML
- Memory Usage
- When to Use Each
- Using Both Together
- FAQ
Quick Comparison
| Feature | lxml | BeautifulSoup |
|---|---|---|
| Speed | Very fast (C implementation) | Slower (Python layer) |
| CSS selectors | Via cssselect | Built-in |
| XPath | Full support | Not supported |
| Broken HTML | Less tolerant | Very tolerant |
| API style | Element tree / XPath | Pythonic / tag-based |
| Learning curve | Medium | Easy |
| Memory usage | Lower | Higher |
| Dependencies | C library (libxml2) | Pure Python option |
| Best for | Performance, XPath | Beginners, messy HTML |
Speed Benchmarks
Parsing the same 500KB HTML document 100 times:
import time
from bs4 import BeautifulSoup
from lxml import html
# Load HTML
with open("large_page.html", "r") as f:
html_content = f.read()
# Benchmark BeautifulSoup
start = time.time()
for _ in range(100):
soup = BeautifulSoup(html_content, "lxml") # Using lxml parser backend
titles = [a.text for a in soup.select("h3 a")]
bs4_time = time.time() - start
# Benchmark lxml
start = time.time()
for _ in range(100):
tree = html.fromstring(html_content)
titles = tree.xpath("//h3/a/text()")
lxml_time = time.time() - start
print(f"BeautifulSoup: {bs4_time:.2f}s")
print(f"lxml: {lxml_time:.2f}s")
print(f"lxml is {bs4_time / lxml_time:.1f}x faster")Typical Results
| Operation | BeautifulSoup (lxml parser) | lxml directly | Speed ratio |
|---|---|---|---|
| Parse 500KB HTML | 45ms | 8ms | 5.6x |
| Extract 100 elements (CSS) | 12ms | 2ms | 6x |
| Extract 100 elements (XPath) | N/A | 1.5ms | — |
| Parse 5MB HTML | 850ms | 120ms | 7x |
| 10,000 small pages | 28s | 5s | 5.6x |
lxml is 5-10x faster than BeautifulSoup for parsing and extraction. The gap widens with larger documents.
Real-World Impact
For a scraper processing 10,000 pages:
- BeautifulSoup: ~28s parsing time + network time
- lxml: ~5s parsing time + network time
- Savings: 23 seconds (parsing alone)
For most scrapers, network latency dominates, so the parsing speed difference only matters for large-scale or parse-heavy workloads.
API Comparison
Parsing HTML
# BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_string, "lxml")
# lxml
from lxml import html
tree = html.fromstring(html_string)Finding Elements
# BeautifulSoup — multiple methods
soup.find("div", class_="product") # First match
soup.find_all("div", class_="product") # All matches
soup.select("div.product") # CSS selector
soup.select_one("div.product") # First match via CSS
# lxml — XPath or CSS
tree.xpath("//div[@class='product']") # XPath
tree.cssselect("div.product") # CSS (via cssselect)Extracting Text
# BeautifulSoup
element = soup.find("h2")
text = element.text # All text including children
text = element.string # Direct text content only
text = element.get_text(strip=True) # Stripped text
# lxml
element = tree.xpath("//h2")[0]
text = element.text # Direct text only
text = element.text_content() # All text including children
text = element.xpath("string()") # XPath string extractionExtracting Attributes
# BeautifulSoup
href = soup.find("a")["href"]
href = soup.find("a").get("href", "")
all_hrefs = [a["href"] for a in soup.find_all("a", href=True)]
# lxml
href = tree.xpath("//a/@href") # Returns list of all hrefs
single = tree.xpath("//a/@href")[0] # First href
element = tree.xpath("//a")[0]
href = element.get("href") # Attribute on elementFull Extraction Example
import requests
from bs4 import BeautifulSoup
from lxml import html as lxml_html
response = requests.get("https://books.toscrape.com/")
# BeautifulSoup approach
soup = BeautifulSoup(response.text, "lxml")
bs4_books = []
for book in soup.select("article.product_pod"):
bs4_books.append({
"title": book.select_one("h3 a")["title"],
"price": book.select_one(".price_color").text,
})
# lxml approach
tree = lxml_html.fromstring(response.content)
lxml_books = []
for book in tree.xpath("//article[contains(@class, 'product_pod')]"):
lxml_books.append({
"title": book.xpath(".//h3/a/@title")[0],
"price": book.xpath(".//*[contains(@class, 'price_color')]/text()")[0],
})
# Both produce identical results
assert len(bs4_books) == len(lxml_books)Selector Comparison
CSS Selectors
# BeautifulSoup — native CSS selector support
soup.select("div.product h3 a")
soup.select("div.product > .price")
soup.select("[data-id='123']")
soup.select("tr:nth-child(odd)")
# lxml — CSS via cssselect
from lxml.cssselect import CSSSelector
sel = CSSSelector("div.product h3 a")
results = sel(tree)
# Or inline
tree.cssselect("div.product h3 a")XPath (lxml only)
XPath is lxml’s superpower — it handles complex queries that CSS cannot express:
# Conditional selection
tree.xpath("//div[@class='product'][.//span[number(substring(text(),2)) < 50]]")
# Sibling navigation
tree.xpath("//h2[text()='Laptop']/following-sibling::span/text()")
# Text contains
tree.xpath("//div[contains(text(), 'In Stock')]")
# Position-based
tree.xpath("//table/tr[position() > 1]") # Skip header row
# Multiple conditions
tree.xpath("//a[@href and @class='active']")
# Aggregate functions
tree.xpath("count(//div[@class='product'])")
# String manipulation
tree.xpath("normalize-space(//p[@class='description'])")BeautifulSoup has no XPath support. If your parsing requires XPath, you must use lxml.
Handling Broken HTML
broken_html = "<div><p>Unclosed paragraph<p>Another<div>Nested wrong</p></div>"
# BeautifulSoup — very forgiving
soup = BeautifulSoup(broken_html, "html.parser")
print(soup.prettify())
# Reconstructs a reasonable tree structure
# lxml — less forgiving but still handles most cases
tree = lxml_html.fromstring(broken_html)
print(lxml_html.tostring(tree, pretty_print=True).decode())
# May produce different structure than BeautifulSoupBeautifulSoup with the html.parser backend is the most tolerant of broken HTML. It is also the only option that does not require C dependencies.
Tolerance ranking:
- BeautifulSoup + html.parser (most tolerant)
- BeautifulSoup + html5lib (standards-compliant)
- BeautifulSoup + lxml (fast but less tolerant)
- lxml directly (least tolerant of severely broken HTML)
Memory Usage
Parsing a 10MB HTML file:
| Parser | Peak Memory | Parse Time |
|---|---|---|
| BeautifulSoup + html.parser | ~250MB | 4.2s |
| BeautifulSoup + lxml | ~180MB | 1.8s |
| lxml directly | ~90MB | 0.8s |
lxml uses roughly half the memory of BeautifulSoup because it doesn’t build the additional Python object layer that BeautifulSoup adds on top of the parse tree.
When to Use Each
Use BeautifulSoup When:
- You are a beginner learning web scraping
- The HTML is severely broken or malformed
- You want the simplest possible API
- You need to work without C dependencies (html.parser backend)
- Parsing speed is not a bottleneck
- You are working in a Jupyter notebook for quick exploration
# BeautifulSoup shines here: quick, readable, forgiving
from bs4 import BeautifulSoup
soup = BeautifulSoup(messy_html, "html.parser")
prices = [el.text for el in soup.select(".price")]Full tutorial: Beautiful Soup tutorial.
Use lxml When:
- Parsing speed is critical (large documents, many pages)
- You need XPath queries
- You are processing XML alongside HTML
- Memory usage matters (large files)
- You are comfortable with XPath syntax
# lxml shines here: fast, powerful XPath
from lxml import html
tree = html.fromstring(large_html)
# Complex query that CSS cannot express
products = tree.xpath(
"//div[@class='product'][.//span[@class='price']"
"[number(substring(text(),2)) < 50]]//h3/text()"
)Consider Parsel When:
You want lxml speed with a cleaner API — Parsel wraps lxml with both CSS and XPath support:
from parsel import Selector
sel = Selector(text=html_string)
titles = sel.css("h3 a::attr(title)").getall() # CSS
titles = sel.xpath("//h3/a/@title").getall() # XPath
isbns = sel.css(".details").re(r"ISBN:\s*([\d-]+)") # RegexSee our HTTPX + Parsel guide for more.
Using Both Together
The hybrid approach: use lxml as BeautifulSoup’s parser backend for speed, with BeautifulSoup’s API for convenience:
from bs4 import BeautifulSoup
# Use lxml as the parser backend — faster than html.parser
soup = BeautifulSoup(html_string, "lxml")
# BeautifulSoup API, lxml speed
products = soup.select("div.product")This gives you ~70% of lxml’s speed with 100% of BeautifulSoup’s API convenience.
FAQ
Is lxml always faster than BeautifulSoup?
For parsing and extraction, yes — typically 5-10x faster. However, when you include network latency (which dominates most scraping workflows), the parsing speed difference may not matter. It only becomes significant when parsing large documents or processing thousands of pages.
Can I use XPath with BeautifulSoup?
No. BeautifulSoup does not support XPath. If you need XPath, use lxml directly or use Parsel (which supports both CSS and XPath). You can also use lxml as BeautifulSoup’s parser backend and access the underlying lxml tree when needed.
Should beginners start with lxml or BeautifulSoup?
Start with BeautifulSoup. Its API is more intuitive, its documentation is better for beginners, and its tolerance for broken HTML means fewer frustrating errors. Move to lxml when parsing speed becomes a bottleneck.
Does using lxml as BeautifulSoup’s parser give me lxml speed?
Partially. Using BeautifulSoup(html, "lxml") uses lxml for parsing (fast) but wraps results in BeautifulSoup objects (slower). You get faster parsing but slower element access compared to using lxml directly. It is a good middle ground.
What about html5lib?
html5lib is the most standards-compliant parser but also the slowest (10-100x slower than lxml). Use it only when you need the exact same parsing behavior as a web browser, which is rare in scraping.
Learn each library: BeautifulSoup tutorial, Python scraping libraries. For proxy setup, see our web scraping proxy guide.
External Resources:
- BeautifulSoup Documentation
- lxml Documentation
- Parsel Documentation
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
Related Reading
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company