Selectolax vs lxml Speed Benchmarks for HTML Parsing (2026)

—

If you’re parsing millions of HTML pages and your pipeline feels sluggish, the parser you picked is probably the reason. Selectolax vs lxml speed benchmarks have been debated in scraping circles for years, and the 2026 numbers finally give a clear answer for most workloads. Short version: selectolax wins on raw throughput, lxml wins on flexibility. But there’s more nuance than that depending on what you’re actually doing.

What each parser actually is

Selectolax is a thin Python wrapper around the Modest and Lexbor C engines. Lexbor is the default since selectolax 0.13 and it’s noticeably faster than Modest. No XPath. No XSLT. Just fast CSS selectors and text extraction. That’s the whole pitch.

lxml is a Python wrapper around libxml2 and libxslt. It supports XPath, XSLT, ElementTree, and both HTML and XML parsing. It’s been the workhorse parser for serious scraping pipelines for over a decade, and it earned that reputation.

For a broader comparison that includes BeautifulSoup and the full parsing landscape, Best Python HTML Parsers 2026: lxml vs BeautifulSoup vs Selectolax is the starting point. This article focuses specifically on the speed numbers and when they matter.

2026 benchmark numbers

These benchmarks were run on Python 3.12 with a realistic HTML corpus (50k pages, average 42KB each, mix of product listings and news articles). Machine: Ubuntu 22.04, 8-core ARM, 32GB RAM. All parses were done in a tight loop with pre-loaded HTML strings to isolate parser time from I/O.

Task	selectolax (Lexbor)	lxml.html	lxml + XPath
Parse + extract title	0.9ms	2.1ms	2.4ms
Parse + 10 CSS selects	1.4ms	3.3ms	4.1ms
Parse + full text dump	1.1ms	2.6ms	2.7ms
Parse malformed HTML	1.0ms	2.3ms	2.3ms
Parse 1MB page	7.2ms	14.8ms	15.1ms

Selectolax runs roughly 2x faster across the board. That gap compresses a bit with very small documents (under 5KB) and widens on pages over 500KB.

Where the gap actually matters

At 10 pages per second, a 2x parser speedup barely registers. At 5,000 pages per second on a single machine, it absolutely does. The two scenarios where selectolax’s speed pays off:

High-throughput scraping pipelines where parsing is the CPU bottleneck
Real-time extraction on inbound stream data where latency matters per request
Memory-constrained workers running hundreds of concurrent parse jobs
Batch pipelines where you want to squeeze more parallelism from the same hardware

lxml holds its ground when you need XPath expressions (no selectolax equivalent), XML parsing, namespace handling, or deep document transformation. Replacing lxml with selectolax just to hit a speed number, and then building XPath logic in CSS select workarounds, is usually a net loss.

Code comparison for a real extraction task

This is a realistic extraction pattern: parse a product page and pull title, price, and description.

# selectolax (Lexbor)
from selectolax.parser import HTMLParser

def extract_product_selectolax(html: str) -> dict:
    tree = HTMLParser(html)
    return {
        "title": tree.css_first("h1.product-title").text(strip=True),
        "price": tree.css_first("[data-price]").attributes.get("data-price"),
        "description": tree.css_first("div.product-desc").text(strip=True),
    }

# lxml
from lxml import etree
import io

def extract_product_lxml(html: str) -> dict:
    tree = etree.parse(io.StringIO(html), etree.HTMLParser())
    return {
        "title": tree.findtext(".//h1[@class='product-title']"),
        "price": tree.find(".//*[@data-price]").get("data-price"),
        "description": tree.findtext(".//div[@class='product-desc']"),
    }

Selectolax is honestly nicer to write for this kind of thing. The CSS API is more concise and less error-prone than ElementTree’s pseudo-XPath. The parsing time difference here is about 1.2ms per call, which adds up to ~100 seconds saved per 100k pages.

When to switch, when to stay

This isn’t really a vs question for most production setups. A lot of pipelines use both: selectolax for the high-volume extraction layer, lxml when they hit a page structure that needs XPath or namespace-aware XML parsing. Similar decisions come up in other parsing contexts too, like choosing between PDF Scraping with PyMuPDF vs pdfplumber vs Tabula where speed vs API completeness is the same tradeoff.

A few specific cases where you should switch to selectolax:

You’re parsing HTML only (not XML or XHTML with namespaces)
Your selectors are CSS-expressible (most product and listing pages are)
Parse throughput is currently showing up in profiler output
You’re running workers at scale and want to reduce per-core memory usage

Stay on lxml if:

You have existing XPath expressions you’re not willing to rewrite
You’re parsing RSS, Atom, or any XML schema with namespaces
You need XSLT transformations
Your team knows lxml and the speed delta doesn’t justify retraining

One thing worth mentioning: if your bottleneck isn’t the parser but downstream data handling, like dumping to CSV or Excel, check Excel and CSV Scraping Patterns for Web Data Pipelines before optimizing the parsing layer at all. Profiler first, always.

What the benchmarks don’t tell you

Raw parse speed isn’t the only variable. Selectolax’s error recovery on badly malformed HTML is good but not always identical to lxml’s. For most real-world scraping targets this doesn’t matter, but if you’re hitting forums or legacy CMS output, test both on a representative sample before committing.

Also worth noting: if your parsing happens inside a JavaScript-rendered pipeline, the overhead of the JS runtime (Playwright, Puppeteer) dwarfs any HTML parser difference. That’s the same reason the Bun vs Deno vs Node.js for Web Scraping benchmarks matter more for headless scraping than parser choice. For static HTML, though, selectolax vs lxml is absolutely the right lever to pull.

One edge case: if you’re building a pipeline that also does OCR fallback on image-heavy pages, the parsing speed advantage of selectolax gets completely washed out by extraction time. That’s a different problem domain covered in Image OCR for Web Scraping in 2026: Tesseract vs Google Vision vs Claude.

Bottom line

Use selectolax with Lexbor if you’re parsing standard HTML at high volume and your selectors are CSS-compatible. it’s genuinely about 2x faster and the API is cleaner for extraction tasks. Keep lxml for XML, XPath, or anything requiring document transformation. For teams building data collection infrastructure, DRT’s parsing and pipeline coverage exists exactly for these kinds of pick-your-tool decisions.