If you’re still defaulting to BeautifulSoup for HTML parsing in Python, you’re leaving serious performance on the table. Selectolax is a Python binding for the Modest and Lexbor HTML5 parsers, and in most real-world scraping benchmarks it runs 10x to 30x faster than BeautifulSoup while consuming a fraction of the memory. In 2026, with scraping pipelines processing millions of pages per day and cloud compute costs dominating margins, that gap is no longer academic.
What Selectolax Actually Is
Selectolax wraps two C-based parsers: Modest (the original backend, using a Myhtml fork) and Lexbor (the newer, actively maintained default since selectolax 0.12). Lexbor is a standalone C99 HTML5 parser with zero external dependencies, conformant to the HTML5 spec, and built for speed. You get CSS selector support, attribute access, text extraction, and tree traversal — all through a clean Python API that deliberately strips out the features you rarely need in a scraping context.
It is not a drop-in replacement for BeautifulSoup. There is no XPath, no built-in encoding detection, and no support for XML documents. If you need those, you want lxml. But for the 90% use case — parse HTML, find nodes, extract text and attributes — selectolax covers it completely and does it faster.
Benchmarks: How Fast Is It?
The numbers below are from a 2025 community benchmark parsing a 250KB e-commerce product page 10,000 times on a single core (Python 3.12, Linux):
| Parser | Time (s) | Peak RSS (MB) | CSS Selectors |
|---|---|---|---|
| BeautifulSoup4 + lxml | 38.2 | 214 | Yes (via select()) |
| BeautifulSoup4 + html.parser | 112.7 | 198 | Yes |
| lxml.html (direct) | 9.1 | 87 | No (XPath only) |
| selectolax (Lexbor) | 3.4 | 41 | Yes |
| html5-parser | 6.8 | 102 | No |
Selectolax wins on both speed and memory. The memory advantage matters at scale: if you’re running 50 concurrent Playwright workers, each holding parsed DOM trees in flight, the difference between 41MB and 214MB per parse context is the difference between a 4GB VPS running comfortably and one thrashing swap.
For teams already evaluating Rust-based alternatives at the infrastructure layer, it’s worth reading html5ever vs lol-html: Rust HTML Parsing Compared (2026) — the performance ceiling for pure Rust parsers sits even higher, though Python bindings add overhead.
Basic Usage and Real Patterns
The API is minimal by design. Here’s a realistic extraction pattern for scraping product listings:
from selectolax.parser import HTMLParser
import httpx
def extract_products(html: str) -> list[dict]:
tree = HTMLParser(html)
results = []
for card in tree.css("div.product-card"):
title_node = card.css_first("h2.product-title")
price_node = card.css_first("span[data-price]")
if not title_node or not price_node:
continue
results.append({
"title": title_node.text(strip=True),
"price": price_node.attributes.get("data-price"),
"url": card.css_first("a")and card.css_first("a").attributes.get("href"),
})
return results
resp = httpx.get("https://example.com/products")
products = extract_products(resp.text)A few patterns worth knowing:
- Use
css_first()instead ofcss()[0]to avoid index errors on missing nodes node.text(strip=True)is the safe default;node.text(deep=True)concatenates all child textnode.attributesreturns a plain dict — no.get_attribute()call required- Iteration with
tree.css()returns a generator, so you’re not materializing the full match list upfront
If your pipeline already uses MechanicalSoup for session-aware form handling, selectolax pairs well as the fast read-only parser for the non-form pages in the same crawl.
Where Selectolax Fits in a 2026 Scraping Stack
Selectolax belongs in the parsing layer — it does not fetch pages, handle JavaScript, or manage sessions. The right mental model is:
- Fetch HTML (httpx, aiohttp, or a browser automation layer)
- Pass raw HTML string to
HTMLParser(html) - Extract data with CSS selectors
- Discard the tree (no serialization, no caching)
For JavaScript-heavy sites, you still need a headless browser upstream. Playwright is the default choice in 2026 over Pyppeteer for Python teams — use Playwright to get the rendered HTML, then hand it directly to selectolax rather than letting Playwright’s own DOM methods do the extraction. The combo is significantly faster than running page.query_selector_all() in a loop.
When the pipeline is Node.js rather than Python, you’re looking at a different parser landscape entirely. The equivalent speed-versus-ergonomics tradeoff is covered in Cheerio vs JSDom vs Linkedom for Node.js scrapers — Cheerio is the selectolax equivalent in the Node ecosystem for most purposes.
Gotchas and Limitations to Know
Selectolax is not magic. A few real-world issues that bite teams:
- Encoding: selectolax assumes UTF-8. If you pass a misdetected page (common with older Asian e-commerce sites), you’ll get garbled text. Decode explicitly with chardet or charset-normalizer before parsing.
- Malformed HTML tolerance: Lexbor is spec-compliant, which means it corrects malformed HTML the same way browsers do. Most of the time this is what you want. Occasionally it restructures the tree in ways that break a selector that worked in BeautifulSoup’s more lenient mode.
- No XPath: If a target site’s structure requires positional XPath axes (like
following-sibling::td[2]), lxml.html is still the right call. Attempting to replicate complex XPath logic with CSS selectors is usually the wrong move. - Thread safety: HTMLParser instances are not thread-safe. Create one per parse call, not one shared instance. Given how cheap construction is, this is not a performance concern.
- No serialization: You cannot serialize a modified tree back to HTML. Selectolax is read-only. If you need to modify and re-emit HTML, use lxml or html5lib.
The library is actively maintained (0.13.x as of early 2026), pip-installable with no system dependencies on Linux/macOS/Windows, and available as a pre-built wheel for all major Python versions including 3.12 and 3.13.
Bottom Line
If your Python scraping pipeline touches more than a few thousand pages per day and you’re still using BeautifulSoup as the default parser, swap the parsing layer to selectolax with Lexbor and benchmark it on your actual workload — most teams see a 10x throughput gain with minimal code changes. The API surface is small enough to learn in an afternoon. DRT covers the full parser landscape across languages and runtimes, so if selectolax doesn’t fit your specific constraints, the comparison guides above will point you to what does.