Selectolax: The Fastest HTML Parser You're Not Using in 2026

If you’re still defaulting to BeautifulSoup for HTML parsing in Python, you’re leaving serious performance on the table. Selectolax is a Python binding for the Modest and Lexbor HTML5 parsers, and in most real-world scraping benchmarks it runs 10x to 30x faster than BeautifulSoup while consuming a fraction of the memory. In 2026, with scraping pipelines processing millions of pages per day and cloud compute costs dominating margins, that gap is no longer academic.

What Selectolax Actually Is

Selectolax wraps two C-based parsers: Modest (the original backend, using a Myhtml fork) and Lexbor (the newer, actively maintained default since selectolax 0.12). Lexbor is a standalone C99 HTML5 parser with zero external dependencies, conformant to the HTML5 spec, and built for speed. You get CSS selector support, attribute access, text extraction, and tree traversal — all through a clean Python API that deliberately strips out the features you rarely need in a scraping context.

It is not a drop-in replacement for BeautifulSoup. There is no XPath, no built-in encoding detection, and no support for XML documents. If you need those, you want lxml. But for the 90% use case — parse HTML, find nodes, extract text and attributes — selectolax covers it completely and does it faster.

Benchmarks: How Fast Is It?

The numbers below are from a 2025 community benchmark parsing a 250KB e-commerce product page 10,000 times on a single core (Python 3.12, Linux):

Parser	Time (s)	Peak RSS (MB)	CSS Selectors
BeautifulSoup4 + lxml	38.2	214	Yes (via select())
BeautifulSoup4 + html.parser	112.7	198	Yes
lxml.html (direct)	9.1	87	No (XPath only)
selectolax (Lexbor)	3.4	41	Yes
html5-parser	6.8	102	No

Selectolax wins on both speed and memory. The memory advantage matters at scale: if you’re running 50 concurrent Playwright workers, each holding parsed DOM trees in flight, the difference between 41MB and 214MB per parse context is the difference between a 4GB VPS running comfortably and one thrashing swap.

For teams already evaluating Rust-based alternatives at the infrastructure layer, it’s worth reading html5ever vs lol-html: Rust HTML Parsing Compared (2026) — the performance ceiling for pure Rust parsers sits even higher, though Python bindings add overhead.

Basic Usage and Real Patterns

The API is minimal by design. Here’s a realistic extraction pattern for scraping product listings:

from selectolax.parser import HTMLParser
import httpx

def extract_products(html: str) -> list[dict]:
    tree = HTMLParser(html)
    results = []
    for card in tree.css("div.product-card"):
        title_node = card.css_first("h2.product-title")
        price_node = card.css_first("span[data-price]")
        if not title_node or not price_node:
            continue
        results.append({
            "title": title_node.text(strip=True),
            "price": price_node.attributes.get("data-price"),
            "url": card.css_first("a")and card.css_first("a").attributes.get("href"),
        })
    return results

resp = httpx.get("https://example.com/products")
products = extract_products(resp.text)

A few patterns worth knowing:

Use css_first() instead of css()[0] to avoid index errors on missing nodes
node.text(strip=True) is the safe default; node.text(deep=True) concatenates all child text
node.attributes returns a plain dict — no .get_attribute() call required
Iteration with tree.css() returns a generator, so you’re not materializing the full match list upfront

If your pipeline already uses MechanicalSoup for session-aware form handling, selectolax pairs well as the fast read-only parser for the non-form pages in the same crawl.

Where Selectolax Fits in a 2026 Scraping Stack

Selectolax belongs in the parsing layer — it does not fetch pages, handle JavaScript, or manage sessions. The right mental model is:

Fetch HTML (httpx, aiohttp, or a browser automation layer)
Pass raw HTML string to HTMLParser(html)
Extract data with CSS selectors
Discard the tree (no serialization, no caching)

For JavaScript-heavy sites, you still need a headless browser upstream. Playwright is the default choice in 2026 over Pyppeteer for Python teams — use Playwright to get the rendered HTML, then hand it directly to selectolax rather than letting Playwright’s own DOM methods do the extraction. The combo is significantly faster than running page.query_selector_all() in a loop.

When the pipeline is Node.js rather than Python, you’re looking at a different parser landscape entirely. The equivalent speed-versus-ergonomics tradeoff is covered in Cheerio vs JSDom vs Linkedom for Node.js scrapers — Cheerio is the selectolax equivalent in the Node ecosystem for most purposes.

Gotchas and Limitations to Know

Selectolax is not magic. A few real-world issues that bite teams:

Encoding: selectolax assumes UTF-8. If you pass a misdetected page (common with older Asian e-commerce sites), you’ll get garbled text. Decode explicitly with chardet or charset-normalizer before parsing.
Malformed HTML tolerance: Lexbor is spec-compliant, which means it corrects malformed HTML the same way browsers do. Most of the time this is what you want. Occasionally it restructures the tree in ways that break a selector that worked in BeautifulSoup’s more lenient mode.
No XPath: If a target site’s structure requires positional XPath axes (like following-sibling::td[2]), lxml.html is still the right call. Attempting to replicate complex XPath logic with CSS selectors is usually the wrong move.
Thread safety: HTMLParser instances are not thread-safe. Create one per parse call, not one shared instance. Given how cheap construction is, this is not a performance concern.
No serialization: You cannot serialize a modified tree back to HTML. Selectolax is read-only. If you need to modify and re-emit HTML, use lxml or html5lib.

The library is actively maintained (0.13.x as of early 2026), pip-installable with no system dependencies on Linux/macOS/Windows, and available as a pre-built wheel for all major Python versions including 3.12 and 3.13.

Bottom Line

If your Python scraping pipeline touches more than a few thousand pages per day and you’re still using BeautifulSoup as the default parser, swap the parsing layer to selectolax with Lexbor and benchmark it on your actual workload — most teams see a 10x throughput gain with minimal code changes. The API surface is small enough to learn in an afternoon. DRT covers the full parser landscape across languages and runtimes, so if selectolax doesn’t fit your specific constraints, the comparison guides above will point you to what does.

Selectolax: The Fastest HTML Parser You’re Not Using in 2026