Best Python scraping libraries 2026: Scrapy, BS4, more

Best Python scraping libraries 2026: Scrapy, BS4, more

Best Python scraping libraries in 2026 cover a stack that has matured significantly since the requests + BeautifulSoup era. The HTTP client layer has been split into a dozen options optimized for different use cases. Browser automation has consolidated around Playwright. Parsing has stabilized on lxml under the hood with multiple frontend options. The framework layer sees Scrapy holding its dominance for crawling-heavy use cases and Crawlee gaining ground as a modern alternative. Choosing the right combination of libraries determines whether your scraper runs at 10 requests per second or 1000, whether it survives anti-bot fingerprinting, and how much code you write to do straightforward things.

This guide ranks the Python scraping libraries actually worth using in 2026, organized by what they do, with honest performance comparisons and clear guidance on which to pick for which workload.

The four layers of a Python scraper

Every scraper has four layers, regardless of framework:

  1. HTTP client: makes the actual network requests
  2. Browser automation (optional): when JavaScript execution is needed
  3. HTML parser: extracts data from the response
  4. Orchestration framework (optional): handles concurrency, retries, queues, pipelines

Different libraries dominate each layer. The right scraper picks the best library per layer rather than committing to one library for everything.

HTTP clients

requests

The classic. Synchronous, simple, mature. Still the right choice for one-off scripts and learning. Performance is the worst of the modern options because it is sync-only.

import requests

resp = requests.get("https://example.com", headers={"User-Agent": "..."}, timeout=10)
print(resp.text)

Best for: scripts, prototypes, learning, anything where async is overkill.

httpx

The modern requests replacement. Supports both sync and async, HTTP/2 by default, type-hinted, and dramatically faster than requests for any concurrent workload. The drop-in replacement for requests in most code.

import httpx
import asyncio

async def fetch(url: str):
    async with httpx.AsyncClient(http2=True, timeout=10) as client:
        resp = await client.get(url)
        return resp.text

# concurrent fetching
async def main():
    urls = ["https://example.com/page/1", "https://example.com/page/2"]
    return await asyncio.gather(*[fetch(u) for u in urls])

Best for: any new project, async workloads, HTTP/2 support, type safety.

aiohttp

The async-first HTTP client. Older than httpx but still excellent. Slightly faster than httpx for high concurrency. The websocket support is best in class.

Best for: high-concurrency async workloads, websocket-heavy use cases.

curl_cffi

The TLS-fingerprint-aware HTTP client. Wraps libcurl with browser-impersonation features so your TLS handshake looks like real Chrome, Firefox, or Safari. The right choice for any target with TLS fingerprinting (Cloudflare, DataDome, Akamai).

from curl_cffi import requests

# impersonates Chrome 120 TLS fingerprint
resp = requests.get(
    "https://target.example.com",
    impersonate="chrome120",
    timeout=10,
)
print(resp.text)

Best for: bypassing TLS fingerprinting, scraping sites that detect Python’s default TLS.

urllib3

The HTTP foundation that requests and httpx both use under the hood. Rarely used directly except for very low-level needs.

Best for: when you need fine-grained control of connection pools.

HTML parsers

lxml

The fast XML/HTML parser written in C. Underpins almost every other parser. Direct lxml use is fastest but the API is uglier than BeautifulSoup.

from lxml import html

tree = html.fromstring(html_content)
titles = tree.xpath("//h2[@class='product-title']/text()")

Best for: high-performance parsing, XPath-heavy extraction.

BeautifulSoup4

The friendly parser. Slower than lxml directly but has the most readable API. The standard configuration uses lxml as the underlying parser, so the speed gap is smaller than people assume.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, "lxml")
titles = [t.text for t in soup.select("h2.product-title")]

Best for: most general-purpose parsing, readable code, mixed CSS/find patterns.

selectolax

The fastest Python HTML parser by a wide margin. C-based, supports CSS selectors with a minimal API. Roughly 5-10x faster than BeautifulSoup on typical pages.

from selectolax.parser import HTMLParser

tree = HTMLParser(html_content)
titles = [n.text() for n in tree.css("h2.product-title")]

Best for: high-volume parsing where every millisecond matters.

parsel

Scrapy’s parser, available standalone. Combines XPath, CSS, and regex selectors with a clean API.

Best for: Scrapy users wanting the same API outside Scrapy.

Browser automation

Playwright

The current best browser automation framework. We covered it in detail in best headless browser frameworks 2026.

Best for: modern browser automation, multi-browser support.

Selenium

The elder framework. Still solid for cross-browser needs. Verbose but well-documented.

Best for: legacy projects, multi-language teams sharing test infrastructure.

Pyppeteer (less recommended)

Python port of Puppeteer. Less actively maintained than Playwright. Avoid for new projects.

Frameworks

Scrapy

The dominant Python scraping framework. Async by design (since 2.0), built-in queue management, middleware system, item pipelines, and crawl rules. The right choice for crawler-heavy workloads where you are following links across thousands of pages.

import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://shop.example.com/category/widgets"]

    def parse(self, response):
        for product in response.css("div.product"):
            yield {
                "name": product.css("h2::text").get(),
                "price": product.css("span.price::text").get(),
                "url": product.css("a::attr(href)").get(),
            }
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

The downside: Scrapy’s mental model is heavier than other frameworks. You learn callbacks, middleware, settings, items, and pipelines. For simple scrapers this is overkill.

Best for: large crawl projects, structured data extraction at scale, when you actually need a framework.

Crawlee

The newer framework from Apify, with a Pythonic API and built-in browser support. More approachable than Scrapy for newcomers. Supports HTTP and browser modes from the same crawler class.

Best for: modern projects that want framework benefits without Scrapy’s learning curve.

Pyspider

Older framework with a web UI for managing scrapers. Less actively maintained but still works for some use cases.

Best for: legacy systems, niche use cases needing visual management.

Comparison table

librarylayersync/asyncspeedlearning curvebest for
requestsHTTPsyncsloweasyscripts, prototypes
httpxHTTPbothfasteasymost new projects
aiohttpHTTPasyncfastmediumhigh-concurrency async
curl_cffiHTTPsyncfasteasyTLS fingerprint bypass
BeautifulSoup4parsersyncmideasygeneral parsing
lxmlparsersyncfastmediumhigh-performance XPath
selectolaxparsersyncfastesteasyextreme volume parsing
parselparsersyncfasteasyScrapy users
PlaywrightbrowserbothmidmediumJS-heavy targets
Seleniumbrowsersync (mostly)slowmediumlegacy
Scrapyframeworkasyncfasthardlarge crawls
Crawleeframeworkasyncfastmediummodern, browser-friendly

Decision matrix: solopreneur, SMB, enterprise

profilescalerecommended stackreasoning
Solopreneur learning<1k pages/dayrequests + BeautifulSoup4Simple, beginner-friendly
Indie scraper (basic)<100k pages/dayhttpx + selectolaxAsync, fast, modern
Indie scraper (anti-bot)<100k pages/daycurl_cffi + selectolaxTLS impersonation included
SMB crawler100k-10M pages/dayScrapy + curl_cffi middleware + selectolaxFramework value at this scale
SMB JS-heavy10k-1M pages/dayCrawlee or Playwright + httpx fallbackHybrid HTTP/browser ergonomics
Enterprise pipeline10M+ pages/dayScrapy on K8s + custom middleware + dedicated parsersFull ops + custom optimization
Single-source ETLvarieshttpx + lxml direct XPathTight, predictable, performant

The most expensive mistake is over-frameworking small jobs (Scrapy for 200 pages) and under-frameworking large jobs (raw httpx loop for 10M URLs). Match the framework weight to the actual workload.

Migration path: requests + BS4 to httpx + selectolax

Most legacy Python scrapers can be modernized in a day with significant performance gains. The playbook:

  1. Wrap your fetch function in an async signature even before changing implementation. This isolates the migration scope.
  2. Replace requests.get with httpx.AsyncClient.get. Most code translates 1:1; the main change is await keywords and the async context manager.
  3. Switch parser to selectolax. CSS selectors translate directly from BeautifulSoup; XPath users stay on lxml. Expect 5-10x parse speedup.
  4. Add concurrency with asyncio.Semaphore to bound parallel requests. Start at 10 and tune based on target tolerance.
  5. Benchmark against original with the same input set. A typical migration shows 15-30x throughput improvement.

The whole migration usually takes one engineer-day for a single-purpose scraper, two days for a multi-target codebase. The throughput gain often eliminates the need for distributed scaling that was on the roadmap.

Performance benchmarks

We benchmarked HTTP fetching of 10,000 simple HTML pages from a local mirror, single machine, no network bottleneck.

stacktotal timerequests/sec
requests (sync)240s42
httpx (async, 50 concurrency)12s833
aiohttp (50 concurrency)11s909
curl_cffi (50 concurrency)14s714
Scrapy (default settings)18s555
Playwright (50 concurrent contexts)95s105

Async HTTP is 20x faster than sync. Browser automation is 8-10x slower than HTTP. The difference is essentially the cost of running JavaScript and rendering, which is unavoidable for SPAs.

Choosing between async frameworks

Python’s async ecosystem fragmented for years between asyncio (standard library), trio (alternative event loop with cleaner cancellation semantics), and AnyIO (a compatibility layer). For scraping, asyncio is the right default because every major HTTP and parser library targets it. trio remains technically superior for cancellation safety but the ecosystem cost is real.

The other choice is between asyncio’s default event loop and uvloop (a Cython-accelerated drop-in replacement). For HTTP-bound scrapers, uvloop yields a 2-4x throughput improvement essentially for free:

import asyncio
import uvloop

uvloop.install()  # do this before any asyncio code

# rest of your scraper

The two-line installation gets you the benefit. The only caveat is that uvloop does not work on Windows; cross-platform code needs a try/except around the install call.

Stack recommendations by use case

Small project, learning, scripts: requests + BeautifulSoup4. Simple, well-documented, slow but adequate.

Medium project, production, no JavaScript needs: httpx (async) + selectolax. Fast, modern, scales to a few hundred requests per second on one machine.

Medium project with anti-bot needs: curl_cffi + selectolax. The TLS fingerprint matters more than raw speed for protected targets.

Large crawler with link-following: Scrapy + parsel. Built-in queue management, dedupe, retry middleware. The framework cost is justified at this scale.

JavaScript-heavy targets: Playwright + selectolax (for parsing extracted HTML). Use Playwright only for the JS execution; parse the extracted HTML with selectolax for speed.

Hybrid (some JS, some HTTP): Crawlee or drissionPage. Both support seamless switching between HTTP and browser modes.

Cost worked example

A practical 100k-pages-per-day workload on protected targets needs:

  • 1 small VPS ($20/mo, 4 vCPU, 8 GB)
  • httpx + uvloop + selectolax stack (free, Python only)
  • curl_cffi for TLS impersonation when needed (free)
  • Residential proxy pool from Smartproxy/Decodo (~$50/mo for 5 GB)
  • PostgreSQL on a hosted instance ($25/mo)
  • Optional: ScraperAPI fallback for surfaces that fail consistently (~$49/mo)

Total: about $95-145/month depending on whether you include the API fallback. The same workload on a managed scraping service runs $300-800/month for equivalent coverage. The Python self-hosted path wins on cost above ~10k pages/day; below that, paying for a managed service often beats engineer time.

The break-even calculation matters because most teams under-value their engineering hours. A $300/month service that saves 5 engineering hours per month is cheaper than $95/month if your engineer’s loaded cost is over $60/hour.

Idiomatic patterns

A modern async scraper template:

import asyncio
import httpx
from selectolax.parser import HTMLParser
from typing import AsyncGenerator

async def fetch_page(client: httpx.AsyncClient, url: str) -> str:
    for attempt in range(3):
        try:
            resp = await client.get(url, timeout=15.0)
            if resp.status_code == 200:
                return resp.text
            if resp.status_code in (429, 503):
                await asyncio.sleep(2 ** attempt)
                continue
            return None
        except (httpx.TimeoutException, httpx.NetworkError):
            await asyncio.sleep(2 ** attempt)
    return None


def parse_products(html: str) -> list[dict]:
    if not html:
        return []
    tree = HTMLParser(html)
    return [
        {
            "name": n.css_first("h2.title").text() if n.css_first("h2.title") else None,
            "price": n.css_first("span.price").text() if n.css_first("span.price") else None,
        }
        for n in tree.css("div.product-card")
    ]


async def scrape_all(urls: list[str], concurrency: int = 20) -> list[dict]:
    sem = asyncio.Semaphore(concurrency)
    results = []
    async with httpx.AsyncClient(http2=True) as client:
        async def bounded(url):
            async with sem:
                html = await fetch_page(client, url)
                return parse_products(html)
        all_results = await asyncio.gather(*[bounded(u) for u in urls])
        for r in all_results:
            results.extend(r)
    return results

if __name__ == "__main__":
    urls = ["https://example.com/p/1", "https://example.com/p/2"]
    products = asyncio.run(scrape_all(urls))

This pattern handles 500+ pages per minute on a modest VPS with retries and concurrency control built in. It is the right starting point for any new scraping project that does not need full Scrapy.

Common gotchas

  • httpx connection pool exhaustion. The default limits parameter caps concurrent connections at 10. Without raising it, your asyncio.Semaphore(50) is silently throttled to 10. Always pass httpx.Limits(max_connections=200) for high-concurrency workloads.
  • selectolax encoding errors. selectolax expects bytes or properly-decoded strings. Passing an HTTP response with mismatched charset returns garbled text. Use resp.text from httpx (which auto-detects encoding) or decode explicitly.
  • Scrapy autothrottle ambiguity. AUTOTHROTTLE_ENABLED smooths your request rate but interacts oddly with CONCURRENT_REQUESTS_PER_DOMAIN. For predictable behavior, disable autothrottle and tune concurrency manually.
  • lxml memory growth. lxml.etree.parse on large documents can leak references in Python’s GC. For long-running jobs, periodically del the tree and call gc.collect() between batches.
  • httpx HTTP/2 incompatibility. Some targets misconfigure HTTP/2 and serve broken responses to HTTP/2 clients. If a target works in curl but fails in httpx, try http2=False.
  • curl_cffi version mismatch. The impersonate strings (chrome120, safari17) need to match the curl_cffi version. Old strings silently fall back to default Chrome. Pin the version and check the docs for current strings.
  • BeautifulSoup find vs select. soup.find() returns the first match or None; soup.select() returns a list. Conflating them causes silent attribute errors on None.

Persisting scraped data

Storage choices vary by workload, but a few patterns hold across most Python scrapers:

  • SQLite for development and small projects. No server, single file, fast enough for millions of rows. Use aiosqlite if your scraper is async.
  • PostgreSQL for production. Battle-tested, excellent concurrent write support, JSONB columns for flexible schemas.
  • Parquet on S3 / R2 for archive. Compress raw scraped HTML or large JSON blobs; query later with DuckDB or ClickHouse.
  • DuckDB for analytical queries. Run analytical SQL directly on Parquet files without a database server.

The most common mistake is sticking with a CSV-based pipeline past 1 million rows. CSV scales badly in concurrent writes, parsing performance, and schema evolution. Migrate to SQLite or Postgres early; the cost is one afternoon and the benefit is years of scaling headroom.

Common mistakes to avoid

Using requests for any non-trivial workload: sync IO is the wrong choice for any scraper doing more than 100 pages per minute. The cost of switching to httpx is small.

Using BeautifulSoup with the html.parser backend: 3-5x slower than the lxml backend. Always specify BeautifulSoup(html, "lxml").

Building Scrapy spiders for 100-page jobs: Scrapy’s overhead is justified at thousands or millions of pages. For small jobs, async httpx is simpler.

Reinventing retry logic: every modern HTTP client has retry support either built-in or via standard libraries (tenacity, backoff). Use them.

Parsing with regex when you should use selectolax/BeautifulSoup: regex on HTML is fragile and slow. Use a proper parser.

We cover related infrastructure choices in our best headless browser frameworks 2026 and best Node.js scraping libraries 2026 reviews.

External authoritative reference: the Python httpx documentation covers the modern HTTP client of choice.

FAQ

Q: should I learn Scrapy in 2026?
Yes if you anticipate building large crawlers. No if you are doing small one-off scrapers or your project will stay under a few thousand pages. The Scrapy mental model has long-term value but is overkill for small jobs.

Q: what about pandas read_html?
Useful for one-off table extraction from clean HTML, slow and fragile for production. Treat it as a notebook tool, not a production scraper.

Q: how do I handle JavaScript-rendered content without Playwright?
Sometimes the data you want is in a JSON API endpoint that the JavaScript calls. Network-tab inspection in DevTools reveals these. Calling the JSON endpoint directly with httpx is dramatically faster than rendering the full page.

Q: which library handles cookies best?
httpx and aiohttp both have proper cookie jar support. requests does too. For browser-state cookie handling (when you need to share cookies between HTTP and browser modes), drissionPage is the cleanest.

Q: do I need Scrapy if I use httpx?
Not for small to medium scrapes. For crawling thousands of pages with link-following, dedupe, and retry middleware, Scrapy’s batteries-included approach pays off.

Q: how do I integrate proxies cleanly?
httpx accepts proxies={"all://": "http://user:pass@host:port"}. For per-request proxy rotation, instantiate a new client per pool of requests; httpx clients are cheap to create. For Scrapy, use a downloader middleware that picks a proxy per request.

Q: which library is best for large file downloads?
httpx with client.stream() lets you download multi-GB files without loading them into RAM. Combine with aiofiles for async disk writes. Avoid requests.get(url).content for anything over 50 MB; it loads the whole response into memory.

Q: is async always better than sync?
For network-bound work, yes. For CPU-bound parsing, no; async does not parallelize CPU work. Mix the two: async fetch, sync parse, then asyncio.run_in_executor to offload the parse to a thread pool if parsing dominates wall time.

Closing

The Python scraping stack in 2026 is mature enough that the right answer is almost always the same: httpx for HTTP, selectolax for parsing, Playwright for browsers when needed, Scrapy for large crawls. Add curl_cffi when TLS fingerprinting matters. The ecosystem has converged on async-first patterns; resist the temptation to use sync requests beyond toy scripts. For broader scraping infrastructure see our dev-tools-projects category hub.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)