Best Python scraping libraries 2026: Scrapy, BS4, more
Best Python scraping libraries in 2026 cover a stack that has matured significantly since the requests + BeautifulSoup era. The HTTP client layer has been split into a dozen options optimized for different use cases. Browser automation has consolidated around Playwright. Parsing has stabilized on lxml under the hood with multiple frontend options. The framework layer sees Scrapy holding its dominance for crawling-heavy use cases and Crawlee gaining ground as a modern alternative. Choosing the right combination of libraries determines whether your scraper runs at 10 requests per second or 1000, whether it survives anti-bot fingerprinting, and how much code you write to do straightforward things.
This guide ranks the Python scraping libraries actually worth using in 2026, organized by what they do, with honest performance comparisons and clear guidance on which to pick for which workload.
The four layers of a Python scraper
Every scraper has four layers, regardless of framework:
- HTTP client: makes the actual network requests
- Browser automation (optional): when JavaScript execution is needed
- HTML parser: extracts data from the response
- Orchestration framework (optional): handles concurrency, retries, queues, pipelines
Different libraries dominate each layer. The right scraper picks the best library per layer rather than committing to one library for everything.
HTTP clients
requests
The classic. Synchronous, simple, mature. Still the right choice for one-off scripts and learning. Performance is the worst of the modern options because it is sync-only.
import requests
resp = requests.get("https://example.com", headers={"User-Agent": "..."}, timeout=10)
print(resp.text)
Best for: scripts, prototypes, learning, anything where async is overkill.
httpx
The modern requests replacement. Supports both sync and async, HTTP/2 by default, type-hinted, and dramatically faster than requests for any concurrent workload. The drop-in replacement for requests in most code.
import httpx
import asyncio
async def fetch(url: str):
async with httpx.AsyncClient(http2=True, timeout=10) as client:
resp = await client.get(url)
return resp.text
# concurrent fetching
async def main():
urls = ["https://example.com/page/1", "https://example.com/page/2"]
return await asyncio.gather(*[fetch(u) for u in urls])
Best for: any new project, async workloads, HTTP/2 support, type safety.
aiohttp
The async-first HTTP client. Older than httpx but still excellent. Slightly faster than httpx for high concurrency. The websocket support is best in class.
Best for: high-concurrency async workloads, websocket-heavy use cases.
curl_cffi
The TLS-fingerprint-aware HTTP client. Wraps libcurl with browser-impersonation features so your TLS handshake looks like real Chrome, Firefox, or Safari. The right choice for any target with TLS fingerprinting (Cloudflare, DataDome, Akamai).
from curl_cffi import requests
# impersonates Chrome 120 TLS fingerprint
resp = requests.get(
"https://target.example.com",
impersonate="chrome120",
timeout=10,
)
print(resp.text)
Best for: bypassing TLS fingerprinting, scraping sites that detect Python’s default TLS.
urllib3
The HTTP foundation that requests and httpx both use under the hood. Rarely used directly except for very low-level needs.
Best for: when you need fine-grained control of connection pools.
HTML parsers
lxml
The fast XML/HTML parser written in C. Underpins almost every other parser. Direct lxml use is fastest but the API is uglier than BeautifulSoup.
from lxml import html
tree = html.fromstring(html_content)
titles = tree.xpath("//h2[@class='product-title']/text()")
Best for: high-performance parsing, XPath-heavy extraction.
BeautifulSoup4
The friendly parser. Slower than lxml directly but has the most readable API. The standard configuration uses lxml as the underlying parser, so the speed gap is smaller than people assume.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, "lxml")
titles = [t.text for t in soup.select("h2.product-title")]
Best for: most general-purpose parsing, readable code, mixed CSS/find patterns.
selectolax
The fastest Python HTML parser by a wide margin. C-based, supports CSS selectors with a minimal API. Roughly 5-10x faster than BeautifulSoup on typical pages.
from selectolax.parser import HTMLParser
tree = HTMLParser(html_content)
titles = [n.text() for n in tree.css("h2.product-title")]
Best for: high-volume parsing where every millisecond matters.
parsel
Scrapy’s parser, available standalone. Combines XPath, CSS, and regex selectors with a clean API.
Best for: Scrapy users wanting the same API outside Scrapy.
Browser automation
Playwright
The current best browser automation framework. We covered it in detail in best headless browser frameworks 2026.
Best for: modern browser automation, multi-browser support.
Selenium
The elder framework. Still solid for cross-browser needs. Verbose but well-documented.
Best for: legacy projects, multi-language teams sharing test infrastructure.
Pyppeteer (less recommended)
Python port of Puppeteer. Less actively maintained than Playwright. Avoid for new projects.
Frameworks
Scrapy
The dominant Python scraping framework. Async by design (since 2.0), built-in queue management, middleware system, item pipelines, and crawl rules. The right choice for crawler-heavy workloads where you are following links across thousands of pages.
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://shop.example.com/category/widgets"]
def parse(self, response):
for product in response.css("div.product"):
yield {
"name": product.css("h2::text").get(),
"price": product.css("span.price::text").get(),
"url": product.css("a::attr(href)").get(),
}
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
The downside: Scrapy’s mental model is heavier than other frameworks. You learn callbacks, middleware, settings, items, and pipelines. For simple scrapers this is overkill.
Best for: large crawl projects, structured data extraction at scale, when you actually need a framework.
Crawlee
The newer framework from Apify, with a Pythonic API and built-in browser support. More approachable than Scrapy for newcomers. Supports HTTP and browser modes from the same crawler class.
Best for: modern projects that want framework benefits without Scrapy’s learning curve.
Pyspider
Older framework with a web UI for managing scrapers. Less actively maintained but still works for some use cases.
Best for: legacy systems, niche use cases needing visual management.
Comparison table
| library | layer | sync/async | speed | learning curve | best for |
|---|---|---|---|---|---|
| requests | HTTP | sync | slow | easy | scripts, prototypes |
| httpx | HTTP | both | fast | easy | most new projects |
| aiohttp | HTTP | async | fast | medium | high-concurrency async |
| curl_cffi | HTTP | sync | fast | easy | TLS fingerprint bypass |
| BeautifulSoup4 | parser | sync | mid | easy | general parsing |
| lxml | parser | sync | fast | medium | high-performance XPath |
| selectolax | parser | sync | fastest | easy | extreme volume parsing |
| parsel | parser | sync | fast | easy | Scrapy users |
| Playwright | browser | both | mid | medium | JS-heavy targets |
| Selenium | browser | sync (mostly) | slow | medium | legacy |
| Scrapy | framework | async | fast | hard | large crawls |
| Crawlee | framework | async | fast | medium | modern, browser-friendly |
Decision matrix: solopreneur, SMB, enterprise
| profile | scale | recommended stack | reasoning |
|---|---|---|---|
| Solopreneur learning | <1k pages/day | requests + BeautifulSoup4 | Simple, beginner-friendly |
| Indie scraper (basic) | <100k pages/day | httpx + selectolax | Async, fast, modern |
| Indie scraper (anti-bot) | <100k pages/day | curl_cffi + selectolax | TLS impersonation included |
| SMB crawler | 100k-10M pages/day | Scrapy + curl_cffi middleware + selectolax | Framework value at this scale |
| SMB JS-heavy | 10k-1M pages/day | Crawlee or Playwright + httpx fallback | Hybrid HTTP/browser ergonomics |
| Enterprise pipeline | 10M+ pages/day | Scrapy on K8s + custom middleware + dedicated parsers | Full ops + custom optimization |
| Single-source ETL | varies | httpx + lxml direct XPath | Tight, predictable, performant |
The most expensive mistake is over-frameworking small jobs (Scrapy for 200 pages) and under-frameworking large jobs (raw httpx loop for 10M URLs). Match the framework weight to the actual workload.
Migration path: requests + BS4 to httpx + selectolax
Most legacy Python scrapers can be modernized in a day with significant performance gains. The playbook:
- Wrap your fetch function in an async signature even before changing implementation. This isolates the migration scope.
- Replace
requests.getwithhttpx.AsyncClient.get. Most code translates 1:1; the main change isawaitkeywords and the async context manager. - Switch parser to selectolax. CSS selectors translate directly from BeautifulSoup; XPath users stay on lxml. Expect 5-10x parse speedup.
- Add concurrency with
asyncio.Semaphoreto bound parallel requests. Start at 10 and tune based on target tolerance. - Benchmark against original with the same input set. A typical migration shows 15-30x throughput improvement.
The whole migration usually takes one engineer-day for a single-purpose scraper, two days for a multi-target codebase. The throughput gain often eliminates the need for distributed scaling that was on the roadmap.
Performance benchmarks
We benchmarked HTTP fetching of 10,000 simple HTML pages from a local mirror, single machine, no network bottleneck.
| stack | total time | requests/sec |
|---|---|---|
| requests (sync) | 240s | 42 |
| httpx (async, 50 concurrency) | 12s | 833 |
| aiohttp (50 concurrency) | 11s | 909 |
| curl_cffi (50 concurrency) | 14s | 714 |
| Scrapy (default settings) | 18s | 555 |
| Playwright (50 concurrent contexts) | 95s | 105 |
Async HTTP is 20x faster than sync. Browser automation is 8-10x slower than HTTP. The difference is essentially the cost of running JavaScript and rendering, which is unavoidable for SPAs.
Choosing between async frameworks
Python’s async ecosystem fragmented for years between asyncio (standard library), trio (alternative event loop with cleaner cancellation semantics), and AnyIO (a compatibility layer). For scraping, asyncio is the right default because every major HTTP and parser library targets it. trio remains technically superior for cancellation safety but the ecosystem cost is real.
The other choice is between asyncio’s default event loop and uvloop (a Cython-accelerated drop-in replacement). For HTTP-bound scrapers, uvloop yields a 2-4x throughput improvement essentially for free:
import asyncio
import uvloop
uvloop.install() # do this before any asyncio code
# rest of your scraper
The two-line installation gets you the benefit. The only caveat is that uvloop does not work on Windows; cross-platform code needs a try/except around the install call.
Stack recommendations by use case
Small project, learning, scripts: requests + BeautifulSoup4. Simple, well-documented, slow but adequate.
Medium project, production, no JavaScript needs: httpx (async) + selectolax. Fast, modern, scales to a few hundred requests per second on one machine.
Medium project with anti-bot needs: curl_cffi + selectolax. The TLS fingerprint matters more than raw speed for protected targets.
Large crawler with link-following: Scrapy + parsel. Built-in queue management, dedupe, retry middleware. The framework cost is justified at this scale.
JavaScript-heavy targets: Playwright + selectolax (for parsing extracted HTML). Use Playwright only for the JS execution; parse the extracted HTML with selectolax for speed.
Hybrid (some JS, some HTTP): Crawlee or drissionPage. Both support seamless switching between HTTP and browser modes.
Cost worked example
A practical 100k-pages-per-day workload on protected targets needs:
- 1 small VPS ($20/mo, 4 vCPU, 8 GB)
- httpx + uvloop + selectolax stack (free, Python only)
- curl_cffi for TLS impersonation when needed (free)
- Residential proxy pool from Smartproxy/Decodo (~$50/mo for 5 GB)
- PostgreSQL on a hosted instance ($25/mo)
- Optional: ScraperAPI fallback for surfaces that fail consistently (~$49/mo)
Total: about $95-145/month depending on whether you include the API fallback. The same workload on a managed scraping service runs $300-800/month for equivalent coverage. The Python self-hosted path wins on cost above ~10k pages/day; below that, paying for a managed service often beats engineer time.
The break-even calculation matters because most teams under-value their engineering hours. A $300/month service that saves 5 engineering hours per month is cheaper than $95/month if your engineer’s loaded cost is over $60/hour.
Idiomatic patterns
A modern async scraper template:
import asyncio
import httpx
from selectolax.parser import HTMLParser
from typing import AsyncGenerator
async def fetch_page(client: httpx.AsyncClient, url: str) -> str:
for attempt in range(3):
try:
resp = await client.get(url, timeout=15.0)
if resp.status_code == 200:
return resp.text
if resp.status_code in (429, 503):
await asyncio.sleep(2 ** attempt)
continue
return None
except (httpx.TimeoutException, httpx.NetworkError):
await asyncio.sleep(2 ** attempt)
return None
def parse_products(html: str) -> list[dict]:
if not html:
return []
tree = HTMLParser(html)
return [
{
"name": n.css_first("h2.title").text() if n.css_first("h2.title") else None,
"price": n.css_first("span.price").text() if n.css_first("span.price") else None,
}
for n in tree.css("div.product-card")
]
async def scrape_all(urls: list[str], concurrency: int = 20) -> list[dict]:
sem = asyncio.Semaphore(concurrency)
results = []
async with httpx.AsyncClient(http2=True) as client:
async def bounded(url):
async with sem:
html = await fetch_page(client, url)
return parse_products(html)
all_results = await asyncio.gather(*[bounded(u) for u in urls])
for r in all_results:
results.extend(r)
return results
if __name__ == "__main__":
urls = ["https://example.com/p/1", "https://example.com/p/2"]
products = asyncio.run(scrape_all(urls))
This pattern handles 500+ pages per minute on a modest VPS with retries and concurrency control built in. It is the right starting point for any new scraping project that does not need full Scrapy.
Common gotchas
- httpx connection pool exhaustion. The default
limitsparameter caps concurrent connections at 10. Without raising it, yourasyncio.Semaphore(50)is silently throttled to 10. Always passhttpx.Limits(max_connections=200)for high-concurrency workloads. - selectolax encoding errors. selectolax expects bytes or properly-decoded strings. Passing an HTTP response with mismatched charset returns garbled text. Use
resp.textfrom httpx (which auto-detects encoding) or decode explicitly. - Scrapy autothrottle ambiguity. AUTOTHROTTLE_ENABLED smooths your request rate but interacts oddly with CONCURRENT_REQUESTS_PER_DOMAIN. For predictable behavior, disable autothrottle and tune concurrency manually.
- lxml memory growth.
lxml.etree.parseon large documents can leak references in Python’s GC. For long-running jobs, periodicallydelthe tree and callgc.collect()between batches. - httpx HTTP/2 incompatibility. Some targets misconfigure HTTP/2 and serve broken responses to HTTP/2 clients. If a target works in curl but fails in httpx, try
http2=False. - curl_cffi version mismatch. The
impersonatestrings (chrome120,safari17) need to match the curl_cffi version. Old strings silently fall back to default Chrome. Pin the version and check the docs for current strings. - BeautifulSoup find vs select.
soup.find()returns the first match orNone;soup.select()returns a list. Conflating them causes silent attribute errors onNone.
Persisting scraped data
Storage choices vary by workload, but a few patterns hold across most Python scrapers:
- SQLite for development and small projects. No server, single file, fast enough for millions of rows. Use
aiosqliteif your scraper is async. - PostgreSQL for production. Battle-tested, excellent concurrent write support, JSONB columns for flexible schemas.
- Parquet on S3 / R2 for archive. Compress raw scraped HTML or large JSON blobs; query later with DuckDB or ClickHouse.
- DuckDB for analytical queries. Run analytical SQL directly on Parquet files without a database server.
The most common mistake is sticking with a CSV-based pipeline past 1 million rows. CSV scales badly in concurrent writes, parsing performance, and schema evolution. Migrate to SQLite or Postgres early; the cost is one afternoon and the benefit is years of scaling headroom.
Common mistakes to avoid
Using requests for any non-trivial workload: sync IO is the wrong choice for any scraper doing more than 100 pages per minute. The cost of switching to httpx is small.
Using BeautifulSoup with the html.parser backend: 3-5x slower than the lxml backend. Always specify BeautifulSoup(html, "lxml").
Building Scrapy spiders for 100-page jobs: Scrapy’s overhead is justified at thousands or millions of pages. For small jobs, async httpx is simpler.
Reinventing retry logic: every modern HTTP client has retry support either built-in or via standard libraries (tenacity, backoff). Use them.
Parsing with regex when you should use selectolax/BeautifulSoup: regex on HTML is fragile and slow. Use a proper parser.
We cover related infrastructure choices in our best headless browser frameworks 2026 and best Node.js scraping libraries 2026 reviews.
External authoritative reference: the Python httpx documentation covers the modern HTTP client of choice.
FAQ
Q: should I learn Scrapy in 2026?
Yes if you anticipate building large crawlers. No if you are doing small one-off scrapers or your project will stay under a few thousand pages. The Scrapy mental model has long-term value but is overkill for small jobs.
Q: what about pandas read_html?
Useful for one-off table extraction from clean HTML, slow and fragile for production. Treat it as a notebook tool, not a production scraper.
Q: how do I handle JavaScript-rendered content without Playwright?
Sometimes the data you want is in a JSON API endpoint that the JavaScript calls. Network-tab inspection in DevTools reveals these. Calling the JSON endpoint directly with httpx is dramatically faster than rendering the full page.
Q: which library handles cookies best?
httpx and aiohttp both have proper cookie jar support. requests does too. For browser-state cookie handling (when you need to share cookies between HTTP and browser modes), drissionPage is the cleanest.
Q: do I need Scrapy if I use httpx?
Not for small to medium scrapes. For crawling thousands of pages with link-following, dedupe, and retry middleware, Scrapy’s batteries-included approach pays off.
Q: how do I integrate proxies cleanly?
httpx accepts proxies={"all://": "http://user:pass@host:port"}. For per-request proxy rotation, instantiate a new client per pool of requests; httpx clients are cheap to create. For Scrapy, use a downloader middleware that picks a proxy per request.
Q: which library is best for large file downloads?
httpx with client.stream() lets you download multi-GB files without loading them into RAM. Combine with aiofiles for async disk writes. Avoid requests.get(url).content for anything over 50 MB; it loads the whole response into memory.
Q: is async always better than sync?
For network-bound work, yes. For CPU-bound parsing, no; async does not parallelize CPU work. Mix the two: async fetch, sync parse, then asyncio.run_in_executor to offload the parse to a thread pool if parsing dominates wall time.
Closing
The Python scraping stack in 2026 is mature enough that the right answer is almost always the same: httpx for HTTP, selectolax for parsing, Playwright for browsers when needed, Scrapy for large crawls. Add curl_cffi when TLS fingerprinting matters. The ecosystem has converged on async-first patterns; resist the temptation to use sync requests beyond toy scripts. For broader scraping infrastructure see our dev-tools-projects category hub.