HTTPX + Parsel: Modern Python Scraping Stack

HTTPX + Parsel: Modern Python Scraping Stack

HTTPX + Parsel is the modern replacement for the classic Requests + BeautifulSoup stack. HTTPX brings async support, HTTP/2, and connection pooling. Parsel — Scrapy’s selector engine — provides the speed of lxml with a clean CSS/XPath/regex API. Together, they deliver the best performance-to-simplicity ratio for Python scraping projects that don’t need a full framework like Scrapy.

This tutorial covers setup, synchronous and async scraping, selector patterns, proxy integration, and real-world examples.

Table of Contents

Why HTTPX + Parsel

FeatureRequests + BS4HTTPX + Parsel
AsyncNoYes
HTTP/2NoYes
Parsing speedMediumFast (lxml)
CSS selectorsYesYes
XPathNoYes
Regex extractionNoYes
Connection poolingBasicAdvanced
Drop-in for RequestsN/AYes

HTTPX is API-compatible with Requests, so migration is straightforward. Parsel is 2-5x faster than BeautifulSoup because it uses lxml directly.

Installation

pip install httpx[http2] parsel

HTTPX Basics

Synchronous (Drop-in Requests Replacement)

import httpx

# Simple GET
response = httpx.get("https://books.toscrape.com/", timeout=30)
print(response.status_code)
print(response.text[:200])

# With client (recommended for multiple requests)
with httpx.Client(
    http2=True,
    follow_redirects=True,
    timeout=30,
    headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"},
) as client:
    response = client.get("https://books.toscrape.com/")
    print(f"HTTP version: {response.http_version}")

Async

import httpx
import asyncio

async def fetch(url):
    async with httpx.AsyncClient(http2=True) as client:
        response = await client.get(url, timeout=30)
        return response.text

html = asyncio.run(fetch("https://books.toscrape.com/"))
print(f"Page length: {len(html)}")

POST and JSON

import httpx

with httpx.Client() as client:
    # Form POST
    response = client.post("https://example.com/search",
                           data={"query": "web scraping", "page": "1"})

    # JSON POST
    response = client.post("https://api.example.com/search",
                           json={"query": "web scraping"})

    # JSON response
    data = response.json()

Parsel Basics

from parsel import Selector

html = """
<html>
<body>
    <div class="products">
        <div class="product" data-id="1">
            <h2>Laptop</h2>
            <span class="price">$999.99</span>
            <p class="desc">High-performance laptop</p>
        </div>
        <div class="product" data-id="2">
            <h2>Tablet</h2>
            <span class="price">$499.99</span>
            <p class="desc">10-inch display</p>
        </div>
    </div>
</body>
</html>
"""

sel = Selector(text=html)

# CSS selectors
titles = sel.css("h2::text").getall()
print(titles)  # ['Laptop', 'Tablet']

# XPath
prices = sel.xpath("//span[@class='price']/text()").getall()
print(prices)  # ['$999.99', '$499.99']

# Attributes
ids = sel.css(".product::attr(data-id)").getall()
print(ids)  # ['1', '2']

# Single result
first_title = sel.css("h2::text").get()
print(first_title)  # 'Laptop'

# With default value
missing = sel.css(".nonexistent::text").get(default="N/A")
print(missing)  # 'N/A'

Combining HTTPX + Parsel

import httpx
from parsel import Selector

with httpx.Client(
    http2=True,
    headers={"User-Agent": "Mozilla/5.0"},
    timeout=30,
) as client:
    response = client.get("https://books.toscrape.com/")
    sel = Selector(text=response.text)

    books = []
    for product in sel.css("article.product_pod"):
        books.append({
            "title": product.css("h3 a::attr(title)").get(),
            "price": product.css(".price_color::text").get(),
            "rating": product.css("p.star-rating::attr(class)").get("").replace("star-rating ", ""),
            "url": response.url.join(product.css("h3 a::attr(href)").get()),
        })

    for book in books[:5]:
        print(f"{book['title']}: {book['price']}")

Async Scraping

Concurrent Page Scraping

import httpx
import asyncio
from parsel import Selector

async def scrape_page(client, url):
    response = await client.get(url)
    sel = Selector(text=response.text)

    return [
        {
            "title": p.css("h3 a::attr(title)").get(),
            "price": p.css(".price_color::text").get(),
        }
        for p in sel.css("article.product_pod")
    ]

async def main():
    urls = [
        f"https://books.toscrape.com/catalogue/page-{i}.html"
        for i in range(1, 51)
    ]

    async with httpx.AsyncClient(
        http2=True,
        headers={"User-Agent": "Mozilla/5.0"},
        timeout=30,
    ) as client:
        # Process in batches of 10
        all_books = []
        for i in range(0, len(urls), 10):
            batch = urls[i:i + 10]
            tasks = [scrape_page(client, url) for url in batch]
            results = await asyncio.gather(*tasks, return_exceptions=True)

            for result in results:
                if isinstance(result, list):
                    all_books.extend(result)

            print(f"Batch {i // 10 + 1}: {len(all_books)} books total")
            await asyncio.sleep(0.5)  # Rate limit between batches

    print(f"Total: {len(all_books)} books")

asyncio.run(main())

With Semaphore for Rate Limiting

import httpx
import asyncio
from parsel import Selector

async def scrape_with_limit(client, url, semaphore):
    async with semaphore:
        response = await client.get(url)
        sel = Selector(text=response.text)
        await asyncio.sleep(0.5)  # Per-request delay

        return {
            "url": str(response.url),
            "title": sel.css("title::text").get(),
            "links": len(sel.css("a[href]").getall()),
        }

async def main():
    urls = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 51)]
    semaphore = asyncio.Semaphore(5)  # Max 5 concurrent requests

    async with httpx.AsyncClient(http2=True, timeout=30) as client:
        tasks = [scrape_with_limit(client, url, semaphore) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    pages = [r for r in results if isinstance(r, dict)]
    print(f"Scraped {len(pages)} pages")

asyncio.run(main())

Advanced Selectors

CSS Pseudo-Selectors in Parsel

from parsel import Selector

sel = Selector(text=html)

# Text content
sel.css("h2::text").get()

# Attribute value
sel.css("a::attr(href)").get()

# All text (including children)
sel.css(".product *::text").getall()

XPath Power Features

# Conditional selection
sel.xpath("//div[@class='product'][.//span[@class='price'][number(substring(text(),2)) < 500]]")

# Following sibling
sel.xpath("//h2[text()='Laptop']/following-sibling::span[@class='price']/text()").get()

# Contains
sel.xpath("//div[contains(@class, 'product')]")

# Position
sel.xpath("//div[@class='product'][1]")  # First product
sel.xpath("//div[@class='product'][last()]")  # Last product

# Normalize whitespace
sel.xpath("normalize-space(//p[@class='desc'])")

Regex Extraction

# Extract patterns from text
sel.css(".product-page").re(r"ISBN:\s*([\d-]+)")
sel.css(".product-page").re_first(r"\$(\d+\.\d{2})")

# Extract numbers
prices = sel.css(".price::text").re(r"[\d.]+")  # ['999.99', '499.99']

Proxy Integration

Synchronous

import httpx
from parsel import Selector

proxy = "http://user:pass@proxy.example.com:8080"

with httpx.Client(proxy=proxy, timeout=30) as client:
    response = client.get("https://httpbin.org/ip")
    print(response.json())

Async

async with httpx.AsyncClient(proxy=proxy, timeout=30) as client:
    response = await client.get("https://httpbin.org/ip")
    print(response.json())

Rotating Proxies

import httpx
import random
from parsel import Selector

proxies = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]

async def scrape_with_rotation(url):
    proxy = random.choice(proxies)
    async with httpx.AsyncClient(proxy=proxy, timeout=30) as client:
        response = await client.get(url)
        sel = Selector(text=response.text)
        return sel.css("title::text").get()

For proxy types, see our web scraping proxy guide and proxy glossary.

Error Handling and Retries

import httpx
import asyncio
from parsel import Selector

async def fetch_with_retry(client, url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = await client.get(url)
            response.raise_for_status()
            return response
        except httpx.HTTPStatusError as e:
            if e.response.status_code == 429:
                wait = 2 ** attempt
                print(f"Rate limited, waiting {wait}s...")
                await asyncio.sleep(wait)
            elif e.response.status_code >= 500:
                await asyncio.sleep(1)
            else:
                raise
        except (httpx.ConnectTimeout, httpx.ReadTimeout):
            await asyncio.sleep(1)

    raise Exception(f"Failed after {max_retries} attempts: {url}")

Complete Example

import httpx
import asyncio
import json
from parsel import Selector
from dataclasses import dataclass, asdict
from typing import Optional

@dataclass
class Book:
    title: str
    price: float
    rating: str
    availability: str
    url: str

async def scrape_bookstore():
    all_books = []

    async with httpx.AsyncClient(
        http2=True,
        headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"},
        timeout=30,
        follow_redirects=True,
    ) as client:
        for page in range(1, 51):
            url = f"https://books.toscrape.com/catalogue/page-{page}.html"

            try:
                response = await client.get(url)
                if response.status_code != 200:
                    break

                sel = Selector(text=response.text)
                products = sel.css("article.product_pod")

                if not products:
                    break

                for p in products:
                    price_text = p.css(".price_color::text").get("")
                    price = float(price_text.replace("£", "").strip()) if price_text else 0

                    book = Book(
                        title=p.css("h3 a::attr(title)").get(""),
                        price=price,
                        rating=p.css("p.star-rating::attr(class)").get("").replace("star-rating ", ""),
                        availability=p.css(".availability::text").getall()[-1].strip() if p.css(".availability::text").getall() else "",
                        url=str(response.url.join(p.css("h3 a::attr(href)").get(""))),
                    )
                    all_books.append(book)

                print(f"Page {page}: {len(products)} books ({len(all_books)} total)")
                await asyncio.sleep(0.5)

            except Exception as e:
                print(f"Error on page {page}: {e}")
                break

    # Save
    with open("books.json", "w") as f:
        json.dump([asdict(b) for b in all_books], f, indent=2)

    print(f"\nTotal: {len(all_books)} books saved")
    return all_books

asyncio.run(scrape_bookstore())

FAQ

Why use HTTPX instead of Requests?

HTTPX provides everything Requests does plus async support, HTTP/2, and better connection pooling. If you are starting a new project, HTTPX is the better choice. Migration from Requests is trivial since the APIs are nearly identical.

Why use Parsel instead of BeautifulSoup?

Parsel is 2-5x faster than BeautifulSoup (it uses lxml under the hood), supports both CSS and XPath selectors, includes regex extraction, and is the same selector engine used by Scrapy. BeautifulSoup is more forgiving with severely broken HTML.

Can I use HTTPX + Parsel for JavaScript-heavy sites?

No. HTTPX only fetches raw HTML — it does not execute JavaScript. For JS-rendered content, first check the Network tab for API endpoints. If none exist, use Playwright or Scrapy + Playwright.

How does this compare to Scrapy?

HTTPX + Parsel is lighter and more flexible — you write regular Python code without a framework’s project structure. Scrapy provides built-in concurrency management, pipelines, middleware, and URL deduplication. Use HTTPX + Parsel for medium projects (100-10,000 pages) and Scrapy for larger ones.


Explore related stacks: aiohttp + BeautifulSoup, Python scraping libraries. For proxy setup, see our web scraping proxy guide.

External Resources:


Related Reading

Scroll to Top