aiohttp + BeautifulSoup: Async Python Scraping

The aiohttp + BeautifulSoup stack is Python’s high-performance scraping combination for static pages. aiohttp provides asynchronous HTTP requests that can process hundreds of pages concurrently using Python’s asyncio event loop. BeautifulSoup handles the parsing. Together, they deliver 10-50x the throughput of synchronous Requests + BeautifulSoup — without the overhead of a full framework like Scrapy.

This tutorial covers async fundamentals, concurrent scraping patterns, rate limiting, error handling, and proxy integration.

Why aiohttp + BeautifulSoup
Installation
Async Fundamentals
Basic Async Scraping
Concurrent Scraping
Rate Limiting
Error Handling and Retries
Proxy Integration
Session Management
Complete Production Scraper
FAQ

Why aiohttp + BeautifulSoup

Feature	Requests + BS4	aiohttp + BS4	Scrapy
Concurrency	Manual threads	Native async	Built-in
Speed (100 pages)	~100s	~10s	~8s
Memory	Low	Low	Medium
Code complexity	Simple	Medium	Complex
Learning curve	Easy	Medium	Hard
Project structure	Script	Script	Framework

aiohttp + BeautifulSoup sits between the simplicity of Requests and the power of Scrapy. It is ideal for medium projects (100-10,000 pages) where you want high concurrency without framework overhead.

Installation

pip install aiohttp beautifulsoup4 lxml aiohttp-socks

Async Fundamentals

import asyncio
import aiohttp
from bs4 import BeautifulSoup

# Basic async request
async def fetch_page(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            html = await response.text()
            return html

# Run it
html = asyncio.run(fetch_page("https://books.toscrape.com/"))
print(f"Page length: {len(html)}")

Key Concepts

async def — Defines a coroutine (async function)
await — Pauses execution until the async operation completes
async with — Async context manager for sessions and responses
asyncio.run() — Runs the async event loop
asyncio.gather() — Runs multiple coroutines concurrently

Basic Async Scraping

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def scrape_books(url):
    async with aiohttp.ClientSession(
        headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
    ) as session:
        async with session.get(url, timeout=aiohttp.ClientTimeout(total=30)) as response:
            html = await response.text()

    soup = BeautifulSoup(html, "lxml")
    books = []

    for article in soup.select("article.product_pod"):
        books.append({
            "title": article.select_one("h3 a")["title"],
            "price": article.select_one(".price_color").text,
        })

    return books

books = asyncio.run(scrape_books("https://books.toscrape.com/"))
for book in books[:5]:
    print(f"{book['title']}: {book['price']}")

Concurrent Scraping

Basic Concurrent Fetching

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def fetch_and_parse(session, url):
    try:
        async with session.get(url) as response:
            html = await response.text()
            soup = BeautifulSoup(html, "lxml")

            return [
                {
                    "title": a.select_one("h3 a")["title"],
                    "price": a.select_one(".price_color").text,
                }
                for a in soup.select("article.product_pod")
            ]
    except Exception as e:
        print(f"Error on {url}: {e}")
        return []

async def main():
    urls = [
        f"https://books.toscrape.com/catalogue/page-{i}.html"
        for i in range(1, 51)
    ]

    async with aiohttp.ClientSession(
        headers={"User-Agent": "Mozilla/5.0"},
        timeout=aiohttp.ClientTimeout(total=30),
    ) as session:
        tasks = [fetch_and_parse(session, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    all_books = []
    for result in results:
        if isinstance(result, list):
            all_books.extend(result)

    print(f"Scraped {len(all_books)} books from {len(urls)} pages")

asyncio.run(main())

Controlled Concurrency with Semaphore

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def fetch_with_limit(session, url, semaphore):
    async with semaphore:
        async with session.get(url) as response:
            html = await response.text()
            await asyncio.sleep(0.5)  # Per-request delay
            return url, html

async def main():
    urls = [
        f"https://books.toscrape.com/catalogue/page-{i}.html"
        for i in range(1, 51)
    ]

    semaphore = asyncio.Semaphore(5)  # Max 5 concurrent requests
    all_books = []

    async with aiohttp.ClientSession(
        headers={"User-Agent": "Mozilla/5.0"},
        timeout=aiohttp.ClientTimeout(total=30),
    ) as session:
        tasks = [fetch_with_limit(session, url, semaphore) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    for result in results:
        if isinstance(result, tuple):
            url, html = result
            soup = BeautifulSoup(html, "lxml")
            for a in soup.select("article.product_pod"):
                all_books.append({
                    "title": a.select_one("h3 a")["title"],
                    "price": a.select_one(".price_color").text,
                })

    print(f"Total: {len(all_books)} books")

asyncio.run(main())

Batch Processing

async def scrape_in_batches(urls, batch_size=10, delay=1.0):
    all_results = []

    async with aiohttp.ClientSession(
        headers={"User-Agent": "Mozilla/5.0"},
    ) as session:
        for i in range(0, len(urls), batch_size):
            batch = urls[i:i + batch_size]
            tasks = [fetch_and_parse(session, url) for url in batch]
            results = await asyncio.gather(*tasks, return_exceptions=True)

            for result in results:
                if isinstance(result, list):
                    all_results.extend(result)

            print(f"Batch {i // batch_size + 1}: {len(all_results)} total items")
            await asyncio.sleep(delay)

    return all_results

Rate Limiting

Per-Domain Rate Limiter

import asyncio
import time
from collections import defaultdict

class RateLimiter:
    def __init__(self, requests_per_second=2):
        self.delay = 1.0 / requests_per_second
        self.last_request = defaultdict(float)
        self.lock = asyncio.Lock()

    async def acquire(self, domain):
        async with self.lock:
            elapsed = time.time() - self.last_request[domain]
            if elapsed < self.delay:
                await asyncio.sleep(self.delay - elapsed)
            self.last_request[domain] = time.time()


# Usage
rate_limiter = RateLimiter(requests_per_second=2)

async def fetch_rate_limited(session, url):
    from urllib.parse import urlparse
    domain = urlparse(url).netloc

    await rate_limiter.acquire(domain)

    async with session.get(url) as response:
        return await response.text()

Error Handling and Retries

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def fetch_with_retry(session, url, max_retries=3):
    for attempt in range(1, max_retries + 1):
        try:
            async with session.get(url) as response:
                if response.status == 200:
                    return await response.text()
                elif response.status == 429:
                    wait = 2 ** attempt
                    print(f"Rate limited on {url}, waiting {wait}s...")
                    await asyncio.sleep(wait)
                elif response.status >= 500:
                    await asyncio.sleep(1)
                else:
                    print(f"HTTP {response.status}: {url}")
                    return None

        except aiohttp.ClientError as e:
            if attempt == max_retries:
                print(f"Failed after {max_retries} attempts: {url} — {e}")
                return None
            await asyncio.sleep(1)

        except asyncio.TimeoutError:
            if attempt == max_retries:
                print(f"Timeout after {max_retries} attempts: {url}")
                return None
            await asyncio.sleep(1)

    return None

Proxy Integration

Single Proxy

import aiohttp

async def fetch_with_proxy(url):
    proxy = "http://user:pass@proxy.example.com:8080"

    async with aiohttp.ClientSession() as session:
        async with session.get(url, proxy=proxy) as response:
            return await response.text()

Rotating Proxies

import random

proxies = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]

async def fetch_rotating(session, url):
    proxy = random.choice(proxies)
    try:
        async with session.get(url, proxy=proxy, timeout=aiohttp.ClientTimeout(total=30)) as response:
            if response.status == 200:
                return await response.text()
    except Exception:
        pass
    return None

SOCKS Proxy

pip install aiohttp-socks

from aiohttp_socks import ProxyConnector

connector = ProxyConnector.from_url("socks5://user:pass@proxy.example.com:1080")

async with aiohttp.ClientSession(connector=connector) as session:
    async with session.get("https://httpbin.org/ip") as response:
        print(await response.json())

For proxy types, see our web scraping proxy guide and proxy glossary.

Session Management

Cookie Handling

import aiohttp

async def login_and_scrape():
    jar = aiohttp.CookieJar()

    async with aiohttp.ClientSession(cookie_jar=jar) as session:
        # Login
        await session.post("https://example.com/login", data={
            "username": "user",
            "password": "pass",
        })

        # Cookies are maintained
        async with session.get("https://example.com/dashboard") as response:
            html = await response.text()
            soup = BeautifulSoup(html, "lxml")
            print(soup.select_one("h1").text)

Custom Headers per Request

async with aiohttp.ClientSession() as session:
    headers = {
        "User-Agent": "Mozilla/5.0",
        "Referer": "https://example.com/",
        "Accept-Language": "en-US,en;q=0.9",
    }
    async with session.get(url, headers=headers) as response:
        html = await response.text()

Complete Production Scraper

import asyncio
import aiohttp
import json
import time
import logging
from bs4 import BeautifulSoup
from dataclasses import dataclass, asdict
from typing import List, Optional

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@dataclass
class Book:
    title: str
    price: float
    rating: str
    available: bool
    page: int

class AsyncBookScraper:
    def __init__(self, concurrency=5, delay=0.5, max_retries=3):
        self.concurrency = concurrency
        self.delay = delay
        self.max_retries = max_retries
        self.books: List[Book] = []
        self.errors = 0

    async def fetch(self, session, url, semaphore):
        async with semaphore:
            for attempt in range(1, self.max_retries + 1):
                try:
                    async with session.get(url) as response:
                        if response.status == 200:
                            html = await response.text()
                            await asyncio.sleep(self.delay)
                            return html
                        elif response.status == 429:
                            await asyncio.sleep(2 ** attempt)
                except (aiohttp.ClientError, asyncio.TimeoutError):
                    if attempt == self.max_retries:
                        self.errors += 1
                        logger.warning(f"Failed: {url}")
                    await asyncio.sleep(1)
            return None

    def parse_page(self, html: str, page_num: int) -> List[Book]:
        soup = BeautifulSoup(html, "lxml")
        books = []

        for article in soup.select("article.product_pod"):
            price_text = article.select_one(".price_color").text
            price = float(price_text.replace("£", "").strip())

            rating_class = article.select_one("p.star-rating")["class"]
            rating = [c for c in rating_class if c != "star-rating"][0]

            books.append(Book(
                title=article.select_one("h3 a")["title"],
                price=price,
                rating=rating,
                available="In stock" in article.select_one(".availability").text,
                page=page_num,
            ))

        return books

    async def scrape(self, total_pages=50):
        start_time = time.time()
        semaphore = asyncio.Semaphore(self.concurrency)

        urls = {
            f"https://books.toscrape.com/catalogue/page-{i}.html": i
            for i in range(1, total_pages + 1)
        }

        async with aiohttp.ClientSession(
            headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"},
            timeout=aiohttp.ClientTimeout(total=30),
        ) as session:
            tasks = {
                url: self.fetch(session, url, semaphore)
                for url in urls
            }

            results = await asyncio.gather(*tasks.values(), return_exceptions=True)

            for url, result in zip(tasks.keys(), results):
                if isinstance(result, str):
                    page_books = self.parse_page(result, urls[url])
                    self.books.extend(page_books)

        elapsed = time.time() - start_time
        logger.info(
            f"Scraped {len(self.books)} books from {total_pages} pages "
            f"in {elapsed:.1f}s ({self.errors} errors)"
        )

    def save(self, filename="books.json"):
        with open(filename, "w") as f:
            json.dump([asdict(b) for b in self.books], f, indent=2)
        logger.info(f"Saved to {filename}")

    def stats(self):
        if not self.books:
            return
        avg = sum(b.price for b in self.books) / len(self.books)
        logger.info(f"Books: {len(self.books)}, Avg price: £{avg:.2f}")


async def main():
    scraper = AsyncBookScraper(concurrency=5, delay=0.3)
    await scraper.scrape(total_pages=50)
    scraper.stats()
    scraper.save()

asyncio.run(main())

FAQ

When should I use aiohttp instead of Requests?

Use aiohttp when you need to scrape many pages concurrently. For scraping under 50 pages sequentially, Requests is simpler. For 50+ pages where speed matters, aiohttp’s async concurrency provides 5-20x faster completion times.

How does aiohttp compare to HTTPX?

HTTPX offers both sync and async APIs, HTTP/2 support, and a Requests-compatible interface. aiohttp is async-only but has higher throughput for extremely high concurrency. For most projects, HTTPX is the better choice due to its simpler API. Use aiohttp when you need maximum async performance.

Can I use lxml instead of BeautifulSoup with aiohttp?

Yes. Replace BeautifulSoup parsing with lxml or Parsel for 2-5x faster parsing. See our lxml vs BeautifulSoup comparison and HTTPX + Parsel guide.

How many concurrent requests can aiohttp handle?

aiohttp can handle thousands of concurrent connections, but you should limit concurrency with a semaphore to avoid overwhelming target servers. A practical limit is 5-20 concurrent requests per domain, with appropriate delays.

Should I use aiohttp + BeautifulSoup or Scrapy?

For medium projects (100-10,000 pages), aiohttp + BeautifulSoup gives you concurrency without framework complexity. For larger projects (10,000+ pages), Scrapy provides built-in pipelines, middleware, and crawl management.

Explore related stacks: HTTPX + Parsel, Python scraping libraries. For proxy setup, see our web scraping proxy guide.

External Resources:

aiohttp + BeautifulSoup: Async Python Scraping

aiohttp + BeautifulSoup: Async Python Scraping

Table of Contents

Why aiohttp + BeautifulSoup

Installation

Async Fundamentals

Key Concepts

Basic Async Scraping

Concurrent Scraping

Basic Concurrent Fetching

Controlled Concurrency with Semaphore

Batch Processing

Rate Limiting

Per-Domain Rate Limiter

Error Handling and Retries

Proxy Integration

Single Proxy

Rotating Proxies

SOCKS Proxy

Session Management

Cookie Handling

Custom Headers per Request

Complete Production Scraper

FAQ

When should I use aiohttp instead of Requests?

How does aiohttp compare to HTTPX?

Can I use lxml instead of BeautifulSoup with aiohttp?

How many concurrent requests can aiohttp handle?

Should I use aiohttp + BeautifulSoup or Scrapy?

Related Reading