aiohttp + BeautifulSoup: Async Python Scraping

aiohttp + BeautifulSoup: Async Python Scraping

The aiohttp + BeautifulSoup stack is Python’s high-performance scraping combination for static pages. aiohttp provides asynchronous HTTP requests that can process hundreds of pages concurrently using Python’s asyncio event loop. BeautifulSoup handles the parsing. Together, they deliver 10-50x the throughput of synchronous Requests + BeautifulSoup — without the overhead of a full framework like Scrapy.

This tutorial covers async fundamentals, concurrent scraping patterns, rate limiting, error handling, and proxy integration.

Table of Contents

Why aiohttp + BeautifulSoup

FeatureRequests + BS4aiohttp + BS4Scrapy
ConcurrencyManual threadsNative asyncBuilt-in
Speed (100 pages)~100s~10s~8s
MemoryLowLowMedium
Code complexitySimpleMediumComplex
Learning curveEasyMediumHard
Project structureScriptScriptFramework

aiohttp + BeautifulSoup sits between the simplicity of Requests and the power of Scrapy. It is ideal for medium projects (100-10,000 pages) where you want high concurrency without framework overhead.

Installation

pip install aiohttp beautifulsoup4 lxml aiohttp-socks

Async Fundamentals

import asyncio
import aiohttp
from bs4 import BeautifulSoup

# Basic async request
async def fetch_page(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            html = await response.text()
            return html

# Run it
html = asyncio.run(fetch_page("https://books.toscrape.com/"))
print(f"Page length: {len(html)}")

Key Concepts

  1. async def — Defines a coroutine (async function)
  2. await — Pauses execution until the async operation completes
  3. async with — Async context manager for sessions and responses
  4. asyncio.run() — Runs the async event loop
  5. asyncio.gather() — Runs multiple coroutines concurrently

Basic Async Scraping

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def scrape_books(url):
    async with aiohttp.ClientSession(
        headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
    ) as session:
        async with session.get(url, timeout=aiohttp.ClientTimeout(total=30)) as response:
            html = await response.text()

    soup = BeautifulSoup(html, "lxml")
    books = []

    for article in soup.select("article.product_pod"):
        books.append({
            "title": article.select_one("h3 a")["title"],
            "price": article.select_one(".price_color").text,
        })

    return books

books = asyncio.run(scrape_books("https://books.toscrape.com/"))
for book in books[:5]:
    print(f"{book['title']}: {book['price']}")

Concurrent Scraping

Basic Concurrent Fetching

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def fetch_and_parse(session, url):
    try:
        async with session.get(url) as response:
            html = await response.text()
            soup = BeautifulSoup(html, "lxml")

            return [
                {
                    "title": a.select_one("h3 a")["title"],
                    "price": a.select_one(".price_color").text,
                }
                for a in soup.select("article.product_pod")
            ]
    except Exception as e:
        print(f"Error on {url}: {e}")
        return []

async def main():
    urls = [
        f"https://books.toscrape.com/catalogue/page-{i}.html"
        for i in range(1, 51)
    ]

    async with aiohttp.ClientSession(
        headers={"User-Agent": "Mozilla/5.0"},
        timeout=aiohttp.ClientTimeout(total=30),
    ) as session:
        tasks = [fetch_and_parse(session, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    all_books = []
    for result in results:
        if isinstance(result, list):
            all_books.extend(result)

    print(f"Scraped {len(all_books)} books from {len(urls)} pages")

asyncio.run(main())

Controlled Concurrency with Semaphore

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def fetch_with_limit(session, url, semaphore):
    async with semaphore:
        async with session.get(url) as response:
            html = await response.text()
            await asyncio.sleep(0.5)  # Per-request delay
            return url, html

async def main():
    urls = [
        f"https://books.toscrape.com/catalogue/page-{i}.html"
        for i in range(1, 51)
    ]

    semaphore = asyncio.Semaphore(5)  # Max 5 concurrent requests
    all_books = []

    async with aiohttp.ClientSession(
        headers={"User-Agent": "Mozilla/5.0"},
        timeout=aiohttp.ClientTimeout(total=30),
    ) as session:
        tasks = [fetch_with_limit(session, url, semaphore) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    for result in results:
        if isinstance(result, tuple):
            url, html = result
            soup = BeautifulSoup(html, "lxml")
            for a in soup.select("article.product_pod"):
                all_books.append({
                    "title": a.select_one("h3 a")["title"],
                    "price": a.select_one(".price_color").text,
                })

    print(f"Total: {len(all_books)} books")

asyncio.run(main())

Batch Processing

async def scrape_in_batches(urls, batch_size=10, delay=1.0):
    all_results = []

    async with aiohttp.ClientSession(
        headers={"User-Agent": "Mozilla/5.0"},
    ) as session:
        for i in range(0, len(urls), batch_size):
            batch = urls[i:i + batch_size]
            tasks = [fetch_and_parse(session, url) for url in batch]
            results = await asyncio.gather(*tasks, return_exceptions=True)

            for result in results:
                if isinstance(result, list):
                    all_results.extend(result)

            print(f"Batch {i // batch_size + 1}: {len(all_results)} total items")
            await asyncio.sleep(delay)

    return all_results

Rate Limiting

Per-Domain Rate Limiter

import asyncio
import time
from collections import defaultdict

class RateLimiter:
    def __init__(self, requests_per_second=2):
        self.delay = 1.0 / requests_per_second
        self.last_request = defaultdict(float)
        self.lock = asyncio.Lock()

    async def acquire(self, domain):
        async with self.lock:
            elapsed = time.time() - self.last_request[domain]
            if elapsed < self.delay:
                await asyncio.sleep(self.delay - elapsed)
            self.last_request[domain] = time.time()


# Usage
rate_limiter = RateLimiter(requests_per_second=2)

async def fetch_rate_limited(session, url):
    from urllib.parse import urlparse
    domain = urlparse(url).netloc

    await rate_limiter.acquire(domain)

    async with session.get(url) as response:
        return await response.text()

Error Handling and Retries

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def fetch_with_retry(session, url, max_retries=3):
    for attempt in range(1, max_retries + 1):
        try:
            async with session.get(url) as response:
                if response.status == 200:
                    return await response.text()
                elif response.status == 429:
                    wait = 2 ** attempt
                    print(f"Rate limited on {url}, waiting {wait}s...")
                    await asyncio.sleep(wait)
                elif response.status >= 500:
                    await asyncio.sleep(1)
                else:
                    print(f"HTTP {response.status}: {url}")
                    return None

        except aiohttp.ClientError as e:
            if attempt == max_retries:
                print(f"Failed after {max_retries} attempts: {url} — {e}")
                return None
            await asyncio.sleep(1)

        except asyncio.TimeoutError:
            if attempt == max_retries:
                print(f"Timeout after {max_retries} attempts: {url}")
                return None
            await asyncio.sleep(1)

    return None

Proxy Integration

Single Proxy

import aiohttp

async def fetch_with_proxy(url):
    proxy = "http://user:pass@proxy.example.com:8080"

    async with aiohttp.ClientSession() as session:
        async with session.get(url, proxy=proxy) as response:
            return await response.text()

Rotating Proxies

import random

proxies = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]

async def fetch_rotating(session, url):
    proxy = random.choice(proxies)
    try:
        async with session.get(url, proxy=proxy, timeout=aiohttp.ClientTimeout(total=30)) as response:
            if response.status == 200:
                return await response.text()
    except Exception:
        pass
    return None

SOCKS Proxy

pip install aiohttp-socks
from aiohttp_socks import ProxyConnector

connector = ProxyConnector.from_url("socks5://user:pass@proxy.example.com:1080")

async with aiohttp.ClientSession(connector=connector) as session:
    async with session.get("https://httpbin.org/ip") as response:
        print(await response.json())

For proxy types, see our web scraping proxy guide and proxy glossary.

Session Management

Cookie Handling

import aiohttp

async def login_and_scrape():
    jar = aiohttp.CookieJar()

    async with aiohttp.ClientSession(cookie_jar=jar) as session:
        # Login
        await session.post("https://example.com/login", data={
            "username": "user",
            "password": "pass",
        })

        # Cookies are maintained
        async with session.get("https://example.com/dashboard") as response:
            html = await response.text()
            soup = BeautifulSoup(html, "lxml")
            print(soup.select_one("h1").text)

Custom Headers per Request

async with aiohttp.ClientSession() as session:
    headers = {
        "User-Agent": "Mozilla/5.0",
        "Referer": "https://example.com/",
        "Accept-Language": "en-US,en;q=0.9",
    }
    async with session.get(url, headers=headers) as response:
        html = await response.text()

Complete Production Scraper

import asyncio
import aiohttp
import json
import time
import logging
from bs4 import BeautifulSoup
from dataclasses import dataclass, asdict
from typing import List, Optional

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@dataclass
class Book:
    title: str
    price: float
    rating: str
    available: bool
    page: int

class AsyncBookScraper:
    def __init__(self, concurrency=5, delay=0.5, max_retries=3):
        self.concurrency = concurrency
        self.delay = delay
        self.max_retries = max_retries
        self.books: List[Book] = []
        self.errors = 0

    async def fetch(self, session, url, semaphore):
        async with semaphore:
            for attempt in range(1, self.max_retries + 1):
                try:
                    async with session.get(url) as response:
                        if response.status == 200:
                            html = await response.text()
                            await asyncio.sleep(self.delay)
                            return html
                        elif response.status == 429:
                            await asyncio.sleep(2 ** attempt)
                except (aiohttp.ClientError, asyncio.TimeoutError):
                    if attempt == self.max_retries:
                        self.errors += 1
                        logger.warning(f"Failed: {url}")
                    await asyncio.sleep(1)
            return None

    def parse_page(self, html: str, page_num: int) -> List[Book]:
        soup = BeautifulSoup(html, "lxml")
        books = []

        for article in soup.select("article.product_pod"):
            price_text = article.select_one(".price_color").text
            price = float(price_text.replace("£", "").strip())

            rating_class = article.select_one("p.star-rating")["class"]
            rating = [c for c in rating_class if c != "star-rating"][0]

            books.append(Book(
                title=article.select_one("h3 a")["title"],
                price=price,
                rating=rating,
                available="In stock" in article.select_one(".availability").text,
                page=page_num,
            ))

        return books

    async def scrape(self, total_pages=50):
        start_time = time.time()
        semaphore = asyncio.Semaphore(self.concurrency)

        urls = {
            f"https://books.toscrape.com/catalogue/page-{i}.html": i
            for i in range(1, total_pages + 1)
        }

        async with aiohttp.ClientSession(
            headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"},
            timeout=aiohttp.ClientTimeout(total=30),
        ) as session:
            tasks = {
                url: self.fetch(session, url, semaphore)
                for url in urls
            }

            results = await asyncio.gather(*tasks.values(), return_exceptions=True)

            for url, result in zip(tasks.keys(), results):
                if isinstance(result, str):
                    page_books = self.parse_page(result, urls[url])
                    self.books.extend(page_books)

        elapsed = time.time() - start_time
        logger.info(
            f"Scraped {len(self.books)} books from {total_pages} pages "
            f"in {elapsed:.1f}s ({self.errors} errors)"
        )

    def save(self, filename="books.json"):
        with open(filename, "w") as f:
            json.dump([asdict(b) for b in self.books], f, indent=2)
        logger.info(f"Saved to {filename}")

    def stats(self):
        if not self.books:
            return
        avg = sum(b.price for b in self.books) / len(self.books)
        logger.info(f"Books: {len(self.books)}, Avg price: £{avg:.2f}")


async def main():
    scraper = AsyncBookScraper(concurrency=5, delay=0.3)
    await scraper.scrape(total_pages=50)
    scraper.stats()
    scraper.save()

asyncio.run(main())

FAQ

When should I use aiohttp instead of Requests?

Use aiohttp when you need to scrape many pages concurrently. For scraping under 50 pages sequentially, Requests is simpler. For 50+ pages where speed matters, aiohttp’s async concurrency provides 5-20x faster completion times.

How does aiohttp compare to HTTPX?

HTTPX offers both sync and async APIs, HTTP/2 support, and a Requests-compatible interface. aiohttp is async-only but has higher throughput for extremely high concurrency. For most projects, HTTPX is the better choice due to its simpler API. Use aiohttp when you need maximum async performance.

Can I use lxml instead of BeautifulSoup with aiohttp?

Yes. Replace BeautifulSoup parsing with lxml or Parsel for 2-5x faster parsing. See our lxml vs BeautifulSoup comparison and HTTPX + Parsel guide.

How many concurrent requests can aiohttp handle?

aiohttp can handle thousands of concurrent connections, but you should limit concurrency with a semaphore to avoid overwhelming target servers. A practical limit is 5-20 concurrent requests per domain, with appropriate delays.

Should I use aiohttp + BeautifulSoup or Scrapy?

For medium projects (100-10,000 pages), aiohttp + BeautifulSoup gives you concurrency without framework complexity. For larger projects (10,000+ pages), Scrapy provides built-in pipelines, middleware, and crawl management.


Explore related stacks: HTTPX + Parsel, Python scraping libraries. For proxy setup, see our web scraping proxy guide.

External Resources:


Related Reading

Scroll to Top