aiohttp + BeautifulSoup: Async Python Scraping
The aiohttp + BeautifulSoup stack is Python’s high-performance scraping combination for static pages. aiohttp provides asynchronous HTTP requests that can process hundreds of pages concurrently using Python’s asyncio event loop. BeautifulSoup handles the parsing. Together, they deliver 10-50x the throughput of synchronous Requests + BeautifulSoup — without the overhead of a full framework like Scrapy.
This tutorial covers async fundamentals, concurrent scraping patterns, rate limiting, error handling, and proxy integration.
Table of Contents
- Why aiohttp + BeautifulSoup
- Installation
- Async Fundamentals
- Basic Async Scraping
- Concurrent Scraping
- Rate Limiting
- Error Handling and Retries
- Proxy Integration
- Session Management
- Complete Production Scraper
- FAQ
Why aiohttp + BeautifulSoup
| Feature | Requests + BS4 | aiohttp + BS4 | Scrapy |
|---|---|---|---|
| Concurrency | Manual threads | Native async | Built-in |
| Speed (100 pages) | ~100s | ~10s | ~8s |
| Memory | Low | Low | Medium |
| Code complexity | Simple | Medium | Complex |
| Learning curve | Easy | Medium | Hard |
| Project structure | Script | Script | Framework |
aiohttp + BeautifulSoup sits between the simplicity of Requests and the power of Scrapy. It is ideal for medium projects (100-10,000 pages) where you want high concurrency without framework overhead.
Installation
pip install aiohttp beautifulsoup4 lxml aiohttp-socksAsync Fundamentals
import asyncio
import aiohttp
from bs4 import BeautifulSoup
# Basic async request
async def fetch_page(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
html = await response.text()
return html
# Run it
html = asyncio.run(fetch_page("https://books.toscrape.com/"))
print(f"Page length: {len(html)}")Key Concepts
async def— Defines a coroutine (async function)await— Pauses execution until the async operation completesasync with— Async context manager for sessions and responsesasyncio.run()— Runs the async event loopasyncio.gather()— Runs multiple coroutines concurrently
Basic Async Scraping
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def scrape_books(url):
async with aiohttp.ClientSession(
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
) as session:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=30)) as response:
html = await response.text()
soup = BeautifulSoup(html, "lxml")
books = []
for article in soup.select("article.product_pod"):
books.append({
"title": article.select_one("h3 a")["title"],
"price": article.select_one(".price_color").text,
})
return books
books = asyncio.run(scrape_books("https://books.toscrape.com/"))
for book in books[:5]:
print(f"{book['title']}: {book['price']}")Concurrent Scraping
Basic Concurrent Fetching
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def fetch_and_parse(session, url):
try:
async with session.get(url) as response:
html = await response.text()
soup = BeautifulSoup(html, "lxml")
return [
{
"title": a.select_one("h3 a")["title"],
"price": a.select_one(".price_color").text,
}
for a in soup.select("article.product_pod")
]
except Exception as e:
print(f"Error on {url}: {e}")
return []
async def main():
urls = [
f"https://books.toscrape.com/catalogue/page-{i}.html"
for i in range(1, 51)
]
async with aiohttp.ClientSession(
headers={"User-Agent": "Mozilla/5.0"},
timeout=aiohttp.ClientTimeout(total=30),
) as session:
tasks = [fetch_and_parse(session, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
all_books = []
for result in results:
if isinstance(result, list):
all_books.extend(result)
print(f"Scraped {len(all_books)} books from {len(urls)} pages")
asyncio.run(main())Controlled Concurrency with Semaphore
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def fetch_with_limit(session, url, semaphore):
async with semaphore:
async with session.get(url) as response:
html = await response.text()
await asyncio.sleep(0.5) # Per-request delay
return url, html
async def main():
urls = [
f"https://books.toscrape.com/catalogue/page-{i}.html"
for i in range(1, 51)
]
semaphore = asyncio.Semaphore(5) # Max 5 concurrent requests
all_books = []
async with aiohttp.ClientSession(
headers={"User-Agent": "Mozilla/5.0"},
timeout=aiohttp.ClientTimeout(total=30),
) as session:
tasks = [fetch_with_limit(session, url, semaphore) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
for result in results:
if isinstance(result, tuple):
url, html = result
soup = BeautifulSoup(html, "lxml")
for a in soup.select("article.product_pod"):
all_books.append({
"title": a.select_one("h3 a")["title"],
"price": a.select_one(".price_color").text,
})
print(f"Total: {len(all_books)} books")
asyncio.run(main())Batch Processing
async def scrape_in_batches(urls, batch_size=10, delay=1.0):
all_results = []
async with aiohttp.ClientSession(
headers={"User-Agent": "Mozilla/5.0"},
) as session:
for i in range(0, len(urls), batch_size):
batch = urls[i:i + batch_size]
tasks = [fetch_and_parse(session, url) for url in batch]
results = await asyncio.gather(*tasks, return_exceptions=True)
for result in results:
if isinstance(result, list):
all_results.extend(result)
print(f"Batch {i // batch_size + 1}: {len(all_results)} total items")
await asyncio.sleep(delay)
return all_resultsRate Limiting
Per-Domain Rate Limiter
import asyncio
import time
from collections import defaultdict
class RateLimiter:
def __init__(self, requests_per_second=2):
self.delay = 1.0 / requests_per_second
self.last_request = defaultdict(float)
self.lock = asyncio.Lock()
async def acquire(self, domain):
async with self.lock:
elapsed = time.time() - self.last_request[domain]
if elapsed < self.delay:
await asyncio.sleep(self.delay - elapsed)
self.last_request[domain] = time.time()
# Usage
rate_limiter = RateLimiter(requests_per_second=2)
async def fetch_rate_limited(session, url):
from urllib.parse import urlparse
domain = urlparse(url).netloc
await rate_limiter.acquire(domain)
async with session.get(url) as response:
return await response.text()Error Handling and Retries
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def fetch_with_retry(session, url, max_retries=3):
for attempt in range(1, max_retries + 1):
try:
async with session.get(url) as response:
if response.status == 200:
return await response.text()
elif response.status == 429:
wait = 2 ** attempt
print(f"Rate limited on {url}, waiting {wait}s...")
await asyncio.sleep(wait)
elif response.status >= 500:
await asyncio.sleep(1)
else:
print(f"HTTP {response.status}: {url}")
return None
except aiohttp.ClientError as e:
if attempt == max_retries:
print(f"Failed after {max_retries} attempts: {url} — {e}")
return None
await asyncio.sleep(1)
except asyncio.TimeoutError:
if attempt == max_retries:
print(f"Timeout after {max_retries} attempts: {url}")
return None
await asyncio.sleep(1)
return NoneProxy Integration
Single Proxy
import aiohttp
async def fetch_with_proxy(url):
proxy = "http://user:pass@proxy.example.com:8080"
async with aiohttp.ClientSession() as session:
async with session.get(url, proxy=proxy) as response:
return await response.text()Rotating Proxies
import random
proxies = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
async def fetch_rotating(session, url):
proxy = random.choice(proxies)
try:
async with session.get(url, proxy=proxy, timeout=aiohttp.ClientTimeout(total=30)) as response:
if response.status == 200:
return await response.text()
except Exception:
pass
return NoneSOCKS Proxy
pip install aiohttp-socksfrom aiohttp_socks import ProxyConnector
connector = ProxyConnector.from_url("socks5://user:pass@proxy.example.com:1080")
async with aiohttp.ClientSession(connector=connector) as session:
async with session.get("https://httpbin.org/ip") as response:
print(await response.json())For proxy types, see our web scraping proxy guide and proxy glossary.
Session Management
Cookie Handling
import aiohttp
async def login_and_scrape():
jar = aiohttp.CookieJar()
async with aiohttp.ClientSession(cookie_jar=jar) as session:
# Login
await session.post("https://example.com/login", data={
"username": "user",
"password": "pass",
})
# Cookies are maintained
async with session.get("https://example.com/dashboard") as response:
html = await response.text()
soup = BeautifulSoup(html, "lxml")
print(soup.select_one("h1").text)Custom Headers per Request
async with aiohttp.ClientSession() as session:
headers = {
"User-Agent": "Mozilla/5.0",
"Referer": "https://example.com/",
"Accept-Language": "en-US,en;q=0.9",
}
async with session.get(url, headers=headers) as response:
html = await response.text()Complete Production Scraper
import asyncio
import aiohttp
import json
import time
import logging
from bs4 import BeautifulSoup
from dataclasses import dataclass, asdict
from typing import List, Optional
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class Book:
title: str
price: float
rating: str
available: bool
page: int
class AsyncBookScraper:
def __init__(self, concurrency=5, delay=0.5, max_retries=3):
self.concurrency = concurrency
self.delay = delay
self.max_retries = max_retries
self.books: List[Book] = []
self.errors = 0
async def fetch(self, session, url, semaphore):
async with semaphore:
for attempt in range(1, self.max_retries + 1):
try:
async with session.get(url) as response:
if response.status == 200:
html = await response.text()
await asyncio.sleep(self.delay)
return html
elif response.status == 429:
await asyncio.sleep(2 ** attempt)
except (aiohttp.ClientError, asyncio.TimeoutError):
if attempt == self.max_retries:
self.errors += 1
logger.warning(f"Failed: {url}")
await asyncio.sleep(1)
return None
def parse_page(self, html: str, page_num: int) -> List[Book]:
soup = BeautifulSoup(html, "lxml")
books = []
for article in soup.select("article.product_pod"):
price_text = article.select_one(".price_color").text
price = float(price_text.replace("£", "").strip())
rating_class = article.select_one("p.star-rating")["class"]
rating = [c for c in rating_class if c != "star-rating"][0]
books.append(Book(
title=article.select_one("h3 a")["title"],
price=price,
rating=rating,
available="In stock" in article.select_one(".availability").text,
page=page_num,
))
return books
async def scrape(self, total_pages=50):
start_time = time.time()
semaphore = asyncio.Semaphore(self.concurrency)
urls = {
f"https://books.toscrape.com/catalogue/page-{i}.html": i
for i in range(1, total_pages + 1)
}
async with aiohttp.ClientSession(
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"},
timeout=aiohttp.ClientTimeout(total=30),
) as session:
tasks = {
url: self.fetch(session, url, semaphore)
for url in urls
}
results = await asyncio.gather(*tasks.values(), return_exceptions=True)
for url, result in zip(tasks.keys(), results):
if isinstance(result, str):
page_books = self.parse_page(result, urls[url])
self.books.extend(page_books)
elapsed = time.time() - start_time
logger.info(
f"Scraped {len(self.books)} books from {total_pages} pages "
f"in {elapsed:.1f}s ({self.errors} errors)"
)
def save(self, filename="books.json"):
with open(filename, "w") as f:
json.dump([asdict(b) for b in self.books], f, indent=2)
logger.info(f"Saved to {filename}")
def stats(self):
if not self.books:
return
avg = sum(b.price for b in self.books) / len(self.books)
logger.info(f"Books: {len(self.books)}, Avg price: £{avg:.2f}")
async def main():
scraper = AsyncBookScraper(concurrency=5, delay=0.3)
await scraper.scrape(total_pages=50)
scraper.stats()
scraper.save()
asyncio.run(main())FAQ
When should I use aiohttp instead of Requests?
Use aiohttp when you need to scrape many pages concurrently. For scraping under 50 pages sequentially, Requests is simpler. For 50+ pages where speed matters, aiohttp’s async concurrency provides 5-20x faster completion times.
How does aiohttp compare to HTTPX?
HTTPX offers both sync and async APIs, HTTP/2 support, and a Requests-compatible interface. aiohttp is async-only but has higher throughput for extremely high concurrency. For most projects, HTTPX is the better choice due to its simpler API. Use aiohttp when you need maximum async performance.
Can I use lxml instead of BeautifulSoup with aiohttp?
Yes. Replace BeautifulSoup parsing with lxml or Parsel for 2-5x faster parsing. See our lxml vs BeautifulSoup comparison and HTTPX + Parsel guide.
How many concurrent requests can aiohttp handle?
aiohttp can handle thousands of concurrent connections, but you should limit concurrency with a semaphore to avoid overwhelming target servers. A practical limit is 5-20 concurrent requests per domain, with appropriate delays.
Should I use aiohttp + BeautifulSoup or Scrapy?
For medium projects (100-10,000 pages), aiohttp + BeautifulSoup gives you concurrency without framework complexity. For larger projects (10,000+ pages), Scrapy provides built-in pipelines, middleware, and crawl management.
Explore related stacks: HTTPX + Parsel, Python scraping libraries. For proxy setup, see our web scraping proxy guide.
External Resources:
- aiohttp Documentation
- BeautifulSoup Documentation
- Python asyncio Documentation
- Axios + Cheerio: Lightweight Node.js Scraping
- Beautiful Soup Tutorial: Python HTML Parsing Guide
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- Axios + Cheerio: Lightweight Node.js Scraping
- Beautiful Soup Tutorial: Python HTML Parsing Guide
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- Axios + Cheerio: Lightweight Node.js Scraping
- Beautiful Soup Tutorial: Python HTML Parsing Guide
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- Axios + Cheerio: Lightweight Node.js Scraping
- Axios Retry for Web Scraping in Node.js: the Complete Guide
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- Axios + Cheerio: Lightweight Node.js Scraping
- Axios Retry for Web Scraping in Node.js: the Complete Guide
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- Axios + Cheerio: Lightweight Node.js Scraping
- Axios Retry for Web Scraping in Node.js: the Complete Guide
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- Axios + Cheerio: Lightweight Node.js Scraping
- Axios Retry for Web Scraping in Node.js: the Complete Guide
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
Related Reading
- Axios + Cheerio: Lightweight Node.js Scraping
- Axios Retry for Web Scraping in Node.js: the Complete Guide
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company