Scrapy Tutorial: Complete Web Scraping Framework
Scrapy is the most powerful web scraping framework in Python. While libraries like BeautifulSoup handle parsing and Requests handle HTTP, Scrapy bundles everything — HTTP requests, HTML parsing, data pipelines, concurrency, and export — into a single, battle-tested framework. If you’re scraping more than a handful of pages, Scrapy should be your default choice.
This tutorial takes you from zero to a production-ready Scrapy project, covering spiders, items, pipelines, middleware, and proxy integration.
Prerequisites
- Python 3.9+ installed
- Basic Python knowledge (classes, generators, decorators)
- Familiarity with HTML and CSS selectors
- A terminal or command prompt
python --version # Should show 3.9+Installation
Create a virtual environment and install Scrapy:
python -m venv scraper-env
source scraper-env/bin/activate # On Windows: scraper-env\Scripts\activate
pip install scrapyVerify installation:
scrapy version # Should show Scrapy 2.xCreating a Scrapy Project
scrapy startproject bookstore
cd bookstoreThis creates the following structure:
bookstore/
scrapy.cfg
bookstore/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.pyYour First Spider
Generate a spider:
scrapy genspider books books.toscrape.comNow edit bookstore/spiders/books.py:
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com/"]
def parse(self, response):
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css(".price_color::text").get(),
"rating": book.css("p.star-rating::attr(class)").get().split()[-1],
"available": book.css(".instock.availability::text").getall()[-1].strip(),
"url": response.urljoin(book.css("h3 a::attr(href)").get()),
}
# Follow pagination
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)Run your spider:
scrapy crawl books -o books.jsonThis outputs all scraped data to books.json. Scrapy handles pagination, concurrent requests, and data serialization automatically.
Defining Items
Items give structure to your scraped data. Edit bookstore/items.py:
import scrapy
class BookItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
rating = scrapy.Field()
available = scrapy.Field()
url = scrapy.Field()
description = scrapy.Field()
upc = scrapy.Field()
category = scrapy.Field()Update your spider to use items:
from bookstore.items import BookItem
class BooksSpider(scrapy.Spider):
name = "books"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com/"]
def parse(self, response):
for book in response.css("article.product_pod"):
detail_url = response.urljoin(book.css("h3 a::attr(href)").get())
yield scrapy.Request(detail_url, callback=self.parse_book)
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_book(self, response):
item = BookItem()
item["title"] = response.css("h1::text").get()
item["price"] = response.css(".price_color::text").get()
table = response.css("table.table-striped")
item["upc"] = table.css("tr:nth-child(1) td::text").get()
item["available"] = table.css("tr:nth-child(6) td::text").get()
breadcrumb = response.css("ul.breadcrumb li a::text").getall()
item["category"] = breadcrumb[-1] if breadcrumb else None
item["description"] = response.css("#product_description + p::text").get()
item["url"] = response.url
rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
rating_class = response.css("p.star-rating::attr(class)").get()
item["rating"] = rating_map.get(rating_class.split()[-1], 0) if rating_class else 0
yield itemCSS Selectors vs XPath
Scrapy supports both CSS selectors and XPath. Here are equivalent examples:
# CSS Selectors
response.css("h1::text").get()
response.css("div.product p::text").getall()
response.css("a::attr(href)").get()
response.css("div#main .content::text").get()
# XPath equivalents
response.xpath("//h1/text()").get()
response.xpath("//div[@class='product']/p/text()").getall()
response.xpath("//a/@href").get()
response.xpath("//div[@id='main']/*[contains(@class,'content')]/text()").get()XPath is more powerful for complex queries. For example, selecting by text content:
# Find links containing "Next"
response.xpath("//a[contains(text(), 'Next')]/@href").get()
# Select parent of a specific element
response.xpath("//span[@class='price']/..").get()Item Pipelines
Pipelines process items after extraction — cleaning data, validating, deduplicating, and storing. Edit bookstore/pipelines.py:
import re
import sqlite3
from itemadapter import ItemAdapter
class CleanPricePipeline:
"""Convert price strings like '£51.77' to float values."""
def process_item(self, item, spider):
adapter = ItemAdapter(item)
price_str = adapter.get("price", "")
price_clean = re.sub(r"[^\d.]", "", price_str)
adapter["price"] = float(price_clean) if price_clean else 0.0
return item
class DuplicateFilterPipeline:
"""Drop duplicate items based on UPC."""
def __init__(self):
self.seen_upcs = set()
def process_item(self, item, spider):
adapter = ItemAdapter(item)
upc = adapter.get("upc")
if upc in self.seen_upcs:
from scrapy.exceptions import DropItem
raise DropItem(f"Duplicate UPC: {upc}")
self.seen_upcs.add(upc)
return item
class SQLitePipeline:
"""Store items in a SQLite database."""
def open_spider(self, spider):
self.conn = sqlite3.connect("books.db")
self.cursor = self.conn.cursor()
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS books (
upc TEXT PRIMARY KEY,
title TEXT,
price REAL,
rating INTEGER,
category TEXT,
available TEXT,
description TEXT,
url TEXT
)
""")
self.conn.commit()
def close_spider(self, spider):
self.conn.close()
def process_item(self, item, spider):
adapter = ItemAdapter(item)
self.cursor.execute("""
INSERT OR REPLACE INTO books
(upc, title, price, rating, category, available, description, url)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", (
adapter.get("upc"),
adapter.get("title"),
adapter.get("price"),
adapter.get("rating"),
adapter.get("category"),
adapter.get("available"),
adapter.get("description"),
adapter.get("url"),
))
self.conn.commit()
return itemEnable the pipelines in bookstore/settings.py:
ITEM_PIPELINES = {
"bookstore.pipelines.CleanPricePipeline": 100,
"bookstore.pipelines.DuplicateFilterPipeline": 200,
"bookstore.pipelines.SQLitePipeline": 300,
}Numbers determine execution order — lower runs first.
Configuring Settings
Key settings in bookstore/settings.py:
# Be respectful
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 1.5
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
# Identify your scraper
USER_AGENT = "BookstoreScraper/1.0 (+https://yoursite.com)"
# Enable AutoThrottle for adaptive delays
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
# Retry configuration
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
# Caching for development
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 86400
HTTPCACHE_DIR = "httpcache"
# Logging
LOG_LEVEL = "INFO"
LOG_FILE = "scraper.log"
# Output encoding
FEED_EXPORT_ENCODING = "utf-8"Middleware: Proxy Rotation
For scraping at scale, rotating proxies prevent IP bans. Create a custom middleware in bookstore/middlewares.py:
import random
from scrapy import signals
class RotatingProxyMiddleware:
"""Rotate through a list of proxy servers."""
def __init__(self, proxy_list):
self.proxies = proxy_list
@classmethod
def from_crawler(cls, crawler):
proxy_list = crawler.settings.getlist("PROXY_LIST", [])
middleware = cls(proxy_list)
return middleware
def process_request(self, request, spider):
if self.proxies:
proxy = random.choice(self.proxies)
request.meta["proxy"] = proxy
spider.logger.debug(f"Using proxy: {proxy}")
class RandomUserAgentMiddleware:
"""Rotate User-Agent headers randomly."""
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
]
def process_request(self, request, spider):
request.headers["User-Agent"] = random.choice(self.USER_AGENTS)Enable in settings:
DOWNLOADER_MIDDLEWARES = {
"bookstore.middlewares.RandomUserAgentMiddleware": 400,
"bookstore.middlewares.RotatingProxyMiddleware": 410,
}
PROXY_LIST = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]For production scraping, rotating residential proxies provide much better success rates than datacenter proxies.
Scrapy Shell: Interactive Testing
The Scrapy shell lets you test selectors interactively:
scrapy shell "https://books.toscrape.com/"Inside the shell:
>>> response.css("article.product_pod h3 a::attr(title)").getall()
['A Light in the ...', 'Tipping the Velvet', ...]
>>> response.css(".price_color::text").getall()
['£51.77', '£53.74', ...]
>>> response.xpath("//li[@class='next']/a/@href").get()
'catalogue/page-2.html'CrawlSpider: Rule-Based Crawling
For sites where you want to follow specific link patterns automatically:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class AutoCrawlSpider(CrawlSpider):
name = "autocrawl"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com/"]
rules = (
# Follow category links
Rule(LinkExtractor(restrict_css=".side_categories a")),
# Follow pagination links
Rule(LinkExtractor(restrict_css=".next a")),
# Extract data from book detail pages
Rule(
LinkExtractor(restrict_css="article.product_pod h3 a"),
callback="parse_book"
),
)
def parse_book(self, response):
yield {
"title": response.css("h1::text").get(),
"price": response.css(".price_color::text").get(),
"category": response.css("ul.breadcrumb li:nth-child(3) a::text").get(),
"url": response.url,
}Exporting Data
Scrapy supports multiple output formats:
# JSON
scrapy crawl books -o output.json
# CSV
scrapy crawl books -o output.csv
# JSON Lines (better for large datasets)
scrapy crawl books -o output.jsonl
# Multiple outputs simultaneously
scrapy crawl books -o output.json -o output.csvConfigure in settings for more control:
FEEDS = {
"output/books_%(time)s.json": {
"format": "json",
"encoding": "utf-8",
"indent": 2,
"overwrite": True,
},
"output/books_%(time)s.csv": {
"format": "csv",
},
}Running Scrapy from a Script
You don’t always need the command line. Run Scrapy from a Python script:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
def run_spider():
process = CrawlerProcess(get_project_settings())
process.crawl("books")
process.start()
if __name__ == "__main__":
run_spider()Or with custom settings:
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess(settings={
"FEEDS": {"output.json": {"format": "json"}},
"DOWNLOAD_DELAY": 2,
"LOG_LEVEL": "INFO",
})
process.crawl(BooksSpider)
process.start()Common Pitfalls and Troubleshooting
1. “Filtered duplicate request” in logs
Scrapy deduplicates URLs by default. If you need to visit the same URL with different parameters, set dont_filter=True:
yield scrapy.Request(url, callback=self.parse, dont_filter=True)2. Spider finishes with zero items
Check that your CSS selectors match the actual HTML. Use scrapy shell to test selectors. The site may be loading content via JavaScript — Scrapy doesn’t render JS by default. Consider Scrapy + Playwright integration.
3. Getting blocked (403/429 errors)
Enable AutoThrottle, add download delays, rotate User-Agents, and use proxy rotation. Start with generous delays and tighten them gradually.
4. Memory issues on large crawls
Use JSON Lines format (-o output.jsonl) instead of JSON. JSON requires the entire dataset in memory; JSON Lines writes one record per line.
5. Encoding errors in output
Set FEED_EXPORT_ENCODING = "utf-8" in settings and make sure your pipelines handle text encoding properly.
FAQ
Is Scrapy better than BeautifulSoup?
They solve different problems. Scrapy is a complete framework (HTTP, parsing, pipelines, concurrency), while BeautifulSoup is only an HTML parser. For simple one-off scripts, BeautifulSoup + Requests is simpler. For structured projects with multiple pages, Scrapy is far more capable. See our Scrapy vs BeautifulSoup comparison for details.
Can Scrapy handle JavaScript-rendered pages?
Not natively, but you can integrate it with Playwright or Splash for JS rendering. The scrapy-playwright library makes this seamless. See our Scrapy + Playwright guide.
How fast is Scrapy?
Scrapy can process hundreds of pages per minute thanks to its asynchronous Twisted engine. With 16 concurrent requests and a 0.5-second delay, you can scrape roughly 30 pages per second. Always respect rate limits.
Can I deploy Scrapy to the cloud?
Yes. Scrapy can run on any server. Popular options include Scrapy Cloud (Zyte), AWS Lambda with layers, Docker containers, or simple VPS deployments with cron jobs.
How do I scrape sites that require login?
Use Scrapy’s FormRequest to submit login credentials, then the session cookies will persist for subsequent requests:
from scrapy import FormRequest
def start_requests(self):
yield FormRequest("https://example.com/login",
formdata={"username": "user", "password": "pass"},
callback=self.after_login)Next Steps
- Scrapy + Playwright: Advanced JS Scraping
- Scrapy vs BeautifulSoup: When to Use Each
- Building a Web Crawler in Python
- Best Python Web Scraping Libraries
- Web Scraping Proxy Integration
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
Related Reading
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company