Scrapy Tutorial: Complete Web Scraping Framework

Scrapy Tutorial: Complete Web Scraping Framework

Scrapy is the most powerful web scraping framework in Python. While libraries like BeautifulSoup handle parsing and Requests handle HTTP, Scrapy bundles everything — HTTP requests, HTML parsing, data pipelines, concurrency, and export — into a single, battle-tested framework. If you’re scraping more than a handful of pages, Scrapy should be your default choice.

This tutorial takes you from zero to a production-ready Scrapy project, covering spiders, items, pipelines, middleware, and proxy integration.

Prerequisites

  • Python 3.9+ installed
  • Basic Python knowledge (classes, generators, decorators)
  • Familiarity with HTML and CSS selectors
  • A terminal or command prompt
python --version  # Should show 3.9+

Installation

Create a virtual environment and install Scrapy:

python -m venv scraper-env
source scraper-env/bin/activate  # On Windows: scraper-env\Scripts\activate
pip install scrapy

Verify installation:

scrapy version  # Should show Scrapy 2.x

Creating a Scrapy Project

scrapy startproject bookstore
cd bookstore

This creates the following structure:

bookstore/
    scrapy.cfg
    bookstore/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

Your First Spider

Generate a spider:

scrapy genspider books books.toscrape.com

Now edit bookstore/spiders/books.py:

import scrapy


class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css(".price_color::text").get(),
                "rating": book.css("p.star-rating::attr(class)").get().split()[-1],
                "available": book.css(".instock.availability::text").getall()[-1].strip(),
                "url": response.urljoin(book.css("h3 a::attr(href)").get()),
            }

        # Follow pagination
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Run your spider:

scrapy crawl books -o books.json

This outputs all scraped data to books.json. Scrapy handles pagination, concurrent requests, and data serialization automatically.

Defining Items

Items give structure to your scraped data. Edit bookstore/items.py:

import scrapy


class BookItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    rating = scrapy.Field()
    available = scrapy.Field()
    url = scrapy.Field()
    description = scrapy.Field()
    upc = scrapy.Field()
    category = scrapy.Field()

Update your spider to use items:

from bookstore.items import BookItem


class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            detail_url = response.urljoin(book.css("h3 a::attr(href)").get())
            yield scrapy.Request(detail_url, callback=self.parse_book)

        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_book(self, response):
        item = BookItem()
        item["title"] = response.css("h1::text").get()
        item["price"] = response.css(".price_color::text").get()

        table = response.css("table.table-striped")
        item["upc"] = table.css("tr:nth-child(1) td::text").get()
        item["available"] = table.css("tr:nth-child(6) td::text").get()

        breadcrumb = response.css("ul.breadcrumb li a::text").getall()
        item["category"] = breadcrumb[-1] if breadcrumb else None

        item["description"] = response.css("#product_description + p::text").get()
        item["url"] = response.url

        rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
        rating_class = response.css("p.star-rating::attr(class)").get()
        item["rating"] = rating_map.get(rating_class.split()[-1], 0) if rating_class else 0

        yield item

CSS Selectors vs XPath

Scrapy supports both CSS selectors and XPath. Here are equivalent examples:

# CSS Selectors
response.css("h1::text").get()
response.css("div.product p::text").getall()
response.css("a::attr(href)").get()
response.css("div#main .content::text").get()

# XPath equivalents
response.xpath("//h1/text()").get()
response.xpath("//div[@class='product']/p/text()").getall()
response.xpath("//a/@href").get()
response.xpath("//div[@id='main']/*[contains(@class,'content')]/text()").get()

XPath is more powerful for complex queries. For example, selecting by text content:

# Find links containing "Next"
response.xpath("//a[contains(text(), 'Next')]/@href").get()

# Select parent of a specific element
response.xpath("//span[@class='price']/..").get()

Item Pipelines

Pipelines process items after extraction — cleaning data, validating, deduplicating, and storing. Edit bookstore/pipelines.py:

import re
import sqlite3
from itemadapter import ItemAdapter


class CleanPricePipeline:
    """Convert price strings like '£51.77' to float values."""

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        price_str = adapter.get("price", "")
        price_clean = re.sub(r"[^\d.]", "", price_str)
        adapter["price"] = float(price_clean) if price_clean else 0.0
        return item


class DuplicateFilterPipeline:
    """Drop duplicate items based on UPC."""

    def __init__(self):
        self.seen_upcs = set()

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        upc = adapter.get("upc")
        if upc in self.seen_upcs:
            from scrapy.exceptions import DropItem
            raise DropItem(f"Duplicate UPC: {upc}")
        self.seen_upcs.add(upc)
        return item


class SQLitePipeline:
    """Store items in a SQLite database."""

    def open_spider(self, spider):
        self.conn = sqlite3.connect("books.db")
        self.cursor = self.conn.cursor()
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS books (
                upc TEXT PRIMARY KEY,
                title TEXT,
                price REAL,
                rating INTEGER,
                category TEXT,
                available TEXT,
                description TEXT,
                url TEXT
            )
        """)
        self.conn.commit()

    def close_spider(self, spider):
        self.conn.close()

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        self.cursor.execute("""
            INSERT OR REPLACE INTO books
            (upc, title, price, rating, category, available, description, url)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            adapter.get("upc"),
            adapter.get("title"),
            adapter.get("price"),
            adapter.get("rating"),
            adapter.get("category"),
            adapter.get("available"),
            adapter.get("description"),
            adapter.get("url"),
        ))
        self.conn.commit()
        return item

Enable the pipelines in bookstore/settings.py:

ITEM_PIPELINES = {
    "bookstore.pipelines.CleanPricePipeline": 100,
    "bookstore.pipelines.DuplicateFilterPipeline": 200,
    "bookstore.pipelines.SQLitePipeline": 300,
}

Numbers determine execution order — lower runs first.

Configuring Settings

Key settings in bookstore/settings.py:

# Be respectful
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 1.5
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4

# Identify your scraper
USER_AGENT = "BookstoreScraper/1.0 (+https://yoursite.com)"

# Enable AutoThrottle for adaptive delays
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

# Retry configuration
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

# Caching for development
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 86400
HTTPCACHE_DIR = "httpcache"

# Logging
LOG_LEVEL = "INFO"
LOG_FILE = "scraper.log"

# Output encoding
FEED_EXPORT_ENCODING = "utf-8"

Middleware: Proxy Rotation

For scraping at scale, rotating proxies prevent IP bans. Create a custom middleware in bookstore/middlewares.py:

import random
from scrapy import signals


class RotatingProxyMiddleware:
    """Rotate through a list of proxy servers."""

    def __init__(self, proxy_list):
        self.proxies = proxy_list

    @classmethod
    def from_crawler(cls, crawler):
        proxy_list = crawler.settings.getlist("PROXY_LIST", [])
        middleware = cls(proxy_list)
        return middleware

    def process_request(self, request, spider):
        if self.proxies:
            proxy = random.choice(self.proxies)
            request.meta["proxy"] = proxy
            spider.logger.debug(f"Using proxy: {proxy}")


class RandomUserAgentMiddleware:
    """Rotate User-Agent headers randomly."""

    USER_AGENTS = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/605.1.15",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
    ]

    def process_request(self, request, spider):
        request.headers["User-Agent"] = random.choice(self.USER_AGENTS)

Enable in settings:

DOWNLOADER_MIDDLEWARES = {
    "bookstore.middlewares.RandomUserAgentMiddleware": 400,
    "bookstore.middlewares.RotatingProxyMiddleware": 410,
}

PROXY_LIST = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]

For production scraping, rotating residential proxies provide much better success rates than datacenter proxies.

Scrapy Shell: Interactive Testing

The Scrapy shell lets you test selectors interactively:

scrapy shell "https://books.toscrape.com/"

Inside the shell:

>>> response.css("article.product_pod h3 a::attr(title)").getall()
['A Light in the ...', 'Tipping the Velvet', ...]

>>> response.css(".price_color::text").getall()
['£51.77', '£53.74', ...]

>>> response.xpath("//li[@class='next']/a/@href").get()
'catalogue/page-2.html'

CrawlSpider: Rule-Based Crawling

For sites where you want to follow specific link patterns automatically:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class AutoCrawlSpider(CrawlSpider):
    name = "autocrawl"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    rules = (
        # Follow category links
        Rule(LinkExtractor(restrict_css=".side_categories a")),
        # Follow pagination links
        Rule(LinkExtractor(restrict_css=".next a")),
        # Extract data from book detail pages
        Rule(
            LinkExtractor(restrict_css="article.product_pod h3 a"),
            callback="parse_book"
        ),
    )

    def parse_book(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "price": response.css(".price_color::text").get(),
            "category": response.css("ul.breadcrumb li:nth-child(3) a::text").get(),
            "url": response.url,
        }

Exporting Data

Scrapy supports multiple output formats:

# JSON
scrapy crawl books -o output.json

# CSV
scrapy crawl books -o output.csv

# JSON Lines (better for large datasets)
scrapy crawl books -o output.jsonl

# Multiple outputs simultaneously
scrapy crawl books -o output.json -o output.csv

Configure in settings for more control:

FEEDS = {
    "output/books_%(time)s.json": {
        "format": "json",
        "encoding": "utf-8",
        "indent": 2,
        "overwrite": True,
    },
    "output/books_%(time)s.csv": {
        "format": "csv",
    },
}

Running Scrapy from a Script

You don’t always need the command line. Run Scrapy from a Python script:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


def run_spider():
    process = CrawlerProcess(get_project_settings())
    process.crawl("books")
    process.start()


if __name__ == "__main__":
    run_spider()

Or with custom settings:

from scrapy.crawler import CrawlerProcess


process = CrawlerProcess(settings={
    "FEEDS": {"output.json": {"format": "json"}},
    "DOWNLOAD_DELAY": 2,
    "LOG_LEVEL": "INFO",
})

process.crawl(BooksSpider)
process.start()

Common Pitfalls and Troubleshooting

1. “Filtered duplicate request” in logs

Scrapy deduplicates URLs by default. If you need to visit the same URL with different parameters, set dont_filter=True:

yield scrapy.Request(url, callback=self.parse, dont_filter=True)

2. Spider finishes with zero items

Check that your CSS selectors match the actual HTML. Use scrapy shell to test selectors. The site may be loading content via JavaScript — Scrapy doesn’t render JS by default. Consider Scrapy + Playwright integration.

3. Getting blocked (403/429 errors)

Enable AutoThrottle, add download delays, rotate User-Agents, and use proxy rotation. Start with generous delays and tighten them gradually.

4. Memory issues on large crawls

Use JSON Lines format (-o output.jsonl) instead of JSON. JSON requires the entire dataset in memory; JSON Lines writes one record per line.

5. Encoding errors in output

Set FEED_EXPORT_ENCODING = "utf-8" in settings and make sure your pipelines handle text encoding properly.

FAQ

Is Scrapy better than BeautifulSoup?

They solve different problems. Scrapy is a complete framework (HTTP, parsing, pipelines, concurrency), while BeautifulSoup is only an HTML parser. For simple one-off scripts, BeautifulSoup + Requests is simpler. For structured projects with multiple pages, Scrapy is far more capable. See our Scrapy vs BeautifulSoup comparison for details.

Can Scrapy handle JavaScript-rendered pages?

Not natively, but you can integrate it with Playwright or Splash for JS rendering. The scrapy-playwright library makes this seamless. See our Scrapy + Playwright guide.

How fast is Scrapy?

Scrapy can process hundreds of pages per minute thanks to its asynchronous Twisted engine. With 16 concurrent requests and a 0.5-second delay, you can scrape roughly 30 pages per second. Always respect rate limits.

Can I deploy Scrapy to the cloud?

Yes. Scrapy can run on any server. Popular options include Scrapy Cloud (Zyte), AWS Lambda with layers, Docker containers, or simple VPS deployments with cron jobs.

How do I scrape sites that require login?

Use Scrapy’s FormRequest to submit login credentials, then the session cookies will persist for subsequent requests:

from scrapy import FormRequest

def start_requests(self):
    yield FormRequest("https://example.com/login",
        formdata={"username": "user", "password": "pass"},
        callback=self.after_login)

Next Steps


Related Reading

Scroll to Top