Scrapy + Playwright: Advanced JS Scraping

Scrapy + Playwright: Advanced JS Scraping

Scrapy excels at large-scale crawling but cannot render JavaScript. Playwright excels at browser automation but lacks crawling infrastructure. The scrapy-playwright plugin combines both — you get Scrapy’s spiders, pipelines, middleware, and concurrency management with Playwright’s JavaScript rendering, network interception, and page interaction.

This tutorial covers installation, configuration, handling JS-rendered pages, page interaction, performance optimization, and proxy integration.

Table of Contents

Why Combine Scrapy and Playwright

ChallengeScrapy AlonePlaywright AloneScrapy + Playwright
JS renderingCannotCanCan
ConcurrencyBuilt-inManualBuilt-in
Data pipelinesBuilt-inManualBuilt-in
Rate limitingBuilt-inManualBuilt-in
Link followingBuilt-inManualBuilt-in
Page interactionCannotCanCan
Scale (1M+ pages)ExcellentPoorGood

Installation

pip install scrapy scrapy-playwright
playwright install chromium

Basic Configuration

Add to your Scrapy project’s settings.py:

# settings.py

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

# Playwright settings
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
    "headless": True,
}

# Default Scrapy settings
CONCURRENT_REQUESTS = 4
DOWNLOAD_DELAY = 1
ROBOTSTXT_OBEY = True

Your First Scrapy-Playwright Spider

import scrapy

class JSBookSpider(scrapy.Spider):
    name = "js_books"

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com/",
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    # Wait for content to load
                    {"method": "wait_for_selector", "args": ["article.product_pod"]},
                ],
            },
        )

    def parse(self, response):
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css(".price_color::text").get(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(
                next_page,
                self.parse,
                meta={"playwright": True},
            )

Run it:

scrapy runspider js_books.py -o books.json

Mixing Playwright and Regular Requests

Not every page needs JavaScript rendering. Only use Playwright where needed:

class MixedSpider(scrapy.Spider):
    name = "mixed"

    def start_requests(self):
        # Static pages — no Playwright (faster)
        for url in static_urls:
            yield scrapy.Request(url, callback=self.parse_static)

        # JS pages — use Playwright
        for url in js_urls:
            yield scrapy.Request(
                url,
                callback=self.parse_js,
                meta={"playwright": True},
            )

    def parse_static(self, response):
        # Regular Scrapy parsing — fast
        yield {"title": response.css("h1::text").get()}

    def parse_js(self, response):
        # Playwright-rendered — slower but handles JS
        yield {"title": response.css("h1::text").get()}

Page Interaction

Clicking Buttons

from scrapy_playwright.page import PageMethod

class LoadMoreSpider(scrapy.Spider):
    name = "load_more"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.com/products",
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", ".product-card"),
                    PageMethod("click", "button.load-more"),
                    PageMethod("wait_for_timeout", 2000),
                    PageMethod("click", "button.load-more"),
                    PageMethod("wait_for_timeout", 2000),
                ],
            },
        )

    def parse(self, response):
        for product in response.css(".product-card"):
            yield {
                "name": product.css("h3::text").get(),
                "price": product.css(".price::text").get(),
            }

Filling Forms

def start_requests(self):
    yield scrapy.Request(
        url="https://example.com/search",
        meta={
            "playwright": True,
            "playwright_page_methods": [
                PageMethod("fill", "input[name='q']", "web scraping"),
                PageMethod("click", "button[type='submit']"),
                PageMethod("wait_for_selector", ".results"),
            ],
        },
    )

Scrolling for Infinite Scroll

def start_requests(self):
    yield scrapy.Request(
        url="https://example.com/feed",
        meta={
            "playwright": True,
            "playwright_page_methods": [
                PageMethod("wait_for_selector", ".feed-item"),
                PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
                PageMethod("wait_for_timeout", 2000),
                PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
                PageMethod("wait_for_timeout", 2000),
                PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
                PageMethod("wait_for_timeout", 2000),
            ],
        },
    )

Advanced Page Interaction with Callback

For complex interactions, use playwright_page_coroutines:

import scrapy
from scrapy_playwright.page import PageMethod

class AdvancedSpider(scrapy.Spider):
    name = "advanced"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.com/products",
            meta={
                "playwright": True,
                "playwright_include_page": True,
            },
            callback=self.parse_with_page,
        )

    async def parse_with_page(self, response):
        page = response.meta["playwright_page"]

        try:
            # Complex interaction logic
            while True:
                load_more = page.locator("button.load-more")
                if await load_more.count() == 0:
                    break
                await load_more.click()
                await page.wait_for_timeout(2000)

            # Get final HTML
            html = await page.content()
            sel = scrapy.Selector(text=html)

            for product in sel.css(".product-card"):
                yield {
                    "name": product.css("h3::text").get(),
                    "price": product.css(".price::text").get(),
                }

        finally:
            await page.close()

Network Interception

Block heavy resources for faster scraping:

# settings.py

PLAYWRIGHT_ABORT_REQUEST = lambda request: request.resource_type in [
    "image", "stylesheet", "font", "media",
]

Or per-spider:

def start_requests(self):
    yield scrapy.Request(
        url="https://example.com",
        meta={
            "playwright": True,
            "playwright_page_methods": [
                PageMethod("route", "**/*.{png,jpg,jpeg,gif,css,woff}", lambda route: route.abort()),
            ],
        },
    )

Handling SPAs

React/Vue/Angular Applications

class SPASpider(scrapy.Spider):
    name = "spa"

    def start_requests(self):
        yield scrapy.Request(
            url="https://spa-example.com/products",
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    # Wait for React/Vue to finish rendering
                    PageMethod("wait_for_load_state", "networkidle"),
                    # Or wait for specific elements
                    PageMethod("wait_for_selector", "[data-loaded='true']"),
                ],
            },
        )

    def parse(self, response):
        for item in response.css("[data-testid='product-card']"):
            yield {
                "name": item.css("[data-testid='name']::text").get(),
                "price": item.css("[data-testid='price']::text").get(),
            }

Capturing API Data

class APICapture(scrapy.Spider):
    name = "api_capture"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.com/products",
            meta={
                "playwright": True,
                "playwright_include_page": True,
            },
            callback=self.capture_api,
        )

    async def capture_api(self, response):
        page = response.meta["playwright_page"]

        api_data = []

        async def handle_response(resp):
            if "/api/products" in resp.url and resp.status == 200:
                try:
                    data = await resp.json()
                    api_data.append(data)
                except Exception:
                    pass

        page.on("response", handle_response)
        await page.reload()
        await page.wait_for_timeout(5000)

        for data in api_data:
            for product in data.get("products", []):
                yield product

        await page.close()

Performance Optimization

1. Minimize Playwright Usage

# Only use Playwright when JavaScript rendering is needed
custom_settings = {
    "CONCURRENT_REQUESTS": 16,  # For regular requests
    "PLAYWRIGHT_MAX_CONTEXTS": 4,  # Limit browser contexts
}

2. Block Resources

# settings.py
PLAYWRIGHT_ABORT_REQUEST = lambda req: req.resource_type in [
    "image", "stylesheet", "font", "media", "other",
]

3. Use Multiple Browser Contexts

# settings.py
PLAYWRIGHT_MAX_CONTEXTS = 8
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT = 4

4. Reuse Contexts

PLAYWRIGHT_CONTEXTS = {
    "default": {
        "viewport": {"width": 1280, "height": 720},
        "user_agent": "Mozilla/5.0",
    },
}

Proxy Integration

Global Proxy

# settings.py
PLAYWRIGHT_LAUNCH_OPTIONS = {
    "headless": True,
    "proxy": {
        "server": "http://proxy.example.com:8080",
        "username": "user",
        "password": "pass",
    },
}

Per-Request Proxy

PLAYWRIGHT_CONTEXTS = {
    "proxy1": {
        "proxy": {
            "server": "http://proxy1.example.com:8080",
            "username": "user",
            "password": "pass",
        },
    },
    "proxy2": {
        "proxy": {
            "server": "http://proxy2.example.com:8080",
            "username": "user",
            "password": "pass",
        },
    },
}

# In your spider
def start_requests(self):
    yield scrapy.Request(
        url="https://example.com",
        meta={
            "playwright": True,
            "playwright_context": "proxy1",
        },
    )

For proxy types, see our web scraping proxy guide and proxy glossary.

Complete Example

import scrapy
from scrapy_playwright.page import PageMethod

class EcommerceSpider(scrapy.Spider):
    name = "ecommerce"
    allowed_domains = ["books.toscrape.com"]

    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "PLAYWRIGHT_BROWSER_TYPE": "chromium",
        "PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": True},
        "CONCURRENT_REQUESTS": 4,
        "DOWNLOAD_DELAY": 1,
        "FEEDS": {
            "books.json": {"format": "json", "overwrite": True},
        },
    }

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com/",
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", "article.product_pod"),
                ],
            },
        )

    def parse(self, response):
        for book in response.css("article.product_pod"):
            detail_url = book.css("h3 a::attr(href)").get()
            yield response.follow(
                detail_url,
                callback=self.parse_detail,
                meta={
                    "playwright": True,
                    "playwright_page_methods": [
                        PageMethod("wait_for_selector", ".product_main"),
                    ],
                },
            )

        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(
                next_page,
                self.parse,
                meta={"playwright": True},
            )

    def parse_detail(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "price": response.css(".price_color::text").get(),
            "availability": response.css(".availability::text").getall()[-1].strip(),
            "description": response.css("#product_description + p::text").get(),
            "upc": response.css("td::text").get(),
            "url": response.url,
        }

FAQ

When should I use Scrapy + Playwright vs plain Scrapy?

Use Scrapy + Playwright when the target site renders content with JavaScript (React, Angular, Vue SPAs). If the HTML source contains all the data you need, plain Scrapy is faster and lighter. Always check the page source first before adding Playwright.

Does Scrapy + Playwright support all Playwright features?

Most features are supported through PageMethod and playwright_include_page. For very complex interactions, use playwright_include_page=True to access the Playwright page object directly in your callback.

How does it compare to Scrapy + Splash?

Scrapy + Playwright is the modern replacement for Scrapy + Splash. Playwright renders pages more accurately (real browser vs simulated), supports more interaction types, and does not require running a separate Splash server. Splash is deprecated for most use cases.

What is the performance impact of adding Playwright?

Significant. Regular Scrapy requests process at 50-200+ pages/second. With Playwright, expect 2-10 pages/second depending on page complexity and resource blocking. Minimize Playwright usage to only the pages that need it.


Learn more: Scrapy tutorial, Playwright tutorial, Python scraping libraries.

External Resources:


Related Reading

Scroll to Top