Scrapy + Playwright integration in 2026

Scrapy + Playwright integration is the standard answer when Scrapy’s speed and pipeline ergonomics meet pages that need a real browser to render. Scrapy alone runs at thousands of pages per second through asynchronous Twisted callbacks but cannot execute JavaScript. Playwright can drive a real browser through any page but is a single-page-at-a-time tool that does not give you Scrapy’s middleware, item pipelines, or deduplication. The scrapy-playwright plugin bridges them: Scrapy still handles the orchestration, scheduling, and pipelines, but pages route through Playwright when needed.

This guide covers scrapy-playwright 0.0.40+ on Scrapy 2.11+ in 2026, the setup steps, the page coroutines pattern, proxy and stealth integration, and production patterns that scale. Code is Python 3.12 throughout. By the end you will have a Scrapy project that can mix HTML-only requests (fast) and Playwright-rendered requests (slow but JS-capable) in one spider, with shared middleware and pipelines.

Why this combination wins

Scrapy alone is unbeatable for HTML-only scraping at scale. Its async model, request prioritization, deduplication, and pipeline architecture are mature. The weakness is that any JavaScript-rendered page returns empty HTML.

Playwright alone handles JS perfectly but has no concept of crawl orchestration. You write your own request queue, deduplication, retry logic, and item processing.

scrapy-playwright lets each page in a Scrapy spider opt into Playwright via a metadata flag. Pages without the flag stay as fast HTML requests. Pages with the flag route through a Playwright pool, render fully, and return rendered HTML to your callback as if Scrapy fetched it directly.

For Scrapy’s official docs, see Scrapy 2.11+ documentation. For Playwright’s, see Playwright Python documentation.

Installation

pip install scrapy scrapy-playwright playwright
playwright install chromium

Verify versions:

scrapy version       # 2.11.0 or later
python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
# 0.0.40 or later

Project setup

Create a Scrapy project:

scrapy startproject dynamic_scraper
cd dynamic_scraper

Edit dynamic_scraper/settings.py to enable scrapy-playwright:

# settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
    "headless": True,
    "args": [
        "--disable-blink-features=AutomationControlled",
        "--disable-features=IsolateOrigins",
    ],
}

# How many concurrent Playwright contexts to maintain
PLAYWRIGHT_MAX_CONTEXTS = 8
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 30000

# Reasonable defaults for mixed crawls
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 0
RETRY_TIMES = 3

# Standard middleware for header rotation, etc
DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
    "scrapy_user_agents.middlewares.RandomUserAgentMiddleware": 400,
    "dynamic_scraper.middlewares.ProxyMiddleware": 410,
}

# Item pipeline (whatever you need)
ITEM_PIPELINES = {
    "dynamic_scraper.pipelines.JsonExportPipeline": 300,
}

The download handlers route every HTTP request through scrapy-playwright. Pages without the playwright meta flag are still fetched via regular HTTP (scrapy-playwright detects this automatically), so you do not pay the browser cost for pages that do not need it.

A first spider mixing HTML and Playwright requests

Most real spiders mix request types. The product listing might be static HTML, but the product detail might require JavaScript. scrapy-playwright handles this cleanly:

# spiders/products.py
import scrapy
from scrapy_playwright.page import PageMethod


class ProductSpider(scrapy.Spider):
    name = "products"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        # Listing page is plain HTML, no Playwright needed
        for product_url in response.css("a.product-link::attr(href)").getall():
            yield response.follow(
                product_url,
                callback=self.parse_product,
                meta={
                    "playwright": True,
                    "playwright_include_page": False,
                    "playwright_page_methods": [
                        PageMethod("wait_for_selector", "div.product-detail"),
                    ],
                },
            )

        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_product(self, response):
        yield {
            "title": response.css("h1.product-title::text").get(),
            "price": response.css("span.price-current::text").get(),
            "stock": response.css("div.stock-info::text").get(),
            "description": response.css("div.product-description").get(),
        }

Key points:

The listing page is fetched normally (no playwright: True in meta)
Product detail pages are fetched via Playwright (playwright: True)
playwright_page_methods runs on the Playwright page before returning the HTML
wait_for_selector ensures the JavaScript-rendered detail is present before scrapy-playwright captures the HTML

This pattern keeps the spider’s overall structure conventional while opting specific pages into the heavyweight rendering path.

Interacting with the page

Some scraping requires interaction (click a “Show more” button, scroll to load infinite content, fill a search form). For this, request the actual Playwright Page object back via playwright_include_page=True:

import scrapy
from scrapy_playwright.page import PageMethod


class InfiniteScrollSpider(scrapy.Spider):
    name = "infinite"
    start_urls = ["https://example.com/feed"]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                meta={
                    "playwright": True,
                    "playwright_include_page": True,
                    "playwright_page_methods": [
                        PageMethod("wait_for_selector", "article.feed-item"),
                    ],
                },
                callback=self.parse_feed,
                errback=self.errback_close_page,
            )

    async def parse_feed(self, response):
        page = response.meta["playwright_page"]
        # Scroll to load more items
        for _ in range(5):
            await page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
            await page.wait_for_timeout(2000)
        html = await page.content()
        await page.close()

        # Parse the now-loaded HTML with Scrapy selectors
        from scrapy.http import HtmlResponse
        new_response = HtmlResponse(url=response.url, body=html, encoding="utf-8")
        for article in new_response.css("article.feed-item"):
            yield {
                "title": article.css("h2::text").get(),
                "url": article.css("a::attr(href)").get(),
                "summary": article.css("p.summary::text").get(),
            }

    async def errback_close_page(self, failure):
        page = failure.request.meta.get("playwright_page")
        if page:
            await page.close()

The playwright_include_page=True means the Page object is attached to the response. Your callback can interact with it (scroll, click, wait) before extracting HTML. Always close the page in your callback or errback to avoid leaking browser contexts.

Proxy integration

Pass proxies to Playwright either at the launch level (one proxy for all contexts) or per-context (different proxy per request). Per-context is the more flexible pattern.

# settings.py addition
PLAYWRIGHT_CONTEXTS = {
    "default": {
        "proxy": {
            "server": "http://squid.internal:3128",
            "username": "scraper_user",
            "password": "secret",
        },
    },
}

For per-request proxies via a custom middleware:

# middlewares.py
import random


class ProxyRotationMiddleware:
    PROXIES = [
        "http://user1:pass1@proxy1.provider.com:8080",
        "http://user2:pass2@proxy2.provider.com:8080",
        "http://user3:pass3@proxy3.provider.com:8080",
    ]

    def process_request(self, request, spider):
        if request.meta.get("playwright"):
            # For Playwright requests, set the context with a specific proxy
            proxy = random.choice(self.PROXIES)
            user_pass, host_port = proxy.split("//")[1].split("@")
            user, password = user_pass.split(":")
            host, port = host_port.split(":")
            request.meta["playwright_context"] = f"proxy_{hash(proxy)}"
            request.meta["playwright_context_kwargs"] = {
                "proxy": {
                    "server": f"http://{host}:{port}",
                    "username": user,
                    "password": password,
                },
            }
        else:
            # For regular HTTP requests, set the standard proxy meta
            request.meta["proxy"] = random.choice(self.PROXIES)

scrapy-playwright caches contexts by name, so reusing the same proxy uses the same context. This avoids spawning a new browser context per request, which is expensive.

Stealth: integrating patchright

For sites that fingerprint, use patchright (Playwright stealth fork) instead of vanilla Playwright. patchright is a drop-in replacement, so you can swap it in via setting the playwright module:

# In your settings.py or as an environment variable
import patchright as playwright_module

PLAYWRIGHT_BROWSER_TYPE = "chromium"
# scrapy-playwright will use whatever playwright module is installed

Or install patchright as the playwright package (it shadows vanilla playwright):

pip install patchright
playwright install chromium

After install, scrapy-playwright automatically uses patchright’s launch logic, which applies stealth patches.

For fine-grained stealth on a per-request basis, inject scripts via init_script:

yield scrapy.Request(
    url,
    meta={
        "playwright": True,
        "playwright_context_kwargs": {
            "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                          "(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        },
        "playwright_page_init_script": """
            Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
            Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3] });
        """,
    },
)

For the full picture on browser fingerprinting, see canvas fingerprinting bypass and WebGL fingerprinting bypass.

Performance: how much does Playwright slow you down

Throughput comparison on a single worker:

request type	requests/sec
pure Scrapy HTML	200-500
scrapy-playwright (cached context, simple page)	5-10
scrapy-playwright (new context, complex SPA)	1-2
scrapy-playwright with full humanization	0.3-0.5

The cost is real. Use Playwright only for pages that need it. The mixed pattern (HTML for listings, Playwright for details) is the right balance for most sites.

To scale Playwright throughput:

Increase PLAYWRIGHT_MAX_CONTEXTS (each context costs ~100 MB RAM)
Reuse contexts across requests by giving them stable names
Run multiple Scrapy instances on different machines
Keep navigation timeout tight (30 seconds is usually enough)

Comparison: scrapy-playwright vs alternatives

approach	strengths	weaknesses
scrapy-playwright	mixed HTML/JS in one project, native Scrapy pipelines	adds Playwright overhead even for simple sites
scrapy-splash	older, lighter than Playwright	Splash is essentially abandoned, JS engine is dated
scrapy + selenium-wire	works	slow, fragile
scrapy + standalone Playwright service	clean separation	extra service to operate
pure Playwright	most JS-capable	no Scrapy pipelines, write your own everything
Crawlee (Node)	similar pattern in Node ecosystem	requires switching off Python

For most Python teams in 2026, scrapy-playwright is the right pick. For Node teams, Crawlee. For very simple JS sites, scrapy-splash still works but is on its way out.

For the alternative Crawlee path, see Scrapy Cloud vs Crawlee Cloud in 2026.

Item pipelines and Playwright

Item pipelines work identically whether the request was HTML-only or Playwright. The pipeline sees a Scrapy Item (or dict) and processes it. No changes needed:

# pipelines.py
import json
from datetime import datetime


class JsonExportPipeline:
    def open_spider(self, spider):
        self.file = open(f"output/{spider.name}_{datetime.now():%Y%m%d_%H%M%S}.jsonl", "w")

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

This is one of the main wins of scrapy-playwright over rolling your own: you get to keep all your existing Scrapy pipelines and middleware.

Common patterns

Wait for specific elements before extracting:

yield scrapy.Request(
    url,
    meta={
        "playwright": True,
        "playwright_page_methods": [
            PageMethod("wait_for_selector", "div.product-loaded"),
            PageMethod("wait_for_load_state", "networkidle"),
        ],
    },
)

Click a button to load content:

yield scrapy.Request(
    url,
    meta={
        "playwright": True,
        "playwright_page_methods": [
            PageMethod("click", "button.show-more"),
            PageMethod("wait_for_selector", "div.expanded-content"),
        ],
    },
)

Take a screenshot for debugging:

yield scrapy.Request(
    url,
    meta={
        "playwright": True,
        "playwright_page_methods": [
            PageMethod("screenshot", path="debug.png", full_page=True),
        ],
    },
)

Intercept network requests:

async def parse(self, response):
    page = response.meta["playwright_page"]
    # Capture all XHR responses
    captured_data = []
    page.on("response", lambda r: captured_data.append(r.url))
    await page.wait_for_load_state("networkidle")
    await page.close()
    yield {"all_xhr_urls": captured_data}

Production deployment

A production scrapy-playwright deployment looks like:

Compute: container with Chromium installed (use playwright-base images)
Concurrency: tune CONCURRENT_REQUESTS based on RAM (each Playwright context is ~100 MB)
Proxy: routed through Squid or commercial pool
Stealth: patchright as drop-in
Monitoring: Scrapy stats + Playwright context counter
Persistence: items to a database or queue, not just files
Scheduling: scrapyd, scrapy-cluster, or Airflow

A reasonable container Dockerfile:

FROM mcr.microsoft.com/playwright/python:v1.45.0-jammy

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright

CMD ["scrapy", "crawl", "products"]

Use the playwright-python base image to skip installing Chromium and dependencies separately.

Operational checklist

For production scrapy-playwright deployments in 2026:

Scrapy 2.11+, scrapy-playwright 0.0.40+
Use playwright-python base image for containers
patchright for stealth
Per-request opt-in to Playwright via meta flag
Always close pages in callbacks and errbacks
Use named contexts to reuse across requests
Tune CONCURRENT_REQUESTS based on RAM
Set navigation timeouts (30s default is reasonable)
Log per-spider stats including Playwright vs HTML request count
Pair with proxy rotation middleware
Monitor RAM usage closely (Playwright contexts leak if not closed)

FAQ

Q: can I use Playwright with Scrapy without scrapy-playwright?
You can, but it is painful. You would have to write your own download handler, manage the browser lifecycle, and integrate with Scrapy’s async model. scrapy-playwright handles all this for you.

Q: what about scrapy-splash?
Splash is essentially unmaintained as of 2024-2025. The JS engine is older and many modern SPAs do not render correctly. Use scrapy-playwright unless you have legacy reasons to stay on Splash.

Q: how do I handle Playwright authentication and cookies?
Pass cookies via the request meta or set them on the context:

meta={"playwright": True, "playwright_context_kwargs": {"storage_state": "auth.json"}}

auth.json is generated by saving Playwright’s context.storage_state() after login.

Q: can I run scrapy-playwright on AWS Lambda?
Technically yes via Playwright’s Lambda layers, but the cold start is brutal (10+ seconds per invocation just to launch Chromium). For Lambda, prefer dedicated browser services like Browserbase or run scrapy-playwright on Fargate or EKS instead.

Q: does scrapy-playwright support Firefox?
Yes. Set PLAYWRIGHT_BROWSER_TYPE = "firefox" and run playwright install firefox. Most patches work on Firefox too, though some stealth libraries are Chromium-focused.

Common pitfalls in production scrapy-playwright

The first failure mode that catches teams off guard is the page-object leak. scrapy-playwright passes the open Page to your callback via response.meta["playwright_page"], but if your parse function raises an exception or yields requests without closing the page, the browser context retains a reference. After 50-100 leaked pages, Chromium hits its per-context limit (typically 30 tabs depending on RAM) and new requests stall waiting for tab slots. The fix is a try/finally pattern in every parse method that touches playwright_page:

async def parse_product(self, response):
    page = response.meta.get("playwright_page")
    try:
        await page.wait_for_selector(".product-loaded", timeout=10000)
        title = await page.text_content("h1")
        yield {"title": title}
    except Exception as e:
        self.logger.error(f"parse failed: {e}")
        raise
    finally:
        if page:
            await page.close()

Add an errback that also closes the page on download failure, otherwise pages leak from network errors that never reach your parse callback.

The second pitfall is the headers-vs-Playwright disconnect. Setting headers on a scrapy.Request does not automatically pass them to Playwright when playwright=True. Scrapy treats those headers as the eventual download request, but Playwright bypasses Scrapy’s downloader entirely and uses its own. The result: your User-Agent rotation middleware appears to set a Chrome 124 UA but Playwright launches with whatever UA the browser instance has. The fix is to set headers via playwright_context_kwargs on each request:

yield scrapy.Request(
    url,
    meta={
        "playwright": True,
        "playwright_context_kwargs": {
            "user_agent": rotated_ua,
            "extra_http_headers": rotated_headers,
        },
    },
)

Each new context spawn now gets the rotated UA and headers. Beware: spawning a new context per request defeats Playwright’s connection reuse, so pair this with named contexts to amortize the cost across same-UA requests.

The third pitfall is silent JavaScript navigation that bypasses Scrapy’s URL tracking. When a page does window.location.href = "/next-page" mid-load, Playwright follows it and ends up at a URL different from what you originally requested. Scrapy’s response.url reflects the post-navigation URL, but your duplicate-detection middleware only saw the original URL. You can end up scraping the same final page repeatedly through different entry-point URLs. The fix is to track final URLs in your DupeFilter:

from scrapy.dupefilters import RFPDupeFilter
import hashlib

class FinalUrlDupeFilter(RFPDupeFilter):
    def request_seen(self, request):
        # If response has been received, dedupe on response.url too
        final_url = request.meta.get("playwright_final_url")
        if final_url:
            fp = hashlib.sha256(final_url.encode()).hexdigest()
            if fp in self.fingerprints:
                return True
            self.fingerprints.add(fp)
        return super().request_seen(request)

Set playwright_final_url in your parse callback after capturing page.url, then re-yield the request through the dedupe filter.

Real-world example: 50,000-product crawl with mixed HTML and Playwright

A scrapy-playwright project crawled an electronics retailer with 50,000 products across 800 category pages. Initial implementation used Playwright for everything and took 18 hours per full crawl with 90 percent CPU usage on a 16-core machine. The optimization that cut this to 2.5 hours was a tiered request strategy that classified URL patterns and used the cheapest method per pattern:

class TieredSpider(scrapy.Spider):
    name = "products_tiered"

    def start_requests(self):
        # Category pages: pure HTML, server-rendered
        for cat_url in self.category_urls:
            yield scrapy.Request(cat_url, callback=self.parse_category)

    def parse_category(self, response):
        # Extract product URLs from server-rendered HTML
        for url in response.css(".product-link::attr(href)").getall():
            full_url = response.urljoin(url)
            # Tier 1: standard product page (HTML works)
            if "/p/" in full_url:
                yield scrapy.Request(full_url, callback=self.parse_product_html)
            # Tier 2: SPA product detail (Playwright needed)
            elif "/spa-product/" in full_url:
                yield scrapy.Request(
                    full_url,
                    meta={
                        "playwright": True,
                        "playwright_page_methods": [
                            PageMethod("wait_for_selector", ".price-loaded"),
                        ],
                    },
                    callback=self.parse_product_playwright,
                )

    def parse_product_html(self, response):
        yield {
            "url": response.url,
            "title": response.css("h1::text").get(),
            "price": response.css(".price::text").get(),
        }

    async def parse_product_playwright(self, response):
        page = response.meta["playwright_page"]
        try:
            title = await page.text_content("h1")
            price = await page.text_content(".price")
            yield {"url": response.url, "title": title, "price": price}
        finally:
            await page.close()

Of the 50,000 products, only 8,000 needed Playwright (the SPA product pages). The remaining 42,000 went through pure Scrapy HTML at 200+ requests/second. Playwright handled the 8,000 SPA pages at 5 requests/second. Total throughput: 5.5 hours for the HTML tier and 27 minutes for the SPA tier, running concurrently in 2.5 hours total. The lesson: classify your URLs by render type and route accordingly. Default-to-Playwright is the most expensive choice you can make.

Comparison: scrapy-playwright settings tuning by site type

Different site types need different tuning profiles. A reference table from a 2026 production deployment:

site type	CONCURRENT_REQUESTS	PLAYWRIGHT_MAX_CONTEXTS	nav timeout	wait strategy
ecommerce SPA	16	8	30s	wait_for_selector on price
news article	32	16	15s	wait_for_load_state domcontentloaded
listing with infinite scroll	8	4	60s	manual scroll + wait_for_response
dashboard behind login	4	2	45s	wait_for_url + wait_for_selector
pricing page (heavy JS)	16	8	30s	wait_for_function on data-loaded attr

For listing pages with infinite scroll, use a manual scroll loop:

async def scroll_until_done(page, max_scrolls=20):
    for i in range(max_scrolls):
        prev_height = await page.evaluate("document.body.scrollHeight")
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        await page.wait_for_timeout(1500)
        new_height = await page.evaluate("document.body.scrollHeight")
        if new_height == prev_height:
            break  # no more content loaded

Without the scroll-until-done pattern, you get only the first viewport’s worth of items even when the listing has 500+. Pair with a per-spider MAX_SCROLL_ITERATIONS setting so a runaway page does not loop forever.

Memory profiling: detecting Playwright leaks

Production scrapy-playwright deployments suffer from RAM growth that takes hours to manifest. The cause is almost always context reuse without bounded lifetime. Each long-lived context accumulates DOM nodes, IndexedDB entries, and service worker state. After 4-6 hours of constant scraping through one context, RAM usage doubles. The detection pattern:

import psutil
import os

class MemoryMonitorMiddleware:
    def __init__(self):
        self.process = psutil.Process(os.getpid())
        self.baseline_mb = self.process.memory_info().rss / 1024 / 1024
        self.requests_processed = 0

    def process_response(self, request, response, spider):
        self.requests_processed += 1
        if self.requests_processed % 100 == 0:
            current_mb = self.process.memory_info().rss / 1024 / 1024
            growth = current_mb - self.baseline_mb
            spider.logger.info(
                f"requests={self.requests_processed} "
                f"rss={current_mb:.0f}MB growth={growth:.0f}MB"
            )
            if growth > 2000:  # 2GB growth, recycle contexts
                spider.logger.warning("memory bloat detected, recycling contexts")
                # Trigger context recycling in your context manager
        return response

Recycle contexts every 1000-2000 requests by closing them and letting scrapy-playwright spawn fresh ones. This caps RAM growth at the cost of one cold-context launch per recycle (about 800ms). For long-running spiders, this is a worthwhile trade.

Wrapping up

Scrapy + Playwright is the right combination for any Python scraping project that mixes simple HTML pages with JavaScript-heavy pages. scrapy-playwright glues them cleanly enough that you keep all of Scrapy’s pipeline ergonomics while getting Playwright’s rendering capability where you need it. Pair this guide with our Scrapy Cloud vs Crawlee Cloud and best Python scraping libraries 2026 writeups, and browse the framework-tutorials category on DRT for related tutorials.