Scrapy + Playwright integration in 2026
Scrapy + Playwright integration is the standard answer when Scrapy’s speed and pipeline ergonomics meet pages that need a real browser to render. Scrapy alone runs at thousands of pages per second through asynchronous Twisted callbacks but cannot execute JavaScript. Playwright can drive a real browser through any page but is a single-page-at-a-time tool that does not give you Scrapy’s middleware, item pipelines, or deduplication. The scrapy-playwright plugin bridges them: Scrapy still handles the orchestration, scheduling, and pipelines, but pages route through Playwright when needed.
This guide covers scrapy-playwright 0.0.40+ on Scrapy 2.11+ in 2026, the setup steps, the page coroutines pattern, proxy and stealth integration, and production patterns that scale. Code is Python 3.12 throughout. By the end you will have a Scrapy project that can mix HTML-only requests (fast) and Playwright-rendered requests (slow but JS-capable) in one spider, with shared middleware and pipelines.
Why this combination wins
Scrapy alone is unbeatable for HTML-only scraping at scale. Its async model, request prioritization, deduplication, and pipeline architecture are mature. The weakness is that any JavaScript-rendered page returns empty HTML.
Playwright alone handles JS perfectly but has no concept of crawl orchestration. You write your own request queue, deduplication, retry logic, and item processing.
scrapy-playwright lets each page in a Scrapy spider opt into Playwright via a metadata flag. Pages without the flag stay as fast HTML requests. Pages with the flag route through a Playwright pool, render fully, and return rendered HTML to your callback as if Scrapy fetched it directly.
For Scrapy’s official docs, see Scrapy 2.11+ documentation. For Playwright’s, see Playwright Python documentation.
Installation
pip install scrapy scrapy-playwright playwright
playwright install chromium
Verify versions:
scrapy version # 2.11.0 or later
python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
# 0.0.40 or later
Project setup
Create a Scrapy project:
scrapy startproject dynamic_scraper
cd dynamic_scraper
Edit dynamic_scraper/settings.py to enable scrapy-playwright:
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": True,
"args": [
"--disable-blink-features=AutomationControlled",
"--disable-features=IsolateOrigins",
],
}
# How many concurrent Playwright contexts to maintain
PLAYWRIGHT_MAX_CONTEXTS = 8
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 30000
# Reasonable defaults for mixed crawls
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 0
RETRY_TIMES = 3
# Standard middleware for header rotation, etc
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
"scrapy_user_agents.middlewares.RandomUserAgentMiddleware": 400,
"dynamic_scraper.middlewares.ProxyMiddleware": 410,
}
# Item pipeline (whatever you need)
ITEM_PIPELINES = {
"dynamic_scraper.pipelines.JsonExportPipeline": 300,
}
The download handlers route every HTTP request through scrapy-playwright. Pages without the playwright meta flag are still fetched via regular HTTP (scrapy-playwright detects this automatically), so you do not pay the browser cost for pages that do not need it.
A first spider mixing HTML and Playwright requests
Most real spiders mix request types. The product listing might be static HTML, but the product detail might require JavaScript. scrapy-playwright handles this cleanly:
# spiders/products.py
import scrapy
from scrapy_playwright.page import PageMethod
class ProductSpider(scrapy.Spider):
name = "products"
allowed_domains = ["example.com"]
start_urls = ["https://example.com/products"]
def parse(self, response):
# Listing page is plain HTML, no Playwright needed
for product_url in response.css("a.product-link::attr(href)").getall():
yield response.follow(
product_url,
callback=self.parse_product,
meta={
"playwright": True,
"playwright_include_page": False,
"playwright_page_methods": [
PageMethod("wait_for_selector", "div.product-detail"),
],
},
)
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_product(self, response):
yield {
"title": response.css("h1.product-title::text").get(),
"price": response.css("span.price-current::text").get(),
"stock": response.css("div.stock-info::text").get(),
"description": response.css("div.product-description").get(),
}
Key points:
- The listing page is fetched normally (no
playwright: Truein meta) - Product detail pages are fetched via Playwright (
playwright: True) playwright_page_methodsruns on the Playwright page before returning the HTMLwait_for_selectorensures the JavaScript-rendered detail is present before scrapy-playwright captures the HTML
This pattern keeps the spider’s overall structure conventional while opting specific pages into the heavyweight rendering path.
Interacting with the page
Some scraping requires interaction (click a “Show more” button, scroll to load infinite content, fill a search form). For this, request the actual Playwright Page object back via playwright_include_page=True:
import scrapy
from scrapy_playwright.page import PageMethod
class InfiniteScrollSpider(scrapy.Spider):
name = "infinite"
start_urls = ["https://example.com/feed"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", "article.feed-item"),
],
},
callback=self.parse_feed,
errback=self.errback_close_page,
)
async def parse_feed(self, response):
page = response.meta["playwright_page"]
# Scroll to load more items
for _ in range(5):
await page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000)
html = await page.content()
await page.close()
# Parse the now-loaded HTML with Scrapy selectors
from scrapy.http import HtmlResponse
new_response = HtmlResponse(url=response.url, body=html, encoding="utf-8")
for article in new_response.css("article.feed-item"):
yield {
"title": article.css("h2::text").get(),
"url": article.css("a::attr(href)").get(),
"summary": article.css("p.summary::text").get(),
}
async def errback_close_page(self, failure):
page = failure.request.meta.get("playwright_page")
if page:
await page.close()
The playwright_include_page=True means the Page object is attached to the response. Your callback can interact with it (scroll, click, wait) before extracting HTML. Always close the page in your callback or errback to avoid leaking browser contexts.
Proxy integration
Pass proxies to Playwright either at the launch level (one proxy for all contexts) or per-context (different proxy per request). Per-context is the more flexible pattern.
# settings.py addition
PLAYWRIGHT_CONTEXTS = {
"default": {
"proxy": {
"server": "http://squid.internal:3128",
"username": "scraper_user",
"password": "secret",
},
},
}
For per-request proxies via a custom middleware:
# middlewares.py
import random
class ProxyRotationMiddleware:
PROXIES = [
"http://user1:pass1@proxy1.provider.com:8080",
"http://user2:pass2@proxy2.provider.com:8080",
"http://user3:pass3@proxy3.provider.com:8080",
]
def process_request(self, request, spider):
if request.meta.get("playwright"):
# For Playwright requests, set the context with a specific proxy
proxy = random.choice(self.PROXIES)
user_pass, host_port = proxy.split("//")[1].split("@")
user, password = user_pass.split(":")
host, port = host_port.split(":")
request.meta["playwright_context"] = f"proxy_{hash(proxy)}"
request.meta["playwright_context_kwargs"] = {
"proxy": {
"server": f"http://{host}:{port}",
"username": user,
"password": password,
},
}
else:
# For regular HTTP requests, set the standard proxy meta
request.meta["proxy"] = random.choice(self.PROXIES)
scrapy-playwright caches contexts by name, so reusing the same proxy uses the same context. This avoids spawning a new browser context per request, which is expensive.
Stealth: integrating patchright
For sites that fingerprint, use patchright (Playwright stealth fork) instead of vanilla Playwright. patchright is a drop-in replacement, so you can swap it in via setting the playwright module:
# In your settings.py or as an environment variable
import patchright as playwright_module
PLAYWRIGHT_BROWSER_TYPE = "chromium"
# scrapy-playwright will use whatever playwright module is installed
Or install patchright as the playwright package (it shadows vanilla playwright):
pip install patchright
playwright install chromium
After install, scrapy-playwright automatically uses patchright’s launch logic, which applies stealth patches.
For fine-grained stealth on a per-request basis, inject scripts via init_script:
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_context_kwargs": {
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
},
"playwright_page_init_script": """
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3] });
""",
},
)
For the full picture on browser fingerprinting, see canvas fingerprinting bypass and WebGL fingerprinting bypass.
Performance: how much does Playwright slow you down
Throughput comparison on a single worker:
| request type | requests/sec |
|---|---|
| pure Scrapy HTML | 200-500 |
| scrapy-playwright (cached context, simple page) | 5-10 |
| scrapy-playwright (new context, complex SPA) | 1-2 |
| scrapy-playwright with full humanization | 0.3-0.5 |
The cost is real. Use Playwright only for pages that need it. The mixed pattern (HTML for listings, Playwright for details) is the right balance for most sites.
To scale Playwright throughput:
- Increase
PLAYWRIGHT_MAX_CONTEXTS(each context costs ~100 MB RAM) - Reuse contexts across requests by giving them stable names
- Run multiple Scrapy instances on different machines
- Keep navigation timeout tight (30 seconds is usually enough)
Comparison: scrapy-playwright vs alternatives
| approach | strengths | weaknesses |
|---|---|---|
| scrapy-playwright | mixed HTML/JS in one project, native Scrapy pipelines | adds Playwright overhead even for simple sites |
| scrapy-splash | older, lighter than Playwright | Splash is essentially abandoned, JS engine is dated |
| scrapy + selenium-wire | works | slow, fragile |
| scrapy + standalone Playwright service | clean separation | extra service to operate |
| pure Playwright | most JS-capable | no Scrapy pipelines, write your own everything |
| Crawlee (Node) | similar pattern in Node ecosystem | requires switching off Python |
For most Python teams in 2026, scrapy-playwright is the right pick. For Node teams, Crawlee. For very simple JS sites, scrapy-splash still works but is on its way out.
For the alternative Crawlee path, see Scrapy Cloud vs Crawlee Cloud in 2026.
Item pipelines and Playwright
Item pipelines work identically whether the request was HTML-only or Playwright. The pipeline sees a Scrapy Item (or dict) and processes it. No changes needed:
# pipelines.py
import json
from datetime import datetime
class JsonExportPipeline:
def open_spider(self, spider):
self.file = open(f"output/{spider.name}_{datetime.now():%Y%m%d_%H%M%S}.jsonl", "w")
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
This is one of the main wins of scrapy-playwright over rolling your own: you get to keep all your existing Scrapy pipelines and middleware.
Common patterns
Wait for specific elements before extracting:
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", "div.product-loaded"),
PageMethod("wait_for_load_state", "networkidle"),
],
},
)
Click a button to load content:
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("click", "button.show-more"),
PageMethod("wait_for_selector", "div.expanded-content"),
],
},
)
Take a screenshot for debugging:
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("screenshot", path="debug.png", full_page=True),
],
},
)
Intercept network requests:
async def parse(self, response):
page = response.meta["playwright_page"]
# Capture all XHR responses
captured_data = []
page.on("response", lambda r: captured_data.append(r.url))
await page.wait_for_load_state("networkidle")
await page.close()
yield {"all_xhr_urls": captured_data}
Production deployment
A production scrapy-playwright deployment looks like:
- Compute: container with Chromium installed (use playwright-base images)
- Concurrency: tune CONCURRENT_REQUESTS based on RAM (each Playwright context is ~100 MB)
- Proxy: routed through Squid or commercial pool
- Stealth: patchright as drop-in
- Monitoring: Scrapy stats + Playwright context counter
- Persistence: items to a database or queue, not just files
- Scheduling: scrapyd, scrapy-cluster, or Airflow
A reasonable container Dockerfile:
FROM mcr.microsoft.com/playwright/python:v1.45.0-jammy
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
CMD ["scrapy", "crawl", "products"]
Use the playwright-python base image to skip installing Chromium and dependencies separately.
Operational checklist
For production scrapy-playwright deployments in 2026:
- Scrapy 2.11+, scrapy-playwright 0.0.40+
- Use playwright-python base image for containers
- patchright for stealth
- Per-request opt-in to Playwright via meta flag
- Always close pages in callbacks and errbacks
- Use named contexts to reuse across requests
- Tune CONCURRENT_REQUESTS based on RAM
- Set navigation timeouts (30s default is reasonable)
- Log per-spider stats including Playwright vs HTML request count
- Pair with proxy rotation middleware
- Monitor RAM usage closely (Playwright contexts leak if not closed)
FAQ
Q: can I use Playwright with Scrapy without scrapy-playwright?
You can, but it is painful. You would have to write your own download handler, manage the browser lifecycle, and integrate with Scrapy’s async model. scrapy-playwright handles all this for you.
Q: what about scrapy-splash?
Splash is essentially unmaintained as of 2024-2025. The JS engine is older and many modern SPAs do not render correctly. Use scrapy-playwright unless you have legacy reasons to stay on Splash.
Q: how do I handle Playwright authentication and cookies?
Pass cookies via the request meta or set them on the context:
meta={"playwright": True, "playwright_context_kwargs": {"storage_state": "auth.json"}}
auth.json is generated by saving Playwright’s context.storage_state() after login.
Q: can I run scrapy-playwright on AWS Lambda?
Technically yes via Playwright’s Lambda layers, but the cold start is brutal (10+ seconds per invocation just to launch Chromium). For Lambda, prefer dedicated browser services like Browserbase or run scrapy-playwright on Fargate or EKS instead.
Q: does scrapy-playwright support Firefox?
Yes. Set PLAYWRIGHT_BROWSER_TYPE = "firefox" and run playwright install firefox. Most patches work on Firefox too, though some stealth libraries are Chromium-focused.
Common pitfalls in production scrapy-playwright
The first failure mode that catches teams off guard is the page-object leak. scrapy-playwright passes the open Page to your callback via response.meta["playwright_page"], but if your parse function raises an exception or yields requests without closing the page, the browser context retains a reference. After 50-100 leaked pages, Chromium hits its per-context limit (typically 30 tabs depending on RAM) and new requests stall waiting for tab slots. The fix is a try/finally pattern in every parse method that touches playwright_page:
async def parse_product(self, response):
page = response.meta.get("playwright_page")
try:
await page.wait_for_selector(".product-loaded", timeout=10000)
title = await page.text_content("h1")
yield {"title": title}
except Exception as e:
self.logger.error(f"parse failed: {e}")
raise
finally:
if page:
await page.close()
Add an errback that also closes the page on download failure, otherwise pages leak from network errors that never reach your parse callback.
The second pitfall is the headers-vs-Playwright disconnect. Setting headers on a scrapy.Request does not automatically pass them to Playwright when playwright=True. Scrapy treats those headers as the eventual download request, but Playwright bypasses Scrapy’s downloader entirely and uses its own. The result: your User-Agent rotation middleware appears to set a Chrome 124 UA but Playwright launches with whatever UA the browser instance has. The fix is to set headers via playwright_context_kwargs on each request:
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_context_kwargs": {
"user_agent": rotated_ua,
"extra_http_headers": rotated_headers,
},
},
)
Each new context spawn now gets the rotated UA and headers. Beware: spawning a new context per request defeats Playwright’s connection reuse, so pair this with named contexts to amortize the cost across same-UA requests.
The third pitfall is silent JavaScript navigation that bypasses Scrapy’s URL tracking. When a page does window.location.href = "/next-page" mid-load, Playwright follows it and ends up at a URL different from what you originally requested. Scrapy’s response.url reflects the post-navigation URL, but your duplicate-detection middleware only saw the original URL. You can end up scraping the same final page repeatedly through different entry-point URLs. The fix is to track final URLs in your DupeFilter:
from scrapy.dupefilters import RFPDupeFilter
import hashlib
class FinalUrlDupeFilter(RFPDupeFilter):
def request_seen(self, request):
# If response has been received, dedupe on response.url too
final_url = request.meta.get("playwright_final_url")
if final_url:
fp = hashlib.sha256(final_url.encode()).hexdigest()
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
return super().request_seen(request)
Set playwright_final_url in your parse callback after capturing page.url, then re-yield the request through the dedupe filter.
Real-world example: 50,000-product crawl with mixed HTML and Playwright
A scrapy-playwright project crawled an electronics retailer with 50,000 products across 800 category pages. Initial implementation used Playwright for everything and took 18 hours per full crawl with 90 percent CPU usage on a 16-core machine. The optimization that cut this to 2.5 hours was a tiered request strategy that classified URL patterns and used the cheapest method per pattern:
class TieredSpider(scrapy.Spider):
name = "products_tiered"
def start_requests(self):
# Category pages: pure HTML, server-rendered
for cat_url in self.category_urls:
yield scrapy.Request(cat_url, callback=self.parse_category)
def parse_category(self, response):
# Extract product URLs from server-rendered HTML
for url in response.css(".product-link::attr(href)").getall():
full_url = response.urljoin(url)
# Tier 1: standard product page (HTML works)
if "/p/" in full_url:
yield scrapy.Request(full_url, callback=self.parse_product_html)
# Tier 2: SPA product detail (Playwright needed)
elif "/spa-product/" in full_url:
yield scrapy.Request(
full_url,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", ".price-loaded"),
],
},
callback=self.parse_product_playwright,
)
def parse_product_html(self, response):
yield {
"url": response.url,
"title": response.css("h1::text").get(),
"price": response.css(".price::text").get(),
}
async def parse_product_playwright(self, response):
page = response.meta["playwright_page"]
try:
title = await page.text_content("h1")
price = await page.text_content(".price")
yield {"url": response.url, "title": title, "price": price}
finally:
await page.close()
Of the 50,000 products, only 8,000 needed Playwright (the SPA product pages). The remaining 42,000 went through pure Scrapy HTML at 200+ requests/second. Playwright handled the 8,000 SPA pages at 5 requests/second. Total throughput: 5.5 hours for the HTML tier and 27 minutes for the SPA tier, running concurrently in 2.5 hours total. The lesson: classify your URLs by render type and route accordingly. Default-to-Playwright is the most expensive choice you can make.
Comparison: scrapy-playwright settings tuning by site type
Different site types need different tuning profiles. A reference table from a 2026 production deployment:
| site type | CONCURRENT_REQUESTS | PLAYWRIGHT_MAX_CONTEXTS | nav timeout | wait strategy |
|---|---|---|---|---|
| ecommerce SPA | 16 | 8 | 30s | wait_for_selector on price |
| news article | 32 | 16 | 15s | wait_for_load_state domcontentloaded |
| listing with infinite scroll | 8 | 4 | 60s | manual scroll + wait_for_response |
| dashboard behind login | 4 | 2 | 45s | wait_for_url + wait_for_selector |
| pricing page (heavy JS) | 16 | 8 | 30s | wait_for_function on data-loaded attr |
For listing pages with infinite scroll, use a manual scroll loop:
async def scroll_until_done(page, max_scrolls=20):
for i in range(max_scrolls):
prev_height = await page.evaluate("document.body.scrollHeight")
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1500)
new_height = await page.evaluate("document.body.scrollHeight")
if new_height == prev_height:
break # no more content loaded
Without the scroll-until-done pattern, you get only the first viewport’s worth of items even when the listing has 500+. Pair with a per-spider MAX_SCROLL_ITERATIONS setting so a runaway page does not loop forever.
Memory profiling: detecting Playwright leaks
Production scrapy-playwright deployments suffer from RAM growth that takes hours to manifest. The cause is almost always context reuse without bounded lifetime. Each long-lived context accumulates DOM nodes, IndexedDB entries, and service worker state. After 4-6 hours of constant scraping through one context, RAM usage doubles. The detection pattern:
import psutil
import os
class MemoryMonitorMiddleware:
def __init__(self):
self.process = psutil.Process(os.getpid())
self.baseline_mb = self.process.memory_info().rss / 1024 / 1024
self.requests_processed = 0
def process_response(self, request, response, spider):
self.requests_processed += 1
if self.requests_processed % 100 == 0:
current_mb = self.process.memory_info().rss / 1024 / 1024
growth = current_mb - self.baseline_mb
spider.logger.info(
f"requests={self.requests_processed} "
f"rss={current_mb:.0f}MB growth={growth:.0f}MB"
)
if growth > 2000: # 2GB growth, recycle contexts
spider.logger.warning("memory bloat detected, recycling contexts")
# Trigger context recycling in your context manager
return response
Recycle contexts every 1000-2000 requests by closing them and letting scrapy-playwright spawn fresh ones. This caps RAM growth at the cost of one cold-context launch per recycle (about 800ms). For long-running spiders, this is a worthwhile trade.
Wrapping up
Scrapy + Playwright is the right combination for any Python scraping project that mixes simple HTML pages with JavaScript-heavy pages. scrapy-playwright glues them cleanly enough that you keep all of Scrapy’s pipeline ergonomics while getting Playwright’s rendering capability where you need it. Pair this guide with our Scrapy Cloud vs Crawlee Cloud and best Python scraping libraries 2026 writeups, and browse the framework-tutorials category on DRT for related tutorials.