Scrapy + Playwright: Advanced JS Scraping
Scrapy excels at large-scale crawling but cannot render JavaScript. Playwright excels at browser automation but lacks crawling infrastructure. The scrapy-playwright plugin combines both — you get Scrapy’s spiders, pipelines, middleware, and concurrency management with Playwright’s JavaScript rendering, network interception, and page interaction.
This tutorial covers installation, configuration, handling JS-rendered pages, page interaction, performance optimization, and proxy integration.
Table of Contents
- Why Combine Scrapy and Playwright
- Installation
- Basic Configuration
- Your First Scrapy-Playwright Spider
- Page Interaction
- Network Interception
- Handling SPAs
- Performance Optimization
- Proxy Integration
- Complete Example
- FAQ
Why Combine Scrapy and Playwright
| Challenge | Scrapy Alone | Playwright Alone | Scrapy + Playwright |
|---|---|---|---|
| JS rendering | Cannot | Can | Can |
| Concurrency | Built-in | Manual | Built-in |
| Data pipelines | Built-in | Manual | Built-in |
| Rate limiting | Built-in | Manual | Built-in |
| Link following | Built-in | Manual | Built-in |
| Page interaction | Cannot | Can | Can |
| Scale (1M+ pages) | Excellent | Poor | Good |
Installation
pip install scrapy scrapy-playwright
playwright install chromiumBasic Configuration
Add to your Scrapy project’s settings.py:
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
# Playwright settings
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": True,
}
# Default Scrapy settings
CONCURRENT_REQUESTS = 4
DOWNLOAD_DELAY = 1
ROBOTSTXT_OBEY = TrueYour First Scrapy-Playwright Spider
import scrapy
class JSBookSpider(scrapy.Spider):
name = "js_books"
def start_requests(self):
yield scrapy.Request(
url="https://books.toscrape.com/",
meta={
"playwright": True,
"playwright_page_methods": [
# Wait for content to load
{"method": "wait_for_selector", "args": ["article.product_pod"]},
],
},
)
def parse(self, response):
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css(".price_color::text").get(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(
next_page,
self.parse,
meta={"playwright": True},
)Run it:
scrapy runspider js_books.py -o books.jsonMixing Playwright and Regular Requests
Not every page needs JavaScript rendering. Only use Playwright where needed:
class MixedSpider(scrapy.Spider):
name = "mixed"
def start_requests(self):
# Static pages — no Playwright (faster)
for url in static_urls:
yield scrapy.Request(url, callback=self.parse_static)
# JS pages — use Playwright
for url in js_urls:
yield scrapy.Request(
url,
callback=self.parse_js,
meta={"playwright": True},
)
def parse_static(self, response):
# Regular Scrapy parsing — fast
yield {"title": response.css("h1::text").get()}
def parse_js(self, response):
# Playwright-rendered — slower but handles JS
yield {"title": response.css("h1::text").get()}Page Interaction
Clicking Buttons
from scrapy_playwright.page import PageMethod
class LoadMoreSpider(scrapy.Spider):
name = "load_more"
def start_requests(self):
yield scrapy.Request(
url="https://example.com/products",
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", ".product-card"),
PageMethod("click", "button.load-more"),
PageMethod("wait_for_timeout", 2000),
PageMethod("click", "button.load-more"),
PageMethod("wait_for_timeout", 2000),
],
},
)
def parse(self, response):
for product in response.css(".product-card"):
yield {
"name": product.css("h3::text").get(),
"price": product.css(".price::text").get(),
}Filling Forms
def start_requests(self):
yield scrapy.Request(
url="https://example.com/search",
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("fill", "input[name='q']", "web scraping"),
PageMethod("click", "button[type='submit']"),
PageMethod("wait_for_selector", ".results"),
],
},
)Scrolling for Infinite Scroll
def start_requests(self):
yield scrapy.Request(
url="https://example.com/feed",
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", ".feed-item"),
PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
PageMethod("wait_for_timeout", 2000),
PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
PageMethod("wait_for_timeout", 2000),
PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
PageMethod("wait_for_timeout", 2000),
],
},
)Advanced Page Interaction with Callback
For complex interactions, use playwright_page_coroutines:
import scrapy
from scrapy_playwright.page import PageMethod
class AdvancedSpider(scrapy.Spider):
name = "advanced"
def start_requests(self):
yield scrapy.Request(
url="https://example.com/products",
meta={
"playwright": True,
"playwright_include_page": True,
},
callback=self.parse_with_page,
)
async def parse_with_page(self, response):
page = response.meta["playwright_page"]
try:
# Complex interaction logic
while True:
load_more = page.locator("button.load-more")
if await load_more.count() == 0:
break
await load_more.click()
await page.wait_for_timeout(2000)
# Get final HTML
html = await page.content()
sel = scrapy.Selector(text=html)
for product in sel.css(".product-card"):
yield {
"name": product.css("h3::text").get(),
"price": product.css(".price::text").get(),
}
finally:
await page.close()Network Interception
Block heavy resources for faster scraping:
# settings.py
PLAYWRIGHT_ABORT_REQUEST = lambda request: request.resource_type in [
"image", "stylesheet", "font", "media",
]Or per-spider:
def start_requests(self):
yield scrapy.Request(
url="https://example.com",
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("route", "**/*.{png,jpg,jpeg,gif,css,woff}", lambda route: route.abort()),
],
},
)Handling SPAs
React/Vue/Angular Applications
class SPASpider(scrapy.Spider):
name = "spa"
def start_requests(self):
yield scrapy.Request(
url="https://spa-example.com/products",
meta={
"playwright": True,
"playwright_page_methods": [
# Wait for React/Vue to finish rendering
PageMethod("wait_for_load_state", "networkidle"),
# Or wait for specific elements
PageMethod("wait_for_selector", "[data-loaded='true']"),
],
},
)
def parse(self, response):
for item in response.css("[data-testid='product-card']"):
yield {
"name": item.css("[data-testid='name']::text").get(),
"price": item.css("[data-testid='price']::text").get(),
}Capturing API Data
class APICapture(scrapy.Spider):
name = "api_capture"
def start_requests(self):
yield scrapy.Request(
url="https://example.com/products",
meta={
"playwright": True,
"playwright_include_page": True,
},
callback=self.capture_api,
)
async def capture_api(self, response):
page = response.meta["playwright_page"]
api_data = []
async def handle_response(resp):
if "/api/products" in resp.url and resp.status == 200:
try:
data = await resp.json()
api_data.append(data)
except Exception:
pass
page.on("response", handle_response)
await page.reload()
await page.wait_for_timeout(5000)
for data in api_data:
for product in data.get("products", []):
yield product
await page.close()Performance Optimization
1. Minimize Playwright Usage
# Only use Playwright when JavaScript rendering is needed
custom_settings = {
"CONCURRENT_REQUESTS": 16, # For regular requests
"PLAYWRIGHT_MAX_CONTEXTS": 4, # Limit browser contexts
}2. Block Resources
# settings.py
PLAYWRIGHT_ABORT_REQUEST = lambda req: req.resource_type in [
"image", "stylesheet", "font", "media", "other",
]3. Use Multiple Browser Contexts
# settings.py
PLAYWRIGHT_MAX_CONTEXTS = 8
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT = 44. Reuse Contexts
PLAYWRIGHT_CONTEXTS = {
"default": {
"viewport": {"width": 1280, "height": 720},
"user_agent": "Mozilla/5.0",
},
}Proxy Integration
Global Proxy
# settings.py
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": True,
"proxy": {
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass",
},
}Per-Request Proxy
PLAYWRIGHT_CONTEXTS = {
"proxy1": {
"proxy": {
"server": "http://proxy1.example.com:8080",
"username": "user",
"password": "pass",
},
},
"proxy2": {
"proxy": {
"server": "http://proxy2.example.com:8080",
"username": "user",
"password": "pass",
},
},
}
# In your spider
def start_requests(self):
yield scrapy.Request(
url="https://example.com",
meta={
"playwright": True,
"playwright_context": "proxy1",
},
)For proxy types, see our web scraping proxy guide and proxy glossary.
Complete Example
import scrapy
from scrapy_playwright.page import PageMethod
class EcommerceSpider(scrapy.Spider):
name = "ecommerce"
allowed_domains = ["books.toscrape.com"]
custom_settings = {
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"PLAYWRIGHT_BROWSER_TYPE": "chromium",
"PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": True},
"CONCURRENT_REQUESTS": 4,
"DOWNLOAD_DELAY": 1,
"FEEDS": {
"books.json": {"format": "json", "overwrite": True},
},
}
def start_requests(self):
yield scrapy.Request(
url="https://books.toscrape.com/",
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", "article.product_pod"),
],
},
)
def parse(self, response):
for book in response.css("article.product_pod"):
detail_url = book.css("h3 a::attr(href)").get()
yield response.follow(
detail_url,
callback=self.parse_detail,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", ".product_main"),
],
},
)
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(
next_page,
self.parse,
meta={"playwright": True},
)
def parse_detail(self, response):
yield {
"title": response.css("h1::text").get(),
"price": response.css(".price_color::text").get(),
"availability": response.css(".availability::text").getall()[-1].strip(),
"description": response.css("#product_description + p::text").get(),
"upc": response.css("td::text").get(),
"url": response.url,
}FAQ
When should I use Scrapy + Playwright vs plain Scrapy?
Use Scrapy + Playwright when the target site renders content with JavaScript (React, Angular, Vue SPAs). If the HTML source contains all the data you need, plain Scrapy is faster and lighter. Always check the page source first before adding Playwright.
Does Scrapy + Playwright support all Playwright features?
Most features are supported through PageMethod and playwright_include_page. For very complex interactions, use playwright_include_page=True to access the Playwright page object directly in your callback.
How does it compare to Scrapy + Splash?
Scrapy + Playwright is the modern replacement for Scrapy + Splash. Playwright renders pages more accurately (real browser vs simulated), supports more interaction types, and does not require running a separate Splash server. Splash is deprecated for most use cases.
What is the performance impact of adding Playwright?
Significant. Regular Scrapy requests process at 50-200+ pages/second. With Playwright, expect 2-10 pages/second depending on page complexity and resource blocking. Minimize Playwright usage to only the pages that need it.
Learn more: Scrapy tutorial, Playwright tutorial, Python scraping libraries.
External Resources:
- scrapy-playwright GitHub
- Scrapy Documentation
- Playwright Documentation
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
Related Reading
- aiohttp + BeautifulSoup: Async Python Scraping
- Axios + Cheerio: Lightweight Node.js Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company