Scraping JavaScript-Rendered Pages: Complete Guide
Modern websites rely heavily on JavaScript to render content. Product listings, reviews, prices, and even basic page text are often loaded dynamically after the initial HTML is served. Traditional HTTP scrapers see an empty shell. This guide covers every technique for extracting data from JavaScript-heavy pages.
The JavaScript Rendering Problem
When you fetch a page with requests.get(), you get the raw HTML before any JavaScript executes. For static sites, that’s fine. For JavaScript-rendered pages, the HTML looks like this:
<html>
<body>
<div id="root"></div>
<script src="/bundle.js"></script>
</body>
</html>
All the actual content is injected by JavaScript. Your scraper sees an empty
Approach 1: Headless Browsers
The most reliable method — run a real browser without a visible window.
Playwright (Recommended)
from playwright.sync_api import sync_playwright
import json
def scrape_with_playwright(url, proxy=None):
with sync_playwright() as p:
browser_options = {"headless": True}
if proxy:
browser_options["proxy"] = {"server": proxy}
browser = p.chromium.launch(**browser_options)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
)
page = context.new_page()
# Navigate and wait for content
page.goto(url, wait_until="networkidle")
# Wait for specific element
page.wait_for_selector(".product-card", timeout=10000)
# Extract data
products = page.evaluate("""
() => {
return Array.from(document.querySelectorAll('.product-card')).map(card => ({
title: card.querySelector('.title')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
link: card.querySelector('a')?.href
}));
}
""")
browser.close()
return products
Usage
data = scrape_with_playwright(
"https://example.com/products",
proxy="http://user:pass@proxy:8080"
)
Selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape_with_selenium(url, proxy=None):
options = Options()
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1920,1080")
if proxy:
options.add_argument(f"--proxy-server={proxy}")
driver = webdriver.Chrome(options=options)
try:
driver.get(url)
# Wait for content to load
WebDriverWait(driver, 15).until(
EC.presence_of_all_elements_located(
(By.CSS_SELECTOR, ".product-card")
)
)
# Extract data
cards = driver.find_elements(By.CSS_SELECTOR, ".product-card")
products = []
for card in cards:
products.append({
"title": card.find_element(By.CSS_SELECTOR, ".title").text,
"price": card.find_element(By.CSS_SELECTOR, ".price").text,
"link": card.find_element(By.TAG_NAME, "a").get_attribute("href")
})
return products
finally:
driver.quit()
Choosing Between Playwright and Selenium
| Feature | Playwright | Selenium |
|---|---|---|
| Speed | Faster | Slower |
| API design | Modern, async-native | Older, synchronous |
| Browser support | Chromium, Firefox, WebKit | Chrome, Firefox, Edge, Safari |
| Auto-wait | Built-in | Manual waits needed |
| Network interception | Easy | Complex |
| Community/ecosystem | Growing | Massive |
Recommendation: Use Playwright for new projects. Use Selenium if you need specific browser compatibility or have existing Selenium infrastructure.
Approach 2: Intercept API Calls
Many JavaScript sites fetch data from internal APIs. Intercept these calls to get structured data directly:
With Playwright Network Interception
from playwright.sync_api import sync_playwright
import json
def intercept_api_calls(url):
api_responses = []
def handle_response(response):
if "/api/" in response.url and response.status == 200:
try:
data = response.json()
api_responses.append({
"url": response.url,
"data": data
})
except Exception:
pass
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.on("response", handle_response)
page.goto(url, wait_until="networkidle")
browser.close()
return api_responses
With Browser DevTools (Manual Discovery)
- Open Chrome DevTools (F12)
- Go to the Network tab
- Filter by XHR/Fetch
- Load the page and interact with it
- Look for API endpoints returning JSON data
- Copy the URL and headers — you can call these directly
Once you find the API endpoint, skip the browser entirely:
import requests
def scrape_via_api(api_url, proxy=None):
headers = {
"Accept": "application/json",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
"Referer": "https://example.com/products",
"X-Requested-With": "XMLHttpRequest"
}
proxies = {"http": proxy, "https": proxy} if proxy else None
response = requests.get(api_url, headers=headers, proxies=proxies)
return response.json()
Approach 3: Rendering Services
Third-party services render JavaScript for you and return the final HTML:
Using a Rendering API
import requests
def scrape_with_renderer(url, api_key):
"""Use a JavaScript rendering service."""
response = requests.get(
"https://render-api.example.com/render",
params={
"url": url,
"render_js": True,
"wait_for": ".product-card",
"timeout": 30
},
headers={"Authorization": f"Bearer {api_key}"}
)
html = response.text
# Now parse with BeautifulSoup as normal
soup = BeautifulSoup(html, "lxml")
return soup
Splash (Self-Hosted Renderer)
Run your own rendering service with Splash:
# docker-compose.yml
services:
splash:
image: scrapinghub/splash:latest
ports:
- "8050:8050"
command: --maxrss 4000 --max-timeout 120
import requests
def scrape_with_splash(url, splash_url="http://localhost:8050"):
response = requests.get(
f"{splash_url}/render.html",
params={
"url": url,
"wait": 3,
"timeout": 30,
"resource_timeout": 10,
"images": 0 # Don't load images
}
)
return response.text
Approach 4: Server-Side Rendering Detection
Some frameworks (Next.js, Nuxt.js) use server-side rendering. The initial HTML already contains the content. Check before using a headless browser:
import requests
from bs4 import BeautifulSoup
def check_ssr(url):
"""Check if a page has server-side rendered content."""
response = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 Chrome/120.0.0.0"
})
soup = BeautifulSoup(response.text, "lxml")
# Check for content in the initial HTML
body_text = soup.body.get_text(strip=True) if soup.body else ""
if len(body_text) > 100:
print(f"SSR detected: {len(body_text)} chars of content")
return True
else:
print(f"Client-side rendered: only {len(body_text)} chars")
return False
Waiting Strategies
The biggest challenge with JS scraping is knowing when the page is ready.
Wait for Network Idle
# Playwright
page.goto(url, wait_until="networkidle")
Waits until no more than 2 network connections for 500ms
Wait for Specific Element
# Playwright
page.wait_for_selector("#product-list .item", state="visible", timeout=15000)
Selenium
WebDriverWait(driver, 15).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, "#product-list .item"))
)
Wait for JavaScript Variable
# Wait until a JS variable is populated
page.wait_for_function("window.__DATA__ !== undefined", timeout=10000)
data = page.evaluate("window.__DATA__")
Custom Wait with Polling
import time
def wait_for_content(page, selector, max_wait=15):
start = time.time()
while time.time() - start < max_wait:
count = page.locator(selector).count()
if count > 0:
return True
page.wait_for_timeout(500)
raise TimeoutError(f"Content not found: {selector}")
Performance Optimization
Block Unnecessary Resources
Speed up rendering by blocking images, fonts, and tracking scripts:
def block_resources(route, request):
blocked_types = {"image", "stylesheet", "font", "media"}
blocked_domains = ["google-analytics.com", "facebook.net", "doubleclick.net"]
if request.resource_type in blocked_types:
route.abort()
elif any(domain in request.url for domain in blocked_domains):
route.abort()
else:
route.continue_()
page.route("*/", block_resources)
This can reduce page load time by 50-70%.
Reuse Browser Instances
Don’t launch a new browser for every page:
class BrowserPool:
def __init__(self, pool_size=5):
self.playwright = sync_playwright().start()
self.browsers = [
self.playwright.chromium.launch(headless=True)
for _ in range(pool_size)
]
self.available = list(self.browsers)
self.lock = threading.Lock()
def get_browser(self):
with self.lock:
if self.available:
return self.available.pop()
return self.playwright.chromium.launch(headless=True)
def return_browser(self, browser):
with self.lock:
self.available.append(browser)
def close_all(self):
for browser in self.browsers:
browser.close()
self.playwright.stop()
Common Frameworks and How to Scrape Them
React Applications
React apps often embed initial state in the HTML:
import re
import json
def extract_react_state(html):
# Look for __NEXT_DATA__ (Next.js)
match = re.search(r'<script id="__NEXT_DATA__"[^>]>(.?)</script>', html)
if match:
return json.loads(match.group(1))
# Look for window.__INITIAL_STATE__
match = re.search(r'window\.__INITIAL_STATE__\s=\s({.*?});', html, re.DOTALL)
if match:
return json.loads(match.group(1))
return None
Vue.js / Nuxt.js
def extract_nuxt_data(html):
match = re.search(r'window\.__NUXT__\s=\s\(function\(.?\)\{return\s({.*?})\}', html, re.DOTALL)
if match:
return json.loads(match.group(1))
return None
Angular Applications
Angular apps are usually tightly coupled to their DOM structure. Use headless browsers for Angular — API interception is often the best strategy.
FAQ
When should I use a headless browser vs. API interception?
Try API interception first — it’s 10x faster and uses fewer resources. Use headless browsers only when you can’t find or replicate the API calls, or when the site uses complex authentication flows.
How much memory does headless Chrome use?
Each Chrome instance uses 100-300MB of RAM. With multiple tabs, it can quickly reach 1GB+. Use resource blocking and limit concurrent pages.
Can I run headless browsers in Docker?
Yes, both Playwright and Selenium work well in Docker. Use the official Playwright Docker image or install Chrome in your Dockerfile. See our Docker scraping guide for details.
How do I handle CAPTCHAs on JavaScript pages?
CAPTCHAs are a separate challenge. Use residential proxies to reduce CAPTCHA triggers, implement CAPTCHA-solving services, or avoid patterns that trigger CAPTCHAs in the first place.
Is headless browser scraping slower than regular HTTP scraping?
Yes, significantly — 5-20x slower per page. Headless browsers need to download, parse, and execute all JavaScript. Optimize by blocking resources, reusing browser instances, and running multiple browsers in parallel.
Conclusion
JavaScript-rendered pages require different tools than static HTML, but they’re far from unscrape-able. Start by checking for API endpoints (fastest approach), try server-side rendering detection (no browser needed), and fall back to headless browsers (Playwright preferred) when necessary. Combined with proxy rotation and rate limiting, you can reliably scrape even the most JavaScript-heavy sites.