Scraping JavaScript-Rendered Pages: Complete Guide

Modern websites rely heavily on JavaScript to render content. Product listings, reviews, prices, and even basic page text are often loaded dynamically after the initial HTML is served. Traditional HTTP scrapers see an empty shell. This guide covers every technique for extracting data from JavaScript-heavy pages.

The JavaScript Rendering Problem

When you fetch a page with requests.get(), you get the raw HTML before any JavaScript executes. For static sites, that’s fine. For JavaScript-rendered pages, the HTML looks like this:

<html>
<body>
<div id="root"></div>
<script src="/bundle.js"></script>
</body>
</html>

All the actual content is injected by JavaScript. Your scraper sees an empty

, not the products, articles, or data you need.

Approach 1: Headless Browsers

The most reliable method — run a real browser without a visible window.

Playwright (Recommended)

from playwright.sync_api import sync_playwright
import json

def scrape_with_playwright(url, proxy=None):
with sync_playwright() as p:
browser_options = {"headless": True}
if proxy:
browser_options["proxy"] = {"server": proxy}

browser = p.chromium.launch(**browser_options)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
)
page = context.new_page()

# Navigate and wait for content
page.goto(url, wait_until="networkidle")

# Wait for specific element
page.wait_for_selector(".product-card", timeout=10000)

# Extract data
products = page.evaluate("""
() => {
return Array.from(document.querySelectorAll('.product-card')).map(card => ({
title: card.querySelector('.title')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
link: card.querySelector('a')?.href
}));
}
""")

browser.close()
return products

Usage
data = scrape_with_playwright(
"https://example.com/products",
proxy="http://user:pass@proxy:8080"
)

Selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_with_selenium(url, proxy=None):
options = Options()
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1920,1080")

if proxy:
options.add_argument(f"--proxy-server={proxy}")

driver = webdriver.Chrome(options=options)

try:
driver.get(url)

# Wait for content to load
WebDriverWait(driver, 15).until(
EC.presence_of_all_elements_located(
(By.CSS_SELECTOR, ".product-card")
)
)

# Extract data
cards = driver.find_elements(By.CSS_SELECTOR, ".product-card")
products = []
for card in cards:
products.append({
"title": card.find_element(By.CSS_SELECTOR, ".title").text,
"price": card.find_element(By.CSS_SELECTOR, ".price").text,
"link": card.find_element(By.TAG_NAME, "a").get_attribute("href")
})

return products
finally:
driver.quit()

Choosing Between Playwright and Selenium

Feature	Playwright	Selenium
Speed	Faster	Slower
API design	Modern, async-native	Older, synchronous
Browser support	Chromium, Firefox, WebKit	Chrome, Firefox, Edge, Safari
Auto-wait	Built-in	Manual waits needed
Network interception	Easy	Complex
Community/ecosystem	Growing	Massive

Recommendation: Use Playwright for new projects. Use Selenium if you need specific browser compatibility or have existing Selenium infrastructure.

Approach 2: Intercept API Calls

Many JavaScript sites fetch data from internal APIs. Intercept these calls to get structured data directly:

With Playwright Network Interception

from playwright.sync_api import sync_playwright
import json

def intercept_api_calls(url):
api_responses = []

def handle_response(response):
if "/api/" in response.url and response.status == 200:
try:
data = response.json()
api_responses.append({
"url": response.url,
"data": data
})
except Exception:
pass

with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.on("response", handle_response)
page.goto(url, wait_until="networkidle")
browser.close()

return api_responses

With Browser DevTools (Manual Discovery)

Open Chrome DevTools (F12)
Go to the Network tab
Filter by XHR/Fetch
Load the page and interact with it
Look for API endpoints returning JSON data
Copy the URL and headers — you can call these directly

Once you find the API endpoint, skip the browser entirely:

import requests

def scrape_via_api(api_url, proxy=None):
headers = {
"Accept": "application/json",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",
"Referer": "https://example.com/products",
"X-Requested-With": "XMLHttpRequest"
}

proxies = {"http": proxy, "https": proxy} if proxy else None
response = requests.get(api_url, headers=headers, proxies=proxies)
return response.json()

Approach 3: Rendering Services

Third-party services render JavaScript for you and return the final HTML:

Using a Rendering API

import requests

def scrape_with_renderer(url, api_key):
"""Use a JavaScript rendering service."""
response = requests.get(
"https://render-api.example.com/render",
params={
"url": url,
"render_js": True,
"wait_for": ".product-card",
"timeout": 30
},
headers={"Authorization": f"Bearer {api_key}"}
)

html = response.text
# Now parse with BeautifulSoup as normal
soup = BeautifulSoup(html, "lxml")
return soup

Splash (Self-Hosted Renderer)

Run your own rendering service with Splash:

# docker-compose.yml
services:
splash:
image: scrapinghub/splash:latest
ports:

"8050:8050"

command: --maxrss 4000 --max-timeout 120

import requests

def scrape_with_splash(url, splash_url="http://localhost:8050"):
response = requests.get(
f"{splash_url}/render.html",
params={
"url": url,
"wait": 3,
"timeout": 30,
"resource_timeout": 10,
"images": 0  # Don't load images
}
)
return response.text

Approach 4: Server-Side Rendering Detection

Some frameworks (Next.js, Nuxt.js) use server-side rendering. The initial HTML already contains the content. Check before using a headless browser:

import requests
from bs4 import BeautifulSoup

def check_ssr(url):
"""Check if a page has server-side rendered content."""
response = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 Chrome/120.0.0.0"
})
soup = BeautifulSoup(response.text, "lxml")

# Check for content in the initial HTML
body_text = soup.body.get_text(strip=True) if soup.body else ""

if len(body_text) > 100:
print(f"SSR detected: {len(body_text)} chars of content")
return True
else:
print(f"Client-side rendered: only {len(body_text)} chars")
return False

Waiting Strategies

The biggest challenge with JS scraping is knowing when the page is ready.

Wait for Network Idle

# Playwright
page.goto(url, wait_until="networkidle")
Waits until no more than 2 network connections for 500ms

Wait for Specific Element

# Playwright
page.wait_for_selector("#product-list .item", state="visible", timeout=15000)

Selenium
WebDriverWait(driver, 15).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, "#product-list .item"))
)

Wait for JavaScript Variable

# Wait until a JS variable is populated
page.wait_for_function("window.__DATA__ !== undefined", timeout=10000)
data = page.evaluate("window.__DATA__")

Custom Wait with Polling

import time

def wait_for_content(page, selector, max_wait=15):
start = time.time()
while time.time() - start < max_wait:
count = page.locator(selector).count()
if count > 0:
return True
page.wait_for_timeout(500)
raise TimeoutError(f"Content not found: {selector}")

Performance Optimization

Block Unnecessary Resources

Speed up rendering by blocking images, fonts, and tracking scripts:

def block_resources(route, request):
blocked_types = {"image", "stylesheet", "font", "media"}
blocked_domains = ["google-analytics.com", "facebook.net", "doubleclick.net"]

if request.resource_type in blocked_types:
route.abort()
elif any(domain in request.url for domain in blocked_domains):
route.abort()
else:
route.continue_()

page.route("*/", block_resources)

This can reduce page load time by 50-70%.

Reuse Browser Instances

Don’t launch a new browser for every page:

class BrowserPool:
def __init__(self, pool_size=5):
self.playwright = sync_playwright().start()
self.browsers = [
self.playwright.chromium.launch(headless=True)
for _ in range(pool_size)
]
self.available = list(self.browsers)
self.lock = threading.Lock()

def get_browser(self):
with self.lock:
if self.available:
return self.available.pop()
return self.playwright.chromium.launch(headless=True)

def return_browser(self, browser):
with self.lock:
self.available.append(browser)

def close_all(self):
for browser in self.browsers:
browser.close()
self.playwright.stop()

Common Frameworks and How to Scrape Them

React Applications

React apps often embed initial state in the HTML:

import re
import json

def extract_react_state(html):
# Look for __NEXT_DATA__ (Next.js)
match = re.search(r'<script id="__NEXT_DATA__"[^>]>(.?)</script>', html)
if match:
return json.loads(match.group(1))

# Look for window.__INITIAL_STATE__
match = re.search(r'window\.__INITIAL_STATE__\s=\s({.*?});', html, re.DOTALL)
if match:
return json.loads(match.group(1))

return None

Vue.js / Nuxt.js

def extract_nuxt_data(html):
match = re.search(r'window\.__NUXT__\s=\s\(function\(.?\)\{return\s({.*?})\}', html, re.DOTALL)
if match:
return json.loads(match.group(1))
return None

Angular Applications

Angular apps are usually tightly coupled to their DOM structure. Use headless browsers for Angular — API interception is often the best strategy.

FAQ

When should I use a headless browser vs. API interception?

Try API interception first — it’s 10x faster and uses fewer resources. Use headless browsers only when you can’t find or replicate the API calls, or when the site uses complex authentication flows.

How much memory does headless Chrome use?

Each Chrome instance uses 100-300MB of RAM. With multiple tabs, it can quickly reach 1GB+. Use resource blocking and limit concurrent pages.

Can I run headless browsers in Docker?

Yes, both Playwright and Selenium work well in Docker. Use the official Playwright Docker image or install Chrome in your Dockerfile. See our Docker scraping guide for details.

How do I handle CAPTCHAs on JavaScript pages?

CAPTCHAs are a separate challenge. Use residential proxies to reduce CAPTCHA triggers, implement CAPTCHA-solving services, or avoid patterns that trigger CAPTCHAs in the first place.

Is headless browser scraping slower than regular HTTP scraping?

Yes, significantly — 5-20x slower per page. Headless browsers need to download, parse, and execute all JavaScript. Optimize by blocking resources, reusing browser instances, and running multiple browsers in parallel.

Conclusion

JavaScript-rendered pages require different tools than static HTML, but they’re far from unscrape-able. Start by checking for API endpoints (fastest approach), try server-side rendering detection (no browser needed), and fall back to headless browsers (Playwright preferred) when necessary. Combined with proxy rotation and rate limiting, you can reliably scrape even the most JavaScript-heavy sites.

Scraping JavaScript-Rendered Pages: Complete Guide

The JavaScript Rendering Problem

Approach 1: Headless Browsers

Playwright (Recommended)

Usage

Selenium

Choosing Between Playwright and Selenium

Approach 2: Intercept API Calls

With Playwright Network Interception

With Browser DevTools (Manual Discovery)

Approach 3: Rendering Services

Using a Rendering API

Splash (Self-Hosted Renderer)

Approach 4: Server-Side Rendering Detection

Waiting Strategies

Wait for Network Idle

Waits until no more than 2 network connections for 500ms

Wait for Specific Element

Selenium

Wait for JavaScript Variable

Custom Wait with Polling

Performance Optimization

Block Unnecessary Resources

Reuse Browser Instances

Common Frameworks and How to Scrape Them

React Applications

Vue.js / Nuxt.js

Angular Applications

FAQ

When should I use a headless browser vs. API interception?

How much memory does headless Chrome use?

Can I run headless browsers in Docker?

How do I handle CAPTCHAs on JavaScript pages?

Is headless browser scraping slower than regular HTTP scraping?

Conclusion

Internal Links