Scraping JavaScript-Rendered Pages: Complete Guide

Scraping JavaScript-Rendered Pages: Complete Guide

Modern websites rely heavily on JavaScript to render content. Product listings, reviews, prices, and even basic page text are often loaded dynamically after the initial HTML is served. Traditional HTTP scrapers see an empty shell. This guide covers every technique for extracting data from JavaScript-heavy pages.

The JavaScript Rendering Problem

When you fetch a page with requests.get(), you get the raw HTML before any JavaScript executes. For static sites, that’s fine. For JavaScript-rendered pages, the HTML looks like this:

<html>

<body>

<div id="root"></div>

<script src="/bundle.js"></script>

</body>

</html>

All the actual content is injected by JavaScript. Your scraper sees an empty

, not the products, articles, or data you need.

Approach 1: Headless Browsers

The most reliable method — run a real browser without a visible window.

Playwright (Recommended)

from playwright.sync_api import sync_playwright

import json

def scrape_with_playwright(url, proxy=None):

with sync_playwright() as p:

browser_options = {"headless": True}

if proxy:

browser_options["proxy"] = {"server": proxy}

browser = p.chromium.launch(**browser_options)

context = browser.new_context(

viewport={"width": 1920, "height": 1080},

user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "

"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"

)

page = context.new_page()

# Navigate and wait for content

page.goto(url, wait_until="networkidle")

# Wait for specific element

page.wait_for_selector(".product-card", timeout=10000)

# Extract data

products = page.evaluate("""

() => {

return Array.from(document.querySelectorAll('.product-card')).map(card => ({

title: card.querySelector('.title')?.textContent?.trim(),

price: card.querySelector('.price')?.textContent?.trim(),

link: card.querySelector('a')?.href

}));

}

""")

browser.close()

return products

Usage

data = scrape_with_playwright(

"https://example.com/products",

proxy="http://user:pass@proxy:8080"

)

Selenium

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

def scrape_with_selenium(url, proxy=None):

options = Options()

options.add_argument("--headless=new")

options.add_argument("--no-sandbox")

options.add_argument("--disable-dev-shm-usage")

options.add_argument("--disable-gpu")

options.add_argument("--window-size=1920,1080")

if proxy:

options.add_argument(f"--proxy-server={proxy}")

driver = webdriver.Chrome(options=options)

try:

driver.get(url)

# Wait for content to load

WebDriverWait(driver, 15).until(

EC.presence_of_all_elements_located(

(By.CSS_SELECTOR, ".product-card")

)

)

# Extract data

cards = driver.find_elements(By.CSS_SELECTOR, ".product-card")

products = []

for card in cards:

products.append({

"title": card.find_element(By.CSS_SELECTOR, ".title").text,

"price": card.find_element(By.CSS_SELECTOR, ".price").text,

"link": card.find_element(By.TAG_NAME, "a").get_attribute("href")

})

return products

finally:

driver.quit()

Choosing Between Playwright and Selenium

FeaturePlaywrightSelenium
SpeedFasterSlower
API designModern, async-nativeOlder, synchronous
Browser supportChromium, Firefox, WebKitChrome, Firefox, Edge, Safari
Auto-waitBuilt-inManual waits needed
Network interceptionEasyComplex
Community/ecosystemGrowingMassive

Recommendation: Use Playwright for new projects. Use Selenium if you need specific browser compatibility or have existing Selenium infrastructure.

Approach 2: Intercept API Calls

Many JavaScript sites fetch data from internal APIs. Intercept these calls to get structured data directly:

With Playwright Network Interception

from playwright.sync_api import sync_playwright

import json

def intercept_api_calls(url):

api_responses = []

def handle_response(response):

if "/api/" in response.url and response.status == 200:

try:

data = response.json()

api_responses.append({

"url": response.url,

"data": data

})

except Exception:

pass

with sync_playwright() as p:

browser = p.chromium.launch(headless=True)

page = browser.new_page()

page.on("response", handle_response)

page.goto(url, wait_until="networkidle")

browser.close()

return api_responses

With Browser DevTools (Manual Discovery)

  1. Open Chrome DevTools (F12)
  2. Go to the Network tab
  3. Filter by XHR/Fetch
  4. Load the page and interact with it
  5. Look for API endpoints returning JSON data
  6. Copy the URL and headers — you can call these directly

Once you find the API endpoint, skip the browser entirely:

import requests

def scrape_via_api(api_url, proxy=None):

headers = {

"Accept": "application/json",

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0",

"Referer": "https://example.com/products",

"X-Requested-With": "XMLHttpRequest"

}

proxies = {"http": proxy, "https": proxy} if proxy else None

response = requests.get(api_url, headers=headers, proxies=proxies)

return response.json()

Approach 3: Rendering Services

Third-party services render JavaScript for you and return the final HTML:

Using a Rendering API

import requests

def scrape_with_renderer(url, api_key):

"""Use a JavaScript rendering service."""

response = requests.get(

"https://render-api.example.com/render",

params={

"url": url,

"render_js": True,

"wait_for": ".product-card",

"timeout": 30

},

headers={"Authorization": f"Bearer {api_key}"}

)

html = response.text

# Now parse with BeautifulSoup as normal

soup = BeautifulSoup(html, "lxml")

return soup

Splash (Self-Hosted Renderer)

Run your own rendering service with Splash:

# docker-compose.yml

services:

splash:

image: scrapinghub/splash:latest

ports:

  • "8050:8050"

command: --maxrss 4000 --max-timeout 120

import requests

def scrape_with_splash(url, splash_url="http://localhost:8050"):

response = requests.get(

f"{splash_url}/render.html",

params={

"url": url,

"wait": 3,

"timeout": 30,

"resource_timeout": 10,

"images": 0 # Don't load images

}

)

return response.text

Approach 4: Server-Side Rendering Detection

Some frameworks (Next.js, Nuxt.js) use server-side rendering. The initial HTML already contains the content. Check before using a headless browser:

import requests

from bs4 import BeautifulSoup

def check_ssr(url):

"""Check if a page has server-side rendered content."""

response = requests.get(url, headers={

"User-Agent": "Mozilla/5.0 Chrome/120.0.0.0"

})

soup = BeautifulSoup(response.text, "lxml")

# Check for content in the initial HTML

body_text = soup.body.get_text(strip=True) if soup.body else ""

if len(body_text) > 100:

print(f"SSR detected: {len(body_text)} chars of content")

return True

else:

print(f"Client-side rendered: only {len(body_text)} chars")

return False

Waiting Strategies

The biggest challenge with JS scraping is knowing when the page is ready.

Wait for Network Idle

# Playwright

page.goto(url, wait_until="networkidle")

Waits until no more than 2 network connections for 500ms

Wait for Specific Element

# Playwright

page.wait_for_selector("#product-list .item", state="visible", timeout=15000)

Selenium

WebDriverWait(driver, 15).until(

EC.visibility_of_element_located((By.CSS_SELECTOR, "#product-list .item"))

)

Wait for JavaScript Variable

# Wait until a JS variable is populated

page.wait_for_function("window.__DATA__ !== undefined", timeout=10000)

data = page.evaluate("window.__DATA__")

Custom Wait with Polling

import time

def wait_for_content(page, selector, max_wait=15):

start = time.time()

while time.time() - start < max_wait:

count = page.locator(selector).count()

if count > 0:

return True

page.wait_for_timeout(500)

raise TimeoutError(f"Content not found: {selector}")

Performance Optimization

Block Unnecessary Resources

Speed up rendering by blocking images, fonts, and tracking scripts:

def block_resources(route, request):

blocked_types = {"image", "stylesheet", "font", "media"}

blocked_domains = ["google-analytics.com", "facebook.net", "doubleclick.net"]

if request.resource_type in blocked_types:

route.abort()

elif any(domain in request.url for domain in blocked_domains):

route.abort()

else:

route.continue_()

page.route("*/", block_resources)

This can reduce page load time by 50-70%.

Reuse Browser Instances

Don’t launch a new browser for every page:

class BrowserPool:

def __init__(self, pool_size=5):

self.playwright = sync_playwright().start()

self.browsers = [

self.playwright.chromium.launch(headless=True)

for _ in range(pool_size)

]

self.available = list(self.browsers)

self.lock = threading.Lock()

def get_browser(self):

with self.lock:

if self.available:

return self.available.pop()

return self.playwright.chromium.launch(headless=True)

def return_browser(self, browser):

with self.lock:

self.available.append(browser)

def close_all(self):

for browser in self.browsers:

browser.close()

self.playwright.stop()

Common Frameworks and How to Scrape Them

React Applications

React apps often embed initial state in the HTML:

import re

import json

def extract_react_state(html):

# Look for __NEXT_DATA__ (Next.js)

match = re.search(r'<script id="__NEXT_DATA__"[^>]>(.?)</script>', html)

if match:

return json.loads(match.group(1))

# Look for window.__INITIAL_STATE__

match = re.search(r'window\.__INITIAL_STATE__\s=\s({.*?});', html, re.DOTALL)

if match:

return json.loads(match.group(1))

return None

Vue.js / Nuxt.js

def extract_nuxt_data(html):

match = re.search(r'window\.__NUXT__\s=\s\(function\(.?\)\{return\s({.*?})\}', html, re.DOTALL)

if match:

return json.loads(match.group(1))

return None

Angular Applications

Angular apps are usually tightly coupled to their DOM structure. Use headless browsers for Angular — API interception is often the best strategy.

FAQ

When should I use a headless browser vs. API interception?

Try API interception first — it’s 10x faster and uses fewer resources. Use headless browsers only when you can’t find or replicate the API calls, or when the site uses complex authentication flows.

How much memory does headless Chrome use?

Each Chrome instance uses 100-300MB of RAM. With multiple tabs, it can quickly reach 1GB+. Use resource blocking and limit concurrent pages.

Can I run headless browsers in Docker?

Yes, both Playwright and Selenium work well in Docker. Use the official Playwright Docker image or install Chrome in your Dockerfile. See our Docker scraping guide for details.

How do I handle CAPTCHAs on JavaScript pages?

CAPTCHAs are a separate challenge. Use residential proxies to reduce CAPTCHA triggers, implement CAPTCHA-solving services, or avoid patterns that trigger CAPTCHAs in the first place.

Is headless browser scraping slower than regular HTTP scraping?

Yes, significantly — 5-20x slower per page. Headless browsers need to download, parse, and execute all JavaScript. Optimize by blocking resources, reusing browser instances, and running multiple browsers in parallel.

Conclusion

JavaScript-rendered pages require different tools than static HTML, but they’re far from unscrape-able. Start by checking for API endpoints (fastest approach), try server-side rendering detection (no browser needed), and fall back to headless browsers (Playwright preferred) when necessary. Combined with proxy rotation and rate limiting, you can reliably scrape even the most JavaScript-heavy sites.

Internal Links

Scroll to Top