Headless Chrome Optimization: Memory & Speed

Headless Chrome Optimization: Memory & Speed

Headless Chrome is the go-to tool for scraping JavaScript-rendered pages, but it’s a resource hog. A single Chrome instance can consume 300MB+ of RAM, and running multiple instances tanks your server. This guide covers every optimization technique to make headless Chrome faster, leaner, and more stable for scraping at scale.

Chrome Launch Flags for Scraping

Start with the right flags to minimize resource usage:

from playwright.sync_api import sync_playwright

def create_optimized_browser(proxy=None):

p = sync_playwright().start()

args = [

"--no-sandbox",

"--disable-dev-shm-usage",

"--disable-gpu",

"--disable-software-rasterizer",

"--disable-extensions",

"--disable-background-networking",

"--disable-background-timer-throttling",

"--disable-backgrounding-occluded-windows",

"--disable-breakpad",

"--disable-component-extensions-with-background-pages",

"--disable-component-update",

"--disable-default-apps",

"--disable-features=TranslateUI",

"--disable-hang-monitor",

"--disable-ipc-flooding-protection",

"--disable-popup-blocking",

"--disable-prompt-on-repost",

"--disable-renderer-backgrounding",

"--disable-sync",

"--force-color-profile=srgb",

"--metrics-recording-only",

"--no-first-run",

"--password-store=basic",

"--use-mock-keychain",

"--disable-blink-features=AutomationControlled",

]

launch_options = {

"headless": True,

"args": args,

}

if proxy:

launch_options["proxy"] = {"server": proxy}

browser = p.chromium.launch(**launch_options)

return p, browser

Selenium Equivalent

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

def create_optimized_driver(proxy=None):

options = Options()

options.add_argument("--headless=new")

options.add_argument("--no-sandbox")

options.add_argument("--disable-dev-shm-usage")

options.add_argument("--disable-gpu")

options.add_argument("--disable-extensions")

options.add_argument("--disable-infobars")

options.add_argument("--disable-notifications")

options.add_argument("--disable-background-networking")

options.add_argument("--disable-default-apps")

options.add_argument("--disable-sync")

options.add_argument("--disable-translate")

options.add_argument("--no-first-run")

options.add_argument("--single-process")

options.add_argument("--disable-logging")

options.add_argument("--disable-blink-features=AutomationControlled")

# Memory optimization

options.add_argument("--js-flags=--max-old-space-size=512")

options.add_argument("--disable-features=site-per-process")

if proxy:

options.add_argument(f"--proxy-server={proxy}")

# Disable image loading

prefs = {

"profile.managed_default_content_settings.images": 2,

"profile.default_content_setting_values.notifications": 2,

}

options.add_experimental_option("prefs", prefs)

driver = webdriver.Chrome(options=options)

return driver

Memory Optimization

1. Block Unnecessary Resources

The biggest memory saver — prevent Chrome from loading images, fonts, and tracking scripts:

# Playwright resource blocking

async def block_heavy_resources(route, request):

blocked_types = {"image", "stylesheet", "font", "media", "texttrack", "eventsource", "websocket"}

blocked_patterns = [

"google-analytics.com", "googletagmanager.com",

"facebook.net", "doubleclick.net",

"hotjar.com", "intercom.io",

".woff", ".woff2", ".ttf",

".png", ".jpg", ".jpeg", ".gif", ".svg", ".ico",

".mp4", ".webm", ".ogg"

]

if request.resource_type in blocked_types:

await route.abort()

return

for pattern in blocked_patterns:

if pattern in request.url:

await route.abort()

return

await route.continue_()

Apply to page

await page.route("*/", block_heavy_resources)

Impact: Reduces memory usage by 40-60% and page load time by 50-70%.

2. Limit Tab/Page Count

Each tab consumes 50-150MB. Reuse pages instead of creating new ones:

class PagePool:

def __init__(self, browser, max_pages=5):

self.browser = browser

self.max_pages = max_pages

self.pages = []

self.available = []

self.lock = asyncio.Lock()

async def get_page(self):

async with self.lock:

if self.available:

return self.available.pop()

if len(self.pages) < self.max_pages:

page = await self.browser.new_page()

self.pages.append(page)

return page

# Wait for a page to become available

while not self.available:

await asyncio.sleep(0.1)

return self.available.pop()

async def release_page(self, page):

# Clear page state before reusing

await page.goto("about:blank")

async with self.lock:

self.available.append(page)

3. Clear Browser Cache Periodically

# Playwright - clear context storage

async def clear_context(context):

await context.clear_cookies()

await context.clear_permissions()

Create fresh context periodically

async def get_fresh_context(browser, proxy=None):

context_options = {

"viewport": {"width": 1920, "height": 1080},

"user_agent": "Mozilla/5.0 Chrome/120.0.0.0",

}

if proxy:

context_options["proxy"] = {"server": proxy}

return await browser.new_context(**context_options)

4. Monitor Memory Usage

import psutil

import os

def get_chrome_memory():

"""Get total memory used by all Chrome processes."""

total_mb = 0

for proc in psutil.process_iter(["name", "memory_info"]):

if "chrome" in proc.info["name"].lower():

total_mb += proc.info["memory_info"].rss / 1024 / 1024

return total_mb

def check_memory_limit(max_mb=2048):

"""Restart browser if memory exceeds limit."""

current = get_chrome_memory()

if current > max_mb:

return True # Signal to restart browser

return False

5. Use /tmp for Disk Cache in Docker

# Allocate tmpfs for Chrome cache

volumes:

  • type: tmpfs

target: /tmp

tmpfs:

size: 536870912 # 512MB

Speed Optimization

1. Parallel Page Loading

import asyncio

from playwright.async_api import async_playwright

async def scrape_urls_parallel(urls, max_concurrent=10, proxy=None):

async with async_playwright() as p:

browser = await p.chromium.launch(headless=True)

semaphore = asyncio.Semaphore(max_concurrent)

async def scrape_single(url):

async with semaphore:

page = await browser.new_page()

try:

await page.route("*/", block_heavy_resources)

await page.goto(url, wait_until="domcontentloaded", timeout=30000)

title = await page.title()

content = await page.inner_text("body")

return {"url": url, "title": title, "content": content[:2000]}

except Exception as e:

return {"url": url, "error": str(e)}

finally:

await page.close()

results = await asyncio.gather(

*[scrape_single(url) for url in urls]

)

await browser.close()

return results

Run

results = asyncio.run(scrape_urls_parallel(

["https://example1.com", "https://example2.com", ...],

max_concurrent=10

))

2. Use domcontentloaded Instead of networkidle

# Slow - waits for ALL network requests to finish

await page.goto(url, wait_until="networkidle")

Fast - DOM is ready, JS may still be loading

await page.goto(url, wait_until="domcontentloaded")

Then wait only for the element you need

await page.wait_for_selector("#product-data", timeout=10000)

Impact: 2-5x faster page loads.

3. Set Navigation Timeout

# Don't wait forever for slow pages

page.set_default_navigation_timeout(15000) # 15 seconds

page.set_default_timeout(10000) # 10 seconds for all actions

4. Viewport Size Matters

Smaller viewports render faster and use less memory:

# For data extraction (no visual rendering needed)

context = browser.new_context(

viewport={"width": 800, "height": 600},

device_scale_factor=1

)

5. Disable JavaScript When Possible

If the content is server-side rendered, disable JS entirely:

context = browser.new_context(java_script_enabled=False)

Browser Instance Management

Browser Pool Pattern

import asyncio

from playwright.async_api import async_playwright

class BrowserManager:

def __init__(self, pool_size=3, max_pages_per_browser=50):

self.pool_size = pool_size

self.max_pages_per_browser = max_pages_per_browser

self.browsers = []

self.page_counts = {}

self.playwright = None

async def start(self):

self.playwright = await async_playwright().start()

for _ in range(self.pool_size):

browser = await self._create_browser()

self.browsers.append(browser)

self.page_counts[id(browser)] = 0

async def _create_browser(self):

return await self.playwright.chromium.launch(

headless=True,

args=["--no-sandbox", "--disable-dev-shm-usage"]

)

async def get_page(self, proxy=None):

# Find browser with lowest page count

browser = min(

self.browsers,

key=lambda b: self.page_counts.get(id(b), 0)

)

browser_id = id(browser)

# Recycle browser if it's served too many pages

if self.page_counts[browser_id] >= self.max_pages_per_browser:

idx = self.browsers.index(browser)

await browser.close()

browser = await self._create_browser()

self.browsers[idx] = browser

browser_id = id(browser)

self.page_counts[browser_id] = 0

context_options = {}

if proxy:

context_options["proxy"] = {"server": proxy}

context = await browser.new_context(**context_options)

page = await context.new_page()

self.page_counts[browser_id] += 1

return page, context

async def release(self, page, context):

await page.close()

await context.close()

async def stop(self):

for browser in self.browsers:

await browser.close()

if self.playwright:

await self.playwright.stop()

Graceful Browser Restart

async def scrape_with_restart(urls, batch_size=100):

"""Restart browser every N pages to prevent memory leaks."""

manager = BrowserManager(pool_size=3)

await manager.start()

results = []

for i in range(0, len(urls), batch_size):

batch = urls[i:i + batch_size]

batch_results = await asyncio.gather(

*[scrape_page(manager, url) for url in batch]

)

results.extend(batch_results)

# Check memory and restart if needed

if get_chrome_memory() > 2048:

await manager.stop()

await manager.start()

await manager.stop()

return results

Stealth Optimization

Make headless Chrome undetectable:

async def create_stealth_page(browser, proxy=None):

context = await browser.new_context(

viewport={"width": 1920, "height": 1080},

user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "

"AppleWebKit/537.36 (KHTML, like Gecko) "

"Chrome/120.0.0.0 Safari/537.36",

locale="en-US",

timezone_id="America/New_York",

permissions=["geolocation"],

geolocation={"longitude": -73.935242, "latitude": 40.730610},

color_scheme="light",

)

page = await context.new_page()

# Override navigator.webdriver

await page.add_init_script("""

Object.defineProperty(navigator, 'webdriver', {

get: () => undefined

});

// Override chrome property

window.chrome = {

runtime: {},

loadTimes: function() {},

csi: function() {},

app: {}

};

// Override permissions

const originalQuery = window.navigator.permissions.query;

window.navigator.permissions.query = (parameters) =>

parameters.name === 'notifications'

? Promise.resolve({ state: Notification.permission })

: originalQuery(parameters);

// Override plugins

Object.defineProperty(navigator, 'plugins', {

get: () => [1, 2, 3, 4, 5]

});

// Override languages

Object.defineProperty(navigator, 'languages', {

get: () => ['en-US', 'en']

});

""")

return page, context

Performance Benchmarks

Typical results on a 4-core, 8GB RAM server:

ConfigurationPages/minMemoryCPU
No optimization5-102GB+90%
Resource blocking15-25800MB60%
+ Parallel (5 pages)50-801.2GB75%
+ Page reuse60-1001GB70%
+ domcontentloaded80-150900MB65%
All optimizations100-200800MB60%

FAQ

How many Chrome instances can I run on one server?

Rule of thumb: allocate 300-500MB RAM per instance. A 16GB server can comfortably run 20-30 instances with resource blocking enabled. Monitor with htop and adjust.

Should I use headless: “new” or headless: true?

Chrome’s new headless mode (--headless=new) is closer to a real browser and harder to detect. Use it when available (Chrome 112+).

Why does Chrome crash with “out of memory” in Docker?

Chrome uses /dev/shm for shared memory, which defaults to 64MB in Docker. Either increase it with --shm-size=2g or use --disable-dev-shm-usage to use /tmp instead.

How do I handle pages that never finish loading?

Set strict timeouts and use domcontentloaded instead of networkidle. Some pages have perpetual WebSocket connections that prevent networkidle from ever firing.

Is Puppeteer faster than Playwright for headless Chrome?

Performance is nearly identical since both control Chrome via the same protocol. Playwright has slightly better API design for concurrent operations and built-in auto-waiting.

Conclusion

Optimized headless Chrome can scrape 100-200 pages per minute on modest hardware. The key optimizations are resource blocking (40-60% memory savings), parallel page loading (5-10x throughput), and browser instance recycling (prevents memory leaks). Apply these techniques with proxy rotation for a fast, reliable, and stealthy scraping setup.

Internal Links

Scroll to Top