Headless Chrome Optimization: Memory & Speed
Headless Chrome is the go-to tool for scraping JavaScript-rendered pages, but it’s a resource hog. A single Chrome instance can consume 300MB+ of RAM, and running multiple instances tanks your server. This guide covers every optimization technique to make headless Chrome faster, leaner, and more stable for scraping at scale.
Chrome Launch Flags for Scraping
Start with the right flags to minimize resource usage:
from playwright.sync_api import sync_playwright
def create_optimized_browser(proxy=None):
p = sync_playwright().start()
args = [
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-gpu",
"--disable-software-rasterizer",
"--disable-extensions",
"--disable-background-networking",
"--disable-background-timer-throttling",
"--disable-backgrounding-occluded-windows",
"--disable-breakpad",
"--disable-component-extensions-with-background-pages",
"--disable-component-update",
"--disable-default-apps",
"--disable-features=TranslateUI",
"--disable-hang-monitor",
"--disable-ipc-flooding-protection",
"--disable-popup-blocking",
"--disable-prompt-on-repost",
"--disable-renderer-backgrounding",
"--disable-sync",
"--force-color-profile=srgb",
"--metrics-recording-only",
"--no-first-run",
"--password-store=basic",
"--use-mock-keychain",
"--disable-blink-features=AutomationControlled",
]
launch_options = {
"headless": True,
"args": args,
}
if proxy:
launch_options["proxy"] = {"server": proxy}
browser = p.chromium.launch(**launch_options)
return p, browser
Selenium Equivalent
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def create_optimized_driver(proxy=None):
options = Options()
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("--disable-extensions")
options.add_argument("--disable-infobars")
options.add_argument("--disable-notifications")
options.add_argument("--disable-background-networking")
options.add_argument("--disable-default-apps")
options.add_argument("--disable-sync")
options.add_argument("--disable-translate")
options.add_argument("--no-first-run")
options.add_argument("--single-process")
options.add_argument("--disable-logging")
options.add_argument("--disable-blink-features=AutomationControlled")
# Memory optimization
options.add_argument("--js-flags=--max-old-space-size=512")
options.add_argument("--disable-features=site-per-process")
if proxy:
options.add_argument(f"--proxy-server={proxy}")
# Disable image loading
prefs = {
"profile.managed_default_content_settings.images": 2,
"profile.default_content_setting_values.notifications": 2,
}
options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(options=options)
return driver
Memory Optimization
1. Block Unnecessary Resources
The biggest memory saver — prevent Chrome from loading images, fonts, and tracking scripts:
# Playwright resource blocking
async def block_heavy_resources(route, request):
blocked_types = {"image", "stylesheet", "font", "media", "texttrack", "eventsource", "websocket"}
blocked_patterns = [
"google-analytics.com", "googletagmanager.com",
"facebook.net", "doubleclick.net",
"hotjar.com", "intercom.io",
".woff", ".woff2", ".ttf",
".png", ".jpg", ".jpeg", ".gif", ".svg", ".ico",
".mp4", ".webm", ".ogg"
]
if request.resource_type in blocked_types:
await route.abort()
return
for pattern in blocked_patterns:
if pattern in request.url:
await route.abort()
return
await route.continue_()
Apply to page
await page.route("*/", block_heavy_resources)
Impact: Reduces memory usage by 40-60% and page load time by 50-70%.
2. Limit Tab/Page Count
Each tab consumes 50-150MB. Reuse pages instead of creating new ones:
class PagePool:
def __init__(self, browser, max_pages=5):
self.browser = browser
self.max_pages = max_pages
self.pages = []
self.available = []
self.lock = asyncio.Lock()
async def get_page(self):
async with self.lock:
if self.available:
return self.available.pop()
if len(self.pages) < self.max_pages:
page = await self.browser.new_page()
self.pages.append(page)
return page
# Wait for a page to become available
while not self.available:
await asyncio.sleep(0.1)
return self.available.pop()
async def release_page(self, page):
# Clear page state before reusing
await page.goto("about:blank")
async with self.lock:
self.available.append(page)
3. Clear Browser Cache Periodically
# Playwright - clear context storage
async def clear_context(context):
await context.clear_cookies()
await context.clear_permissions()
Create fresh context periodically
async def get_fresh_context(browser, proxy=None):
context_options = {
"viewport": {"width": 1920, "height": 1080},
"user_agent": "Mozilla/5.0 Chrome/120.0.0.0",
}
if proxy:
context_options["proxy"] = {"server": proxy}
return await browser.new_context(**context_options)
4. Monitor Memory Usage
import psutil
import os
def get_chrome_memory():
"""Get total memory used by all Chrome processes."""
total_mb = 0
for proc in psutil.process_iter(["name", "memory_info"]):
if "chrome" in proc.info["name"].lower():
total_mb += proc.info["memory_info"].rss / 1024 / 1024
return total_mb
def check_memory_limit(max_mb=2048):
"""Restart browser if memory exceeds limit."""
current = get_chrome_memory()
if current > max_mb:
return True # Signal to restart browser
return False
5. Use /tmp for Disk Cache in Docker
# Allocate tmpfs for Chrome cache
volumes:
- type: tmpfs
target: /tmp
tmpfs:
size: 536870912 # 512MB
Speed Optimization
1. Parallel Page Loading
import asyncio
from playwright.async_api import async_playwright
async def scrape_urls_parallel(urls, max_concurrent=10, proxy=None):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
semaphore = asyncio.Semaphore(max_concurrent)
async def scrape_single(url):
async with semaphore:
page = await browser.new_page()
try:
await page.route("*/", block_heavy_resources)
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
title = await page.title()
content = await page.inner_text("body")
return {"url": url, "title": title, "content": content[:2000]}
except Exception as e:
return {"url": url, "error": str(e)}
finally:
await page.close()
results = await asyncio.gather(
*[scrape_single(url) for url in urls]
)
await browser.close()
return results
Run
results = asyncio.run(scrape_urls_parallel(
["https://example1.com", "https://example2.com", ...],
max_concurrent=10
))
2. Use domcontentloaded Instead of networkidle
# Slow - waits for ALL network requests to finish
await page.goto(url, wait_until="networkidle")
Fast - DOM is ready, JS may still be loading
await page.goto(url, wait_until="domcontentloaded")
Then wait only for the element you need
await page.wait_for_selector("#product-data", timeout=10000)
Impact: 2-5x faster page loads.
3. Set Navigation Timeout
# Don't wait forever for slow pages
page.set_default_navigation_timeout(15000) # 15 seconds
page.set_default_timeout(10000) # 10 seconds for all actions
4. Viewport Size Matters
Smaller viewports render faster and use less memory:
# For data extraction (no visual rendering needed)
context = browser.new_context(
viewport={"width": 800, "height": 600},
device_scale_factor=1
)
5. Disable JavaScript When Possible
If the content is server-side rendered, disable JS entirely:
context = browser.new_context(java_script_enabled=False)
Browser Instance Management
Browser Pool Pattern
import asyncio
from playwright.async_api import async_playwright
class BrowserManager:
def __init__(self, pool_size=3, max_pages_per_browser=50):
self.pool_size = pool_size
self.max_pages_per_browser = max_pages_per_browser
self.browsers = []
self.page_counts = {}
self.playwright = None
async def start(self):
self.playwright = await async_playwright().start()
for _ in range(self.pool_size):
browser = await self._create_browser()
self.browsers.append(browser)
self.page_counts[id(browser)] = 0
async def _create_browser(self):
return await self.playwright.chromium.launch(
headless=True,
args=["--no-sandbox", "--disable-dev-shm-usage"]
)
async def get_page(self, proxy=None):
# Find browser with lowest page count
browser = min(
self.browsers,
key=lambda b: self.page_counts.get(id(b), 0)
)
browser_id = id(browser)
# Recycle browser if it's served too many pages
if self.page_counts[browser_id] >= self.max_pages_per_browser:
idx = self.browsers.index(browser)
await browser.close()
browser = await self._create_browser()
self.browsers[idx] = browser
browser_id = id(browser)
self.page_counts[browser_id] = 0
context_options = {}
if proxy:
context_options["proxy"] = {"server": proxy}
context = await browser.new_context(**context_options)
page = await context.new_page()
self.page_counts[browser_id] += 1
return page, context
async def release(self, page, context):
await page.close()
await context.close()
async def stop(self):
for browser in self.browsers:
await browser.close()
if self.playwright:
await self.playwright.stop()
Graceful Browser Restart
async def scrape_with_restart(urls, batch_size=100):
"""Restart browser every N pages to prevent memory leaks."""
manager = BrowserManager(pool_size=3)
await manager.start()
results = []
for i in range(0, len(urls), batch_size):
batch = urls[i:i + batch_size]
batch_results = await asyncio.gather(
*[scrape_page(manager, url) for url in batch]
)
results.extend(batch_results)
# Check memory and restart if needed
if get_chrome_memory() > 2048:
await manager.stop()
await manager.start()
await manager.stop()
return results
Stealth Optimization
Make headless Chrome undetectable:
async def create_stealth_page(browser, proxy=None):
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36",
locale="en-US",
timezone_id="America/New_York",
permissions=["geolocation"],
geolocation={"longitude": -73.935242, "latitude": 40.730610},
color_scheme="light",
)
page = await context.new_page()
# Override navigator.webdriver
await page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
// Override chrome property
window.chrome = {
runtime: {},
loadTimes: function() {},
csi: function() {},
app: {}
};
// Override permissions
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) =>
parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: originalQuery(parameters);
// Override plugins
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5]
});
// Override languages
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
""")
return page, context
Performance Benchmarks
Typical results on a 4-core, 8GB RAM server:
| Configuration | Pages/min | Memory | CPU |
|---|---|---|---|
| No optimization | 5-10 | 2GB+ | 90% |
| Resource blocking | 15-25 | 800MB | 60% |
| + Parallel (5 pages) | 50-80 | 1.2GB | 75% |
| + Page reuse | 60-100 | 1GB | 70% |
| + domcontentloaded | 80-150 | 900MB | 65% |
| All optimizations | 100-200 | 800MB | 60% |
FAQ
How many Chrome instances can I run on one server?
Rule of thumb: allocate 300-500MB RAM per instance. A 16GB server can comfortably run 20-30 instances with resource blocking enabled. Monitor with htop and adjust.
Should I use headless: “new” or headless: true?
Chrome’s new headless mode (--headless=new) is closer to a real browser and harder to detect. Use it when available (Chrome 112+).
Why does Chrome crash with “out of memory” in Docker?
Chrome uses /dev/shm for shared memory, which defaults to 64MB in Docker. Either increase it with --shm-size=2g or use --disable-dev-shm-usage to use /tmp instead.
How do I handle pages that never finish loading?
Set strict timeouts and use domcontentloaded instead of networkidle. Some pages have perpetual WebSocket connections that prevent networkidle from ever firing.
Is Puppeteer faster than Playwright for headless Chrome?
Performance is nearly identical since both control Chrome via the same protocol. Playwright has slightly better API design for concurrent operations and built-in auto-waiting.
Conclusion
Optimized headless Chrome can scrape 100-200 pages per minute on modest hardware. The key optimizations are resource blocking (40-60% memory savings), parallel page loading (5-10x throughput), and browser instance recycling (prevents memory leaks). Apply these techniques with proxy rotation for a fast, reliable, and stealthy scraping setup.