How to Scrape SimilarWeb Traffic Analytics Data

TL;DR
SimilarWeb’s public-facing pages expose traffic estimates, top pages, referral sources, and keyword data without an API key. this covers the scraping approach, the data available for free vs paid, and how to structure a competitive intelligence pipeline.

what data similarweb exposes publicly

SimilarWeb’s website overview pages at similarweb.com/website/{domain}/ show monthly visits, visit duration, pages per visit, bounce rate, traffic by channel (direct, search, social, referral, email, display), and top countries. this data is available without a login for the most recent month.

behind login (free account), you get 3 months of history and some keyword data. the paid tiers unlock 12-36 months of history, full keyword lists, competitor comparisons, and API access. this guide focuses on the publicly accessible data and what you can extract without an account.

site structure and request patterns

SimilarWeb uses React Server Components and Next.js. the initial page load delivers server-rendered HTML with traffic summary data embedded in a __NEXT_DATA__ JSON blob. this is the most reliable extraction target; it is structured JSON, not scraped HTML attributes.

from curl_cffi import requests as cffi_requests
import json
from lxml import html

def scrape_similarweb(domain):
    url = f"https://www.similarweb.com/website/{domain}/"
    s = cffi_requests.Session(impersonate="chrome120")
    resp = s.get(url, headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://www.google.com/",
    })
    if resp.status_code != 200:
        return {"error": resp.status_code}
    tree = html.fromstring(resp.text)
    next_data_els = tree.cssselect("#__NEXT_DATA__")
    if not next_data_els:
        return {"error": "no __NEXT_DATA__ found"}
    data = json.loads(next_data_els[0].text_content())
    try:
        site_data = data["props"]["pageProps"]["site"]
        return {
            "domain": domain,
            "monthly_visits": site_data.get("engagements", {}).get("visits"),
            "bounce_rate": site_data.get("engagements", {}).get("bounceRate"),
        }
    except (KeyError, TypeError):
        return {"error": "structure changed, check __NEXT_DATA__ schema"}

result = scrape_similarweb("shopify.com")
print(result)

bot protection on similarweb

SimilarWeb uses Akamai Bot Manager, which is a tier above DataDome in terms of detection sophistication. Akamai checks: sensor data collected by injected JavaScript (keyboard events, mouse movement patterns, timing), TLS fingerprint, HTTP/2 frame order, and IP reputation.

for low-volume scraping (under 100 domains/day), curl-cffi with a residential proxy and proper headers passes Akamai’s first-party checks on most requests. for higher volumes, you will start seeing CAPTCHAs or blocks. at that point, you need Playwright with stealth patches on residential IPs.

alternative: the unofficial api

SimilarWeb has an internal API that the frontend calls for dynamic data. the endpoints follow the pattern https://data.similarweb.com/api/v1/data?domain={domain}. these endpoints require a session cookie from a logged-in browser session. they are undocumented and change without notice.

for a more stable approach, use their official API if you have a paid account. if you are scraping SimilarWeb data seriously, the paid API is cheaper than the engineering cost of maintaining an unofficial scraper.

building a competitive intelligence pipeline

a practical pipeline: collect SimilarWeb estimates for 500 competitor domains weekly. store in PostgreSQL with timestamps. track month-over-month traffic changes. flag domains with 20%+ growth or decline for manual review.

import time
import psycopg2
from datetime import datetime

domains = ["competitor1.com", "competitor2.com", "competitor3.com"]
conn = psycopg2.connect("postgresql://user:pass@localhost/competitive_intel")
cur = conn.cursor()

for domain in domains:
    data = scrape_similarweb(domain)
    if "error" not in data:
        sql = "INSERT INTO traffic_snapshots (domain, monthly_visits, bounce_rate, collected_at) VALUES (%s, %s, %s, %s)"
        cur.execute(sql, (domain, data["monthly_visits"], data["bounce_rate"], datetime.utcnow()))
        conn.commit()
    time.sleep(3)

cur.close()
conn.close()

data accuracy and limitations

SimilarWeb’s traffic estimates are derived from a panel of browser extensions, ISP data partnerships, and web crawls. for sites with under 50,000 monthly visits, the estimates have high variance and are often unreliable. for sites over 500,000 monthly visits, the estimates are typically within 20-30% of actual traffic. use SimilarWeb for directional competitive intelligence, not precise traffic numbers. see what is web scraping for general scraping fundamentals.