How to Scrape Robo-Advisor Performance Data (Wealthfront, Betterment) (2026)

Robo-advisor performance data is some of the most valuable fintech intelligence you can collect, and also some of the most aggressively protected. Wealthfront and Betterment both publish historical portfolio returns, allocation breakdowns, and projected growth figures on their public-facing sites, but neither offers a public API for that data. If you’re building a competitive analysis tool, a fintech dashboard, or a model that tracks net return drift across platforms, you need to scrape it.

What Data Is Actually Accessible

Before writing a single line of code, map what each platform exposes publicly versus behind login.

Wealthfront publishes historical performance charts for its Classic, Socially Responsible, and Direct Indexing portfolios on its public marketing pages. The data is rendered via JavaScript, loaded from an internal API endpoint (/api/portfolio_performance), and typically returns JSON with annualized returns by risk score (1-10) and time horizon.

Betterment is similar: portfolio performance by allocation (stock/bond ratio) is available on public pages, pulled via XHR calls to endpoints like /api/v1/portfolio/performance. Their data includes net-of-fee return comparisons against benchmarks like S&P 500 and a 60/40 blend.

Neither platform gates this data behind authentication. The challenge is anti-bot protection, not login.

Anti-Bot Stack You’re Up Against

Both platforms use a layered defense:

Platform	WAF	Bot Detection	Fingerprinting
Wealthfront	Cloudflare	Cloudflare Bot Management	TLS fingerprint + JS challenge
Betterment	Akamai	Akamai Bot Manager	Canvas + audio fingerprint
Both	CDN-level	Behavioral scoring	Header anomaly checks

Cloudflare Bot Management on Wealthfront scores each request for browser-like behavior. A raw requests call gets blocked at the JS challenge stage before the performance endpoint even loads. Betterment’s Akamai layer checks for inconsistencies between declared browser headers and actual TLS handshake patterns, which kills most curl-based scrapers immediately.

The practical implication: you need either a real browser (Playwright/Puppeteer with stealth plugins) or a proxy network that passes TLS fingerprinting checks. For the latter, residential IPs with proper JA3/JA4 fingerprint rotation are the baseline requirement. The Bright Data vs Decodo (Smartproxy) 2026: Full Pricing + Performance breakdown is worth reading before you commit to a proxy provider for fintech targets specifically.

Extraction Approach: Intercept the XHR, Skip the DOM

Parsing the rendered DOM is fragile here. Both platforms update their frontend regularly. The more durable approach is intercepting the background XHR/fetch calls that load performance JSON directly.

With Playwright:

import asyncio
from playwright.async_api import async_playwright
import json

async def scrape_wealthfront_performance():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            viewport={"width": 1280, "height": 800}
        )
        page = await context.new_page()

        captured = []
        page.on("response", lambda r: captured.append(r) 
                 if "portfolio_performance" in r.url else None)

        await page.goto("https://www.wealthfront.com/historical-returns", 
                        wait_until="networkidle")
        
        for r in captured:
            if r.status == 200:
                data = await r.json()
                print(json.dumps(data, indent=2))

        await browser.close()

asyncio.run(scrape_wealthfront_performance())

This captures the raw JSON before any frontend transformation. Once you have the endpoint URL and response structure, you can attempt direct HTTP requests with matching headers to reduce browser overhead on subsequent runs.

Key headers to replicate for Betterment’s Akamai layer:

sec-ch-ua matching the UA string exactly
sec-fetch-site: same-origin
x-requested-with absent (Betterment’s XHR does not send this)
Cookie session values from an initial page load

Scheduling and Rate Management

Performance data on both platforms updates monthly, not daily, so aggressive polling is pointless and accelerates IP bans. A reasonable production schedule:

Run full browser-based extraction once per month (aligned to Betterment’s typical update cycle around the 10th of each month)
Cache the JSON response with a timestamp
Run a lightweight HEAD or conditional GET check weekly to detect content changes
Only trigger full re-scrape when ETag or Last-Modified headers change

For comparison: How to Scrape Mortgage Rate Aggregators Daily (2026) covers a much higher-frequency use case where daily delta tracking matters. Robo-advisor data doesn’t need that cadence.

If you’re running this at scale across multiple fintech targets, residential proxies with sticky sessions work better than rotating IPs here. Rotating IPs mid-session breaks Akamai’s behavioral scoring and triggers re-challenges. Sticky sessions that persist for 10-15 minutes per scrape run are the right configuration.

Normalizing and Comparing the Output

Wealthfront returns annualized returns indexed by risk score (1-10). Betterment returns them indexed by stock allocation percentage (0-100%). They’re not directly comparable without a mapping layer.

A workable normalization:

Wealthfront Risk 5 ≈ Betterment 60% stocks (moderate allocation)
Wealthfront Risk 7 ≈ Betterment 80% stocks (growth)
Wealthfront Risk 9 ≈ Betterment 95% stocks (aggressive)

Both platforms report net-of-fee returns, which matters. Wealthfront’s fee (0.25% AUM) is baked into their published numbers. Betterment’s (also 0.25%) is likewise included. Gross return figures aren’t published, which limits how granularly you can decompose performance attribution.

For adjacent fintech scraping projects, the data normalization challenges in How to Scrape Bank Rate Comparison Sites at Scale (2026) are similar: different providers structure the same underlying metric differently and you need a canonical schema layer. The same pattern applies here.

Handling Schema Drift

Both platforms have restructured their internal API responses at least twice in the past 18 months. Build schema validation into your pipeline from day one, not as an afterthought.

Useful patterns:

Use pydantic models with model_config = ConfigDict(extra='ignore') so new fields don’t break your parser
Log raw responses to cold storage (S3 or local disk) before transformation, so you can replay when the schema changes
Set up an alert on KeyError or field-missing failures rather than silently returning null data

This is the same discipline required when scraping personal loan or insurance aggregators. How to Scrape Personal Loan Aggregators in 2026 and How to Scrape Insurance Quote Aggregators Programmatically (2026) both cover schema drift handling in detail for similarly volatile fintech endpoints.

Bottom line

Robo-advisor performance scraping is technically achievable with Playwright plus a residential proxy layer, but it rewards patience over aggression: monthly runs, sticky sessions, and XHR interception beat DOM parsing every time. Pick a proxy provider that handles TLS fingerprinting before you hit Akamai or Cloudflare. DRT covers the full stack of fintech data collection infrastructure if you need to go deeper on the proxy and anti-bot side.