How to Scrape DeFi Protocol Data: TVL, Yields, Vault Compositions (2026)

I’ll write this directly.

—

Most DeFi data worth having sits behind a mix of open JSON APIs, on-chain RPC calls, and a handful of aggregator endpoints that nobody documents properly. if you need to scrape DeFi protocol data — TVL history, live yield rates, vault token compositions — the good news is that 90% of it is publicly accessible without authentication. the bad news is that the schema changes constantly, rate limits are aggressive, and the same metric can mean four different things depending on which aggregator you ask.

Where DeFi Data Actually Lives

The three layers that matter are: on-chain (direct RPC), protocol-level APIs (each protocol runs its own), and aggregators (DefiLlama, Zapper, DeBridge Analytics). each has a different update frequency and reliability profile.

Source	Update Frequency	Auth Required	Rate Limit
DefiLlama API	1-5 min	No	~300 req/min
Protocol subgraphs (The Graph)	Per block (~12s)	API key (free tier)	1000 req/day free
Direct RPC (Alchemy/Infura)	Real-time	API key	Varies by plan
Zapper API	5-15 min	API key (waitlist)	Strict
Yearn v3 API	Per block	No	Generous

DefiLlama is the starting point for most TVL and yield pipelines. its REST API returns clean JSON and covers 3,000+ protocols across 200+ chains. the /tvl/{protocol} endpoint gives you a time-series going back to protocol launch. for vault compositions and per-chain breakdowns, /protocol/{slug} returns the full current state including token allocations, chain distributions, and 24h/7d change figures.

Pulling TVL and Yield Data from DefiLlama

The API is unauthenticated and well-structured. a basic Python pipeline looks like this:

import httpx
import time

BASE = "https://api.llama.fi"

def fetch_protocol_tvl(slug: str) -> dict:
    r = httpx.get(f"{BASE}/protocol/{slug}", timeout=15)
    r.raise_for_status()
    return r.json()

def fetch_yields(min_tvl_usd: float = 1_000_000) -> list[dict]:
    r = httpx.get(f"{BASE}/yields/pools", timeout=30)
    r.raise_for_status()
    pools = r.json()["data"]
    return [p for p in pools if (p.get("tvlUsd") or 0) >= min_tvl_usd]

# Rate-limit respect
slugs = ["aave-v3", "compound-v3", "morpho", "fluid"]
results = []
for slug in slugs:
    results.append(fetch_protocol_tvl(slug))
    time.sleep(0.3)

The yields/pools endpoint returns APY, TVL, chain, project, and underlying token for every tracked pool — roughly 8,000+ rows. filter by apyBase, apyReward, and ilRisk to build screeners. for historical yield curves, use /yields/chart/{pool_id} with the UUID from the pool listing.

Vault Compositions via The Graph

DefiLlama gives you totals. if you need token-level vault compositions (what percentage of a Yearn vault is in stETH vs. USDC, for example), you need subgraph queries. The Graph hosts subgraphs for Aave, Compound, Uniswap, Balancer, and dozens more at gateway.thegraph.com.

The free tier gives you 1,000 queries per day per API key. for high-frequency needs, self-hosting a graph node against an archive RPC is the right call — expensive upfront, but it removes the rate ceiling entirely. this is similar to the architecture pattern used for How to Scrape Crypto Exchange Order Books at Sub-Second Frequency (2026), where the latency and throughput requirements push you toward owning your own data pipeline rather than relying on third-party endpoints.

A GraphQL query for Aave v3 reserve compositions:

{
  reserves(first: 50, orderBy: totalLiquidity, orderDirection: desc) {
    id
    symbol
    totalLiquidity
    liquidityRate
    variableBorrowRate
    utilizationRate
    totalATokenSupply
  }
}

Run this against https://gateway.thegraph.com/api/{key}/subgraphs/id/{subgraph_id}. the Aave v3 mainnet subgraph ID is GqzP4Xaehti8KSfQmv3ZctFSjnSUYZ4En5NRsiTbvZpz (verify on The Graph Explorer before hardcoding — these migrate).

Handling Schema Drift and Protocol-Specific Quirks

DeFi protocols rename fields, deprecate endpoints, and silently break backward compatibility far more often than, say, a government health API (the CDC and WHO data infrastructure tends to be stable for years). treat every DeFi schema as provisional.

Practical defenses:

Pin your response parsing to explicit field access (pool["tvlUsd"]) and log KeyError exceptions to a separate error store — schema breaks show up as spikes
Store raw JSON alongside your normalized rows. when the schema changes, you can backfill from the raw archive without re-fetching
Version your collection jobs and tag each row with source_schema_version so you know which extraction logic produced it
For subgraph data, always request _meta { block { number } } alongside your query — this tells you how far behind the indexer is, which matters for TVL freshness

Protocol-specific quirks to know:

Curve Finance TVL is multi-chain and reported per-gauge, not per-pool. use the Curve API at api.curve.finance/api/getPools rather than DefiLlama for composition detail.
Morpho Blue vault shares are not ERC-4626 compatible on all chains. check asset() vs totalAssets() via direct RPC before computing APY.
Pendle’s yield-splitting creates two token types (PT and YT) per pool. aggregators often mislabel APY because they conflate the two.
Eigenlayer restaking positions are not reflected in standard TVL because the underlying ETH is already counted by Lido/rocketpool. double-counting is common in dashboards.

Scaling and Storage Considerations

A full DefiLlama yield scrape (8,000+ pools) returns roughly 4MB of JSON per run. at 15-minute intervals that’s 384MB/day, or about 140GB/year uncompressed. compress with zstd before storage and you drop that to 15-20GB/year. for time-series queries, TimescaleDB handles this workload well — partition by collected_at and index on (protocol, chain, pool_id).

For workloads where you’re also pulling research context alongside financial data — say, matching protocol launches to academic whitepapers — the indexing patterns from scraping OpenAlex research paper metadata at scale apply directly. both domains involve large, structured JSON APIs with cursor-based pagination and similar freshness requirements.

One thing worth noting: scraping DeFi aggregators is categorically simpler than scraping commercial B2B platforms, where data access is gated and the legal landscape is murkier. if you’ve worked through the access strategies in scraping ZoomInfo without an account, DeFi APIs will feel refreshingly open — most protocol teams actively want their data consumed and indexed.

Rate limiting is your main operational concern. DefiLlama enforces soft limits around 300 requests/minute. add exponential backoff with jitter on 429 responses and cap concurrent requests to 10. for subgraph endpoints, a queue with 100ms inter-request delay stays comfortably inside free tier limits.

For teams building AI training corpora that include financial protocol documentation alongside biomedical literature (a growing pattern in multi-domain LLM work), the same async fetching and deduplication architecture from PubMed Central open access scraping works cleanly — the latency profiles are similar and both benefit from async batching with httpx.AsyncClient.

Bottom line

Start with DefiLlama for TVL and yield aggregates, add The Graph subgraphs for token-level vault compositions, and go direct to protocol APIs (Curve, Yearn, Morpho) only when the aggregators fall short. store raw JSON with versioned schemas from day one — you will need to backfill. dataresearchtools.com covers this class of structured public API scraping in depth, including the infrastructure and proxy patterns that keep high-frequency collection pipelines stable at scale.