How to Scrape World Bank Open Data API in 2026

The World Bank Open Data API is one of the most valuable free data sources for macroeconomic research, development finance analysis, and cross-country comparisons — but extracting it at scale without hitting pagination traps or undocumented rate limits takes more than a simple requests.get(). This guide covers the full pipeline: endpoints, pagination, rate limit behavior, and production-grade scraping patterns for 2026.

What the World Bank API Actually Gives You

The World Bank exposes its data through a REST API at api.worldbank.org/v2/. The v2 endpoint covers over 16,000 indicators across 200+ economies, time series going back to 1960, and metadata on countries, lending types, income groups, and topics.

The API supports JSON and XML responses. JSON is cleaner to parse. Key endpoints:

GET /v2/country/{countryCode}/indicator/{indicatorCode} — time series for one indicator, one country
GET /v2/indicator — full indicator catalog
GET /v2/country — country metadata with income and region classification
GET /v2/sources — data source catalog (World Development Indicators, Doing Business, etc.)

Response format uses a two-element array: index 0 is pagination metadata, index 1 is the data array. Always check index 0 first.

Pagination and Rate Limits

The API defaults to 50 records per page. For indicators with long time series across many countries, this creates hundreds of pages. Use per_page=1000 to reduce round trips (the max is 32767 but responses above 1000 rows slow down noticeably).

import requests
import time

BASE = "https://api.worldbank.org/v2"

def fetch_indicator(indicator: str, country: str = "all", per_page: int = 1000):
    page = 1
    rows = []
    while True:
        params = {
            "format": "json",
            "per_page": per_page,
            "page": page,
            "mrv": 20,  # most recent 20 years
        }
        r = requests.get(f"{BASE}/country/{country}/indicator/{indicator}", params=params)
        r.raise_for_status()
        meta, data = r.json()
        if not data:
            break
        rows.extend(data)
        if page >= meta["pages"]:
            break
        page += 1
        time.sleep(0.3)  # World Bank has no published rate limit; 0.3s is safe
    return rows

The World Bank does not publish a formal rate limit, but aggressive parallel scraping (10+ concurrent threads) will trigger HTTP 429s or silent connection resets. In practice, one request every 200-300ms per IP works reliably. If you are bulk-downloading the full WDI catalog (16K indicators × 200 countries), route requests through a proxy rotation layer — the same approach covered in How to Scrape Twitter/X Data Without API Limits Using Proxies.

Indicator Selection and Data Quality Traps

Not all 16,000 indicators are equally populated. Many have coverage gaps for smaller economies or years before 1990. Before building a pipeline around an indicator, check its metadata:

GET /v2/indicator/NY.GDP.MKTP.CD?format=json

Look at sourceNote and sourceOrganization to understand methodology, and always check for null values in value fields — the API returns null for missing years rather than omitting them.

Common quality issues to handle:

Null values: expected, not an error. Filter on the consumer side.
Revision lag: WDI data for the most recent year is often preliminary and gets revised 6-12 months later. If you cache aggressively, you will serve stale numbers.
Country code mismatches: the World Bank uses its own 3-letter codes (e.g., CHN, USA, KOR) that differ from ISO 3166-1 alpha-3 in some cases. Keep a mapping table.
Aggregate regions: codes like WLD, EAP, SSA are regional aggregates, not countries. Decide upfront whether to include them.

For comparison, the How to Scrape EU Open Data Portal at Scale (2026) article covers similar null-handling patterns for statistical APIs that return sparse datasets.

Bulk Download vs. API: Which to Use

For full-catalog extraction, the World Bank also publishes bulk CSV/Excel exports via their data catalog. The API is better for targeted, parameterized queries; the bulk download is better when you need everything.

Approach	Best For	Pagination	Rate Limit Risk	Freshness
REST API (`/v2/`)	Targeted indicators, filtered by year/country	Yes (handle manually)	Moderate	Near real-time
Bulk CSV (databank.worldbank.org)	Full WDI catalog, offline pipelines	No	None	Monthly updates
OData endpoint (legacy)	BI tools expecting OData	Yes	Low	Matches API
WDI Data Archive ZIP	Historical snapshots	No	None	Annual

For production pipelines that need fresh data on a schedule, use the REST API with caching (Redis or filesystem). For one-time research datasets or training data, pull the bulk CSV and parse locally. This mirrors the architecture described in How to Scrape data.gov US Federal Data Sources (2026), where the same bulk-vs-API tradeoff applies.

Structuring a Production Pipeline

A reliable World Bank scraper needs three components: a fetch layer, a deduplication cache, and a refresh scheduler.

Numbered setup steps for a minimal production pipeline:

Pull the full indicator list once (GET /v2/indicator?per_page=20000&format=json) and store it locally. This is your scraping manifest.
For each indicator in scope, check your cache for the last fetch timestamp. Skip if under your refresh interval (weekly is usually enough for WDI).
Fetch with exponential backoff on 429 or 5xx. The World Bank API occasionally returns 503 during peak hours (UTC 14:00-16:00 is the worst window in practice).
Store raw JSON responses before parsing. Schema changes in government APIs are rare but happen, and raw storage lets you reparse without re-fetching.
Normalize country codes to ISO 3166-1 alpha-3 at parse time, not query time.

For similar pipeline patterns applied to OECD’s SDMX-based API (which is more complex), see How to Scrape OECD Open Data in 2026. If your project requires multi-source government data aggregation across regions, How to Scrape Australian Bureau of Statistics Data in 2026 covers a complementary Asia-Pacific source with its own quirks around SDMX-JSON formatting.

Handling the Two-Element JSON Response

The World Bank’s response wrapper trips up engineers on first integration. The top-level JSON is always a list with two elements, not a dict:

response = r.json()
# response[0] = {"page": 1, "pages": 12, "per_page": 1000, "total": 11842}
# response[1] = [{...}, {...}, ...]  # actual data rows

meta = response[0]
data = response[1] or []  # can be null if no results

If the query returns zero results (valid indicator, no data for the requested country/year range), response[1] will be null, not an empty list. Always coerce with or [].

Bottom Line

The World Bank Open Data API is stable, well-documented, and genuinely free with no authentication required — making it one of the easier government data sources to scrape at scale compared to the auth-walled alternatives. Use the REST API for fresh targeted pulls, fall back to bulk CSV for full-catalog snapshots, and add a proxy rotation layer if you are running high-frequency parallel jobs. DRT covers this full stack of government and institutional data sources in depth, so if the World Bank is one node in a larger data collection pipeline, the surrounding guides here will close the gaps.