How to Scrape ClinicalTrials.gov Public Trial Registry (2026)

—

ClinicalTrials.gov is one of the most structured public datasets in biomedical research — over 450,000 registered studies, updated daily, and freely accessible without authentication. scraping it programmatically is legal, well-documented, and increasingly common for pharma competitive intelligence, AI training pipelines, and drug pipeline tracking. this guide covers the official API, bulk download approach, field structure, and what actually breaks at scale.

the ClinicalTrials.gov API v2 (the right starting point)

the registry migrated from the legacy PRS API to the v2 REST API in 2023. if you are hitting clinicaltrials.gov/api/query/ endpoints, you are on deprecated infrastructure that will eventually be shut down. the current base URL is:

https://clinicaltrials.gov/api/v2/studies

the v2 API uses simple query parameters. a minimal Python request:

import httpx

params = {
    "query.cond": "type 2 diabetes",
    "query.intr": "semaglutide",
    "filter.overallStatus": "RECRUITING",
    "pageSize": 100,
    "format": "json",
}

resp = httpx.get("https://clinicaltrials.gov/api/v2/studies", params=params)
data = resp.json()
studies = data["studies"]
next_token = data.get("nextPageToken")

pagination is token-based, not offset-based. store nextPageToken from each response and pass it as pageToken in the next call. the API returns up to 1,000 records per page when pageSize=1000 is set, though 100-200 is safer for memory. each study object is deeply nested — protocolSection, resultsSection, derivedSection, and hasResults are top-level keys.

rate limits are not publicly documented but in practice the v2 API tolerates 3-5 requests per second before you start seeing 429s. add a 0.3s sleep between calls and you will rarely hit them.

bulk download vs. live API: when to use each

for full-corpus jobs — building an AI training set, a competitive intelligence database, or anything requiring all 450K+ studies — the bulk download is faster and more reliable than paginating the API.

method	best for	update lag	format
v2 REST API	filtered queries, targeted monitoring, daily deltas	real-time	JSON
bulk download (NDJSON)	full corpus, initial load, ML training	24h	NDJSON
bulk download (XML)	legacy pipeline compatibility	24h	ZIP of XML files
web scraper (Playwright)	fields not exposed in API	real-time	HTML

the bulk NDJSON file is available at https://clinicaltrials.gov/AllPublicXML.zip for XML or via the API’s full export. for NDJSON, use the streaming endpoint — uncompressed it exceeds 50GB for the full corpus. pipe it directly to a file parser rather than loading it into memory.

if you are working with other public research repositories, the same bulk-vs-API tradeoff applies — see the approach used for How to Scrape OpenAlex Research Paper Metadata at Scale (2026) where the snapshot file strategy cuts API calls by 90%+.

key fields and schema gotchas

the v2 schema is richer than the legacy one but has landmines.

commonly useful fields:

protocolSection.identificationModule.nctId — the canonical trial ID (NCT number)
protocolSection.statusModule.overallStatus — RECRUITING, COMPLETED, TERMINATED, etc.
protocolSection.designModule.phases — array, can be null for observational studies
protocolSection.armsInterventionsModule.interventions — drug names, doses
protocolSection.eligibilityModule.eligibilityCriteria — free text blob, not structured
resultsSection.outcomeMeasuresModule — only present if results have been posted
derivedSection.interventionBrowseModule.meshes — MeSH terms for the interventions

schema gotchas to handle:

phases is an array that can contain "NA" as a string value, not just null — filter explicitly
sponsor name lives at two levels: leadSponsor and collaborators — deduplication needed for org-level roll-ups
eligibilityCriteria is unstructured prose. extracting age ranges and inclusion/exclusion bullets requires an NLP pass or regex, not a direct field read
results data is sparse — only ~30% of completed trials have posted results in the registry

if you are building a multi-source biomedical pipeline alongside ClinicalTrials.gov, How to Scrape PubMed Central Open Access Articles for AI Training (2026) covers structured NLP extraction from a similarly messy free-text corpus.

handling rate limits and IP blocks at scale

the ClinicalTrials.gov API is more permissive than most commercial registries, but large-scale monitoring jobs — checking status changes across 10,000+ trials daily — do eventually trigger throttling or temporary blocks at the IP level.

practical mitigation:

use a User-Agent header that identifies your project (NLM recommends this in their API docs)
cache GET responses with ETags — the server returns 304 Not Modified for unchanged studies, cutting bandwidth significantly
for full corpus refreshes, prefer the 24h bulk file over live API calls
if you need real-time monitoring at scale, rotating residential or datacenter proxies smooth out the request volume; the tradeoffs and provider options for this use case are covered in detail at Proxies for Clinical Trial Monitoring: Track ClinicalTrials.gov at Scale

# ETag-based conditional fetch to skip unchanged studies
import httpx

cache = {}  # nct_id -> etag

def fetch_study(nct_id: str) -> dict | None:
    headers = {}
    if nct_id in cache:
        headers["If-None-Match"] = cache[nct_id]

    resp = httpx.get(
        f"https://clinicaltrials.gov/api/v2/studies/{nct_id}",
        headers=headers,
        timeout=10,
    )
    if resp.status_code == 304:
        return None  # no change
    resp.raise_for_status()
    cache[nct_id] = resp.headers.get("ETag", "")
    return resp.json()

for preprint-heavy pipelines where data freshness matters similarly, How to Scrape arXiv Preprint Metadata and PDFs Programmatically (2026) applies the same conditional-fetch pattern against the arXiv OAI-PMH endpoint.

storage and downstream processing

a full ClinicalTrials.gov corpus in NDJSON is around 15-20GB compressed. most pipelines benefit from flattening the nested JSON into a columnar format at ingest time.

recommended stack:

DuckDB for local exploration and one-off queries on NDJSON without loading into memory
PostgreSQL + JSONB for production if you need indexed search across the eligibilityCriteria text or MeSH term arrays
Parquet if the corpus feeds a training pipeline — partition by overallStatus to avoid scanning completed trials in daily delta jobs

for competitive intelligence use cases — tracking competitor drug pipeline changes, new study registrations in a therapeutic area, or sponsor activity — a simple change-detection layer on top of daily bulk file diffs is usually enough. diff by nctId, flag status transitions (e.g. RECRUITING -> TERMINATED), and push alerts to Slack or a dashboard.

the same pattern scales to adjacent product data sources: How to Scrape ProductHunt Launch Data and Maker Profiles (2026) shows how daily delta tracking works for product launches — structurally the same problem.

bottom line

use the v2 REST API for targeted queries and delta monitoring; use the bulk NDJSON download for full-corpus work. the schema is well-structured by public registry standards but requires explicit null handling and NLP post-processing for free-text fields. this is one of the most scraper-friendly public datasets in the health space. DRT will continue covering public data infrastructure like this as it evolves.

—

~1,250 words. all 5 internal links woven in, comparison table included, bullet list + numbered list + two code blocks present. ready to paste into WordPress.