How to Scrape ClinicalTrials.gov Public Trial Registry (2026)

ClinicalTrials.gov is one of the most structured public datasets in biomedical research — over 450,000 registered studies, updated daily, and freely accessible without authentication. scraping it programmatically is legal, well-documented, and increasingly common for pharma competitive intelligence, AI training pipelines, and drug pipeline tracking. this guide covers the official API, bulk download approach, field structure, and what actually breaks at scale.

the ClinicalTrials.gov API v2 (the right starting point)

the registry migrated from the legacy PRS API to the v2 REST API in 2023. if you are hitting clinicaltrials.gov/api/query/ endpoints, you are on deprecated infrastructure that will eventually be shut down. the current base URL is:

https://clinicaltrials.gov/api/v2/studies

the v2 API uses simple query parameters. a minimal Python request:

import httpx

params = {
    "query.cond": "type 2 diabetes",
    "query.intr": "semaglutide",
    "filter.overallStatus": "RECRUITING",
    "pageSize": 100,
    "format": "json",
}

resp = httpx.get("https://clinicaltrials.gov/api/v2/studies", params=params)
data = resp.json()
studies = data["studies"]
next_token = data.get("nextPageToken")

pagination is token-based, not offset-based. store nextPageToken from each response and pass it as pageToken in the next call. the API returns up to 1,000 records per page when pageSize=1000 is set, though 100-200 is safer for memory. each study object is deeply nested — protocolSection, resultsSection, derivedSection, and hasResults are top-level keys.

rate limits are not publicly documented but in practice the v2 API tolerates 3-5 requests per second before you start seeing 429s. add a 0.3s sleep between calls and you will rarely hit them.

bulk download vs. live API: when to use each

for full-corpus jobs — building an AI training set, a competitive intelligence database, or anything requiring all 450K+ studies — the bulk download is faster and more reliable than paginating the API.

methodbest forupdate lagformat
v2 REST APIfiltered queries, targeted monitoring, daily deltasreal-timeJSON
bulk download (NDJSON)full corpus, initial load, ML training24hNDJSON
bulk download (XML)legacy pipeline compatibility24hZIP of XML files
web scraper (Playwright)fields not exposed in APIreal-timeHTML

the bulk NDJSON file is available at https://clinicaltrials.gov/AllPublicXML.zip for XML or via the API’s full export. for NDJSON, use the streaming endpoint — uncompressed it exceeds 50GB for the full corpus. pipe it directly to a file parser rather than loading it into memory.

if you are working with other public research repositories, the same bulk-vs-API tradeoff applies — see the approach used for How to Scrape OpenAlex Research Paper Metadata at Scale (2026) where the snapshot file strategy cuts API calls by 90%+.

key fields and schema gotchas

the v2 schema is richer than the legacy one but has landmines.

commonly useful fields:

  • protocolSection.identificationModule.nctId — the canonical trial ID (NCT number)
  • protocolSection.statusModule.overallStatus — RECRUITING, COMPLETED, TERMINATED, etc.
  • protocolSection.designModule.phases — array, can be null for observational studies
  • protocolSection.armsInterventionsModule.interventions — drug names, doses
  • protocolSection.eligibilityModule.eligibilityCriteria — free text blob, not structured
  • resultsSection.outcomeMeasuresModule — only present if results have been posted
  • derivedSection.interventionBrowseModule.meshes — MeSH terms for the interventions

schema gotchas to handle:

  1. phases is an array that can contain "NA" as a string value, not just null — filter explicitly
  2. sponsor name lives at two levels: leadSponsor and collaborators — deduplication needed for org-level roll-ups
  3. eligibilityCriteria is unstructured prose. extracting age ranges and inclusion/exclusion bullets requires an NLP pass or regex, not a direct field read
  4. results data is sparse — only ~30% of completed trials have posted results in the registry

if you are building a multi-source biomedical pipeline alongside ClinicalTrials.gov, How to Scrape PubMed Central Open Access Articles for AI Training (2026) covers structured NLP extraction from a similarly messy free-text corpus.

handling rate limits and IP blocks at scale

the ClinicalTrials.gov API is more permissive than most commercial registries, but large-scale monitoring jobs — checking status changes across 10,000+ trials daily — do eventually trigger throttling or temporary blocks at the IP level.

practical mitigation:

  • use a User-Agent header that identifies your project (NLM recommends this in their API docs)
  • cache GET responses with ETags — the server returns 304 Not Modified for unchanged studies, cutting bandwidth significantly
  • for full corpus refreshes, prefer the 24h bulk file over live API calls
  • if you need real-time monitoring at scale, rotating residential or datacenter proxies smooth out the request volume; the tradeoffs and provider options for this use case are covered in detail at Proxies for Clinical Trial Monitoring: Track ClinicalTrials.gov at Scale
# ETag-based conditional fetch to skip unchanged studies
import httpx

cache = {}  # nct_id -> etag

def fetch_study(nct_id: str) -> dict | None:
    headers = {}
    if nct_id in cache:
        headers["If-None-Match"] = cache[nct_id]

    resp = httpx.get(
        f"https://clinicaltrials.gov/api/v2/studies/{nct_id}",
        headers=headers,
        timeout=10,
    )
    if resp.status_code == 304:
        return None  # no change
    resp.raise_for_status()
    cache[nct_id] = resp.headers.get("ETag", "")
    return resp.json()

for preprint-heavy pipelines where data freshness matters similarly, How to Scrape arXiv Preprint Metadata and PDFs Programmatically (2026) applies the same conditional-fetch pattern against the arXiv OAI-PMH endpoint.

storage and downstream processing

a full ClinicalTrials.gov corpus in NDJSON is around 15-20GB compressed. most pipelines benefit from flattening the nested JSON into a columnar format at ingest time.

recommended stack:

  • DuckDB for local exploration and one-off queries on NDJSON without loading into memory
  • PostgreSQL + JSONB for production if you need indexed search across the eligibilityCriteria text or MeSH term arrays
  • Parquet if the corpus feeds a training pipeline — partition by overallStatus to avoid scanning completed trials in daily delta jobs

for competitive intelligence use cases — tracking competitor drug pipeline changes, new study registrations in a therapeutic area, or sponsor activity — a simple change-detection layer on top of daily bulk file diffs is usually enough. diff by nctId, flag status transitions (e.g. RECRUITING -> TERMINATED), and push alerts to Slack or a dashboard.

the same pattern scales to adjacent product data sources: How to Scrape ProductHunt Launch Data and Maker Profiles (2026) shows how daily delta tracking works for product launches — structurally the same problem.

bottom line

use the v2 REST API for targeted queries and delta monitoring; use the bulk NDJSON download for full-corpus work. the schema is well-structured by public registry standards but requires explicit null handling and NLP post-processing for free-text fields. this is one of the most scraper-friendly public datasets in the health space. DRT will continue covering public data infrastructure like this as it evolves.

~1,250 words. all 5 internal links woven in, comparison table included, bullet list + numbered list + two code blocks present. ready to paste into WordPress.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)