Scraping legal records: court dockets, case databases
Scrape court dockets and you tap into one of the most analytically valuable but operationally challenging public-information ecosystems on the web. Federal court records sit behind PACER (the Public Access to Court Electronic Records system), state court records spread across roughly 80 distinct state and county systems with no shared schema, and the public-good aggregators (CourtListener, Justia) try to consolidate everything into a single searchable layer. The scraping landscape is shaped by three things: PACER’s per-page billing model that constrains the economics of comprehensive federal scraping, state court diversity that requires per-jurisdiction adapters, and a strong public-good ecosystem (RECAP, Free Law Project) that handles much of the heavy lifting through federated contributions.
This guide focuses on practical patterns for federal court data via PACER and CourtListener, and on the high-volume state systems that matter most for commercial legal intelligence applications.
PACER and the RECAP archive
PACER is the canonical source for U.S. federal court records covering 94 district courts, 13 circuit courts, and 90+ bankruptcy courts. Access is per-page-billed at $0.10 per page with a $30 quarterly cap if you don’t exceed it. Free read-only access is available for opinions but document downloads are billed.
The RECAP project (Reliable Electronic Court Access for Public) is a Free Law Project initiative that builds a free archive of PACER documents through a browser extension that uploads pages users have already paid for. The RECAP archive (accessible through CourtListener) contains tens of millions of federal documents that are freely searchable and downloadable. For most federal docket research, RECAP is the cheapest and easiest entry point.
import httpx
CL_API = "https://www.courtlistener.com/api/rest/v3"
async def search_courtlistener(query: str, court: str = None):
url = f"{CL_API}/search/"
params = {"q": query, "type": "r"} # r = RECAP documents
if court:
params["court"] = court
async with httpx.AsyncClient(timeout=30) as c:
r = await c.get(url, params=params)
if r.status_code == 200:
return r.json().get("results", [])
return []
CourtListener exposes a free REST API with reasonable rate limits for non-commercial use. For commercial use, paid tiers unlock higher rate limits and bulk export.
State court systems and their diversity
State court systems are the harder problem. There is no centralized state court records system; each of the 50 states (plus DC) operates its own architecture. Some states (California, Florida, New York) have unified state-wide systems. Others delegate to county-level systems with no unified portal.
The high-volume states for commercial legal intelligence are roughly:
| State | System | Access pattern |
|---|---|---|
| California | Multiple county portals | Per-county scraping |
| Texas | TexasFile (paid), county-level free | Hybrid |
| Florida | Florida Court Clerks (per-county) | Per-county scraping |
| New York | eCourts (NYSCEF) | State-wide API and scraping |
| Illinois | Cook County primary | Per-county |
| Pennsylvania | UJS Web Portal | State-wide search |
For a national legal intelligence product, building per-jurisdiction adapters for the top 20-30 states covers roughly 80% of commercially relevant cases. The remaining long tail requires either heroic scraping effort or partnering with a commercial aggregator like UniCourt or Trellis.
Docket parsing and entity extraction
A docket entry includes a sequence number, a filed date, the document title, the filer (attorney or pro se party), and a link to the underlying document if filed electronically. Parsing the structured docket sequence is straightforward; extracting the legal entities (parties, attorneys, law firms, judges) requires named entity recognition tuned to legal text.
import spacy
nlp = spacy.load("en_core_web_lg")
def extract_legal_entities(docket_text: str) -> dict:
doc = nlp(docket_text)
return {
"persons": [ent.text for ent in doc.ents if ent.label_ == "PERSON"],
"organizations": [ent.text for ent in doc.ents if ent.label_ == "ORG"],
"dates": [ent.text for ent in doc.ents if ent.label_ == "DATE"],
}
For specialized legal NER, the Free Law Project publishes models trained on legal text. These models recognize judges, attorney names, and law firm names with substantially higher accuracy than general-purpose NER models on the same text.
Schema for legal docket snapshots
CREATE TABLE docket_snapshot (
snapshot_at TIMESTAMP NOT NULL,
court_id VARCHAR(32) NOT NULL,
case_number VARCHAR(64) NOT NULL,
docket_entry_seq INT NOT NULL,
filed_at TIMESTAMP,
document_title TEXT,
filer_name TEXT,
document_url TEXT,
PRIMARY KEY (snapshot_at, court_id, case_number, docket_entry_seq)
);
CREATE INDEX docket_case_idx ON docket_snapshot(court_id, case_number);
For analytics on attorney activity or law firm activity, build a derived table that aggregates docket entries per attorney and per firm per week. Filing velocity per attorney reveals workload patterns; filing velocity per firm reveals competitive positioning.
For broader pattern guidance, see our residential proxy provider ranking and our GDPR compliance guide for scraping.
Detecting and routing around bot challenges
When court systems and PACER flag your traffic, the response is usually a Cloudflare or vendor interrogation page rather than a clean HTTP error. Your scraper needs to detect this content-type swap explicitly. Look for the signature cf-mitigated header, the presence of __cf_chl_ cookies, or HTML containing Just a moment....
def is_challenged(response) -> bool:
if response.status_code in (403, 503):
return True
if "cf-mitigated" in response.headers:
return True
if "__cf_chl_" in response.headers.get("set-cookie", ""):
return True
body = response.text[:2000].lower()
return "just a moment" in body or "checking your browser" in body
When you detect a challenge, do not retry on the same IP for at least 30 minutes. Mark that IP as cooling and route subsequent requests to a different IP in your pool. Aggressive retries on a flagged IP cause the cooling window to extend and can lead to long-term blacklisting of your subnet. For pages that absolutely must be fetched, have a fallback path that uses a headless browser. Most production setups maintain a 95/5 split between the lightweight HTTP path and the browser fallback path.
Operational monitoring and alerting
Every production scraper needs three monitoring layers regardless of vertical. The first is per-IP success rate over a 5-minute window, alerting if any IP drops below 80%. The second is parser error rate, alerting if more than 1% of fetched pages fail to extract the canonical fields. The third is data freshness, alerting if your downstream consumers see snapshots more than 24 hours old.
import time
from collections import deque
class IPHealthTracker:
def __init__(self, window_seconds: int = 300):
self.window = window_seconds
self.events = {}
def record(self, ip: str, success: bool):
bucket = self.events.setdefault(ip, deque())
now = time.time()
bucket.append((now, success))
while bucket and bucket[0][0] < now - self.window:
bucket.popleft()
def success_rate(self, ip: str) -> float:
bucket = self.events.get(ip)
if not bucket:
return 1.0
return sum(1 for _, ok in bucket if ok) / len(bucket)
Wire this into Prometheus or your existing observability stack so the on-call engineer sees IP degradation as it happens rather than after the daily snapshot fails.
Pipeline orchestration and scheduling
For any non-trivial legal records scraping operation, a dedicated orchestration layer is the difference between a script you babysit and a service that runs unattended. The two strong open-source choices in 2026 are Prefect 3 and Dagster. Both handle DAG dependencies, retries, observability, secret management, and dynamic fan-out across IPs and sources.
from prefect import flow, task
@task(retries=3, retry_delay_seconds=60)
def fetch_source(source_id: str, page: int):
return crawl_one_page(source_id, page)
@flow(name="legal-records-daily-sweep")
def daily_sweep(source_ids: list):
futures = []
for sid in source_ids:
for page in range(1, 30):
futures.append(fetch_source.submit(sid, page))
return [f.result() for f in futures]
Run the flow on a cadence aligned to how dynamic the underlying data is. For legal records where records change intraday, a 4-6 hour cadence catches meaningful movements. For longer-cycle data, daily is sufficient and the cost saving is meaningful.
Data quality monitoring patterns
Beyond per-IP success rate, every snapshot should pass a small battery of data quality checks before being considered authoritative. Structural checks verify that every required field is present and of the expected type. A snapshot row missing the canonical identifier is not a real snapshot. Distributional checks compare the current snapshot against recent history. If today’s snapshot has 30% fewer records than yesterday, something broke either in collection or in the upstream source. Semantic checks compare related fields for consistency.
def quality_check(snapshot: list[dict]) -> list[str]:
errors = []
if not snapshot:
errors.append("empty snapshot")
return errors
avg_yesterday = get_yesterday_avg_size()
if len(snapshot) < avg_yesterday * 0.7:
errors.append(f"snapshot size below threshold")
return errors
Run quality checks as a separate flow that gates promotion of the snapshot from staging to production. A snapshot that fails quality checks should be quarantined for human review, not silently published.
Cost optimization strategies
Proxy bandwidth is usually the dominant cost in a production scraping operation. Three optimization patterns consistently reduce cost without hurting data quality. The first is request deduplication: serve cached responses when consumers ask for the same record within the same hour. The second is conditional GET using ETag or If-Modified-Since headers when supported. The third is selective field hydration when the upstream API supports field selection.
For workloads above 100 GB of monthly proxy bandwidth, these three optimizations together reduce cost by 40-60% without changing the analytical output. The engineering effort to implement them is modest and the payback period is usually under a month at production volume.
End-to-end pipeline architecture
A production-grade scraping pipeline has four layers: collection, parsing, storage, and serving. The collection layer handles the network conversation and knows nothing about data shape. The parsing layer transforms raw bytes into structured records and owns the schema. The storage layer holds the canonical snapshots in a query-optimized format like DuckDB, ClickHouse, or BigQuery. The serving layer exposes the data to consumers and should be denormalized and pre-aggregated where possible.
Decoupling these layers also enables independent scaling. The collection layer is bound by proxy capacity and network bandwidth. The parsing layer is CPU-bound. The storage layer is bound by I/O and disk capacity. The serving layer is bound by query concurrency. Each layer can scale horizontally without coupling to the others.
Legal and compliance considerations
Public legal records data is generally treated as fair to scrape in most jurisdictions, but always confine your collection to non-personal data: identifiers, structured attributes, and aggregates. Avoid collecting personally identifying details, and avoid pulling any data behind a login.
For commercial deployment, document your basis for processing, your data retention period, and your purpose limitation. Most data protection regimes treat scraped public data more favorably when there is a clear lawful basis and the data is not used for direct marketing to identified individuals. The W3C Web Annotation guidance and the OECD guidance on AI training data sourcing remain useful starting points for documenting your approach.
Sample analytics queries on the collected dataset
Once your snapshots are landing reliably, the analytics layer is where the value materializes. A few queries that consistently come up across legal records datasets:
-- Volume trend over the last 30 days
SELECT date_trunc('day', snapshot_at) AS day, COUNT(*) AS records
FROM snapshot
WHERE snapshot_at > now() - interval '30 days'
GROUP BY 1 ORDER BY 1;
-- New entities first seen in the last 14 days
SELECT entity_id, MIN(snapshot_at) AS first_seen
FROM snapshot
GROUP BY entity_id
HAVING MIN(snapshot_at) > now() - interval '14 days'
ORDER BY first_seen DESC;
-- Source distribution
SELECT source, COUNT(*) AS records
FROM snapshot
WHERE snapshot_at > now() - interval '7 days'
GROUP BY source
ORDER BY records DESC;
Add a category share view, a source concentration view, and a price-volatility view (where applicable) and you have a solid foundation for a legal records intelligence product. The collection layer is the prerequisite; the analytics layer is where you create defensible value.
Versioning your scraper for source evolution
Every legal records source evolves its schema regularly. New fields appear, old fields are deprecated, and display logic changes. Stamp every snapshot row with the scraper version that produced it. Downstream analytics can filter by version when they need consistent semantics across a time range, or join across versions when they want long-running trend analysis.
Pair this with a small registry table that documents what each scraper version did differently. When a downstream user asks why a particular metric jumped on a specific date, the version registry usually has the answer. This habit pays for itself dramatically the first time a parser change introduces a subtle metric drift.
Caching strategy and incremental crawls
Full daily snapshots scale linearly with source size, which becomes expensive at multi-million record scale. Most production deployments shift from full snapshots to incremental refreshes after the initial ramp. The pattern uses three signals to decide what to refetch on each cycle: freshness deadline, volatility, and business priority. Records that downstream users actually query get higher refresh priority than dormant records that nobody has looked at in months.
Priority-driven scheduling reduces total request volume by 60-80% compared to blind full snapshots, while keeping the data fresh on the records that actually matter to the business.
Building a litigation analytics dashboard
The most common analytical product on top of legal docket scraping is a litigation analytics dashboard that tracks new filings per court per practice area per week, attorney and firm activity rankings, and judge assignment patterns. The dashboard layer sits on top of the snapshot store and pre-computes the most common views for fast serving.
def attorney_activity(start_date, end_date):
return db.query("""
SELECT filer_name, COUNT(*) AS filing_count
FROM docket_snapshot
WHERE filed_at BETWEEN %s AND %s
GROUP BY filer_name
ORDER BY filing_count DESC
LIMIT 100
""", [start_date, end_date])
For commercial legal intelligence products, the headline metrics are practice-area-specific filing trends (patent litigation in the Eastern District of Texas, product liability in California, antitrust in the Southern District of New York). Each of these views informs both law firm business development and litigation finance investment decisions.
Layer in win-rate analytics by attorney, by firm, and by judge for the deepest commercial intelligence. Win rates require parsing case outcomes from disposition orders, which is itself a non-trivial NLP problem. The Free Law Project publishes models that classify dispositions with reasonable accuracy.
International legal scraping notes
Outside the U.S., court access varies dramatically. The UK courts publish judgments through BAILII (free and comprehensive). The European Court of Justice publishes through CURIA (free, structured). Common-law countries like Australia and Canada have similar judgment-publication frameworks. Civil-law countries vary widely; some publish judgments comprehensively, others restrict access to parties.
For multi-jurisdictional legal research, the dataset shape is judgment-and-opinion-centric rather than docket-centric because the docket-tracking culture is much stronger in the U.S. than elsewhere. Build separate adapters for the docket-tracking jurisdictions and the judgment-publishing jurisdictions, with a shared canonical schema for cross-jurisdictional queries.
The personal data dimension is more constrained outside the U.S. EU GDPR specifically considers court records and the names of private individuals appearing in them. The CJEU’s right-to-be-forgotten decisions interact with court-record scraping in nuanced ways. For commercial deployment of EU-facing legal scraping products, specialized counsel is essential.
Common pitfalls when scraping court dockets
Three issues recur across PACER and state court scrapers. The first is sealed-document leakage. PACER serves sealed documents with a ‘sealed’ flag in the metadata but the underlying file is sometimes still downloadable due to clerk error. Hard-code a metadata check before any download and treat any sealed-flag positive as a non-fetch, even if the URL resolves.
The second is docket-entry numbering inconsistency. Federal courts number docket entries sequentially within a case but renumber after consolidation or transfer. State courts use their own conventions. A scraper that joins on docket entry number across consolidations loses the chronology. Use the docket entry timestamp as the secondary sort key.
The third is OCR-quality drift on scanned filings. Older filings (pre-2010 in most districts) are scanned PDFs with variable OCR quality. A keyword search for case-relevant terms misses 5-15% of older documents because the OCR layer dropped or misread the term. Run a second-pass OCR (Textract, Google Vision) on critical historical documents and store both OCR layers for cross-validation.
FAQ
Is scraping court dockets legal?
Court records are public records and are explicitly intended to be accessible. PACER access requires registration and billing; state court access varies. The legal scraping question is generally not whether you can access the data but whether your access pattern complies with the system’s terms and rate limits.
What about sealed cases and protective orders?
Sealed cases and documents under protective orders are not public and should never be scraped. The access systems generally enforce this through their access controls; if you somehow encounter sealed material, do not store or process it.
Can I scrape attorney names and use them for marketing?
Attorney names are public information published in court dockets. Using them for general legal-vertical analytics is fair. Using them for direct marketing to attorneys is regulated by state bar rules and CAN-SPAM equivalents; consult specialized counsel before building marketing lists from docket data.
How do I handle the PACER per-page billing?
For comprehensive federal coverage, the RECAP archive via CourtListener is dramatically cheaper than direct PACER access. For real-time coverage of specific cases, direct PACER access is needed. Budget roughly $1,000-3,000 per month per analyst-equivalent for active federal docket monitoring.
What about international court systems?
UK court records have varying public-access levels through the Royal Courts of Justice and county courts. EU court records also vary widely. The patterns transfer with per-jurisdiction adapters but the legal landscape (especially around personal data in court records) is more restrictive than in the U.S.
Are PACER fees still applicable in 2026?
Yes, with the standard quarterly waiver under $30. RECAP Project mirrors public dockets to reduce repeat downloads of the same document.
How do I monitor a specific case for new filings?
Poll the docket index endpoint hourly during business hours and emit a row for each new docket entry detected. Diff against the previous snapshot rather than re-scraping the full case.
To build broader OSINT and legal intelligence pipelines, browse the cybersecurity-osint category for tooling reviews and framework deep dives.