—
If you need to scrape court records at scale, the good news is that US federal court documents are public record — the bad news is that PACER will charge you $0.10 per page if you do it naively. Knowing which sources to hit, in what order, and with what tooling is the difference between a $12 research bill and a $1,200 one. This guide covers the legal framework, the three main data sources, a working Python setup, and the proxy strategy you’ll need to avoid throttling.
What “legal” actually means for court record scraping
Public record doctrine in the US means federal court filings are legally accessible to anyone. The Electronic Public Access (EPA) program was created specifically to enable this. What the law does not protect you from is PACER’s ToS, rate limits, and per-page fees — or a court clerk deciding your IP is hammering their system.
The key rule: you cannot scrape non-public court systems (sealed records, juvenile records, state systems with access controls). Federal PACER documents, state court portals with open public access, and aggregators like CourtListener are all fair game for automated collection. Always check the specific court system’s robots.txt and ToS before pulling. Courts running Tyler Technologies’ Odyssey platform, for example, have varying access policies depending on the state, and some explicitly prohibit automated access. This same due-diligence mindset applies when you’re working with other government datasets — similar nuances show up when you scrape public health data from CDC, WHO, and ECDC sources.
PACER vs CourtListener vs Justia: which source to use
The three main sources for federal court data each have a different cost and coverage profile:
| Source | Cost | API Available | Coverage | Rate Limit | Data Format |
|---|---|---|---|---|---|
| PACER | $0.10/page (waived under $30/quarter) | No native REST API | All federal courts | Aggressive throttle | PDF, HTML dockets |
| CourtListener | Free | Yes (REST + bulk) | PACER + state appellate | 5,000 req/day (free tier) | JSON, PDF |
| Justia | Free | No (HTML only) | Federal + some state | Informal (no stated limit) | HTML |
CourtListener, run by the non-profit Free Law Project, is the default choice for most engineering teams. It mirrors a substantial portion of PACER, offers a documented REST API, and has a bulk data download option for case metadata going back decades. PACER direct is unavoidable only when you need documents that haven’t been mirrored yet — newly filed cases in the last 24-72 hours, or obscure district courts with low filing volume.
Justia works as a fallback for case summaries and citations, but the HTML structure changes without notice and they have no official API contract. It is not a reliable production source.
Setting up CourtListener API access
Registration is free at courtlistener.com. After confirming your email, generate a token from the API settings page. The base endpoint is https://www.courtlistener.com/api/rest/v3/.
import requests
TOKEN = "your_token_here"
BASE = "https://www.courtlistener.com/api/rest/v3"
headers = {"Authorization": f"Token {TOKEN}"}
params = {
"court": "scotus",
"filed_after": "2025-01-01",
"order_by": "-date_filed",
"page_size": 20,
}
resp = requests.get(f"{BASE}/dockets/", headers=headers, params=params)
resp.raise_for_status()
data = resp.json()
for docket in data["results"]:
print(docket["case_name"], docket["date_filed"], docket["absolute_url"])The dockets endpoint returns case metadata. To pull actual documents, follow the clusters and opinions endpoints from each docket. PDFs are linked directly and can be downloaded with the same auth header. Free tier caps at 5,000 requests per day — for bulk historical pulls, use their bulk data downloads (case metadata CSVs are available by year and court, no API quota consumed).
For PACER direct access, the community-maintained juriscraper library (Python) handles login, session cookies, and docket parsing for most district courts. It is not officially supported by PACER and breaks occasionally when courts update their login flows, so pin your version and test on deploy.
Step-by-step workflow for a production pipeline
- Start with CourtListener bulk CSVs to build your case index (free, no rate limit pressure).
- Filter to cases you need by court, date range, or party name using pandas or DuckDB.
- For each case requiring full documents, check CourtListener’s opinion/document endpoints first — if the document is mirrored, download it there.
- For documents not in CourtListener, queue them for PACER retrieval. Use a PACER account with billing monitoring; the $30/quarter fee waiver applies per account, so some teams maintain multiple accounts for cost splitting.
- Parse PDFs with
pdfplumberfor structured extraction (tables, case metadata) orpypdffor raw text. For scanned PDFs,pytesseractwithpdf2imageadds OCR — expect 2-4 seconds per page. - Store extracted metadata in Postgres with a
jsonbcolumn for variable fields; store raw PDFs in S3 or equivalent object storage. - Deduplicate on PACER document ID or CourtListener cluster ID before inserting.
The workflow for court records maps cleanly to other government data pipelines. If you’re also pulling regulatory approval data, the approach covered in how to scrape the FDA drug approval database programmatically shares the same PDF-heavy extraction pattern.
Proxy and IP strategy
Court websites are not Cloudflare-protected, but they do throttle by IP. PACER in particular will lock out IPs that fire more than roughly 30 requests per minute sustained. CourtListener’s free tier rate limit is enforced by API token, not IP, so proxies are less critical there.
Where proxies matter most is scraping state court portals (Odyssey, Tyler’s eCourt, state-run HTML systems) that have no API. For those, residential proxies with US state-level targeting give you the most reliable session continuity. The full proxy selection framework for legal database scraping is covered in the pillar guide on scraping court records and legal databases with proxies.
A few practical rules for PACER IP management:
- Rotate IPs between docket requests, not between page fetches within a session (PACER uses session cookies)
- Use a minimum 2-second delay between requests per IP
- Avoid scraping between 9am-5pm Eastern — court system load is highest then and throttling is more aggressive
- Log HTTP 503 and 429 responses separately; 503 usually means the court’s server is down, not that you’re blocked
Common data quality pitfalls
- Party name inconsistency: “Smith v. Jones” and “Jones v. Smith” may be the same case in different courts. Normalize on PACER case ID.
- PDF version drift: PACER serves both native PDFs and image-scanned PDFs for older cases. check for extractable text before assuming pdfplumber will work.
- Docket entry gaps: CourtListener’s mirror is not real-time. entries can lag 24-72 hours for lower-traffic courts.
- Jurisdiction confusion: federal district courts, circuit courts, and bankruptcy courts each have separate PACER systems. a case filed in SDNY is not the same system as the Second Circuit.
The data quality challenges here are structurally similar to other public datasets. Scraping K-12 school district data involves the same identifier normalization problems across state and local government sources. And if you’re building multi-source financial intelligence pipelines, scraping crypto exchange order books is a useful contrast — high-frequency, structured, API-first — compared to the low-frequency, PDF-heavy court record world.
Bottom line
Start with CourtListener’s free API and bulk downloads for anything historical — PACER direct access is only worth the cost for fresh filings or documents not yet mirrored. use juriscraper for PACER automation, pdfplumber for extraction, and residential US proxies only for state court portals where session continuity matters. DRT covers this and adjacent government data verticals in depth, so bookmark the site if you’re building legal or regulatory data pipelines.
Related guides on dataresearchtools.com
- How to Scrape K-12 School District Data and Test Scores (2026)
- How to Scrape FDA Drug Approval Database Programmatically (2026)
- How to Scrape Public Health Data: CDC, WHO, ECDC Sources (2026)
- How to Scrape Crypto Exchange Order Books at Sub-Second Frequency (2026)
- Pillar: Scraping Court Records and Legal Databases with Proxies