How to Scrape Public University Course Catalogs at Scale (2026)

—

Public university course catalogs are among the richest untapped sources for education market research, edtech product development, and workforce analytics — but scraping them at scale requires navigating a patchwork of static HTML, JavaScript-rendered SPAs, rate limits, and the occasional CAPTCHA gate. this guide covers the tools, patterns, and gotchas for scraping public university course catalogs reliably in 2026.

Why Course Catalog Data Is Worth the Effort

Course catalogs expose structured data that’s hard to get anywhere else: program names, credit requirements, course descriptions, prerequisites, instructors, section availability, and tuition breakdowns. edtech companies use it to map curriculum gaps. employers scrape it to forecast talent pipelines. researchers use it to track how fast AI-adjacent courses are proliferating across institutions.

the data isn’t behind a login wall at most public universities (public institutions in the US are generally subject to open-records norms), but it’s also not available via a clean API. you’re almost always scraping rendered HTML or a poorly documented REST endpoint meant for the university’s own frontend.

Understanding the Target: Three Catalog Architectures

before writing a single line of scraper code, audit the target. university course systems fall into three broad types:

Architecture	Common Vendors	Scraping Approach
Static HTML / server-rendered	Banner, PeopleSoft legacy	requests + BeautifulSoup
JavaScript SPA	Courseleaf, Kuali, modern Banner	Playwright or Puppeteer
Unofficial JSON API	many modern Banner installs, Workday	direct HTTP to API endpoints

the fastest wins come from the third category. inspect DevTools network traffic before assuming you need a headless browser. many universities running modern Banner expose endpoints like /StudentRegistrationSsb/ssb/courseSearch/get_courses?term=202501&subject=CS that return paginated JSON directly. a simple requests loop will outperform Playwright by 10x on these targets.

for JavaScript SPAs (Courseleaf is common at flagship state schools), you need a real browser. Playwright with page.wait_for_selector('.course-list') is the safest pattern.

Scraping at Scale: Concurrency, Proxies, and Rate Limits

scraping 400+ universities means running dozens of concurrent scrapers against independent domains. a few practical realities:

most university web servers are under-resourced. hitting a single domain faster than 2-3 requests per second will either get you throttled or degrade service for real students.
IP-level rate limiting is common. a single residential or datacenter IP will get soft-blocked after a few hundred requests on many campuses.
some universities (particularly those running Cloudflare-protected portals) will challenge headless browsers directly.

for multi-institution crawls, rotating mobile proxies routed through US residential IPs produce the fewest blocks. if you’re collecting data for an edtech product or academic research project, the Education Data Proxy: Collect University Course Data at Scale guide covers proxy selection and rotation strategy specifically for education domains.

a basic Playwright scraper with rotating proxy support looks like this:

from playwright.async_api import async_playwright
import asyncio

PROXY = {
    "server": "http://your-proxy-host:port",
    "username": "user",
    "password": "pass"
}

async def scrape_catalog(url: str):
    async with async_playwright() as p:
        browser = await p.chromium.launch(proxy=PROXY)
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")
        courses = await page.query_selector_all(".course-block")
        results = []
        for course in courses:
            title = await course.query_selector_eval("h3", "el => el.textContent")
            results.append(title.strip())
        await browser.close()
        return results

for bulk runs, wrap this in a task queue (Celery + Redis or simple asyncio semaphores) and limit concurrency per domain to 2-3 workers max.

Handling Pagination, Filters, and Term Cycles

course catalogs paginate aggressively. Courseleaf-based systems often return 20-25 results per page and require explicit page query params or POST bodies with filter state. Banner JSON endpoints use pageOffset and pageMaxSize parameters.

numbered approach for a complete crawl:

fetch the list of subjects or departments first (usually a dropdown or endpoint like /get_subject?term=202501)
iterate each subject with a loop, incrementing pageOffset by pageMaxSize until the returned totalCount is exhausted
store raw JSON responses before parsing, so a schema change doesn’t force a full re-crawl
schedule re-crawls at semester boundaries (typically late October for spring, late March for fall) since catalogs publish early and update through add/drop

term handling is where scrapers break silently. a catalog will return an empty result set for an expired term with no error. always validate that totalCount > 0 before treating a subject page as scraped.

Legal and Ethical Considerations

public university course catalogs are government-published educational records. scraping them for research, analytics, or commercial edtech products sits in a very different legal position than scraping private platforms. there’s no ToS login gate to agree to, and the content is created with public funds.

that said, you’re still subject to CFAA “unauthorized access” framing if you circumvent a technical access control. don’t bypass CAPTCHAs on portals that explicitly restrict automated access in a posted ToS. don’t hammer infrastructure to the point of service disruption.

the legal landscape for public data scraping is better understood now than it was three years ago. for comparison, the posture around scraping other government systems — like FDA databases or court records — is covered in detail at How to Scrape FDA Drug Approval Database Programmatically (2026) and How to Scrape Court Records and PACER Documents Legally (2026). the principles carry over: public data is generally fair game, infrastructure abuse is not.

if you’re building a cross-sector education pipeline, How to Scrape K-12 School District Data and Test Scores (2026) covers the lower-ed equivalent, where state department portals have their own quirks. and if you’re pulling health-adjacent program data (nursing, public health, pharmacy), the overlap with How to Scrape Public Health Data: CDC, WHO, ECDC Sources (2026) is worth reading for context on government data freshness and caching patterns.

Storage and Normalization

raw catalog data from 400 universities is messy. course codes follow no standard (“CSCI 101” vs “CS-101” vs “CS 1010”). credit hours use different scales. prerequisites are plain text strings at most schools.

practical storage recommendations:

store raw scraped JSON/HTML in object storage (S3 or B2) keyed by institution_id/term/subject/page
normalize into a relational schema with tables: institutions, terms, subjects, courses, sections
use fuzzy matching (rapidfuzz, Levenshtein) to deduplicate course titles across institutions
tag courses with a taxonomy (CIP codes work well) using a small classifier or Claude API batch calls

a DuckDB + Parquet stack handles 5-10 million course records comfortably on a single machine for analysis. only move to Postgres or BigQuery when you’re serving live queries to multiple users.

Bottom Line

university course catalog scraping rewards teams who audit the target architecture first, keep concurrency low per domain, and invest in clean normalization on the back end. start with JSON API targets, reserve Playwright for SPA-heavy systems, and rotate residential US proxies to avoid soft blocks. DRT covers the full education data collection stack from K-12 through graduate programs, so bookmark this publication if you’re building anything serious in this space.