How to Scrape Hacker News Front Page Data Without API Limits (2026)

Hacker News front page data is some of the most signal-rich content on the internet, and scraping it without hitting API limits is easier than most engineers assume. The official Firebase API caps you at 500 requests per 10 seconds per IP, which sounds generous until you’re pulling full item trees, tracking score velocity, or running a sentiment pipeline across 30 days of top stories. Here’s how to do it right in 2026.

Why the Official API Falls Short

The Algolia HN Search API and the Firebase endpoint (https://hacker-news.firebaseio.com/v1) are the two official options. Algolia is excellent for historical queries but reflects indexed data, not live scores. Firebase is real-time but chatty: fetching a single front-page story requires one call for the item ID list, then one call per story, then recursively one call per comment. A 30-comment story costs 31 requests minimum.

At scale this is a problem. Rate limits aside, Firebase enforces connection throttling during peak hours (roughly 06:00–10:00 UTC, when the US morning crowd votes). Engineers building trend detectors or competitive intelligence tools for tech launches — similar to what’s needed when you scrape ProductHunt launch data and maker profiles — run into this wall fast.

The practical alternative is direct HTML scraping of news.ycombinator.com, combined with smart caching and IP rotation.

Scraping the Front Page HTML

HN’s HTML is clean, minimal, and remarkably stable. The front page renders as a single

with class itemlist. Each story sits in a row, and the score/author/comment metadata is in the following .

import httpx
from bs4 import BeautifulSoup

def fetch_front_page(proxy=None):
    headers = {"User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)"}
    url = "https://news.ycombinator.com/"
    resp = httpx.get(url, headers=headers, proxy=proxy, timeout=10)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")
    stories = []
    for row in soup.select("tr.athing"):
        rank = row.select_one(".rank").text.strip(".")
        title_a = row.select_one(".titleline > a")
        title = title_a.text if title_a else ""
        url = title_a["href"] if title_a else ""
        subrow = row.find_next_sibling("tr")
        score_el = subrow.select_one(".score")
        score = int(score_el.text.split()[0]) if score_el else 0
        stories.append({"rank": rank, "title": title, "url": url, "score": score})
    return stories

This pulls all 30 front-page items in one HTTP request. No recursion, no Firebase quota burned. Run it every 5 minutes and you have a near-real-time score velocity tracker.

Handling Bans and IP Rotation

HN does block scrapers, but less aggressively than most sites. The triggers are volume (more than ~120 requests per hour per IP) and missing headers. A realistic production setup uses rotating residential IPs with a 5-second minimum delay between calls.

Method	Cost	Block Risk	Data Freshness
Firebase API	Free	Low	Real-time
HTML scrape, single IP	Free	High at scale	Real-time
HTML scrape + datacenter proxies	~$0.50/GB	Medium	Real-time
HTML scrape + residential proxies	~$3-8/GB	Low	Real-time
Algolia Search API	Free	Very low	1-5 min lag

For most research workloads, Algolia covers 80% of the use case. For live score tracking or front-page monitoring at sub-minute resolution, residential rotation is the right call. The same proxy infrastructure that handles HN works well when you scrape public university course catalogs at scale, where content freshness matters less but volume matters more.

Avoid free proxy lists. They’re flagged by HN’s reverse-DNS checks within minutes.

What Data to Collect and Store

A full HN front-page row contains more than just title and URL. The fields worth capturing:

item_id: the numeric HN ID, permanent and stable
rank: position 1-30 at time of scrape
score: current upvotes
comments_count: descendants field in Firebase, or parsed from “N comments” link
author: the submitter’s username
domain: extracted from the URL (useful for source-frequency analysis)
scraped_at: UTC timestamp with second precision

Score velocity (score delta over time) is the high-value derived field. Store snapshots every 5 minutes and you can compute it trivially. Stories that jump from 50 to 200 points in 15 minutes are almost always front-page bound — that’s the signal that makes HN scraping worth doing over just reading the site.

For researchers doing academic or technical content monitoring, this velocity signal pairs well with metadata pipelines like those used to scrape arXiv preprint metadata and PDFs programmatically, where tracking what gets shared on HN vs. what gets submitted to arXiv reveals real knowledge diffusion patterns.

Storage, Dedup, and Query Design

Write to Postgres with a composite unique constraint on (item_id, scraped_at) rounded to your snapshot interval. This makes upserts idempotent and lets you re-run scrapers without duplicate rows.

Numbered steps for setting up a minimal snapshot pipeline:

Create table hn_snapshots (item_id bigint, rank int, score int, comments int, domain text, scraped_at timestamptz, PRIMARY KEY (item_id, scraped_at))
Add index on (scraped_at DESC, score DESC) for time-range queries
Add index on domain if you’re doing source-frequency analysis
Schedule scraper every 5 minutes via cron or a simple scheduler like APScheduler
Archive rows older than 90 days to a cheaper store (Parquet on S3 or TimescaleDB compression)

TimescaleDB is the cleanest option if you’re already running Postgres — its hypertable chunking makes the time-range queries fast without manual partitioning. For K-12 or education researchers who scrape K-12 school district data and test scores and need to correlate with community engagement trends on forums like HN, this same schema pattern works across datasets.

For a complete breakdown of every HN data field and pagination across pages 2-30 (where you catch stories that have fallen off the front), the DRT pillar guide on how to scrape Hacker News data covers the full item tree structure and comment threading logic.

Staying Within Reasonable Limits

HN is a small operation run by Y Combinator. There’s no published scraping policy, but the community expectation is that you don’t hammer the site. Practical guidelines:

Keep request rate below 1 per 5 seconds from any single IP
Cache aggressively: front page changes slowly, no need to re-fetch on every pipeline stage
Use Algolia for any historical backfill (it’s designed for it)
Set a real User-Agent string with contact info if you’re running a production research tool

The Firebase API is the polite default for low-volume work. Direct HTML scraping is appropriate when you need sub-minute freshness, custom field extraction, or volume that exceeds Firebase’s effective throttle in practice.

Bottom Line

For most engineers, a hybrid setup works best: Algolia for history and search, direct HTML scraping with a residential proxy pool for live front-page monitoring, and Postgres or TimescaleDB for snapshots. The Firebase API is fine for prototyping but breaks down under production load. dataresearchtools.com covers the full stack of scraping patterns like this — from single-page extractions to distributed pipelines — so check the related guides if you’re building something more complex than a cron job.