Hacker News front page data is some of the most signal-rich content on the internet, and scraping it without hitting API limits is easier than most engineers assume. The official Firebase API caps you at 500 requests per 10 seconds per IP, which sounds generous until you’re pulling full item trees, tracking score velocity, or running a sentiment pipeline across 30 days of top stories. Here’s how to do it right in 2026.
Why the Official API Falls Short
The Algolia HN Search API and the Firebase endpoint (https://hacker-news.firebaseio.com/v1) are the two official options. Algolia is excellent for historical queries but reflects indexed data, not live scores. Firebase is real-time but chatty: fetching a single front-page story requires one call for the item ID list, then one call per story, then recursively one call per comment. A 30-comment story costs 31 requests minimum.
At scale this is a problem. Rate limits aside, Firebase enforces connection throttling during peak hours (roughly 06:00–10:00 UTC, when the US morning crowd votes). Engineers building trend detectors or competitive intelligence tools for tech launches — similar to what’s needed when you scrape ProductHunt launch data and maker profiles — run into this wall fast.
The practical alternative is direct HTML scraping of news.ycombinator.com, combined with smart caching and IP rotation.
Scraping the Front Page HTML
HN’s HTML is clean, minimal, and remarkably stable. The front page renders as a single
| Method | Cost | Block Risk | Data Freshness |
|---|---|---|---|
| Firebase API | Free | Low | Real-time |
| HTML scrape, single IP | Free | High at scale | Real-time |
| HTML scrape + datacenter proxies | ~$0.50/GB | Medium | Real-time |
| HTML scrape + residential proxies | ~$3-8/GB | Low | Real-time |
| Algolia Search API | Free | Very low | 1-5 min lag |
For most research workloads, Algolia covers 80% of the use case. For live score tracking or front-page monitoring at sub-minute resolution, residential rotation is the right call. The same proxy infrastructure that handles HN works well when you scrape public university course catalogs at scale, where content freshness matters less but volume matters more.
Avoid free proxy lists. They’re flagged by HN’s reverse-DNS checks within minutes.
What Data to Collect and Store
A full HN front-page row contains more than just title and URL. The fields worth capturing:
item_id: the numeric HN ID, permanent and stablerank: position 1-30 at time of scrapescore: current upvotescomments_count: descendants field in Firebase, or parsed from “N comments” linkauthor: the submitter’s usernamedomain: extracted from the URL (useful for source-frequency analysis)scraped_at: UTC timestamp with second precision
Score velocity (score delta over time) is the high-value derived field. Store snapshots every 5 minutes and you can compute it trivially. Stories that jump from 50 to 200 points in 15 minutes are almost always front-page bound — that’s the signal that makes HN scraping worth doing over just reading the site.
For researchers doing academic or technical content monitoring, this velocity signal pairs well with metadata pipelines like those used to scrape arXiv preprint metadata and PDFs programmatically, where tracking what gets shared on HN vs. what gets submitted to arXiv reveals real knowledge diffusion patterns.
Storage, Dedup, and Query Design
Write to Postgres with a composite unique constraint on (item_id, scraped_at) rounded to your snapshot interval. This makes upserts idempotent and lets you re-run scrapers without duplicate rows.
Numbered steps for setting up a minimal snapshot pipeline:
- Create table
hn_snapshots (item_id bigint, rank int, score int, comments int, domain text, scraped_at timestamptz, PRIMARY KEY (item_id, scraped_at)) - Add index on
(scraped_at DESC, score DESC)for time-range queries - Add index on
domainif you’re doing source-frequency analysis - Schedule scraper every 5 minutes via cron or a simple scheduler like APScheduler
- Archive rows older than 90 days to a cheaper store (Parquet on S3 or TimescaleDB compression)
TimescaleDB is the cleanest option if you’re already running Postgres — its hypertable chunking makes the time-range queries fast without manual partitioning. For K-12 or education researchers who scrape K-12 school district data and test scores and need to correlate with community engagement trends on forums like HN, this same schema pattern works across datasets.
For a complete breakdown of every HN data field and pagination across pages 2-30 (where you catch stories that have fallen off the front), the DRT pillar guide on how to scrape Hacker News data covers the full item tree structure and comment threading logic.
Staying Within Reasonable Limits
HN is a small operation run by Y Combinator. There’s no published scraping policy, but the community expectation is that you don’t hammer the site. Practical guidelines:
- Keep request rate below 1 per 5 seconds from any single IP
- Cache aggressively: front page changes slowly, no need to re-fetch on every pipeline stage
- Use Algolia for any historical backfill (it’s designed for it)
- Set a real User-Agent string with contact info if you’re running a production research tool
The Firebase API is the polite default for low-volume work. Direct HTML scraping is appropriate when you need sub-minute freshness, custom field extraction, or volume that exceeds Firebase’s effective throttle in practice.
Bottom Line
For most engineers, a hybrid setup works best: Algolia for history and search, direct HTML scraping with a residential proxy pool for live front-page monitoring, and Postgres or TimescaleDB for snapshots. The Firebase API is fine for prototyping but breaks down under production load. dataresearchtools.com covers the full stack of scraping patterns like this — from single-page extractions to distributed pipelines — so check the related guides if you’re building something more complex than a cron job.
Related guides on dataresearchtools.com
- How to Scrape arXiv Preprint Metadata and PDFs Programmatically (2026)
- How to Scrape ProductHunt Launch Data and Maker Profiles (2026)
- How to Scrape Public University Course Catalogs at Scale (2026)
- How to Scrape K-12 School District Data and Test Scores (2026)
- Pillar: How to Scrape Hacker News Data
Resources
Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.
Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)