How to Scrape Hacker News Data

How to Scrape Hacker News Data

Hacker News (HN) is one of the most influential technology communities on the internet, with thousands of submissions and comments posted daily. Scraping Hacker News data provides valuable insights into technology trends, startup sentiment, developer tool popularity, and emerging technologies. Unlike most websites, HN offers a free, official API that makes data collection straightforward.

This guide covers both the official Hacker News API approach (recommended) and HTML scraping methods, along with practical examples for data analysis, trend detection, and monitoring.

Why Scrape Hacker News?

Hacker News data is valuable for several use cases:

  • Trend analysis — Track which technologies, frameworks, and tools are gaining or losing popularity
  • Competitive intelligence — Monitor mentions of your product or competitors
  • Content research — Find popular topics for blog posts and marketing content
  • Sentiment analysis — Gauge developer sentiment toward specific technologies
  • Recruitment — Identify active developers and their interests from “Who is Hiring” threads
  • Market research — Track startup launches and funding announcements

What Data Is Available?

Data TypeAPI AccessVolume
Stories (posts)Yes~1,000/day
CommentsYes~5,000-10,000/day
User profilesYes~500K+ users
Votes/pointsYes (per item)Real-time
Job postingsYesMonthly “Who is Hiring”
Historical dataYes (all items)30M+ items since 2006

Method 1: Official Hacker News API (Recommended)

The Hacker News API is free, requires no authentication, and has generous rate limits. Always use the API over HTML scraping when possible.

API Basics

import requests

import json

from datetime import datetime

BASE_URL = "https://hacker-news.firebaseio.com/v0"

def get_item(item_id):

"""Fetch a single item (story, comment, job, poll) by ID."""

url = f"{BASE_URL}/item/{item_id}.json"

response = requests.get(url, timeout=10)

return response.json()

def get_top_stories():

"""Get IDs of current top 500 stories."""

url = f"{BASE_URL}/topstories.json"

response = requests.get(url, timeout=10)

return response.json()

def get_new_stories():

"""Get IDs of newest 500 stories."""

url = f"{BASE_URL}/newstories.json"

response = requests.get(url, timeout=10)

return response.json()

def get_best_stories():

"""Get IDs of best 500 stories."""

url = f"{BASE_URL}/beststories.json"

response = requests.get(url, timeout=10)

return response.json()

def get_user(username):

"""Fetch user profile."""

url = f"{BASE_URL}/user/{username}.json"

response = requests.get(url, timeout=10)

return response.json()

Example: Get top story

top_ids = get_top_stories()

top_story = get_item(top_ids[0])

print(json.dumps(top_story, indent=2))

Fetching Top Stories with Details

import requests

import concurrent.futures

import json

from datetime import datetime

BASE_URL = "https://hacker-news.firebaseio.com/v0"

def fetch_item(item_id):

"""Fetch a single item from the API."""

try:

response = requests.get(f"{BASE_URL}/item/{item_id}.json", timeout=10)

return response.json()

except Exception as e:

print(f"Error fetching item {item_id}: {e}")

return None

def fetch_top_stories(limit=30):

"""Fetch top stories with full details using concurrent requests."""

# Get top story IDs

response = requests.get(f"{BASE_URL}/topstories.json", timeout=10)

story_ids = response.json()[:limit]

# Fetch story details concurrently

stories = []

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:

future_to_id = {executor.submit(fetch_item, sid): sid for sid in story_ids}

for future in concurrent.futures.as_completed(future_to_id):

result = future.result()

if result:

stories.append({

'id': result.get('id'),

'title': result.get('title'),

'url': result.get('url', ''),

'score': result.get('score', 0),

'by': result.get('by', ''),

'time': datetime.fromtimestamp(result.get('time', 0)).isoformat(),

'descendants': result.get('descendants', 0), # comment count

'type': result.get('type'),

})

# Sort by score

stories.sort(key=lambda x: x['score'], reverse=True)

return stories

Fetch and display top stories

stories = fetch_top_stories(50)

for i, story in enumerate(stories[:10], 1):

print(f"{i}. [{story['score']}pts] {story['title']}")

print(f" By: {story['by']} | Comments: {story['descendants']}")

print(f" URL: {story['url']}\n")

Scraping Comments from a Story

def fetch_comments(story_id, max_depth=3):

"""Recursively fetch all comments for a story."""

story = fetch_item(story_id)

if not story or 'kids' not in story:

return []

comments = []

def fetch_comment_tree(comment_ids, depth=0):

if depth >= max_depth:

return

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:

futures = {executor.submit(fetch_item, cid): cid for cid in comment_ids}

for future in concurrent.futures.as_completed(futures):

comment = future.result()

if comment and comment.get('type') == 'comment' and not comment.get('deleted'):

comments.append({

'id': comment['id'],

'by': comment.get('by', '[deleted]'),

'text': comment.get('text', ''),

'time': datetime.fromtimestamp(comment.get('time', 0)).isoformat(),

'depth': depth,

'parent': comment.get('parent'),

})

# Fetch replies

if 'kids' in comment:

fetch_comment_tree(comment['kids'], depth + 1)

fetch_comment_tree(story['kids'])

comments.sort(key=lambda x: x['time'])

return comments

Example: Get all comments from a story

comments = fetch_comments(38792446, max_depth=5)

print(f"Found {len(comments)} comments")

for c in comments[:5]:

indent = " " * c['depth']

text_preview = c['text'][:100].replace('\n', ' ')

print(f"{indent}[{c['by']}]: {text_preview}...")

Method 2: Async API Scraping for Large-Scale Collection

For collecting large volumes of data, use async requests:

import asyncio

import aiohttp

import json

from datetime import datetime

BASE_URL = "https://hacker-news.firebaseio.com/v0"

async def fetch_item_async(session, item_id):

"""Fetch a single item asynchronously."""

try:

async with session.get(f"{BASE_URL}/item/{item_id}.json") as response:

return await response.json()

except Exception:

return None

async def fetch_stories_batch(story_ids, batch_size=50):

"""Fetch multiple stories in batches."""

all_stories = []

async with aiohttp.ClientSession() as session:

for i in range(0, len(story_ids), batch_size):

batch = story_ids[i:i + batch_size]

tasks = [fetch_item_async(session, sid) for sid in batch]

results = await asyncio.gather(*tasks)

for result in results:

if result and result.get('type') == 'story':

all_stories.append({

'id': result['id'],

'title': result.get('title', ''),

'url': result.get('url', ''),

'score': result.get('score', 0),

'by': result.get('by', ''),

'time': result.get('time', 0),

'comments': result.get('descendants', 0),

})

print(f"Fetched {min(i + batch_size, len(story_ids))}/{len(story_ids)}")

await asyncio.sleep(0.5) # Small delay between batches

return all_stories

async def main():

# Get top, new, and best story IDs

async with aiohttp.ClientSession() as session:

async with session.get(f"{BASE_URL}/topstories.json") as resp:

top_ids = await resp.json()

async with session.get(f"{BASE_URL}/newstories.json") as resp:

new_ids = await resp.json()

async with session.get(f"{BASE_URL}/beststories.json") as resp:

best_ids = await resp.json()

# Combine and deduplicate

all_ids = list(set(top_ids + new_ids + best_ids))

print(f"Total unique story IDs: {len(all_ids)}")

# Fetch all stories

stories = await fetch_stories_batch(all_ids)

print(f"Successfully fetched {len(stories)} stories")

# Save to file

with open('hn_stories.json', 'w') as f:

json.dump(stories, f, indent=2)

asyncio.run(main())

Method 3: HTML Scraping (Alternative)

If you need data not available through the API (like exact vote counts or specific page layouts), you can scrape the HTML directly:

import requests

from bs4 import BeautifulSoup

import time

def scrape_hn_frontpage():

"""Scrape the Hacker News front page HTML."""

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',

}

response = requests.get('https://news.ycombinator.com/', headers=headers, timeout=15)

soup = BeautifulSoup(response.text, 'lxml')

stories = []

rows = soup.select('tr.athing')

for row in rows:

story = {}

# Title and URL

title_link = row.select_one('span.titleline a')

if title_link:

story['title'] = title_link.text

story['url'] = title_link.get('href', '')

# Rank

rank_el = row.select_one('span.rank')

story['rank'] = rank_el.text.strip('.') if rank_el else None

# Score and metadata from the next sibling row

subtext = row.find_next_sibling('tr')

if subtext:

score_el = subtext.select_one('span.score')

story['score'] = score_el.text if score_el else '0 points'

user_el = subtext.select_one('a.hnuser')

story['by'] = user_el.text if user_el else None

# Comment count

links = subtext.select('a')

for link in links:

if 'comment' in link.text or 'discuss' in link.text:

story['comments'] = link.text

break

stories.append(story)

return stories

frontpage = scrape_hn_frontpage()

for s in frontpage[:5]:

print(f"#{s['rank']} [{s.get('score', 'N/A')}] {s['title']}")

Practical Applications

Technology Trend Tracker

import re

from collections import Counter

def analyze_tech_trends(stories):

"""Analyze technology mentions in HN story titles."""

tech_keywords = {

'AI/ML': ['ai', 'machine learning', 'llm', 'gpt', 'claude', 'openai', 'transformer', 'neural'],

'Rust': ['rust', 'rustlang', 'cargo'],

'Go': ['golang', ' go '],

'Python': ['python', 'django', 'flask', 'fastapi'],

'JavaScript': ['javascript', 'typescript', 'nodejs', 'node.js', 'react', 'vue', 'nextjs'],

'Databases': ['postgres', 'sqlite', 'mysql', 'redis', 'mongodb'],

'Cloud': ['aws', 'azure', 'gcp', 'kubernetes', 'docker', 'k8s'],

'Security': ['security', 'vulnerability', 'hack', 'breach', 'zero-day'],

'Crypto': ['bitcoin', 'ethereum', 'crypto', 'blockchain', 'web3'],

}

trend_counts = Counter()

trend_scores = Counter()

for story in stories:

title_lower = story.get('title', '').lower()

for category, keywords in tech_keywords.items():

if any(kw in title_lower for kw in keywords):

trend_counts[category] += 1

trend_scores[category] += story.get('score', 0)

print("\n=== Technology Trends ===")

print(f"{'Category':<20} {'Mentions':>10} {'Total Score':>12} {'Avg Score':>10}")

print("-" * 55)

for category, count in trend_counts.most_common():

avg = trend_scores[category] / count if count > 0 else 0

print(f"{category:<20} {count:>10} {trend_scores[category]:>12} {avg:>10.1f}")

Use with previously fetched stories

analyze_tech_trends(stories)

Brand Monitoring

def monitor_brand_mentions(brand_keywords, check_interval=300):

"""Monitor Hacker News for brand mentions."""

import time

seen_ids = set()

while True:

try:

response = requests.get(f"{BASE_URL}/newstories.json", timeout=10)

new_ids = response.json()[:100] # Check latest 100 stories

for story_id in new_ids:

if story_id in seen_ids:

continue

seen_ids.add(story_id)

story = fetch_item(story_id)

if not story:

continue

title = story.get('title', '').lower()

url = story.get('url', '').lower()

for keyword in brand_keywords:

if keyword.lower() in title or keyword.lower() in url:

print(f"\n{'='*60}")

print(f"BRAND MENTION DETECTED: {keyword}")

print(f"Title: {story.get('title')}")

print(f"URL: {story.get('url')}")

print(f"Score: {story.get('score', 0)}")

print(f"HN Link: https://news.ycombinator.com/item?id={story_id}")

print(f"{'='*60}")

except Exception as e:

print(f"Error checking HN: {e}")

time.sleep(check_interval)

Monitor for your brand

monitor_brand_mentions(['YourBrand', 'yourproduct.com'])

Who is Hiring Data Extraction

def scrape_who_is_hiring(story_id):

"""Extract job postings from 'Ask HN: Who is Hiring?' threads."""

comments = fetch_comments(story_id, max_depth=1) # Only top-level comments

jobs = []

for comment in comments:

if comment['depth'] == 0 and comment.get('text'):

text = comment['text']

# Simple parsing — top-level comments are job postings

job = {

'raw_text': text,

'posted_by': comment['by'],

'posted_at': comment['time'],

}

# Try to extract company name (usually first line)

lines = text.split('<p>')

if lines:

first_line = BeautifulSoup(lines[0], 'lxml').text

job['company_line'] = first_line.strip()

# Check for common markers

text_lower = text.lower()

job['remote'] = 'remote' in text_lower

job['onsite'] = 'onsite' in text_lower or 'on-site' in text_lower

job['visa'] = 'visa' in text_lower

jobs.append(job)

return jobs

Using BigQuery for Historical Analysis

For large-scale historical analysis, use the Hacker News dataset on Google BigQuery (free tier available):

-- Top domains submitted to HN in the last year

SELECT

NET.REG_DOMAIN(url) AS domain,

COUNT(*) AS submissions,

AVG(score) AS avg_score,

SUM(descendants) AS total_comments

FROM bigquery-public-data.hacker_news.full

WHERE type = 'story'

AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 365 DAY)

AND url IS NOT NULL

GROUP BY domain

ORDER BY submissions DESC

LIMIT 50;

This is far more efficient than scraping millions of items through the API.

Frequently Asked Questions

Does Hacker News have rate limits?

The official Firebase API does not document strict rate limits, but it is good practice to limit yourself to a few requests per second. For heavy use, add small delays between requests and use concurrent connections responsibly. The API is free and does not require authentication.

Can I get historical Hacker News data?

Yes, through several methods: (1) The API gives access to all items ever posted (30M+) by incrementing item IDs, (2) Google BigQuery has the complete HN dataset updated regularly, (3) Third-party datasets are available on Kaggle and academic repositories.

Is it legal to scrape Hacker News?

Hacker News provides an official API specifically for programmatic access, making API-based collection explicitly allowed. HTML scraping is also generally acceptable as long as you respect reasonable rate limits. All HN content is publicly available [link to: is-web-scraping-legal].

How can I track trending topics on Hacker News?

Fetch the top stories periodically (every 15-30 minutes), analyze titles for keyword frequency, and track score trajectories over time. Compare against historical baselines to identify topics gaining unusual traction. The async API approach shown above makes this efficient.

Can I use Hacker News data for machine learning?

Yes, HN data is popular for NLP research, including sentiment analysis, topic modeling, text classification (story vs. comment vs. job posting), and engagement prediction (predicting which stories will reach the front page). The BigQuery dataset is the most practical source for ML training data.

last updated: March 12, 2026

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)