How to Scrape Hacker News Data
Hacker News (HN) is one of the most influential technology communities on the internet, with thousands of submissions and comments posted daily. Scraping Hacker News data provides valuable insights into technology trends, startup sentiment, developer tool popularity, and emerging technologies. Unlike most websites, HN offers a free, official API that makes data collection straightforward.
This guide covers both the official Hacker News API approach (recommended) and HTML scraping methods, along with practical examples for data analysis, trend detection, and monitoring.
Why Scrape Hacker News?
Hacker News data is valuable for several use cases:
- Trend analysis — Track which technologies, frameworks, and tools are gaining or losing popularity
- Competitive intelligence — Monitor mentions of your product or competitors
- Content research — Find popular topics for blog posts and marketing content
- Sentiment analysis — Gauge developer sentiment toward specific technologies
- Recruitment — Identify active developers and their interests from “Who is Hiring” threads
- Market research — Track startup launches and funding announcements
What Data Is Available?
| Data Type | API Access | Volume |
|---|---|---|
| Stories (posts) | Yes | ~1,000/day |
| Comments | Yes | ~5,000-10,000/day |
| User profiles | Yes | ~500K+ users |
| Votes/points | Yes (per item) | Real-time |
| Job postings | Yes | Monthly “Who is Hiring” |
| Historical data | Yes (all items) | 30M+ items since 2006 |
Method 1: Official Hacker News API (Recommended)
The Hacker News API is free, requires no authentication, and has generous rate limits. Always use the API over HTML scraping when possible.
API Basics
import requests
import json
from datetime import datetime
BASE_URL = "https://hacker-news.firebaseio.com/v0"
def get_item(item_id):
"""Fetch a single item (story, comment, job, poll) by ID."""
url = f"{BASE_URL}/item/{item_id}.json"
response = requests.get(url, timeout=10)
return response.json()
def get_top_stories():
"""Get IDs of current top 500 stories."""
url = f"{BASE_URL}/topstories.json"
response = requests.get(url, timeout=10)
return response.json()
def get_new_stories():
"""Get IDs of newest 500 stories."""
url = f"{BASE_URL}/newstories.json"
response = requests.get(url, timeout=10)
return response.json()
def get_best_stories():
"""Get IDs of best 500 stories."""
url = f"{BASE_URL}/beststories.json"
response = requests.get(url, timeout=10)
return response.json()
def get_user(username):
"""Fetch user profile."""
url = f"{BASE_URL}/user/{username}.json"
response = requests.get(url, timeout=10)
return response.json()
Example: Get top story
top_ids = get_top_stories()
top_story = get_item(top_ids[0])
print(json.dumps(top_story, indent=2))
Fetching Top Stories with Details
import requests
import concurrent.futures
import json
from datetime import datetime
BASE_URL = "https://hacker-news.firebaseio.com/v0"
def fetch_item(item_id):
"""Fetch a single item from the API."""
try:
response = requests.get(f"{BASE_URL}/item/{item_id}.json", timeout=10)
return response.json()
except Exception as e:
print(f"Error fetching item {item_id}: {e}")
return None
def fetch_top_stories(limit=30):
"""Fetch top stories with full details using concurrent requests."""
# Get top story IDs
response = requests.get(f"{BASE_URL}/topstories.json", timeout=10)
story_ids = response.json()[:limit]
# Fetch story details concurrently
stories = []
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
future_to_id = {executor.submit(fetch_item, sid): sid for sid in story_ids}
for future in concurrent.futures.as_completed(future_to_id):
result = future.result()
if result:
stories.append({
'id': result.get('id'),
'title': result.get('title'),
'url': result.get('url', ''),
'score': result.get('score', 0),
'by': result.get('by', ''),
'time': datetime.fromtimestamp(result.get('time', 0)).isoformat(),
'descendants': result.get('descendants', 0), # comment count
'type': result.get('type'),
})
# Sort by score
stories.sort(key=lambda x: x['score'], reverse=True)
return stories
Fetch and display top stories
stories = fetch_top_stories(50)
for i, story in enumerate(stories[:10], 1):
print(f"{i}. [{story['score']}pts] {story['title']}")
print(f" By: {story['by']} | Comments: {story['descendants']}")
print(f" URL: {story['url']}\n")
Scraping Comments from a Story
def fetch_comments(story_id, max_depth=3):
"""Recursively fetch all comments for a story."""
story = fetch_item(story_id)
if not story or 'kids' not in story:
return []
comments = []
def fetch_comment_tree(comment_ids, depth=0):
if depth >= max_depth:
return
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(fetch_item, cid): cid for cid in comment_ids}
for future in concurrent.futures.as_completed(futures):
comment = future.result()
if comment and comment.get('type') == 'comment' and not comment.get('deleted'):
comments.append({
'id': comment['id'],
'by': comment.get('by', '[deleted]'),
'text': comment.get('text', ''),
'time': datetime.fromtimestamp(comment.get('time', 0)).isoformat(),
'depth': depth,
'parent': comment.get('parent'),
})
# Fetch replies
if 'kids' in comment:
fetch_comment_tree(comment['kids'], depth + 1)
fetch_comment_tree(story['kids'])
comments.sort(key=lambda x: x['time'])
return comments
Example: Get all comments from a story
comments = fetch_comments(38792446, max_depth=5)
print(f"Found {len(comments)} comments")
for c in comments[:5]:
indent = " " * c['depth']
text_preview = c['text'][:100].replace('\n', ' ')
print(f"{indent}[{c['by']}]: {text_preview}...")
Method 2: Async API Scraping for Large-Scale Collection
For collecting large volumes of data, use async requests:
import asyncio
import aiohttp
import json
from datetime import datetime
BASE_URL = "https://hacker-news.firebaseio.com/v0"
async def fetch_item_async(session, item_id):
"""Fetch a single item asynchronously."""
try:
async with session.get(f"{BASE_URL}/item/{item_id}.json") as response:
return await response.json()
except Exception:
return None
async def fetch_stories_batch(story_ids, batch_size=50):
"""Fetch multiple stories in batches."""
all_stories = []
async with aiohttp.ClientSession() as session:
for i in range(0, len(story_ids), batch_size):
batch = story_ids[i:i + batch_size]
tasks = [fetch_item_async(session, sid) for sid in batch]
results = await asyncio.gather(*tasks)
for result in results:
if result and result.get('type') == 'story':
all_stories.append({
'id': result['id'],
'title': result.get('title', ''),
'url': result.get('url', ''),
'score': result.get('score', 0),
'by': result.get('by', ''),
'time': result.get('time', 0),
'comments': result.get('descendants', 0),
})
print(f"Fetched {min(i + batch_size, len(story_ids))}/{len(story_ids)}")
await asyncio.sleep(0.5) # Small delay between batches
return all_stories
async def main():
# Get top, new, and best story IDs
async with aiohttp.ClientSession() as session:
async with session.get(f"{BASE_URL}/topstories.json") as resp:
top_ids = await resp.json()
async with session.get(f"{BASE_URL}/newstories.json") as resp:
new_ids = await resp.json()
async with session.get(f"{BASE_URL}/beststories.json") as resp:
best_ids = await resp.json()
# Combine and deduplicate
all_ids = list(set(top_ids + new_ids + best_ids))
print(f"Total unique story IDs: {len(all_ids)}")
# Fetch all stories
stories = await fetch_stories_batch(all_ids)
print(f"Successfully fetched {len(stories)} stories")
# Save to file
with open('hn_stories.json', 'w') as f:
json.dump(stories, f, indent=2)
asyncio.run(main())
Method 3: HTML Scraping (Alternative)
If you need data not available through the API (like exact vote counts or specific page layouts), you can scrape the HTML directly:
import requests
from bs4 import BeautifulSoup
import time
def scrape_hn_frontpage():
"""Scrape the Hacker News front page HTML."""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
}
response = requests.get('https://news.ycombinator.com/', headers=headers, timeout=15)
soup = BeautifulSoup(response.text, 'lxml')
stories = []
rows = soup.select('tr.athing')
for row in rows:
story = {}
# Title and URL
title_link = row.select_one('span.titleline a')
if title_link:
story['title'] = title_link.text
story['url'] = title_link.get('href', '')
# Rank
rank_el = row.select_one('span.rank')
story['rank'] = rank_el.text.strip('.') if rank_el else None
# Score and metadata from the next sibling row
subtext = row.find_next_sibling('tr')
if subtext:
score_el = subtext.select_one('span.score')
story['score'] = score_el.text if score_el else '0 points'
user_el = subtext.select_one('a.hnuser')
story['by'] = user_el.text if user_el else None
# Comment count
links = subtext.select('a')
for link in links:
if 'comment' in link.text or 'discuss' in link.text:
story['comments'] = link.text
break
stories.append(story)
return stories
frontpage = scrape_hn_frontpage()
for s in frontpage[:5]:
print(f"#{s['rank']} [{s.get('score', 'N/A')}] {s['title']}")
Practical Applications
Technology Trend Tracker
import re
from collections import Counter
def analyze_tech_trends(stories):
"""Analyze technology mentions in HN story titles."""
tech_keywords = {
'AI/ML': ['ai', 'machine learning', 'llm', 'gpt', 'claude', 'openai', 'transformer', 'neural'],
'Rust': ['rust', 'rustlang', 'cargo'],
'Go': ['golang', ' go '],
'Python': ['python', 'django', 'flask', 'fastapi'],
'JavaScript': ['javascript', 'typescript', 'nodejs', 'node.js', 'react', 'vue', 'nextjs'],
'Databases': ['postgres', 'sqlite', 'mysql', 'redis', 'mongodb'],
'Cloud': ['aws', 'azure', 'gcp', 'kubernetes', 'docker', 'k8s'],
'Security': ['security', 'vulnerability', 'hack', 'breach', 'zero-day'],
'Crypto': ['bitcoin', 'ethereum', 'crypto', 'blockchain', 'web3'],
}
trend_counts = Counter()
trend_scores = Counter()
for story in stories:
title_lower = story.get('title', '').lower()
for category, keywords in tech_keywords.items():
if any(kw in title_lower for kw in keywords):
trend_counts[category] += 1
trend_scores[category] += story.get('score', 0)
print("\n=== Technology Trends ===")
print(f"{'Category':<20} {'Mentions':>10} {'Total Score':>12} {'Avg Score':>10}")
print("-" * 55)
for category, count in trend_counts.most_common():
avg = trend_scores[category] / count if count > 0 else 0
print(f"{category:<20} {count:>10} {trend_scores[category]:>12} {avg:>10.1f}")
Use with previously fetched stories
analyze_tech_trends(stories)
Brand Monitoring
def monitor_brand_mentions(brand_keywords, check_interval=300):
"""Monitor Hacker News for brand mentions."""
import time
seen_ids = set()
while True:
try:
response = requests.get(f"{BASE_URL}/newstories.json", timeout=10)
new_ids = response.json()[:100] # Check latest 100 stories
for story_id in new_ids:
if story_id in seen_ids:
continue
seen_ids.add(story_id)
story = fetch_item(story_id)
if not story:
continue
title = story.get('title', '').lower()
url = story.get('url', '').lower()
for keyword in brand_keywords:
if keyword.lower() in title or keyword.lower() in url:
print(f"\n{'='*60}")
print(f"BRAND MENTION DETECTED: {keyword}")
print(f"Title: {story.get('title')}")
print(f"URL: {story.get('url')}")
print(f"Score: {story.get('score', 0)}")
print(f"HN Link: https://news.ycombinator.com/item?id={story_id}")
print(f"{'='*60}")
except Exception as e:
print(f"Error checking HN: {e}")
time.sleep(check_interval)
Monitor for your brand
monitor_brand_mentions(['YourBrand', 'yourproduct.com'])
Who is Hiring Data Extraction
def scrape_who_is_hiring(story_id):
"""Extract job postings from 'Ask HN: Who is Hiring?' threads."""
comments = fetch_comments(story_id, max_depth=1) # Only top-level comments
jobs = []
for comment in comments:
if comment['depth'] == 0 and comment.get('text'):
text = comment['text']
# Simple parsing — top-level comments are job postings
job = {
'raw_text': text,
'posted_by': comment['by'],
'posted_at': comment['time'],
}
# Try to extract company name (usually first line)
lines = text.split('<p>')
if lines:
first_line = BeautifulSoup(lines[0], 'lxml').text
job['company_line'] = first_line.strip()
# Check for common markers
text_lower = text.lower()
job['remote'] = 'remote' in text_lower
job['onsite'] = 'onsite' in text_lower or 'on-site' in text_lower
job['visa'] = 'visa' in text_lower
jobs.append(job)
return jobs
Using BigQuery for Historical Analysis
For large-scale historical analysis, use the Hacker News dataset on Google BigQuery (free tier available):
-- Top domains submitted to HN in the last year
SELECT
NET.REG_DOMAIN(url) AS domain,
COUNT(*) AS submissions,
AVG(score) AS avg_score,
SUM(descendants) AS total_comments
FROM bigquery-public-data.hacker_news.full
WHERE type = 'story'
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 365 DAY)
AND url IS NOT NULL
GROUP BY domain
ORDER BY submissions DESC
LIMIT 50;
This is far more efficient than scraping millions of items through the API.
Frequently Asked Questions
Does Hacker News have rate limits?
The official Firebase API does not document strict rate limits, but it is good practice to limit yourself to a few requests per second. For heavy use, add small delays between requests and use concurrent connections responsibly. The API is free and does not require authentication.
Can I get historical Hacker News data?
Yes, through several methods: (1) The API gives access to all items ever posted (30M+) by incrementing item IDs, (2) Google BigQuery has the complete HN dataset updated regularly, (3) Third-party datasets are available on Kaggle and academic repositories.
Is it legal to scrape Hacker News?
Hacker News provides an official API specifically for programmatic access, making API-based collection explicitly allowed. HTML scraping is also generally acceptable as long as you respect reasonable rate limits. All HN content is publicly available [link to: is-web-scraping-legal].
How can I track trending topics on Hacker News?
Fetch the top stories periodically (every 15-30 minutes), analyze titles for keyword frequency, and track score trajectories over time. Compare against historical baselines to identify topics gaining unusual traction. The async API approach shown above makes this efficient.
Can I use Hacker News data for machine learning?
Yes, HN data is popular for NLP research, including sentiment analysis, topic modeling, text classification (story vs. comment vs. job posting), and engagement prediction (predicting which stories will reach the front page). The BigQuery dataset is the most practical source for ML training data.
last updated: March 12, 2026