Scraping job board data for talent intelligence in 2026
Scrape job boards and you build the foundation for one of the most commercially valuable analytical products: talent intelligence. Hiring trends are leading indicators for company growth, market entry, technology adoption, and competitive positioning. A consistent dataset of job postings across LinkedIn, Indeed, Glassdoor, Wellfound (formerly AngelList), and the dozens of niche boards lets you answer questions like which companies are scaling their engineering teams, which technology stacks are gaining adoption, and which competitors are expanding into new geographies. The scraping landscape is shaped by three things: aggressive bot defenses on LinkedIn specifically, an aggregation problem because the same job appears on multiple boards, and a normalization problem because job titles and skills are unstructured free text.
This guide focuses on the major U.S.-anchored job boards but the patterns transfer to European boards (StepStone, Welcome to the Jungle, Otta) and Asian boards (JobStreet, Naukri).
Source taxonomy and posting identifiers
The job board ecosystem has three distinct source types with different scraping characteristics.
Aggregator boards consolidate postings from many companies into a single browseable catalogue. LinkedIn Jobs, Indeed, Glassdoor, and ZipRecruiter are the dominant aggregators. They expose listing search APIs (mostly undocumented) and have aggressive bot defenses because their business model depends on the data being a moat.
Direct company career pages are the long-tail source. Most companies use one of a handful of ATS platforms (Greenhouse, Lever, Workday, Ashby, SmartRecruiters) and each ATS has a consistent URL structure. Scraping direct career pages is dramatically easier than scraping aggregators because the bot defenses are minimal.
Niche boards target specific verticals (Stack Overflow Jobs for engineering, Wellfound for startups, We Work Remotely for remote, BuiltIn for tech-city-specific). These tend to have moderate defenses and rich structured data.
import httpx
from bs4 import BeautifulSoup
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/json",
"Accept-Language": "en-US,en;q=0.9",
}
async def scrape_greenhouse_company(slug: str, proxy: str):
url = f"https://api.greenhouse.io/v1/boards/{slug}/jobs"
async with httpx.AsyncClient(proxy=proxy, headers=HEADERS, timeout=20) as c:
r = await c.get(url)
if r.status_code == 200:
return r.json().get("jobs", [])
return []
Greenhouse and Lever both expose public API endpoints per company that return structured job postings. For companies on these platforms, the public API is dramatically more reliable than scraping aggregators. The Greenhouse pattern alone covers thousands of mid-market and large companies.
LinkedIn-specific considerations
LinkedIn has the strictest bot defenses in the talent intelligence ecosystem and is the only major source where scraping requires authentication for meaningful coverage. The hiQ Labs ruling clarified some of the legal status of public LinkedIn scraping, but LinkedIn’s terms of service still explicitly prohibit it and the company actively pursues commercial scrapers.
For ethical and risk-conscious operations, the practical pattern is to use the LinkedIn Talent Insights or LinkedIn Sales Navigator APIs (paid, contracted access) rather than unauthorized scraping. For analytical use cases that absolutely require LinkedIn data, residential or mobile IPs combined with hand-warmed accounts are the technical baseline, but the legal and account-risk picture is meaningfully worse than other sources.
Proxy strategy across the major boards
| Source | Recommended proxy | Tolerance per IP |
|---|---|---|
| Indeed | U.S. residential | 30 req/min per IP |
| Glassdoor | U.S. residential | 30 req/min per IP |
| U.S. residential or mobile, with auth | 10 req/min per IP, hand-warmed account | |
| Greenhouse / Lever | Datacenter | 100 req/min per IP |
| Wellfound | U.S. residential | 60 req/min per IP |
| ZipRecruiter | U.S. residential | 30 req/min per IP |
For workloads under 50,000 postings per day, a small U.S. residential pool covers everything except LinkedIn. For LinkedIn coverage, the proxy economics shift toward dedicated mobile inventory because the per-account session lifetime is short.
Job posting deduplication
The same job posting frequently appears on 5-10 different boards. The deduplication problem is harder than ecommerce SKU dedup because job titles are highly variable and the company-and-location tuple is the only stable signal across sources. The standard approach uses a three-pass funnel:
The first pass groups by exact match on company name plus job title plus location. This catches the easy duplicates where the recruiter posted the same text everywhere.
The second pass groups by company plus a normalized job title plus a 25-mile location radius. Normalization removes seniority adjectives (senior, junior, II, III), removes hyphens and parenthetical clarifications, and standardizes common synonyms (engineer/developer, manager/lead).
The third pass groups by company plus job description text similarity using a sentence-embedding model (typically a small open-source model from Sentence Transformers). Cosine similarity above 0.85 indicates likely duplicates.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def description_similarity(desc_a: str, desc_b: str) -> float:
emb_a = model.encode(desc_a)
emb_b = model.encode(desc_b)
import numpy as np
return float(np.dot(emb_a, emb_b) / (np.linalg.norm(emb_a) * np.linalg.norm(emb_b)))
After dedup, attach the source set as an attribute on the canonical posting so you preserve “this same job appeared on Indeed, LinkedIn, and Glassdoor”. The cross-source presence is itself a useful analytical signal.
Schema for job posting snapshots
CREATE TABLE job_posting_snapshot (
snapshot_at TIMESTAMP NOT NULL,
canonical_id UUID NOT NULL,
company_name TEXT NOT NULL,
title TEXT NOT NULL,
location TEXT,
country VARCHAR(2),
source_set TEXT[],
employment_type VARCHAR(32),
posted_at DATE,
skills TEXT[],
description_excerpt TEXT,
PRIMARY KEY (snapshot_at, canonical_id)
);
For talent intelligence analytics, the most valuable derived signal is the time-series of postings per company per role family. Aggregating at company plus role family plus week reveals hiring acceleration before it shows up in headcount data publicly.
For broader pattern guidance, see our residential proxy provider ranking and our headless browser frameworks ranking.
Detecting and routing around bot challenges
When LinkedIn, Indeed, and other job boards flag your traffic, the response is usually a Cloudflare or vendor interrogation page rather than a clean HTTP error. Your scraper needs to detect this content-type swap explicitly. Look for the signature cf-mitigated header, the presence of __cf_chl_ cookies, or HTML containing Just a moment....
def is_challenged(response) -> bool:
if response.status_code in (403, 503):
return True
if "cf-mitigated" in response.headers:
return True
if "__cf_chl_" in response.headers.get("set-cookie", ""):
return True
body = response.text[:2000].lower()
return "just a moment" in body or "checking your browser" in body
When you detect a challenge, do not retry on the same IP for at least 30 minutes. Mark that IP as cooling and route subsequent requests to a different IP in your pool. Aggressive retries on a flagged IP cause the cooling window to extend and can lead to long-term blacklisting of your subnet. For pages that absolutely must be fetched, have a fallback path that uses a headless browser. Most production setups maintain a 95/5 split between the lightweight HTTP path and the browser fallback path.
Operational monitoring and alerting
Every production scraper needs three monitoring layers regardless of vertical. The first is per-IP success rate over a 5-minute window, alerting if any IP drops below 80%. The second is parser error rate, alerting if more than 1% of fetched pages fail to extract the canonical fields. The third is data freshness, alerting if your downstream consumers see snapshots more than 24 hours old.
import time
from collections import deque
class IPHealthTracker:
def __init__(self, window_seconds: int = 300):
self.window = window_seconds
self.events = {}
def record(self, ip: str, success: bool):
bucket = self.events.setdefault(ip, deque())
now = time.time()
bucket.append((now, success))
while bucket and bucket[0][0] < now - self.window:
bucket.popleft()
def success_rate(self, ip: str) -> float:
bucket = self.events.get(ip)
if not bucket:
return 1.0
successes = sum(1 for _, ok in bucket if ok)
return successes / len(bucket)
Wire this into Prometheus or your existing observability stack so the on-call engineer sees IP degradation as it happens rather than after the daily snapshot fails. For long-running operations, IP rotation triggered by the health tracker is more reliable than fixed rotation schedules.
Pipeline orchestration and scheduling
For any non-trivial talent intelligence scraping operation, a dedicated orchestration layer is the difference between a script you babysit and a service that runs unattended. The two strong open-source choices in 2026 are Prefect 3 and Dagster.
from prefect import flow, task
@task(retries=3, retry_delay_seconds=60)
def fetch_source(source_id: str, page: int):
return crawl_one_page(source_id, page)
@flow(name="talent-intelligence-daily-sweep")
def daily_sweep(source_ids: list):
futures = []
for sid in source_ids:
for page in range(1, 30):
futures.append(fetch_source.submit(sid, page))
return [f.result() for f in futures]
Run the flow on a cadence aligned to how dynamic the underlying data is. For talent intelligence where records change intraday, a 4-6 hour cadence catches meaningful movements. For longer-cycle data, daily is sufficient.
Data quality monitoring patterns
Beyond per-IP success rate, every snapshot should pass a small battery of data quality checks before being considered authoritative. Structural checks verify that every required field is present and of the expected type. Distributional checks compare the current snapshot against recent history. Semantic checks compare related fields for consistency.
def quality_check(snapshot: list[dict]) -> list[str]:
errors = []
if not snapshot:
errors.append("empty snapshot")
return errors
avg_yesterday = get_yesterday_avg_size()
if len(snapshot) < avg_yesterday * 0.7:
errors.append(f"snapshot size {len(snapshot)} is 30% below yesterday")
return errors
Run quality checks as a separate flow that gates promotion of the snapshot from staging to production. A snapshot that fails quality checks should be quarantined for human review.
Cost optimization strategies
Proxy bandwidth is usually the dominant cost in a production scraping operation. Three optimization patterns consistently reduce cost without hurting data quality. The first is request deduplication: serve cached responses when consumers ask for the same record within the same hour. The second is conditional GET using ETag or If-Modified-Since headers. The third is selective field hydration when the upstream API supports it.
For workloads above 100 GB of monthly proxy bandwidth, these three optimizations together reduce cost by 40-60% without changing the analytical output.
End-to-end pipeline architecture
A production-grade scraping pipeline has four layers: collection, parsing, storage, and serving. The collection layer handles the network conversation and knows nothing about data shape. The parsing layer transforms raw bytes into structured records and owns the schema. The storage layer holds the canonical snapshots in a query-optimized format like DuckDB or ClickHouse. The serving layer exposes the data to consumers and should be denormalized and pre-aggregated.
Decoupling these layers enables independent scaling. The collection layer is bound by proxy capacity. The parsing layer is CPU-bound. The storage layer is bound by I/O. The serving layer is bound by query concurrency.
Legal and compliance considerations
Public talent intelligence data is generally treated as fair to scrape in most jurisdictions, but always confine your collection to non-personal data: identifiers, structured attributes, and aggregates. Avoid collecting personally identifying details, and avoid pulling any data behind a login.
For commercial deployment, document your basis for processing, your data retention period, and your purpose limitation. The W3C Web Annotation guidance and similar published frameworks remain useful starting points for documenting your approach.
Sample analytics queries
-- Volume trend over the last 30 days
SELECT date_trunc('day', snapshot_at) AS day, COUNT(*) AS records
FROM snapshot
WHERE snapshot_at > now() - interval '30 days'
GROUP BY 1 ORDER BY 1;
-- New entities first seen in the last 14 days
SELECT entity_id, MIN(snapshot_at) AS first_seen
FROM snapshot
GROUP BY entity_id
HAVING MIN(snapshot_at) > now() - interval '14 days';
-- Source distribution
SELECT source, COUNT(*) AS records
FROM snapshot
WHERE snapshot_at > now() - interval '7 days'
GROUP BY source
ORDER BY records DESC;
Add a category share view, a source concentration view, and a price-volatility view (where applicable) and you have a solid foundation for a talent intelligence intelligence product.
Versioning your scraper for source evolution
Every talent intelligence source evolves its schema regularly. New fields appear, old fields are deprecated, and display logic changes. Stamp every snapshot row with the scraper version that produced it. Downstream analytics can filter by version when they need consistent semantics across a time range. Pair this with a small registry table that documents what each scraper version did differently so debugging unexpected metric jumps becomes tractable.
Building a hiring-velocity dashboard from the dataset
The most common analytical product on top of job board scraping is a hiring-velocity dashboard that tracks postings per company per role family per week. For analytical depth, layer in geography (city or country), seniority, and remote-friendly classification. The combination of these dimensions produces a 5-7 dimensional cube that supports most talent intelligence questions.
def hiring_velocity(df):
return df.groupby(['company', 'role_family', 'week']).agg(
new_postings=('canonical_id', 'nunique'),
unique_locations=('location', 'nunique'),
).reset_index()
The headline metric is week-over-week new-posting count per company. A company that posted 5 engineering roles in week 1 and 50 in week 4 is in active scaling mode. A company that posted 50 in week 1 and 5 in week 4 may be hitting a hiring freeze. Both signals are leading indicators of broader business state.
For sector-level analysis, aggregate at the SIC or NAICS classification level. Hiring velocity by sector reveals macro-economic shifts before they show up in employment statistics.
Skills taxonomy and demand tracking
After dedup, the next analytical step is normalizing skills. Job descriptions mention thousands of distinct skill phrases that map to a smaller canonical taxonomy. ESCO is the European reference taxonomy with roughly 13,000 skills. O*NET is the U.S. equivalent with similar coverage. For most practical applications, a custom taxonomy of 500-1,000 high-frequency skills is sufficient.
The pipeline is: extract skill phrases from job descriptions using a named entity recognition model, map each phrase to a canonical skill via fuzzy match against the taxonomy, then aggregate at the skill plus week plus geography level to produce demand trends.
import spacy
nlp = spacy.load("en_core_web_sm")
def extract_skill_phrases(description: str) -> list:
doc = nlp(description)
return [ent.text for ent in doc.ents if ent.label_ == "SKILL"]
Demand trends per skill are valuable to a wide range of consumers: technology vendors tracking adoption of their stack, training companies positioning their curriculum, and recruiters pricing their candidates.
Compensation intelligence considerations
Salary disclosure rules are evolving rapidly. Several U.S. states require explicit salary range disclosure on postings, and EU regulation is moving the same direction with the Pay Transparency Directive. For postings with explicit salary, capture the range and the currency. For postings without, third-party estimates from Glassdoor, Levels.fyi, or company-supplied benchmarks can be joined as a derived signal.
Compensation data is the most personal-data-adjacent slice of talent intelligence. Even though postings themselves are public, the inferences drawn about specific companies and specific roles can be commercially sensitive. Document your basis for processing and your use limitations clearly.
Common pitfalls when scraping job boards
Three failure patterns are nearly universal across LinkedIn, Indeed, Glassdoor, and the specialized boards. The first is duplicate-posting inflation. The same role often appears on 3-7 boards with different posting IDs. Recruiter tools cross-post automatically. A scraper that counts postings rather than unique roles overstates demand by 2-4x in hot verticals. Use a hash of (employer_id, normalized_title, location) as the canonical role key and treat per-board postings as children.
The second is reposting-bias in time-to-fill metrics. Many ATS systems repost an unfilled role every 30 days to keep it ranked. A scraper that treats each repost as a new posting undercounts the true days-to-fill. Detect reposts by tracking the same role-key over time and merge the gap windows.
The third is salary-band inference noise. Where stated salary ranges are absent (most US postings outside California, Colorado, Washington, NY), inferred bands from third-party estimators (Levels.fyi, Glassdoor) carry wide error bars. Treat inferred bands as one noisy signal among several, not as ground truth, when reporting compensation trends.
FAQ
Is scraping LinkedIn legal after the hiQ ruling?
The hiQ Labs vs. LinkedIn case clarified that scraping public LinkedIn data does not violate the Computer Fraud and Abuse Act, but LinkedIn still actively enforces its terms of service through other legal channels and through technical countermeasures. For commercial use, the safer pattern is licensed access through LinkedIn’s official APIs rather than unauthorized scraping.
How do I extract structured skills from unstructured job descriptions?
The standard approach uses a named entity recognition model trained on a skills taxonomy like ESCO (the European framework) or O*NET (the U.S. equivalent). Open-source NER models from spaCy and Hugging Face handle this well after fine-tuning on a small annotated corpus.
Can I track salary information from job postings?
Salary disclosure varies by jurisdiction. Several U.S. states (California, New York, Colorado, Washington) require salary range disclosure, and EU regulation is moving in the same direction. For postings without explicit salary, third-party estimates from Glassdoor or Levels.fyi can be joined as a derived signal.
How fresh is the data on aggregator sites?
Indeed and Glassdoor refresh their indexes hourly. LinkedIn refreshes faster for the highlighted postings but slower for the long tail. Direct ATS APIs are real-time. For talent intelligence use cases, daily snapshots are sufficient because hiring decisions move on weekly or monthly cycles.
What about international job boards?
The patterns transfer with minor adjustments. JobStreet (Southeast Asia), Naukri (India), and StepStone (Europe) all expose similar API surfaces with similar bot defenses. Plan for per-region proxy sourcing because country-specific IPs improve success rates significantly.
Does LinkedIn block scrapers more aggressively in 2026?
Yes. The post-hiQ enforcement layer escalated through 2024-2025. Public-profile scraping remains legally defensible; production volume requires careful rate management and rotating fingerprints.
How do I separate hiring-manager intent from agency reposting?
Cluster postings by employer + role-key and discount postings where the contact is a known staffing agency. The signal-to-noise ratio improves substantially after this filter.
Which job boards are most useful for executive-search intelligence vs volume hiring?
LinkedIn dominates executive and director-level postings, especially when combined with profile-change signals. Indeed and ZipRecruiter dominate volume hiring. Specialized boards (Wellfound for startups, Dice for tech infrastructure, Built In for regional tech) carry signal that the generalists miss. A balanced talent-intelligence stack pulls from at least one generalist and two specialist sources per vertical of interest.
To build broader B2B intelligence pipelines, browse the b2b-lead-gen category for tooling reviews and framework deep dives.