How to Scrape Job Listings at Scale with Rotating Proxies
Job market data is one of the most valuable datasets for recruitment agencies, HR tech companies, and market researchers. Understanding which companies are hiring, what skills are in demand, and how compensation trends are shifting requires collecting data from job boards at scale. This guide covers the technical approach to scraping major job platforms using rotating proxies, structuring the data, and handling the anti-bot protections these sites employ.
Why Scrape Job Listings
Before diving into the technical implementation, here are the primary use cases:
- Talent intelligence: Track hiring trends across industries and competitors.
- Salary benchmarking: Collect compensation data from job postings to inform salary decisions.
- Market research: Understand which technologies and skills are growing in demand.
- Lead generation: Identify companies that are actively hiring as sales prospects.
- Academic research: Analyze labor market dynamics at scale.
- Job aggregation: Build job search engines that pull from multiple sources.
Job Board Landscape and Their Protections
Each job board has different levels of anti-scraping protection. Understanding these helps you plan your approach.
Indeed
Indeed is one of the largest job boards globally. Their protections include:
- Rate limiting: Aggressive throttling after a few dozen requests.
- CAPTCHA: Cloudflare-based challenges for suspicious traffic.
- IP blocking: Quick to block datacenter IPs.
- JavaScript rendering: Some content loads dynamically.
LinkedIn has the most aggressive anti-scraping measures:
- Login walls: Most job data requires authentication.
- Rate limiting: Very strict, even for logged-in users.
- Legal action: LinkedIn has sued scraping companies (though the legality of scraping public data remains contested after the hiQ Labs v. LinkedIn ruling).
- Browser fingerprinting: Advanced detection of automated browsers.
Glassdoor
Glassdoor sits in the middle:
- Cloudflare protection: Standard bot detection.
- Rate limiting: Moderate.
- Review data: Salary and review data requires more careful scraping.
Regional Job Boards (SEA)
Southeast Asian job boards like JobStreet, Kalibrr, and JobsDB have varying protections:
- Generally less aggressive than US-based platforms.
- May require proxies from the specific country to access local listings.
- DataResearchTools mobile proxies with SEA exit IPs are particularly useful for these platforms.
Setting Up Your Scraping Infrastructure
Proxy Configuration
For job scraping, rotating mobile proxies provide the best success rate. Here is a basic setup using Python:
import requests
from itertools import cycle
import time
import random
class ProxyManager:
def __init__(self, proxy_endpoints):
self.proxy_pool = cycle(proxy_endpoints)
self.current_proxy = None
def get_proxy(self):
self.current_proxy = next(self.proxy_pool)
return {
'http': self.current_proxy,
'https': self.current_proxy,
}
def get_session(self):
session = requests.Session()
proxy = self.get_proxy()
session.proxies.update(proxy)
session.headers.update({
'User-Agent': self._random_user_agent(),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
})
return session
def _random_user_agent(self):
agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
]
return random.choice(agents)
# Initialize with DataResearchTools proxy endpoints
proxy_manager = ProxyManager([
'http://user-s1:pass@gate.dataresearchtools.com:5432',
'http://user-s2:pass@gate.dataresearchtools.com:5432',
'http://user-s3:pass@gate.dataresearchtools.com:5432',
])Scraping Indeed Job Listings
from bs4 import BeautifulSoup
import json
import time
class IndeedScraper:
def __init__(self, proxy_manager):
self.proxy_manager = proxy_manager
self.base_url = "https://www.indeed.com"
def search_jobs(self, query, location, num_pages=5):
all_jobs = []
for page in range(num_pages):
start = page * 10
url = f"{self.base_url}/jobs?q={query}&l={location}&start={start}"
session = self.proxy_manager.get_session()
try:
response = session.get(url, timeout=30)
if response.status_code == 200:
jobs = self._parse_search_results(response.text)
all_jobs.extend(jobs)
print(f"Page {page + 1}: Found {len(jobs)} jobs")
elif response.status_code == 403:
print(f"Page {page + 1}: Blocked (403). Switching proxy...")
time.sleep(5)
continue
else:
print(f"Page {page + 1}: Status {response.status_code}")
except requests.RequestException as e:
print(f"Page {page + 1}: Error - {e}")
# Random delay between pages
time.sleep(random.uniform(2, 5))
return all_jobs
def _parse_search_results(self, html):
soup = BeautifulSoup(html, 'html.parser')
jobs = []
for card in soup.select('div.job_seen_beacon'):
job = {}
title_elem = card.select_one('h2.jobTitle a')
if title_elem:
job['title'] = title_elem.get_text(strip=True)
job['url'] = self.base_url + title_elem.get('href', '')
company_elem = card.select_one('[data-testid="company-name"]')
if company_elem:
job['company'] = company_elem.get_text(strip=True)
location_elem = card.select_one('[data-testid="text-location"]')
if location_elem:
job['location'] = location_elem.get_text(strip=True)
salary_elem = card.select_one('.salary-snippet-container')
if salary_elem:
job['salary'] = salary_elem.get_text(strip=True)
snippet_elem = card.select_one('.job-snippet')
if snippet_elem:
job['description_snippet'] = snippet_elem.get_text(strip=True)
if job.get('title'):
jobs.append(job)
return jobs
def get_job_details(self, job_url):
session = self.proxy_manager.get_session()
try:
response = session.get(job_url, timeout=30)
if response.status_code == 200:
return self._parse_job_detail(response.text)
except requests.RequestException as e:
print(f"Error fetching job detail: {e}")
return None
def _parse_job_detail(self, html):
soup = BeautifulSoup(html, 'html.parser')
detail = {}
desc_elem = soup.select_one('#jobDescriptionText')
if desc_elem:
detail['full_description'] = desc_elem.get_text(strip=True)
# Extract structured data if available
script_tags = soup.select('script[type="application/ld+json"]')
for script in script_tags:
try:
data = json.loads(script.string)
if data.get('@type') == 'JobPosting':
detail['structured_data'] = data
except (json.JSONDecodeError, TypeError):
pass
return detail
# Usage
scraper = IndeedScraper(proxy_manager)
jobs = scraper.search_jobs("software engineer", "Singapore", num_pages=10)
print(f"Total jobs found: {len(jobs)}")Scraping with Playwright for JavaScript-Heavy Sites
Some job boards require JavaScript rendering. Use Playwright with proxies:
from playwright.sync_api import sync_playwright
import json
def scrape_with_playwright(url, proxy_config):
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy={
'server': f'http://{proxy_config["host"]}:{proxy_config["port"]}',
'username': proxy_config['username'],
'password': proxy_config['password'],
}
)
context = browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)
page = context.new_page()
try:
page.goto(url, wait_until='networkidle', timeout=60000)
# Wait for job listings to load
page.wait_for_selector('.job-card', timeout=10000)
# Extract job data
jobs = page.evaluate('''() => {
const cards = document.querySelectorAll('.job-card');
return Array.from(cards).map(card => ({
title: card.querySelector('.job-title')?.textContent?.trim(),
company: card.querySelector('.company-name')?.textContent?.trim(),
location: card.querySelector('.job-location')?.textContent?.trim(),
}));
}''')
return jobs
finally:
browser.close()
proxy_config = {
'host': 'gate.dataresearchtools.com',
'port': 5432,
'username': 'your_username',
'password': 'your_password',
}
jobs = scrape_with_playwright('https://example-job-board.com/jobs', proxy_config)Data Structuring and Storage
Standard Job Data Schema
from dataclasses import dataclass, asdict
from datetime import datetime
from typing import Optional, List
@dataclass
class JobListing:
title: str
company: str
location: str
url: str
source: str # "indeed", "glassdoor", etc.
scraped_at: str
salary_min: Optional[float] = None
salary_max: Optional[float] = None
salary_currency: Optional[str] = None
employment_type: Optional[str] = None # full-time, part-time, contract
experience_level: Optional[str] = None # entry, mid, senior
description: Optional[str] = None
skills: Optional[List[str]] = None
posted_date: Optional[str] = None
def to_dict(self):
return asdict(self)Saving to CSV
import csv
def save_jobs_csv(jobs, filename):
if not jobs:
return
fieldnames = jobs[0].keys() if isinstance(jobs[0], dict) else list(asdict(jobs[0]).keys())
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for job in jobs:
row = job if isinstance(job, dict) else asdict(job)
writer.writerow(row)
print(f"Saved {len(jobs)} jobs to {filename}")Saving to SQLite
import sqlite3
def save_jobs_db(jobs, db_path='jobs.db'):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS job_listings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
company TEXT,
location TEXT,
url TEXT UNIQUE,
source TEXT,
salary_min REAL,
salary_max REAL,
salary_currency TEXT,
employment_type TEXT,
description TEXT,
posted_date TEXT,
scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
)
''')
for job in jobs:
try:
cursor.execute('''
INSERT OR IGNORE INTO job_listings
(title, company, location, url, source, salary_min, salary_max, description, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
''', (
job.get('title'), job.get('company'), job.get('location'),
job.get('url'), job.get('source'), job.get('salary_min'),
job.get('salary_max'), job.get('description'),
datetime.now().isoformat()
))
except sqlite3.IntegrityError:
pass # Duplicate URL, skip
conn.commit()
conn.close()Scaling Strategies
Concurrent Scraping
Use asyncio for concurrent job scraping:
import asyncio
import aiohttp
async def fetch_job_page(session, url, proxy):
try:
async with session.get(url, proxy=proxy, timeout=aiohttp.ClientTimeout(total=30)) as response:
if response.status == 200:
return await response.text()
return None
except Exception as e:
print(f"Error: {e}")
return None
async def scrape_concurrent(urls, proxies, max_concurrent=5):
semaphore = asyncio.Semaphore(max_concurrent)
proxy_cycle = cycle(proxies)
async def bounded_fetch(url):
async with semaphore:
proxy = next(proxy_cycle)
async with aiohttp.ClientSession() as session:
result = await fetch_job_page(session, url, proxy)
await asyncio.sleep(random.uniform(1, 3))
return (url, result)
tasks = [bounded_fetch(url) for url in urls]
return await asyncio.gather(*tasks)Scheduling Regular Scrapes
For ongoing job market monitoring, schedule scrapes using cron or a task scheduler:
# scrape_jobs.py - Run daily via cron
import schedule
import time
def daily_scrape():
scraper = IndeedScraper(proxy_manager)
queries = [
("software engineer", "Singapore"),
("data scientist", "Bangkok"),
("product manager", "Manila"),
("devops engineer", "Jakarta"),
]
all_jobs = []
for query, location in queries:
jobs = scraper.search_jobs(query, location, num_pages=5)
all_jobs.extend(jobs)
time.sleep(10) # Pause between queries
save_jobs_db(all_jobs)
print(f"Daily scrape complete: {len(all_jobs)} jobs collected")
schedule.every().day.at("02:00").do(daily_scrape)
while True:
schedule.run_pending()
time.sleep(60)Deduplication
Job listings appear on multiple boards and persist for days or weeks. Deduplicate effectively:
import hashlib
def generate_job_id(job):
"""Create a unique identifier based on job content."""
key = f"{job.get('title', '')}-{job.get('company', '')}-{job.get('location', '')}".lower()
return hashlib.md5(key.encode()).hexdigest()
def deduplicate_jobs(jobs):
seen = set()
unique_jobs = []
for job in jobs:
job_id = generate_job_id(job)
if job_id not in seen:
seen.add(job_id)
job['dedup_id'] = job_id
unique_jobs.append(job)
print(f"Deduplicated: {len(jobs)} -> {len(unique_jobs)} unique jobs")
return unique_jobsWhy Mobile Proxies Matter for Job Scraping
Job boards are among the most aggressively protected websites. Here is why DataResearchTools mobile proxies are the right choice:
- High success rate: Mobile carrier IPs from Singapore, Thailand, and Philippines are not flagged by job boards.
- Geo-specific listings: Access job boards that show different results based on your location. A Singapore mobile IP shows you Singapore-specific listings with local salary data.
- Sustainable scraping: Mobile IPs rotate naturally, reducing the risk of permanent bans.
- No CAPTCHA: Job boards rarely present CAPTCHAs to mobile IPs, unlike datacenter IPs which trigger them frequently.
Ethical and Legal Considerations
Job scraping operates in a legal gray area. Key points to consider:
- Respect robots.txt: Check the site’s robots.txt file. While not legally binding in all jurisdictions, following it demonstrates good faith.
- Rate limiting: Do not overwhelm servers. Add delays between requests.
- Public data only: Scrape publicly available job listings, not behind login walls (unless you have legitimate credentials).
- Data usage: Be transparent about how you use collected data.
- GDPR and privacy: Job listings may contain personal data. Ensure compliance with relevant data protection regulations.
- Terms of service: Review and understand each platform’s ToS before scraping.
Conclusion
Scraping job listings at scale requires a combination of robust proxy infrastructure, careful parsing logic, and respectful scraping practices. Rotating mobile proxies from DataResearchTools provide the reliability needed for sustained data collection from job boards that actively block datacenter IPs. Whether you are building a talent intelligence platform, conducting salary research, or aggregating listings from Southeast Asian job markets, the techniques in this guide give you a solid foundation for production-grade job scraping.
- Proxies for HR Tech: Salary Benchmarking & Talent Intelligence
- How to Build an Automated Lead Scraping Pipeline with Proxies
- Building a B2B Contact Enrichment Pipeline with Mobile Proxies
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- Proxies for HR Tech: Salary Benchmarking & Talent Intelligence
- How to Build an Automated Lead Scraping Pipeline with Proxies
- Building a B2B Contact Enrichment Pipeline with Mobile Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- Proxies for HR Tech: Salary Benchmarking & Talent Intelligence
- How to Build an Automated Lead Scraping Pipeline with Proxies
- Building a B2B Contact Enrichment Pipeline with Mobile Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- Proxies for HR Tech: Salary Benchmarking & Talent Intelligence
- How to Build an Automated Lead Scraping Pipeline with Proxies
- Building a B2B Contact Enrichment Pipeline with Mobile Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
Related Reading
- Proxies for HR Tech: Salary Benchmarking & Talent Intelligence
- How to Build an Automated Lead Scraping Pipeline with Proxies
- Building a B2B Contact Enrichment Pipeline with Mobile Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking