How to Scrape Job Listings at Scale with Rotating Proxies

How to Scrape Job Listings at Scale with Rotating Proxies

Job market data is one of the most valuable datasets for recruitment agencies, HR tech companies, and market researchers. Understanding which companies are hiring, what skills are in demand, and how compensation trends are shifting requires collecting data from job boards at scale. This guide covers the technical approach to scraping major job platforms using rotating proxies, structuring the data, and handling the anti-bot protections these sites employ.

Why Scrape Job Listings

Before diving into the technical implementation, here are the primary use cases:

  • Talent intelligence: Track hiring trends across industries and competitors.
  • Salary benchmarking: Collect compensation data from job postings to inform salary decisions.
  • Market research: Understand which technologies and skills are growing in demand.
  • Lead generation: Identify companies that are actively hiring as sales prospects.
  • Academic research: Analyze labor market dynamics at scale.
  • Job aggregation: Build job search engines that pull from multiple sources.

Job Board Landscape and Their Protections

Each job board has different levels of anti-scraping protection. Understanding these helps you plan your approach.

Indeed

Indeed is one of the largest job boards globally. Their protections include:

  • Rate limiting: Aggressive throttling after a few dozen requests.
  • CAPTCHA: Cloudflare-based challenges for suspicious traffic.
  • IP blocking: Quick to block datacenter IPs.
  • JavaScript rendering: Some content loads dynamically.

LinkedIn

LinkedIn has the most aggressive anti-scraping measures:

  • Login walls: Most job data requires authentication.
  • Rate limiting: Very strict, even for logged-in users.
  • Legal action: LinkedIn has sued scraping companies (though the legality of scraping public data remains contested after the hiQ Labs v. LinkedIn ruling).
  • Browser fingerprinting: Advanced detection of automated browsers.

Glassdoor

Glassdoor sits in the middle:

  • Cloudflare protection: Standard bot detection.
  • Rate limiting: Moderate.
  • Review data: Salary and review data requires more careful scraping.

Regional Job Boards (SEA)

Southeast Asian job boards like JobStreet, Kalibrr, and JobsDB have varying protections:

  • Generally less aggressive than US-based platforms.
  • May require proxies from the specific country to access local listings.
  • DataResearchTools mobile proxies with SEA exit IPs are particularly useful for these platforms.

Setting Up Your Scraping Infrastructure

Proxy Configuration

For job scraping, rotating mobile proxies provide the best success rate. Here is a basic setup using Python:

import requests
from itertools import cycle
import time
import random

class ProxyManager:
    def __init__(self, proxy_endpoints):
        self.proxy_pool = cycle(proxy_endpoints)
        self.current_proxy = None

    def get_proxy(self):
        self.current_proxy = next(self.proxy_pool)
        return {
            'http': self.current_proxy,
            'https': self.current_proxy,
        }

    def get_session(self):
        session = requests.Session()
        proxy = self.get_proxy()
        session.proxies.update(proxy)
        session.headers.update({
            'User-Agent': self._random_user_agent(),
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept-Encoding': 'gzip, deflate, br',
        })
        return session

    def _random_user_agent(self):
        agents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
        ]
        return random.choice(agents)

# Initialize with DataResearchTools proxy endpoints
proxy_manager = ProxyManager([
    'http://user-s1:pass@gate.dataresearchtools.com:5432',
    'http://user-s2:pass@gate.dataresearchtools.com:5432',
    'http://user-s3:pass@gate.dataresearchtools.com:5432',
])

Scraping Indeed Job Listings

from bs4 import BeautifulSoup
import json
import time

class IndeedScraper:
    def __init__(self, proxy_manager):
        self.proxy_manager = proxy_manager
        self.base_url = "https://www.indeed.com"

    def search_jobs(self, query, location, num_pages=5):
        all_jobs = []

        for page in range(num_pages):
            start = page * 10
            url = f"{self.base_url}/jobs?q={query}&l={location}&start={start}"

            session = self.proxy_manager.get_session()

            try:
                response = session.get(url, timeout=30)

                if response.status_code == 200:
                    jobs = self._parse_search_results(response.text)
                    all_jobs.extend(jobs)
                    print(f"Page {page + 1}: Found {len(jobs)} jobs")
                elif response.status_code == 403:
                    print(f"Page {page + 1}: Blocked (403). Switching proxy...")
                    time.sleep(5)
                    continue
                else:
                    print(f"Page {page + 1}: Status {response.status_code}")

            except requests.RequestException as e:
                print(f"Page {page + 1}: Error - {e}")

            # Random delay between pages
            time.sleep(random.uniform(2, 5))

        return all_jobs

    def _parse_search_results(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        jobs = []

        for card in soup.select('div.job_seen_beacon'):
            job = {}

            title_elem = card.select_one('h2.jobTitle a')
            if title_elem:
                job['title'] = title_elem.get_text(strip=True)
                job['url'] = self.base_url + title_elem.get('href', '')

            company_elem = card.select_one('[data-testid="company-name"]')
            if company_elem:
                job['company'] = company_elem.get_text(strip=True)

            location_elem = card.select_one('[data-testid="text-location"]')
            if location_elem:
                job['location'] = location_elem.get_text(strip=True)

            salary_elem = card.select_one('.salary-snippet-container')
            if salary_elem:
                job['salary'] = salary_elem.get_text(strip=True)

            snippet_elem = card.select_one('.job-snippet')
            if snippet_elem:
                job['description_snippet'] = snippet_elem.get_text(strip=True)

            if job.get('title'):
                jobs.append(job)

        return jobs

    def get_job_details(self, job_url):
        session = self.proxy_manager.get_session()

        try:
            response = session.get(job_url, timeout=30)
            if response.status_code == 200:
                return self._parse_job_detail(response.text)
        except requests.RequestException as e:
            print(f"Error fetching job detail: {e}")

        return None

    def _parse_job_detail(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        detail = {}

        desc_elem = soup.select_one('#jobDescriptionText')
        if desc_elem:
            detail['full_description'] = desc_elem.get_text(strip=True)

        # Extract structured data if available
        script_tags = soup.select('script[type="application/ld+json"]')
        for script in script_tags:
            try:
                data = json.loads(script.string)
                if data.get('@type') == 'JobPosting':
                    detail['structured_data'] = data
            except (json.JSONDecodeError, TypeError):
                pass

        return detail

# Usage
scraper = IndeedScraper(proxy_manager)
jobs = scraper.search_jobs("software engineer", "Singapore", num_pages=10)
print(f"Total jobs found: {len(jobs)}")

Scraping with Playwright for JavaScript-Heavy Sites

Some job boards require JavaScript rendering. Use Playwright with proxies:

from playwright.sync_api import sync_playwright
import json

def scrape_with_playwright(url, proxy_config):
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            proxy={
                'server': f'http://{proxy_config["host"]}:{proxy_config["port"]}',
                'username': proxy_config['username'],
                'password': proxy_config['password'],
            }
        )

        context = browser.new_context(
            viewport={'width': 1920, 'height': 1080},
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        )

        page = context.new_page()

        try:
            page.goto(url, wait_until='networkidle', timeout=60000)

            # Wait for job listings to load
            page.wait_for_selector('.job-card', timeout=10000)

            # Extract job data
            jobs = page.evaluate('''() => {
                const cards = document.querySelectorAll('.job-card');
                return Array.from(cards).map(card => ({
                    title: card.querySelector('.job-title')?.textContent?.trim(),
                    company: card.querySelector('.company-name')?.textContent?.trim(),
                    location: card.querySelector('.job-location')?.textContent?.trim(),
                }));
            }''')

            return jobs
        finally:
            browser.close()

proxy_config = {
    'host': 'gate.dataresearchtools.com',
    'port': 5432,
    'username': 'your_username',
    'password': 'your_password',
}

jobs = scrape_with_playwright('https://example-job-board.com/jobs', proxy_config)

Data Structuring and Storage

Standard Job Data Schema

from dataclasses import dataclass, asdict
from datetime import datetime
from typing import Optional, List

@dataclass
class JobListing:
    title: str
    company: str
    location: str
    url: str
    source: str  # "indeed", "glassdoor", etc.
    scraped_at: str
    salary_min: Optional[float] = None
    salary_max: Optional[float] = None
    salary_currency: Optional[str] = None
    employment_type: Optional[str] = None  # full-time, part-time, contract
    experience_level: Optional[str] = None  # entry, mid, senior
    description: Optional[str] = None
    skills: Optional[List[str]] = None
    posted_date: Optional[str] = None

    def to_dict(self):
        return asdict(self)

Saving to CSV

import csv

def save_jobs_csv(jobs, filename):
    if not jobs:
        return

    fieldnames = jobs[0].keys() if isinstance(jobs[0], dict) else list(asdict(jobs[0]).keys())

    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        for job in jobs:
            row = job if isinstance(job, dict) else asdict(job)
            writer.writerow(row)

    print(f"Saved {len(jobs)} jobs to {filename}")

Saving to SQLite

import sqlite3

def save_jobs_db(jobs, db_path='jobs.db'):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS job_listings (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            title TEXT NOT NULL,
            company TEXT,
            location TEXT,
            url TEXT UNIQUE,
            source TEXT,
            salary_min REAL,
            salary_max REAL,
            salary_currency TEXT,
            employment_type TEXT,
            description TEXT,
            posted_date TEXT,
            scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
        )
    ''')

    for job in jobs:
        try:
            cursor.execute('''
                INSERT OR IGNORE INTO job_listings
                (title, company, location, url, source, salary_min, salary_max, description, scraped_at)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
            ''', (
                job.get('title'), job.get('company'), job.get('location'),
                job.get('url'), job.get('source'), job.get('salary_min'),
                job.get('salary_max'), job.get('description'),
                datetime.now().isoformat()
            ))
        except sqlite3.IntegrityError:
            pass  # Duplicate URL, skip

    conn.commit()
    conn.close()

Scaling Strategies

Concurrent Scraping

Use asyncio for concurrent job scraping:

import asyncio
import aiohttp

async def fetch_job_page(session, url, proxy):
    try:
        async with session.get(url, proxy=proxy, timeout=aiohttp.ClientTimeout(total=30)) as response:
            if response.status == 200:
                return await response.text()
            return None
    except Exception as e:
        print(f"Error: {e}")
        return None

async def scrape_concurrent(urls, proxies, max_concurrent=5):
    semaphore = asyncio.Semaphore(max_concurrent)
    proxy_cycle = cycle(proxies)

    async def bounded_fetch(url):
        async with semaphore:
            proxy = next(proxy_cycle)
            async with aiohttp.ClientSession() as session:
                result = await fetch_job_page(session, url, proxy)
                await asyncio.sleep(random.uniform(1, 3))
                return (url, result)

    tasks = [bounded_fetch(url) for url in urls]
    return await asyncio.gather(*tasks)

Scheduling Regular Scrapes

For ongoing job market monitoring, schedule scrapes using cron or a task scheduler:

# scrape_jobs.py - Run daily via cron
import schedule
import time

def daily_scrape():
    scraper = IndeedScraper(proxy_manager)

    queries = [
        ("software engineer", "Singapore"),
        ("data scientist", "Bangkok"),
        ("product manager", "Manila"),
        ("devops engineer", "Jakarta"),
    ]

    all_jobs = []
    for query, location in queries:
        jobs = scraper.search_jobs(query, location, num_pages=5)
        all_jobs.extend(jobs)
        time.sleep(10)  # Pause between queries

    save_jobs_db(all_jobs)
    print(f"Daily scrape complete: {len(all_jobs)} jobs collected")

schedule.every().day.at("02:00").do(daily_scrape)

while True:
    schedule.run_pending()
    time.sleep(60)

Deduplication

Job listings appear on multiple boards and persist for days or weeks. Deduplicate effectively:

import hashlib

def generate_job_id(job):
    """Create a unique identifier based on job content."""
    key = f"{job.get('title', '')}-{job.get('company', '')}-{job.get('location', '')}".lower()
    return hashlib.md5(key.encode()).hexdigest()

def deduplicate_jobs(jobs):
    seen = set()
    unique_jobs = []

    for job in jobs:
        job_id = generate_job_id(job)
        if job_id not in seen:
            seen.add(job_id)
            job['dedup_id'] = job_id
            unique_jobs.append(job)

    print(f"Deduplicated: {len(jobs)} -> {len(unique_jobs)} unique jobs")
    return unique_jobs

Why Mobile Proxies Matter for Job Scraping

Job boards are among the most aggressively protected websites. Here is why DataResearchTools mobile proxies are the right choice:

  • High success rate: Mobile carrier IPs from Singapore, Thailand, and Philippines are not flagged by job boards.
  • Geo-specific listings: Access job boards that show different results based on your location. A Singapore mobile IP shows you Singapore-specific listings with local salary data.
  • Sustainable scraping: Mobile IPs rotate naturally, reducing the risk of permanent bans.
  • No CAPTCHA: Job boards rarely present CAPTCHAs to mobile IPs, unlike datacenter IPs which trigger them frequently.

Ethical and Legal Considerations

Job scraping operates in a legal gray area. Key points to consider:

  • Respect robots.txt: Check the site’s robots.txt file. While not legally binding in all jurisdictions, following it demonstrates good faith.
  • Rate limiting: Do not overwhelm servers. Add delays between requests.
  • Public data only: Scrape publicly available job listings, not behind login walls (unless you have legitimate credentials).
  • Data usage: Be transparent about how you use collected data.
  • GDPR and privacy: Job listings may contain personal data. Ensure compliance with relevant data protection regulations.
  • Terms of service: Review and understand each platform’s ToS before scraping.

Conclusion

Scraping job listings at scale requires a combination of robust proxy infrastructure, careful parsing logic, and respectful scraping practices. Rotating mobile proxies from DataResearchTools provide the reliability needed for sustained data collection from job boards that actively block datacenter IPs. Whether you are building a talent intelligence platform, conducting salary research, or aggregating listings from Southeast Asian job markets, the techniques in this guide give you a solid foundation for production-grade job scraping.


Related Reading

Scroll to Top