Proxies for HR Tech: Salary Benchmarking & Talent Intelligence

Proxies for HR Tech: Salary Benchmarking & Talent Intelligence

HR technology companies and recruitment agencies depend on accurate, up-to-date salary and talent data to serve their clients. Salary benchmarking helps organizations set competitive compensation packages, while talent intelligence reveals where skilled professionals are located and what they expect. Collecting this data at scale requires sophisticated proxy infrastructure. This guide covers the data sources, collection strategies, and technical approaches that power modern HR tech platforms.

The Value of Salary and Talent Data

Salary benchmarking and talent intelligence are not just HR buzzwords — they drive real business decisions:

  • Compensation planning: Companies need to know the market rate for every role to attract and retain talent without overpaying.
  • Offer competitiveness: Recruiters need current salary data to make offers that candidates accept.
  • Geographic arbitrage: Remote-first companies use salary data across regions to optimize hiring costs.
  • Workforce planning: Understanding talent supply and demand helps plan hiring timelines.
  • Diversity and equity: Salary data analysis reveals pay gaps that need addressing.

In Southeast Asia, where labor markets differ significantly between countries like Singapore, Thailand, and the Philippines, localized salary data is especially valuable. DataResearchTools mobile proxies with SEA exit IPs enable collecting this region-specific data effectively.

Key Data Sources for Salary Benchmarking

Job Board Salary Data

Job postings increasingly include salary ranges. These are the most abundant and timely sources:

  • Indeed: Many listings include salary ranges or “estimated salary” data.
  • Glassdoor: User-reported salaries plus employer-posted compensation.
  • LinkedIn: Salary insights and posted ranges on job listings.
  • JobStreet: The dominant job board in Southeast Asia with local salary data.
  • Levels.fyi: Tech-focused compensation data with detailed breakdowns.
  • Payscale: Crowdsourced salary reports.

Government and Public Datasets

Some countries publish official wage statistics:

  • US Bureau of Labor Statistics (BLS): Occupational Employment Statistics.
  • Singapore Ministry of Manpower: Annual wage surveys.
  • Thailand National Statistical Office: Labor force surveys.

These datasets are free but updated infrequently (usually annually).

Company Review Sites

  • Glassdoor: Salary reports by company and role.
  • Blind: Anonymous salary sharing, popular in tech.
  • Comparably: Compensation comparisons.
  • InHerSight: Compensation with a focus on gender equity.

Professional Networks

  • LinkedIn: Company pages reveal headcount changes, hiring velocity, and role distribution.
  • GitHub: Open-source contributions can indicate technology adoption trends.
  • Stack Overflow Developer Survey: Annual salary data by technology and region.

Technical Architecture for Salary Data Collection

System Overview

A production salary benchmarking system typically includes:

  1. Data collection layer: Scrapers with proxy rotation.
  2. Parsing and extraction: Structured data extraction from raw HTML.
  3. Normalization: Standardizing job titles, locations, and salary formats.
  4. Storage: Database with deduplication and versioning.
  5. Analysis: Statistical processing and visualization.

Proxy Infrastructure

Salary data sources are among the most protected websites. Your proxy setup needs to handle:

  • High volume: Thousands of pages per day.
  • Multiple sources: Each source has different protection levels.
  • Geo-targeting: Country-specific salary data requires country-specific proxies.
  • Session management: Some sources require multi-page navigation.
import requests
from dataclasses import dataclass
from typing import Dict, List, Optional
import random

@dataclass
class ProxyEndpoint:
    url: str
    country: str
    sessions_used: int = 0

class HRProxyManager:
    def __init__(self, endpoints: List[ProxyEndpoint]):
        self.endpoints = endpoints
        self.country_map: Dict[str, List[ProxyEndpoint]] = {}

        for ep in endpoints:
            self.country_map.setdefault(ep.country, []).append(ep)

    def get_proxy(self, target_country: Optional[str] = None) -> dict:
        """Get a proxy, optionally filtered by country."""
        if target_country and target_country in self.country_map:
            pool = self.country_map[target_country]
        else:
            pool = self.endpoints

        endpoint = random.choice(pool)
        endpoint.sessions_used += 1

        return {
            'http': endpoint.url,
            'https': endpoint.url,
        }

# DataResearchTools endpoints for different SEA countries
proxy_manager = HRProxyManager([
    ProxyEndpoint('http://user-sg:pass@gate.dataresearchtools.com:5432', 'SG'),
    ProxyEndpoint('http://user-th:pass@gate.dataresearchtools.com:5432', 'TH'),
    ProxyEndpoint('http://user-ph:pass@gate.dataresearchtools.com:5432', 'PH'),
    ProxyEndpoint('http://user-id:pass@gate.dataresearchtools.com:5432', 'ID'),
    ProxyEndpoint('http://user-my:pass@gate.dataresearchtools.com:5432', 'MY'),
])

Salary Data Extraction

import re
from typing import Tuple, Optional

class SalaryParser:
    """Parse salary information from job listing text."""

    CURRENCY_PATTERNS = {
        'USD': r'\$|USD|US\$',
        'SGD': r'S\$|SGD',
        'THB': r'฿|THB|baht',
        'PHP': r'₱|PHP|peso',
        'IDR': r'Rp|IDR|rupiah',
        'MYR': r'RM|MYR|ringgit',
    }

    def parse_salary(self, text: str) -> Optional[dict]:
        """Extract salary range from text."""
        if not text:
            return None

        # Detect currency
        currency = self._detect_currency(text)

        # Try to extract salary range
        range_match = re.search(
            r'(\d{1,3}(?:[,\.]\d{3})*(?:\.\d{2})?)\s*[-–to]+\s*(\d{1,3}(?:[,\.]\d{3})*(?:\.\d{2})?)',
            text
        )

        if range_match:
            min_salary = self._parse_number(range_match.group(1))
            max_salary = self._parse_number(range_match.group(2))

            # Detect period (monthly, yearly, hourly)
            period = self._detect_period(text)

            return {
                'min': min_salary,
                'max': max_salary,
                'currency': currency,
                'period': period,
                'raw_text': text.strip(),
            }

        # Try single salary value
        single_match = re.search(r'(\d{1,3}(?:[,\.]\d{3})*(?:\.\d{2})?)', text)
        if single_match:
            salary = self._parse_number(single_match.group(1))
            period = self._detect_period(text)

            return {
                'min': salary,
                'max': salary,
                'currency': currency,
                'period': period,
                'raw_text': text.strip(),
            }

        return None

    def _detect_currency(self, text: str) -> str:
        for currency, pattern in self.CURRENCY_PATTERNS.items():
            if re.search(pattern, text, re.IGNORECASE):
                return currency
        return 'USD'  # Default

    def _detect_period(self, text: str) -> str:
        text_lower = text.lower()
        if any(w in text_lower for w in ['year', 'annual', 'per annum', 'p.a.']):
            return 'yearly'
        if any(w in text_lower for w in ['month', 'monthly', 'per month']):
            return 'monthly'
        if any(w in text_lower for w in ['hour', 'hourly', 'per hour']):
            return 'hourly'
        if any(w in text_lower for w in ['day', 'daily', 'per day']):
            return 'daily'
        return 'yearly'  # Default assumption

    def _parse_number(self, text: str) -> float:
        cleaned = text.replace(',', '').replace(' ', '')
        try:
            return float(cleaned)
        except ValueError:
            return 0.0

    def normalize_to_annual(self, salary_data: dict) -> dict:
        """Convert any salary to annual equivalent."""
        if not salary_data:
            return salary_data

        multipliers = {
            'yearly': 1,
            'monthly': 12,
            'daily': 260,  # Working days per year
            'hourly': 2080,  # Working hours per year
        }

        period = salary_data.get('period', 'yearly')
        multiplier = multipliers.get(period, 1)

        return {
            **salary_data,
            'annual_min': salary_data['min'] * multiplier,
            'annual_max': salary_data['max'] * multiplier,
        }

Job Title Normalization

One of the biggest challenges in salary benchmarking is normalizing job titles:

class TitleNormalizer:
    """Normalize job titles for consistent comparisons."""

    TITLE_MAP = {
        # Software Engineering
        'software engineer': 'Software Engineer',
        'software developer': 'Software Engineer',
        'sde': 'Software Engineer',
        'full stack developer': 'Full Stack Engineer',
        'fullstack developer': 'Full Stack Engineer',
        'frontend developer': 'Frontend Engineer',
        'front-end developer': 'Frontend Engineer',
        'backend developer': 'Backend Engineer',
        'back-end developer': 'Backend Engineer',

        # Data
        'data scientist': 'Data Scientist',
        'data analyst': 'Data Analyst',
        'data engineer': 'Data Engineer',
        'ml engineer': 'Machine Learning Engineer',
        'machine learning engineer': 'Machine Learning Engineer',

        # Management
        'engineering manager': 'Engineering Manager',
        'tech lead': 'Technical Lead',
        'technical lead': 'Technical Lead',
        'vp of engineering': 'VP Engineering',
        'cto': 'CTO',
    }

    LEVEL_PATTERNS = [
        (r'\b(intern|internship)\b', 'Intern'),
        (r'\bjunior\b|entry.level|\bI\b$|\b1\b$', 'Junior'),
        (r'\bmid.?level\b|\bII\b$|\b2\b$', 'Mid'),
        (r'\bsenior\b|sr\.?\b|\bIII\b$|\b3\b$', 'Senior'),
        (r'\bstaff\b|\bIV\b$|\b4\b$', 'Staff'),
        (r'\bprincipal\b|\bV\b$|\b5\b$', 'Principal'),
        (r'\blead\b', 'Lead'),
        (r'\bmanager\b|\bdirector\b', 'Manager'),
        (r'\bvp\b|\bvice president\b', 'VP'),
        (r'\bc-level\b|\bchief\b|\bcto\b|\bcio\b', 'Executive'),
    ]

    def normalize(self, title: str) -> dict:
        title_lower = title.lower().strip()

        # Match base title
        base_title = title  # Default to original
        for pattern, normalized in self.TITLE_MAP.items():
            if pattern in title_lower:
                base_title = normalized
                break

        # Detect seniority level
        level = 'Mid'  # Default
        for pattern, level_name in self.LEVEL_PATTERNS:
            if re.search(pattern, title_lower):
                level = level_name
                break

        return {
            'original_title': title,
            'normalized_title': base_title,
            'level': level,
        }

Building a Salary Benchmarking Pipeline

End-to-End Pipeline

import json
import sqlite3
from datetime import datetime

class SalaryBenchmarkPipeline:
    def __init__(self, proxy_manager, db_path='salary_data.db'):
        self.proxy_manager = proxy_manager
        self.salary_parser = SalaryParser()
        self.title_normalizer = TitleNormalizer()
        self.db_path = db_path
        self._init_db()

    def _init_db(self):
        conn = sqlite3.connect(self.db_path)
        conn.execute('''
            CREATE TABLE IF NOT EXISTS salary_records (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                source TEXT,
                original_title TEXT,
                normalized_title TEXT,
                level TEXT,
                company TEXT,
                location TEXT,
                country TEXT,
                salary_min REAL,
                salary_max REAL,
                annual_min REAL,
                annual_max REAL,
                currency TEXT,
                period TEXT,
                posted_date TEXT,
                scraped_at TEXT,
                raw_url TEXT UNIQUE
            )
        ''')
        conn.commit()
        conn.close()

    def process_job(self, raw_job: dict) -> dict:
        """Process a raw job listing into a salary record."""
        # Normalize title
        title_info = self.title_normalizer.normalize(raw_job.get('title', ''))

        # Parse salary
        salary_text = raw_job.get('salary', '') or raw_job.get('salary_text', '')
        salary_data = self.salary_parser.parse_salary(salary_text)

        if salary_data:
            salary_data = self.salary_parser.normalize_to_annual(salary_data)

        record = {
            'source': raw_job.get('source', 'unknown'),
            'original_title': title_info['original_title'],
            'normalized_title': title_info['normalized_title'],
            'level': title_info['level'],
            'company': raw_job.get('company', ''),
            'location': raw_job.get('location', ''),
            'country': raw_job.get('country', ''),
            'salary_min': salary_data['min'] if salary_data else None,
            'salary_max': salary_data['max'] if salary_data else None,
            'annual_min': salary_data.get('annual_min') if salary_data else None,
            'annual_max': salary_data.get('annual_max') if salary_data else None,
            'currency': salary_data['currency'] if salary_data else None,
            'period': salary_data['period'] if salary_data else None,
            'posted_date': raw_job.get('posted_date', ''),
            'scraped_at': datetime.now().isoformat(),
            'raw_url': raw_job.get('url', ''),
        }

        return record

    def save_record(self, record: dict):
        conn = sqlite3.connect(self.db_path)
        try:
            conn.execute('''
                INSERT OR IGNORE INTO salary_records
                (source, original_title, normalized_title, level, company,
                 location, country, salary_min, salary_max, annual_min,
                 annual_max, currency, period, posted_date, scraped_at, raw_url)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            ''', tuple(record.values()))
            conn.commit()
        finally:
            conn.close()

    def generate_benchmark(self, title: str, country: str) -> dict:
        """Generate a salary benchmark for a specific role and country."""
        conn = sqlite3.connect(self.db_path)
        cursor = conn.execute('''
            SELECT annual_min, annual_max, currency, level, company
            FROM salary_records
            WHERE normalized_title = ?
            AND country = ?
            AND annual_min IS NOT NULL
            ORDER BY scraped_at DESC
            LIMIT 100
        ''', (title, country))

        records = cursor.fetchall()
        conn.close()

        if not records:
            return {'error': 'No data available'}

        all_salaries = []
        for min_sal, max_sal, currency, level, company in records:
            mid = (min_sal + max_sal) / 2
            all_salaries.append(mid)

        all_salaries.sort()
        n = len(all_salaries)

        return {
            'title': title,
            'country': country,
            'sample_size': n,
            'p25': all_salaries[int(n * 0.25)],
            'p50_median': all_salaries[int(n * 0.50)],
            'p75': all_salaries[int(n * 0.75)],
            'p90': all_salaries[int(n * 0.90)] if n > 10 else None,
            'min': all_salaries[0],
            'max': all_salaries[-1],
            'currency': records[0][2],
        }

Talent Intelligence Use Cases

Competitor Hiring Analysis

Track what competitors are hiring for to understand their strategy:

class CompetitorTracker:
    def __init__(self, proxy_manager, companies: List[str]):
        self.proxy_manager = proxy_manager
        self.companies = companies

    def get_hiring_snapshot(self, company: str) -> dict:
        """Get current open positions for a company."""
        session = requests.Session()
        session.proxies = self.proxy_manager.get_proxy()
        session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
        })

        # Search Indeed for company's open positions
        url = f"https://www.indeed.com/cmp/{company}/jobs"
        response = session.get(url, timeout=30)

        if response.status_code == 200:
            # Parse and categorize roles
            return self._categorize_roles(response.text, company)

        return {'company': company, 'error': f'Status {response.status_code}'}

    def _categorize_roles(self, html: str, company: str) -> dict:
        soup = BeautifulSoup(html, 'html.parser')

        departments = {
            'engineering': 0,
            'sales': 0,
            'marketing': 0,
            'product': 0,
            'operations': 0,
            'hr': 0,
            'finance': 0,
            'other': 0,
        }

        job_cards = soup.select('.job_seen_beacon, .jobsearch-ResultsList li')
        for card in job_cards:
            title_elem = card.select_one('h2, .jobTitle')
            if title_elem:
                title = title_elem.get_text(strip=True).lower()
                categorized = False
                for dept, keywords in {
                    'engineering': ['engineer', 'developer', 'devops', 'sre', 'architect'],
                    'sales': ['sales', 'account executive', 'business development'],
                    'marketing': ['marketing', 'content', 'seo', 'growth'],
                    'product': ['product manager', 'product owner', 'ux', 'designer'],
                    'operations': ['operations', 'logistics', 'supply chain'],
                    'hr': ['recruiter', 'human resources', 'people ops', 'talent'],
                    'finance': ['finance', 'accounting', 'controller', 'treasury'],
                }.items():
                    if any(kw in title for kw in keywords):
                        departments[dept] += 1
                        categorized = True
                        break
                if not categorized:
                    departments['other'] += 1

        return {
            'company': company,
            'total_openings': sum(departments.values()),
            'departments': departments,
            'snapshot_date': datetime.now().isoformat(),
        }

Skills Demand Analysis

Extract and analyze skills mentioned in job postings:

class SkillsAnalyzer:
    TECH_SKILLS = {
        'languages': ['python', 'java', 'javascript', 'typescript', 'golang', 'go', 'rust', 'c++', 'c#', 'ruby', 'php', 'kotlin', 'swift'],
        'frameworks': ['react', 'angular', 'vue', 'django', 'flask', 'spring', 'node.js', 'express', 'fastapi', 'next.js'],
        'databases': ['postgresql', 'mysql', 'mongodb', 'redis', 'elasticsearch', 'dynamodb', 'cassandra'],
        'cloud': ['aws', 'azure', 'gcp', 'google cloud', 'kubernetes', 'docker', 'terraform'],
        'data': ['spark', 'kafka', 'airflow', 'hadoop', 'snowflake', 'databricks', 'tableau', 'power bi'],
        'ai_ml': ['machine learning', 'deep learning', 'tensorflow', 'pytorch', 'nlp', 'computer vision', 'llm', 'generative ai'],
    }

    def extract_skills(self, description: str) -> dict:
        desc_lower = description.lower()
        found_skills = {}

        for category, skills in self.TECH_SKILLS.items():
            found = [skill for skill in skills if skill in desc_lower]
            if found:
                found_skills[category] = found

        return found_skills

    def aggregate_skills(self, job_descriptions: List[str]) -> dict:
        """Analyze skill frequency across multiple job descriptions."""
        skill_counts = {}

        for desc in job_descriptions:
            skills = self.extract_skills(desc)
            for category, skill_list in skills.items():
                for skill in skill_list:
                    key = f"{category}/{skill}"
                    skill_counts[key] = skill_counts.get(key, 0) + 1

        total = len(job_descriptions)
        skill_percentages = {
            skill: {
                'count': count,
                'percentage': round(count / total * 100, 1)
            }
            for skill, count in sorted(skill_counts.items(), key=lambda x: -x[1])
        }

        return skill_percentages

Why Mobile Proxies Are Essential for HR Data

HR data collection faces unique challenges that make mobile proxies from DataResearchTools the right choice:

  • Job boards block aggressively: Indeed, LinkedIn, and Glassdoor all invest heavily in bot detection. Mobile IPs pass through these defenses reliably.
  • Localized data requires local IPs: Salary data varies dramatically by country. A Singapore mobile proxy shows you Singapore salary ranges, while a Thailand proxy shows Thai compensation data.
  • Session persistence: Multi-page scraping (search results, then job detail pages) requires maintaining the same IP. DataResearchTools sticky sessions handle this.
  • Volume: HR tech platforms need to collect thousands of data points daily. Mobile proxy pools provide enough IP diversity for sustained high-volume collection.

Data Quality and Validation

Salary Outlier Detection

import statistics

def detect_outliers(salaries: List[float], threshold: float = 2.0) -> List[float]:
    """Remove salary outliers using z-score method."""
    if len(salaries) < 5:
        return salaries

    mean = statistics.mean(salaries)
    stdev = statistics.stdev(salaries)

    if stdev == 0:
        return salaries

    return [s for s in salaries if abs((s - mean) / stdev) <= threshold]

Currency Conversion

For comparing salaries across SEA countries:

# Approximate annual conversion rates (update regularly)
TO_USD = {
    'SGD': 0.74,
    'THB': 0.028,
    'PHP': 0.018,
    'IDR': 0.000063,
    'MYR': 0.22,
    'USD': 1.0,
}

def convert_to_usd(amount: float, currency: str) -> float:
    rate = TO_USD.get(currency, 1.0)
    return round(amount * rate, 2)

Conclusion

Salary benchmarking and talent intelligence require reliable, large-scale data collection from protected websites. The combination of smart proxy rotation, structured data extraction, and rigorous normalization creates a data pipeline that delivers actionable insights. DataResearchTools mobile proxies provide the infrastructure backbone for this kind of work, offering the geo-targeting, reliability, and IP diversity that HR data collection demands. Whether you are building an HR tech product or conducting internal compensation research, these techniques and tools give you the data foundation to make informed decisions about talent and compensation in Southeast Asian and global markets.


Related Reading

Scroll to Top