How to Scrape Glassdoor Reviews and Salary Data

How to Scrape Glassdoor Reviews and Salary Data

Glassdoor contains one of the largest collections of employer-contributed workplace data on the internet, with over 100 million reviews, salary reports, and interview experiences across millions of companies. For HR tech platforms, recruitment agencies, and labor market researchers, this data is a goldmine for understanding compensation trends, employer reputation, and hiring patterns.

However, Glassdoor aggressively protects its data behind login walls, rate limits, and bot detection systems. This guide walks through building a reliable Glassdoor scraper using Python and mobile proxy rotation to extract reviews, salaries, and interview data at scale.

Understanding Glassdoor’s Anti-Scraping Measures

Glassdoor employs multiple layers of protection that make it one of the more challenging sites to scrape:

Mandatory authentication. Unlike most websites, Glassdoor requires users to log in before viewing more than a handful of reviews or any salary data. This “give to get” model means scrapers must maintain authenticated sessions.

Session fingerprinting. Glassdoor tracks browser fingerprints, cookie state, and behavioral patterns within sessions. Rapid navigation or inconsistent browser characteristics trigger account restrictions.

Content gating. After viewing a few pages, Glassdoor prompts users to contribute their own review or salary before accessing more data. Automated sessions hit this wall quickly.

IP-based rate limiting. Repeated requests from the same IP address result in temporary blocks and CAPTCHA challenges. This is where web scraping proxies become essential.

Setting Up the Environment

Install the necessary Python packages:

pip install selenium webdriver-manager beautifulsoup4 pandas lxml

Managing Authenticated Sessions with Proxies

The key challenge with Glassdoor scraping is maintaining authenticated sessions across proxy rotations. You need to authenticate once per proxy, then use that session for multiple page requests:

import time
import random
import json
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd


class GlassdoorSessionManager:
    """Manages authenticated Glassdoor sessions across proxy rotation."""

    def __init__(self, proxy_list, credentials):
        self.proxy_list = proxy_list
        self.credentials = credentials
        self.active_sessions = {}

    def create_driver(self, proxy):
        """Create a configured Selenium driver with proxy."""
        chrome_options = Options()
        chrome_options.add_argument(f"--proxy-server={proxy}")
        chrome_options.add_argument("--headless=new")
        chrome_options.add_argument("--disable-blink-features=AutomationControlled")
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--window-size=1920,1080")
        chrome_options.add_argument(
            "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
        )
        chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])

        service = Service(ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service, options=chrome_options)

        driver.execute_cdp_cmd(
            "Page.addScriptToEvaluateOnNewDocument",
            {"source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"},
        )

        return driver

    def authenticate(self, driver):
        """Log into Glassdoor with the configured credentials."""
        driver.get("https://www.glassdoor.com/profile/login_input.htm")
        time.sleep(random.uniform(2, 4))

        try:
            email_field = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.ID, "inlineUserEmail"))
            )
            email_field.clear()
            email_field.send_keys(self.credentials["email"])
            time.sleep(random.uniform(0.5, 1.5))

            continue_btn = driver.find_element(
                By.CSS_SELECTOR, "button[type='submit']"
            )
            continue_btn.click()
            time.sleep(random.uniform(2, 3))

            password_field = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.ID, "inlineUserPassword"))
            )
            password_field.clear()
            password_field.send_keys(self.credentials["password"])
            time.sleep(random.uniform(0.5, 1.0))

            login_btn = driver.find_element(
                By.CSS_SELECTOR, "button[type='submit']"
            )
            login_btn.click()
            time.sleep(random.uniform(3, 5))

            return "profile" in driver.current_url or "member" in driver.current_url

        except Exception as e:
            print(f"Authentication failed: {e}")
            return False

    def get_authenticated_driver(self):
        """Return an authenticated driver using the next available proxy."""
        for proxy in self.proxy_list:
            if proxy in self.active_sessions:
                continue

            driver = self.create_driver(proxy)
            if self.authenticate(driver):
                self.active_sessions[proxy] = driver
                print(f"Authenticated session established via {proxy}")
                return driver, proxy
            else:
                driver.quit()

        raise Exception("No proxies available for authentication")

    def release_session(self, proxy):
        """Close and release a proxy session."""
        if proxy in self.active_sessions:
            self.active_sessions[proxy].quit()
            del self.active_sessions[proxy]

Scraping Company Reviews

With authenticated sessions ready, build the review scraper that extracts structured review data:

class GlassdoorReviewScraper:
    """Extracts company reviews from Glassdoor."""

    def __init__(self, session_manager):
        self.session_manager = session_manager

    def scrape_reviews(self, company_url, max_pages=10):
        """Scrape reviews for a company across multiple pages."""
        all_reviews = []
        driver, proxy = self.session_manager.get_authenticated_driver()

        try:
            for page in range(1, max_pages + 1):
                page_url = self._build_page_url(company_url, page)
                driver.get(page_url)

                WebDriverWait(driver, 15).until(
                    EC.presence_of_element_located(
                        (By.CSS_SELECTOR, ".review-details__review-details-module__review")
                    )
                )

                time.sleep(random.uniform(2, 4))

                # Expand truncated reviews
                self._expand_all_reviews(driver)

                soup = BeautifulSoup(driver.page_source, "html.parser")
                reviews = self._parse_reviews(soup)
                all_reviews.extend(reviews)

                print(f"Page {page}: extracted {len(reviews)} reviews")

                if not reviews:
                    break

                time.sleep(random.uniform(3, 6))

        except Exception as e:
            print(f"Scraping error: {e}")
        finally:
            self.session_manager.release_session(proxy)

        return all_reviews

    def _build_page_url(self, base_url, page):
        """Add pagination parameter to the company review URL."""
        if "_P" in base_url:
            return base_url.rsplit("_P", 1)[0] + f"_P{page}.htm"
        return base_url.replace(".htm", f"_P{page}.htm")

    def _expand_all_reviews(self, driver):
        """Click 'Continue reading' links to expand full review text."""
        try:
            expand_buttons = driver.find_elements(
                By.CSS_SELECTOR, ".review-details__review-details-module__continueReading"
            )
            for btn in expand_buttons:
                try:
                    driver.execute_script("arguments[0].click();", btn)
                    time.sleep(random.uniform(0.3, 0.8))
                except Exception:
                    pass
        except Exception:
            pass

    def _parse_reviews(self, soup):
        """Extract structured data from review elements."""
        reviews = []
        review_elements = soup.select(
            ".review-details__review-details-module__review"
        )

        for element in review_elements:
            review = {}

            # Overall rating
            rating_el = element.select_one(".review-details__review-details-module__overallRating")
            if rating_el:
                review["overall_rating"] = rating_el.get_text(strip=True)

            # Review title
            title_el = element.select_one("a.review-details__review-details-module__titleHeading")
            review["title"] = title_el.get_text(strip=True) if title_el else None

            # Pros
            pros_el = element.select_one("[data-testid='pros']")
            review["pros"] = pros_el.get_text(strip=True) if pros_el else None

            # Cons
            cons_el = element.select_one("[data-testid='cons']")
            review["cons"] = cons_el.get_text(strip=True) if cons_el else None

            # Job title
            job_el = element.select_one(".review-details__review-details-module__employee")
            review["job_title"] = job_el.get_text(strip=True) if job_el else None

            # Date
            date_el = element.select_one(".review-details__review-details-module__reviewDate")
            review["date"] = date_el.get_text(strip=True) if date_el else None

            # Location
            location_el = element.select_one(".review-details__review-details-module__location")
            review["location"] = location_el.get_text(strip=True) if location_el else None

            # Recommendation
            recommend_el = element.select_one(".review-details__review-details-module__recommend")
            review["recommends"] = recommend_el.get_text(strip=True) if recommend_el else None

            if review.get("title"):
                reviews.append(review)

        return reviews

Scraping Salary Data

Salary information on Glassdoor is structured differently from reviews and requires navigating to a separate section of each company profile:

class GlassdoorSalaryScraper:
    """Extracts salary data from Glassdoor company profiles."""

    def __init__(self, session_manager):
        self.session_manager = session_manager

    def scrape_salaries(self, company_salary_url, max_pages=5):
        """Scrape salary data for all available job titles."""
        all_salaries = []
        driver, proxy = self.session_manager.get_authenticated_driver()

        try:
            for page in range(1, max_pages + 1):
                page_url = company_salary_url.replace(".htm", f"_P{page}.htm")
                driver.get(page_url)

                WebDriverWait(driver, 15).until(
                    EC.presence_of_element_located(
                        (By.CSS_SELECTOR, "[data-testid='salary-row']")
                    )
                )

                time.sleep(random.uniform(2, 4))
                soup = BeautifulSoup(driver.page_source, "html.parser")
                salaries = self._parse_salaries(soup)
                all_salaries.extend(salaries)

                print(f"Salary page {page}: {len(salaries)} entries")

                if not salaries:
                    break

                time.sleep(random.uniform(3, 6))

        except Exception as e:
            print(f"Salary scraping error: {e}")
        finally:
            self.session_manager.release_session(proxy)

        return all_salaries

    def _parse_salaries(self, soup):
        """Extract salary entries from the page."""
        salaries = []
        rows = soup.select("[data-testid='salary-row']")

        for row in rows:
            salary = {}

            title_el = row.select_one("[data-testid='salary-job-title']")
            salary["job_title"] = title_el.get_text(strip=True) if title_el else None

            pay_el = row.select_one("[data-testid='salary-pay']")
            salary["total_pay"] = pay_el.get_text(strip=True) if pay_el else None

            base_el = row.select_one("[data-testid='salary-base']")
            salary["base_salary"] = base_el.get_text(strip=True) if base_el else None

            additional_el = row.select_one("[data-testid='salary-additional']")
            salary["additional_pay"] = additional_el.get_text(strip=True) if additional_el else None

            count_el = row.select_one("[data-testid='salary-count']")
            salary["data_points"] = count_el.get_text(strip=True) if count_el else None

            if salary.get("job_title"):
                salaries.append(salary)

        return salaries

Extracting Interview Data

Glassdoor’s interview section provides unique data about hiring processes, difficulty levels, and common questions:

def scrape_interviews(session_manager, company_interview_url, max_pages=5):
    """Scrape interview experiences from a company profile."""
    all_interviews = []
    driver, proxy = session_manager.get_authenticated_driver()

    try:
        for page in range(1, max_pages + 1):
            page_url = company_interview_url.replace(".htm", f"_P{page}.htm")
            driver.get(page_url)
            time.sleep(random.uniform(3, 5))

            soup = BeautifulSoup(driver.page_source, "html.parser")
            interviews = []

            interview_cards = soup.select(".interview-details__interview-details-module__interviewDetails")

            for card in interview_cards:
                interview = {}

                role_el = card.select_one(".interview-details__interview-details-module__jobTitle")
                interview["role"] = role_el.get_text(strip=True) if role_el else None

                offer_el = card.select_one(".interview-details__interview-details-module__offer")
                interview["offer_status"] = offer_el.get_text(strip=True) if offer_el else None

                experience_el = card.select_one(".interview-details__interview-details-module__experience")
                interview["experience"] = experience_el.get_text(strip=True) if experience_el else None

                difficulty_el = card.select_one(".interview-details__interview-details-module__difficulty")
                interview["difficulty"] = difficulty_el.get_text(strip=True) if difficulty_el else None

                description_el = card.select_one(".interview-details__interview-details-module__description")
                interview["description"] = description_el.get_text(strip=True) if description_el else None

                if interview.get("role"):
                    interviews.append(interview)

            all_interviews.extend(interviews)
            print(f"Interview page {page}: {len(interviews)} entries")

            if not interviews:
                break

            time.sleep(random.uniform(3, 6))

    except Exception as e:
        print(f"Interview scraping error: {e}")
    finally:
        session_manager.release_session(proxy)

    return all_interviews

Putting It All Together

Here is a complete execution script that scrapes reviews, salaries, and interviews for a target company:

def main():
    proxies = [
        "http://user:pass@proxy1.example.com:8080",
        "http://user:pass@proxy2.example.com:8080",
        "http://user:pass@proxy3.example.com:8080",
    ]

    credentials = {
        "email": "your-glassdoor-email@example.com",
        "password": "your-password",
    }

    session_mgr = GlassdoorSessionManager(proxies, credentials)

    # Scrape Google reviews
    review_scraper = GlassdoorReviewScraper(session_mgr)
    reviews = review_scraper.scrape_reviews(
        "https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm",
        max_pages=20,
    )

    # Scrape salary data
    salary_scraper = GlassdoorSalaryScraper(session_mgr)
    salaries = salary_scraper.scrape_salaries(
        "https://www.glassdoor.com/Salary/Google-Salaries-E9079.htm",
        max_pages=10,
    )

    # Scrape interview data
    interviews = scrape_interviews(
        session_mgr,
        "https://www.glassdoor.com/Interview/Google-Interview-Questions-E9079.htm",
        max_pages=10,
    )

    # Export all data
    pd.DataFrame(reviews).to_csv("glassdoor_reviews.csv", index=False)
    pd.DataFrame(salaries).to_csv("glassdoor_salaries.csv", index=False)
    pd.DataFrame(interviews).to_csv("glassdoor_interviews.csv", index=False)

    print(f"Reviews: {len(reviews)}, Salaries: {len(salaries)}, Interviews: {len(interviews)}")


if __name__ == "__main__":
    main()

Session Management Best Practices

Use multiple accounts. Glassdoor limits how much data a single account can access. Rotate between several accounts, each assigned to a different proxy, to distribute the load.

Maintain session persistence. Keep cookies and session state within each proxy-account pair. Logging in repeatedly from the same proxy raises flags. Authenticate once, then reuse that session for as many pages as possible.

Mimic human browsing patterns. Add randomized delays between page loads, occasionally visit non-target pages, and vary scroll behavior. Glassdoor’s behavioral analysis detects mechanistic navigation patterns.

Handle content gates gracefully. When Glassdoor prompts for a contribution, switch to a different account-proxy pair rather than attempting to bypass the gate programmatically.

Data Quality Considerations

Glassdoor data requires careful cleaning and validation:

  • Review dates should be parsed into standard date formats for time-series analysis
  • Salary ranges often include text like “per year” or “per month” that needs normalization
  • Some reviews may be duplicated across pages due to Glassdoor’s sorting algorithm
  • Job titles in salary data may use internal company titles rather than standardized roles

Conclusion

Scraping Glassdoor effectively requires authenticated sessions, careful proxy rotation, and patient data collection. The combination of login walls and behavioral detection makes this one of the more technically demanding scraping targets, but the data’s value for HR analytics and competitive intelligence justifies the effort.

Mobile proxies are particularly valuable for Glassdoor because they provide residential-grade IP addresses that survive Glassdoor’s IP reputation checks. For additional scraping techniques, visit our web scraping proxy guides and check the proxy glossary for definitions of key terms used in this article.


Related Reading

Scroll to Top