How to Scrape Glassdoor Reviews and Salary Data
Glassdoor contains one of the largest collections of employer-contributed workplace data on the internet, with over 100 million reviews, salary reports, and interview experiences across millions of companies. For HR tech platforms, recruitment agencies, and labor market researchers, this data is a goldmine for understanding compensation trends, employer reputation, and hiring patterns.
However, Glassdoor aggressively protects its data behind login walls, rate limits, and bot detection systems. This guide walks through building a reliable Glassdoor scraper using Python and mobile proxy rotation to extract reviews, salaries, and interview data at scale.
Understanding Glassdoor’s Anti-Scraping Measures
Glassdoor employs multiple layers of protection that make it one of the more challenging sites to scrape:
Mandatory authentication. Unlike most websites, Glassdoor requires users to log in before viewing more than a handful of reviews or any salary data. This “give to get” model means scrapers must maintain authenticated sessions.
Session fingerprinting. Glassdoor tracks browser fingerprints, cookie state, and behavioral patterns within sessions. Rapid navigation or inconsistent browser characteristics trigger account restrictions.
Content gating. After viewing a few pages, Glassdoor prompts users to contribute their own review or salary before accessing more data. Automated sessions hit this wall quickly.
IP-based rate limiting. Repeated requests from the same IP address result in temporary blocks and CAPTCHA challenges. This is where web scraping proxies become essential.
Setting Up the Environment
Install the necessary Python packages:
pip install selenium webdriver-manager beautifulsoup4 pandas lxmlManaging Authenticated Sessions with Proxies
The key challenge with Glassdoor scraping is maintaining authenticated sessions across proxy rotations. You need to authenticate once per proxy, then use that session for multiple page requests:
import time
import random
import json
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
class GlassdoorSessionManager:
"""Manages authenticated Glassdoor sessions across proxy rotation."""
def __init__(self, proxy_list, credentials):
self.proxy_list = proxy_list
self.credentials = credentials
self.active_sessions = {}
def create_driver(self, proxy):
"""Create a configured Selenium driver with proxy."""
chrome_options = Options()
chrome_options.add_argument(f"--proxy-server={proxy}")
chrome_options.add_argument("--headless=new")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--window-size=1920,1080")
chrome_options.add_argument(
"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)
driver.execute_cdp_cmd(
"Page.addScriptToEvaluateOnNewDocument",
{"source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"},
)
return driver
def authenticate(self, driver):
"""Log into Glassdoor with the configured credentials."""
driver.get("https://www.glassdoor.com/profile/login_input.htm")
time.sleep(random.uniform(2, 4))
try:
email_field = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "inlineUserEmail"))
)
email_field.clear()
email_field.send_keys(self.credentials["email"])
time.sleep(random.uniform(0.5, 1.5))
continue_btn = driver.find_element(
By.CSS_SELECTOR, "button[type='submit']"
)
continue_btn.click()
time.sleep(random.uniform(2, 3))
password_field = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "inlineUserPassword"))
)
password_field.clear()
password_field.send_keys(self.credentials["password"])
time.sleep(random.uniform(0.5, 1.0))
login_btn = driver.find_element(
By.CSS_SELECTOR, "button[type='submit']"
)
login_btn.click()
time.sleep(random.uniform(3, 5))
return "profile" in driver.current_url or "member" in driver.current_url
except Exception as e:
print(f"Authentication failed: {e}")
return False
def get_authenticated_driver(self):
"""Return an authenticated driver using the next available proxy."""
for proxy in self.proxy_list:
if proxy in self.active_sessions:
continue
driver = self.create_driver(proxy)
if self.authenticate(driver):
self.active_sessions[proxy] = driver
print(f"Authenticated session established via {proxy}")
return driver, proxy
else:
driver.quit()
raise Exception("No proxies available for authentication")
def release_session(self, proxy):
"""Close and release a proxy session."""
if proxy in self.active_sessions:
self.active_sessions[proxy].quit()
del self.active_sessions[proxy]Scraping Company Reviews
With authenticated sessions ready, build the review scraper that extracts structured review data:
class GlassdoorReviewScraper:
"""Extracts company reviews from Glassdoor."""
def __init__(self, session_manager):
self.session_manager = session_manager
def scrape_reviews(self, company_url, max_pages=10):
"""Scrape reviews for a company across multiple pages."""
all_reviews = []
driver, proxy = self.session_manager.get_authenticated_driver()
try:
for page in range(1, max_pages + 1):
page_url = self._build_page_url(company_url, page)
driver.get(page_url)
WebDriverWait(driver, 15).until(
EC.presence_of_element_located(
(By.CSS_SELECTOR, ".review-details__review-details-module__review")
)
)
time.sleep(random.uniform(2, 4))
# Expand truncated reviews
self._expand_all_reviews(driver)
soup = BeautifulSoup(driver.page_source, "html.parser")
reviews = self._parse_reviews(soup)
all_reviews.extend(reviews)
print(f"Page {page}: extracted {len(reviews)} reviews")
if not reviews:
break
time.sleep(random.uniform(3, 6))
except Exception as e:
print(f"Scraping error: {e}")
finally:
self.session_manager.release_session(proxy)
return all_reviews
def _build_page_url(self, base_url, page):
"""Add pagination parameter to the company review URL."""
if "_P" in base_url:
return base_url.rsplit("_P", 1)[0] + f"_P{page}.htm"
return base_url.replace(".htm", f"_P{page}.htm")
def _expand_all_reviews(self, driver):
"""Click 'Continue reading' links to expand full review text."""
try:
expand_buttons = driver.find_elements(
By.CSS_SELECTOR, ".review-details__review-details-module__continueReading"
)
for btn in expand_buttons:
try:
driver.execute_script("arguments[0].click();", btn)
time.sleep(random.uniform(0.3, 0.8))
except Exception:
pass
except Exception:
pass
def _parse_reviews(self, soup):
"""Extract structured data from review elements."""
reviews = []
review_elements = soup.select(
".review-details__review-details-module__review"
)
for element in review_elements:
review = {}
# Overall rating
rating_el = element.select_one(".review-details__review-details-module__overallRating")
if rating_el:
review["overall_rating"] = rating_el.get_text(strip=True)
# Review title
title_el = element.select_one("a.review-details__review-details-module__titleHeading")
review["title"] = title_el.get_text(strip=True) if title_el else None
# Pros
pros_el = element.select_one("[data-testid='pros']")
review["pros"] = pros_el.get_text(strip=True) if pros_el else None
# Cons
cons_el = element.select_one("[data-testid='cons']")
review["cons"] = cons_el.get_text(strip=True) if cons_el else None
# Job title
job_el = element.select_one(".review-details__review-details-module__employee")
review["job_title"] = job_el.get_text(strip=True) if job_el else None
# Date
date_el = element.select_one(".review-details__review-details-module__reviewDate")
review["date"] = date_el.get_text(strip=True) if date_el else None
# Location
location_el = element.select_one(".review-details__review-details-module__location")
review["location"] = location_el.get_text(strip=True) if location_el else None
# Recommendation
recommend_el = element.select_one(".review-details__review-details-module__recommend")
review["recommends"] = recommend_el.get_text(strip=True) if recommend_el else None
if review.get("title"):
reviews.append(review)
return reviewsScraping Salary Data
Salary information on Glassdoor is structured differently from reviews and requires navigating to a separate section of each company profile:
class GlassdoorSalaryScraper:
"""Extracts salary data from Glassdoor company profiles."""
def __init__(self, session_manager):
self.session_manager = session_manager
def scrape_salaries(self, company_salary_url, max_pages=5):
"""Scrape salary data for all available job titles."""
all_salaries = []
driver, proxy = self.session_manager.get_authenticated_driver()
try:
for page in range(1, max_pages + 1):
page_url = company_salary_url.replace(".htm", f"_P{page}.htm")
driver.get(page_url)
WebDriverWait(driver, 15).until(
EC.presence_of_element_located(
(By.CSS_SELECTOR, "[data-testid='salary-row']")
)
)
time.sleep(random.uniform(2, 4))
soup = BeautifulSoup(driver.page_source, "html.parser")
salaries = self._parse_salaries(soup)
all_salaries.extend(salaries)
print(f"Salary page {page}: {len(salaries)} entries")
if not salaries:
break
time.sleep(random.uniform(3, 6))
except Exception as e:
print(f"Salary scraping error: {e}")
finally:
self.session_manager.release_session(proxy)
return all_salaries
def _parse_salaries(self, soup):
"""Extract salary entries from the page."""
salaries = []
rows = soup.select("[data-testid='salary-row']")
for row in rows:
salary = {}
title_el = row.select_one("[data-testid='salary-job-title']")
salary["job_title"] = title_el.get_text(strip=True) if title_el else None
pay_el = row.select_one("[data-testid='salary-pay']")
salary["total_pay"] = pay_el.get_text(strip=True) if pay_el else None
base_el = row.select_one("[data-testid='salary-base']")
salary["base_salary"] = base_el.get_text(strip=True) if base_el else None
additional_el = row.select_one("[data-testid='salary-additional']")
salary["additional_pay"] = additional_el.get_text(strip=True) if additional_el else None
count_el = row.select_one("[data-testid='salary-count']")
salary["data_points"] = count_el.get_text(strip=True) if count_el else None
if salary.get("job_title"):
salaries.append(salary)
return salariesExtracting Interview Data
Glassdoor’s interview section provides unique data about hiring processes, difficulty levels, and common questions:
def scrape_interviews(session_manager, company_interview_url, max_pages=5):
"""Scrape interview experiences from a company profile."""
all_interviews = []
driver, proxy = session_manager.get_authenticated_driver()
try:
for page in range(1, max_pages + 1):
page_url = company_interview_url.replace(".htm", f"_P{page}.htm")
driver.get(page_url)
time.sleep(random.uniform(3, 5))
soup = BeautifulSoup(driver.page_source, "html.parser")
interviews = []
interview_cards = soup.select(".interview-details__interview-details-module__interviewDetails")
for card in interview_cards:
interview = {}
role_el = card.select_one(".interview-details__interview-details-module__jobTitle")
interview["role"] = role_el.get_text(strip=True) if role_el else None
offer_el = card.select_one(".interview-details__interview-details-module__offer")
interview["offer_status"] = offer_el.get_text(strip=True) if offer_el else None
experience_el = card.select_one(".interview-details__interview-details-module__experience")
interview["experience"] = experience_el.get_text(strip=True) if experience_el else None
difficulty_el = card.select_one(".interview-details__interview-details-module__difficulty")
interview["difficulty"] = difficulty_el.get_text(strip=True) if difficulty_el else None
description_el = card.select_one(".interview-details__interview-details-module__description")
interview["description"] = description_el.get_text(strip=True) if description_el else None
if interview.get("role"):
interviews.append(interview)
all_interviews.extend(interviews)
print(f"Interview page {page}: {len(interviews)} entries")
if not interviews:
break
time.sleep(random.uniform(3, 6))
except Exception as e:
print(f"Interview scraping error: {e}")
finally:
session_manager.release_session(proxy)
return all_interviewsPutting It All Together
Here is a complete execution script that scrapes reviews, salaries, and interviews for a target company:
def main():
proxies = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
credentials = {
"email": "your-glassdoor-email@example.com",
"password": "your-password",
}
session_mgr = GlassdoorSessionManager(proxies, credentials)
# Scrape Google reviews
review_scraper = GlassdoorReviewScraper(session_mgr)
reviews = review_scraper.scrape_reviews(
"https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm",
max_pages=20,
)
# Scrape salary data
salary_scraper = GlassdoorSalaryScraper(session_mgr)
salaries = salary_scraper.scrape_salaries(
"https://www.glassdoor.com/Salary/Google-Salaries-E9079.htm",
max_pages=10,
)
# Scrape interview data
interviews = scrape_interviews(
session_mgr,
"https://www.glassdoor.com/Interview/Google-Interview-Questions-E9079.htm",
max_pages=10,
)
# Export all data
pd.DataFrame(reviews).to_csv("glassdoor_reviews.csv", index=False)
pd.DataFrame(salaries).to_csv("glassdoor_salaries.csv", index=False)
pd.DataFrame(interviews).to_csv("glassdoor_interviews.csv", index=False)
print(f"Reviews: {len(reviews)}, Salaries: {len(salaries)}, Interviews: {len(interviews)}")
if __name__ == "__main__":
main()Session Management Best Practices
Use multiple accounts. Glassdoor limits how much data a single account can access. Rotate between several accounts, each assigned to a different proxy, to distribute the load.
Maintain session persistence. Keep cookies and session state within each proxy-account pair. Logging in repeatedly from the same proxy raises flags. Authenticate once, then reuse that session for as many pages as possible.
Mimic human browsing patterns. Add randomized delays between page loads, occasionally visit non-target pages, and vary scroll behavior. Glassdoor’s behavioral analysis detects mechanistic navigation patterns.
Handle content gates gracefully. When Glassdoor prompts for a contribution, switch to a different account-proxy pair rather than attempting to bypass the gate programmatically.
Data Quality Considerations
Glassdoor data requires careful cleaning and validation:
- Review dates should be parsed into standard date formats for time-series analysis
- Salary ranges often include text like “per year” or “per month” that needs normalization
- Some reviews may be duplicated across pages due to Glassdoor’s sorting algorithm
- Job titles in salary data may use internal company titles rather than standardized roles
Conclusion
Scraping Glassdoor effectively requires authenticated sessions, careful proxy rotation, and patient data collection. The combination of login walls and behavioral detection makes this one of the more technically demanding scraping targets, but the data’s value for HR analytics and competitive intelligence justifies the effort.
Mobile proxies are particularly valuable for Glassdoor because they provide residential-grade IP addresses that survive Glassdoor’s IP reputation checks. For additional scraping techniques, visit our web scraping proxy guides and check the proxy glossary for definitions of key terms used in this article.
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix