How to Scrape Trustpilot Reviews for Sentiment Analysis
Trustpilot hosts over 200 million reviews across more than 900,000 businesses, making it one of the most important consumer opinion platforms in the world. For brand managers, competitive intelligence teams, and market researchers, Trustpilot reviews provide direct insight into customer satisfaction, product quality, and service issues that no internal metrics can replicate.
Combining Trustpilot review scraping with automated sentiment analysis creates a powerful monitoring system that tracks brand perception over time, identifies emerging customer complaints, and benchmarks against competitors. This guide covers building both components using Python, mobile proxy rotation, and basic NLP techniques.
Trustpilot’s Structure and Challenges
Trustpilot organizes reviews by business domain, with each company having a dedicated page at trustpilot.com/review/{domain}. The platform presents several scraping challenges:
Server-side rendering with hydration. Trustpilot uses a React-based frontend but renders initial content server-side. This means basic HTTP requests can capture review text, but some interactive elements require JavaScript execution.
Rate limiting. Trustpilot monitors request rates and blocks IPs that make too many requests. This is where web scraping proxies become necessary for any meaningful data collection.
Structured data in HTML. Trustpilot embeds JSON-LD structured data in its pages, providing clean review data without requiring complex HTML parsing. This is a significant advantage for scrapers.
Pagination limits. Trustpilot limits how far back you can paginate through reviews, typically capping at around 100 pages (2,000 reviews) per business.
Setting Up the Environment
pip install requests beautifulsoup4 pandas textblob vaderSentiment lxmlThe textblob and vaderSentiment packages provide the sentiment analysis capabilities. VADER (Valence Aware Dictionary and sEntiment Reasoner) is particularly well-suited for short-text sentiment analysis like reviews.
Building the Trustpilot Review Scraper
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time
import random
import re
from datetime import datetime
class TrustpilotProxyPool:
"""Manages proxy rotation for Trustpilot scraping."""
def __init__(self, proxy_list):
self.proxies = proxy_list
self.index = 0
self.blocked = set()
def get_proxy(self):
"""Return the next available proxy."""
available = [p for p in self.proxies if p not in self.blocked]
if not available:
self.blocked.clear()
available = self.proxies
proxy = available[self.index % len(available)]
self.index += 1
return {"http": proxy, "https": proxy}
def mark_blocked(self, proxy_dict):
"""Mark a proxy as temporarily blocked."""
self.blocked.add(proxy_dict.get("http", ""))
class TrustpilotScraper:
"""Scrapes reviews from Trustpilot business pages."""
BASE_URL = "https://www.trustpilot.com/review"
def __init__(self, proxy_pool):
self.proxy_pool = proxy_pool
self.session = requests.Session()
self.session.headers.update({
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
})
def scrape_company_reviews(self, domain, max_pages=50, stars=None):
"""Scrape all reviews for a company domain."""
all_reviews = []
for page in range(1, max_pages + 1):
url = f"{self.BASE_URL}/{domain}?page={page}"
if stars:
url += f"&stars={stars}"
proxy = self.proxy_pool.get_proxy()
try:
response = self.session.get(url, proxies=proxy, timeout=15)
if response.status_code == 200:
reviews = self._extract_reviews(response.text, domain)
if not reviews:
print(f"No more reviews found on page {page}")
break
all_reviews.extend(reviews)
print(f"Page {page}: {len(reviews)} reviews (total: {len(all_reviews)})")
elif response.status_code == 403 or response.status_code == 429:
print(f"Blocked on page {page}, rotating proxy...")
self.proxy_pool.mark_blocked(proxy)
time.sleep(random.uniform(10, 20))
continue
else:
print(f"HTTP {response.status_code} on page {page}")
break
except requests.RequestException as e:
print(f"Request error on page {page}: {e}")
self.proxy_pool.mark_blocked(proxy)
time.sleep(random.uniform(2, 5))
return all_reviews
def _extract_reviews(self, html, domain):
"""Extract review data from page HTML using JSON-LD and HTML parsing."""
soup = BeautifulSoup(html, "lxml")
reviews = []
# Method 1: Try JSON-LD structured data
json_ld_reviews = self._extract_from_json_ld(soup)
if json_ld_reviews:
return json_ld_reviews
# Method 2: Fall back to HTML parsing
return self._extract_from_html(soup, domain)
def _extract_from_json_ld(self, soup):
"""Extract reviews from JSON-LD structured data."""
reviews = []
for script in soup.select('script[type="application/ld+json"]'):
try:
data = json.loads(script.string)
if isinstance(data, dict) and data.get("@type") == "LocalBusiness":
review_list = data.get("review", [])
for review_data in review_list:
review = {
"author": review_data.get("author", {}).get("name", ""),
"rating": review_data.get("reviewRating", {}).get("ratingValue"),
"date": review_data.get("datePublished"),
"title": review_data.get("headline", ""),
"body": review_data.get("reviewBody", ""),
}
reviews.append(review)
except (json.JSONDecodeError, TypeError):
continue
return reviews
def _extract_from_html(self, soup, domain):
"""Extract reviews by parsing HTML elements."""
reviews = []
review_cards = soup.select(
"article.paper_paper__1PY90, "
"[data-service-review-card-paper], "
"div.styles_reviewCardInner__EwDQw"
)
for card in review_cards:
review = {"domain": domain}
# Rating (from star image or data attribute)
rating_el = card.select_one(
"[data-service-review-rating], "
"img[alt*='star']"
)
if rating_el:
rating_attr = rating_el.get("data-service-review-rating")
if rating_attr:
review["rating"] = int(rating_attr)
else:
alt_text = rating_el.get("alt", "")
match = re.search(r"(\d)", alt_text)
review["rating"] = int(match.group(1)) if match else None
else:
review["rating"] = None
# Title
title_el = card.select_one(
"h2[data-service-review-title-typography], "
"a[data-review-title-typography]"
)
review["title"] = title_el.get_text(strip=True) if title_el else ""
# Body
body_el = card.select_one(
"p[data-service-review-text-typography], "
"[data-review-body]"
)
review["body"] = body_el.get_text(strip=True) if body_el else ""
# Author
author_el = card.select_one(
"span[data-consumer-name-typography], "
"[data-consumer-name]"
)
review["author"] = author_el.get_text(strip=True) if author_el else ""
# Date
date_el = card.select_one("time")
if date_el:
review["date"] = date_el.get("datetime") or date_el.get_text(strip=True)
else:
review["date"] = None
# Company reply
reply_el = card.select_one("[data-service-review-business-reply-text]")
review["company_reply"] = reply_el.get_text(strip=True) if reply_el else None
if review.get("title") or review.get("body"):
reviews.append(review)
return reviews
def scrape_company_stats(self, domain):
"""Scrape overall company rating and review statistics."""
url = f"{self.BASE_URL}/{domain}"
proxy = self.proxy_pool.get_proxy()
try:
response = self.session.get(url, proxies=proxy, timeout=15)
if response.status_code != 200:
return None
soup = BeautifulSoup(response.text, "lxml")
stats = {"domain": domain}
# Overall rating
for script in soup.select('script[type="application/ld+json"]'):
try:
data = json.loads(script.string)
if isinstance(data, dict) and "aggregateRating" in data:
agg = data["aggregateRating"]
stats["overall_rating"] = agg.get("ratingValue")
stats["total_reviews"] = agg.get("reviewCount")
break
except (json.JSONDecodeError, TypeError):
continue
# Trust score
score_el = soup.select_one("[data-rating-typography]")
stats["trust_score"] = score_el.get_text(strip=True) if score_el else None
return stats
except Exception as e:
print(f"Stats scrape error: {e}")
return NoneBuilding the Sentiment Analysis Pipeline
With reviews collected, apply sentiment analysis to extract quantified insights:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from collections import Counter
class ReviewSentimentAnalyzer:
"""Analyzes sentiment in collected Trustpilot reviews."""
def __init__(self):
self.vader = SentimentIntensityAnalyzer()
def analyze_reviews(self, reviews):
"""Add sentiment scores to a list of review dictionaries."""
for review in reviews:
text = f"{review.get('title', '')} {review.get('body', '')}".strip()
if text:
# VADER sentiment
vader_scores = self.vader.polarity_scores(text)
review["vader_compound"] = vader_scores["compound"]
review["vader_positive"] = vader_scores["pos"]
review["vader_negative"] = vader_scores["neg"]
review["vader_neutral"] = vader_scores["neu"]
# TextBlob sentiment
blob = TextBlob(text)
review["textblob_polarity"] = round(blob.sentiment.polarity, 4)
review["textblob_subjectivity"] = round(blob.sentiment.subjectivity, 4)
# Classify sentiment
review["sentiment_label"] = self._classify_sentiment(
vader_scores["compound"]
)
else:
review["vader_compound"] = 0
review["sentiment_label"] = "neutral"
return reviews
@staticmethod
def _classify_sentiment(compound_score):
"""Classify compound score into sentiment categories."""
if compound_score >= 0.05:
return "positive"
elif compound_score <= -0.05:
return "negative"
return "neutral"
def extract_themes(self, reviews, top_n=20):
"""Extract common themes from review text using keyword frequency."""
# Common stop words to filter out
stop_words = {
"the", "a", "an", "is", "are", "was", "were", "be", "been",
"being", "have", "has", "had", "do", "does", "did", "will",
"would", "could", "should", "may", "might", "shall", "can",
"to", "of", "in", "for", "on", "with", "at", "by", "from",
"up", "about", "into", "through", "during", "before", "after",
"above", "below", "between", "out", "off", "over", "under",
"again", "further", "then", "once", "i", "me", "my", "we",
"our", "you", "your", "he", "she", "it", "they", "them",
"this", "that", "these", "those", "and", "but", "or", "so",
"not", "no", "very", "just", "also", "than", "too", "all",
"each", "every", "both", "few", "more", "most", "other",
"some", "such", "only", "own", "same", "as", "if", "when",
"which", "who", "what", "how", "there", "here", "get", "got",
}
positive_words = Counter()
negative_words = Counter()
for review in reviews:
text = f"{review.get('title', '')} {review.get('body', '')}".lower()
words = re.findall(r"\b[a-z]{3,}\b", text)
filtered = [w for w in words if w not in stop_words]
if review.get("sentiment_label") == "positive":
positive_words.update(filtered)
elif review.get("sentiment_label") == "negative":
negative_words.update(filtered)
return {
"positive_themes": positive_words.most_common(top_n),
"negative_themes": negative_words.most_common(top_n),
}
def sentiment_over_time(self, reviews_df):
"""Track sentiment trends over time."""
if "date" not in reviews_df.columns:
return None
reviews_df["date_parsed"] = pd.to_datetime(
reviews_df["date"], errors="coerce"
)
reviews_df = reviews_df.dropna(subset=["date_parsed"])
monthly = reviews_df.set_index("date_parsed").resample("M").agg({
"vader_compound": "mean",
"rating": "mean",
"title": "count",
}).rename(columns={"title": "review_count"})
return monthlyCompetitive Sentiment Comparison
Compare sentiment across multiple competitors:
class CompetitiveSentimentTracker:
"""Compares review sentiment across competitor businesses."""
def __init__(self, scraper, analyzer):
self.scraper = scraper
self.analyzer = analyzer
def compare_companies(self, domains, max_reviews_each=500):
"""Scrape and analyze reviews for multiple companies."""
comparison = {}
for domain in domains:
print(f"\nAnalyzing: {domain}")
# Scrape reviews
reviews = self.scraper.scrape_company_reviews(
domain, max_pages=max_reviews_each // 20
)
if not reviews:
print(f"No reviews found for {domain}")
continue
# Analyze sentiment
analyzed = self.analyzer.analyze_reviews(reviews)
# Calculate summary stats
df = pd.DataFrame(analyzed)
summary = {
"total_reviews": len(df),
"avg_rating": df["rating"].mean() if "rating" in df.columns else None,
"avg_sentiment": df["vader_compound"].mean(),
"positive_pct": (df["sentiment_label"] == "positive").mean() * 100,
"negative_pct": (df["sentiment_label"] == "negative").mean() * 100,
"neutral_pct": (df["sentiment_label"] == "neutral").mean() * 100,
}
# Extract themes
themes = self.analyzer.extract_themes(analyzed)
summary["top_positive_themes"] = [
t[0] for t in themes["positive_themes"][:5]
]
summary["top_negative_themes"] = [
t[0] for t in themes["negative_themes"][:5]
]
comparison[domain] = summary
time.sleep(random.uniform(5, 10))
return comparison
def generate_report(self, comparison):
"""Generate a comparison report from analyzed data."""
report_data = []
for domain, stats in comparison.items():
row = {"domain": domain}
row.update(stats)
report_data.append(row)
df = pd.DataFrame(report_data)
if not df.empty:
df = df.sort_values("avg_sentiment", ascending=False)
return dfRunning the Complete Pipeline
def main():
proxies = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
"http://user:pass@proxy4.example.com:8080",
]
pool = TrustpilotProxyPool(proxies)
scraper = TrustpilotScraper(pool)
analyzer = ReviewSentimentAnalyzer()
# Scrape a single company
domain = "example.com"
reviews = scraper.scrape_company_reviews(domain, max_pages=20)
# Analyze sentiment
analyzed = analyzer.analyze_reviews(reviews)
df = pd.DataFrame(analyzed)
df.to_csv(f"trustpilot_{domain.replace('.', '_')}_reviews.csv", index=False)
# Summary statistics
print(f"\nResults for {domain}:")
print(f"Total reviews: {len(df)}")
print(f"Average rating: {df['rating'].mean():.2f}")
print(f"Average sentiment: {df['vader_compound'].mean():.3f}")
print(f"Positive: {(df['sentiment_label'] == 'positive').sum()}")
print(f"Negative: {(df['sentiment_label'] == 'negative').sum()}")
print(f"Neutral: {(df['sentiment_label'] == 'neutral').sum()}")
# Theme extraction
themes = analyzer.extract_themes(analyzed)
print(f"\nTop positive themes: {themes['positive_themes'][:10]}")
print(f"Top negative themes: {themes['negative_themes'][:10]}")
# Sentiment over time
monthly = analyzer.sentiment_over_time(df)
if monthly is not None:
monthly.to_csv(f"trustpilot_{domain.replace('.', '_')}_monthly.csv")
print(f"\nMonthly sentiment trend exported")
# Competitive comparison
tracker = CompetitiveSentimentTracker(scraper, analyzer)
competitors = ["competitor1.com", "competitor2.com", "competitor3.com"]
comparison = tracker.compare_companies(competitors, max_reviews_each=200)
report = tracker.generate_report(comparison)
if not report.empty:
print("\nCompetitive Comparison:")
print(report[["domain", "total_reviews", "avg_rating", "avg_sentiment",
"positive_pct", "negative_pct"]].to_string())
report.to_csv("trustpilot_competitive_report.csv", index=False)
if __name__ == "__main__":
main()Advanced Sentiment Analysis Techniques
Aspect-Based Sentiment
Go beyond overall sentiment to understand sentiment about specific aspects of a business:
def aspect_sentiment(reviews, aspects):
"""Calculate sentiment for specific aspects mentioned in reviews."""
vader = SentimentIntensityAnalyzer()
aspect_scores = {aspect: [] for aspect in aspects}
for review in reviews:
text = f"{review.get('title', '')} {review.get('body', '')}".lower()
for aspect in aspects:
if aspect.lower() in text:
# Find sentences mentioning the aspect
sentences = text.split(".")
for sentence in sentences:
if aspect.lower() in sentence:
score = vader.polarity_scores(sentence)
aspect_scores[aspect].append(score["compound"])
# Calculate averages
results = {}
for aspect, scores in aspect_scores.items():
if scores:
results[aspect] = {
"avg_sentiment": round(sum(scores) / len(scores), 3),
"mention_count": len(scores),
"positive_mentions": sum(1 for s in scores if s > 0.05),
"negative_mentions": sum(1 for s in scores if s < -0.05),
}
return resultsUsage example for a proxy service:
aspects = [
"speed", "customer service", "support", "price", "reliability",
"connection", "bandwidth", "uptime", "documentation", "setup",
]
results = aspect_sentiment(analyzed_reviews, aspects)
for aspect, data in sorted(results.items(), key=lambda x: x[1]["avg_sentiment"]):
print(f"{aspect}: sentiment={data['avg_sentiment']}, mentions={data['mention_count']}")Best Practices for Trustpilot Scraping
Leverage JSON-LD first. Trustpilot’s structured data provides the cleanest review extraction. Only fall back to HTML parsing when JSON-LD is incomplete.
Respect pagination limits. Trustpilot caps accessible reviews at approximately 2,000 per business. Plan your data collection scope accordingly.
Filter by star rating. Use the stars URL parameter to scrape reviews of specific ratings. This is useful for targeted negative review analysis.
Monitor for page structure changes. Trustpilot updates its frontend regularly. Build your scraper with multiple fallback selectors to handle variations in CSS class names.
Pair with proxy rotation. Even moderate scraping volumes require proxy rotation. A pool of 3-5 mobile proxies handles most Trustpilot collection tasks effectively.
Conclusion
Trustpilot review scraping combined with sentiment analysis creates a powerful brand monitoring system. The platform’s JSON-LD structured data makes extraction reliable, while VADER sentiment analysis provides quick, accurate sentiment classification without requiring model training.
For businesses in competitive markets, automated Trustpilot monitoring reveals customer perception shifts before they impact sales. Combined with mobile proxy rotation for reliable data collection, this approach scales from single-company monitoring to industry-wide competitive analysis.
For related techniques, explore our web scraping proxy tutorials and social media scraping guides. The proxy glossary provides definitions for proxy concepts referenced throughout this article.
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix