How to Scrape Trustpilot Reviews for Sentiment Analysis

How to Scrape Trustpilot Reviews for Sentiment Analysis

Trustpilot hosts over 200 million reviews across more than 900,000 businesses, making it one of the most important consumer opinion platforms in the world. For brand managers, competitive intelligence teams, and market researchers, Trustpilot reviews provide direct insight into customer satisfaction, product quality, and service issues that no internal metrics can replicate.

Combining Trustpilot review scraping with automated sentiment analysis creates a powerful monitoring system that tracks brand perception over time, identifies emerging customer complaints, and benchmarks against competitors. This guide covers building both components using Python, mobile proxy rotation, and basic NLP techniques.

Trustpilot’s Structure and Challenges

Trustpilot organizes reviews by business domain, with each company having a dedicated page at trustpilot.com/review/{domain}. The platform presents several scraping challenges:

Server-side rendering with hydration. Trustpilot uses a React-based frontend but renders initial content server-side. This means basic HTTP requests can capture review text, but some interactive elements require JavaScript execution.

Rate limiting. Trustpilot monitors request rates and blocks IPs that make too many requests. This is where web scraping proxies become necessary for any meaningful data collection.

Structured data in HTML. Trustpilot embeds JSON-LD structured data in its pages, providing clean review data without requiring complex HTML parsing. This is a significant advantage for scrapers.

Pagination limits. Trustpilot limits how far back you can paginate through reviews, typically capping at around 100 pages (2,000 reviews) per business.

Setting Up the Environment

pip install requests beautifulsoup4 pandas textblob vaderSentiment lxml

The textblob and vaderSentiment packages provide the sentiment analysis capabilities. VADER (Valence Aware Dictionary and sEntiment Reasoner) is particularly well-suited for short-text sentiment analysis like reviews.

Building the Trustpilot Review Scraper

import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time
import random
import re
from datetime import datetime


class TrustpilotProxyPool:
    """Manages proxy rotation for Trustpilot scraping."""

    def __init__(self, proxy_list):
        self.proxies = proxy_list
        self.index = 0
        self.blocked = set()

    def get_proxy(self):
        """Return the next available proxy."""
        available = [p for p in self.proxies if p not in self.blocked]
        if not available:
            self.blocked.clear()
            available = self.proxies

        proxy = available[self.index % len(available)]
        self.index += 1
        return {"http": proxy, "https": proxy}

    def mark_blocked(self, proxy_dict):
        """Mark a proxy as temporarily blocked."""
        self.blocked.add(proxy_dict.get("http", ""))


class TrustpilotScraper:
    """Scrapes reviews from Trustpilot business pages."""

    BASE_URL = "https://www.trustpilot.com/review"

    def __init__(self, proxy_pool):
        self.proxy_pool = proxy_pool
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/120.0.0.0 Safari/537.36"
            ),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.5",
        })

    def scrape_company_reviews(self, domain, max_pages=50, stars=None):
        """Scrape all reviews for a company domain."""
        all_reviews = []

        for page in range(1, max_pages + 1):
            url = f"{self.BASE_URL}/{domain}?page={page}"
            if stars:
                url += f"&stars={stars}"

            proxy = self.proxy_pool.get_proxy()

            try:
                response = self.session.get(url, proxies=proxy, timeout=15)

                if response.status_code == 200:
                    reviews = self._extract_reviews(response.text, domain)

                    if not reviews:
                        print(f"No more reviews found on page {page}")
                        break

                    all_reviews.extend(reviews)
                    print(f"Page {page}: {len(reviews)} reviews (total: {len(all_reviews)})")

                elif response.status_code == 403 or response.status_code == 429:
                    print(f"Blocked on page {page}, rotating proxy...")
                    self.proxy_pool.mark_blocked(proxy)
                    time.sleep(random.uniform(10, 20))
                    continue

                else:
                    print(f"HTTP {response.status_code} on page {page}")
                    break

            except requests.RequestException as e:
                print(f"Request error on page {page}: {e}")
                self.proxy_pool.mark_blocked(proxy)

            time.sleep(random.uniform(2, 5))

        return all_reviews

    def _extract_reviews(self, html, domain):
        """Extract review data from page HTML using JSON-LD and HTML parsing."""
        soup = BeautifulSoup(html, "lxml")
        reviews = []

        # Method 1: Try JSON-LD structured data
        json_ld_reviews = self._extract_from_json_ld(soup)
        if json_ld_reviews:
            return json_ld_reviews

        # Method 2: Fall back to HTML parsing
        return self._extract_from_html(soup, domain)

    def _extract_from_json_ld(self, soup):
        """Extract reviews from JSON-LD structured data."""
        reviews = []

        for script in soup.select('script[type="application/ld+json"]'):
            try:
                data = json.loads(script.string)

                if isinstance(data, dict) and data.get("@type") == "LocalBusiness":
                    review_list = data.get("review", [])
                    for review_data in review_list:
                        review = {
                            "author": review_data.get("author", {}).get("name", ""),
                            "rating": review_data.get("reviewRating", {}).get("ratingValue"),
                            "date": review_data.get("datePublished"),
                            "title": review_data.get("headline", ""),
                            "body": review_data.get("reviewBody", ""),
                        }
                        reviews.append(review)

            except (json.JSONDecodeError, TypeError):
                continue

        return reviews

    def _extract_from_html(self, soup, domain):
        """Extract reviews by parsing HTML elements."""
        reviews = []

        review_cards = soup.select(
            "article.paper_paper__1PY90, "
            "[data-service-review-card-paper], "
            "div.styles_reviewCardInner__EwDQw"
        )

        for card in review_cards:
            review = {"domain": domain}

            # Rating (from star image or data attribute)
            rating_el = card.select_one(
                "[data-service-review-rating], "
                "img[alt*='star']"
            )
            if rating_el:
                rating_attr = rating_el.get("data-service-review-rating")
                if rating_attr:
                    review["rating"] = int(rating_attr)
                else:
                    alt_text = rating_el.get("alt", "")
                    match = re.search(r"(\d)", alt_text)
                    review["rating"] = int(match.group(1)) if match else None
            else:
                review["rating"] = None

            # Title
            title_el = card.select_one(
                "h2[data-service-review-title-typography], "
                "a[data-review-title-typography]"
            )
            review["title"] = title_el.get_text(strip=True) if title_el else ""

            # Body
            body_el = card.select_one(
                "p[data-service-review-text-typography], "
                "[data-review-body]"
            )
            review["body"] = body_el.get_text(strip=True) if body_el else ""

            # Author
            author_el = card.select_one(
                "span[data-consumer-name-typography], "
                "[data-consumer-name]"
            )
            review["author"] = author_el.get_text(strip=True) if author_el else ""

            # Date
            date_el = card.select_one("time")
            if date_el:
                review["date"] = date_el.get("datetime") or date_el.get_text(strip=True)
            else:
                review["date"] = None

            # Company reply
            reply_el = card.select_one("[data-service-review-business-reply-text]")
            review["company_reply"] = reply_el.get_text(strip=True) if reply_el else None

            if review.get("title") or review.get("body"):
                reviews.append(review)

        return reviews

    def scrape_company_stats(self, domain):
        """Scrape overall company rating and review statistics."""
        url = f"{self.BASE_URL}/{domain}"
        proxy = self.proxy_pool.get_proxy()

        try:
            response = self.session.get(url, proxies=proxy, timeout=15)
            if response.status_code != 200:
                return None

            soup = BeautifulSoup(response.text, "lxml")
            stats = {"domain": domain}

            # Overall rating
            for script in soup.select('script[type="application/ld+json"]'):
                try:
                    data = json.loads(script.string)
                    if isinstance(data, dict) and "aggregateRating" in data:
                        agg = data["aggregateRating"]
                        stats["overall_rating"] = agg.get("ratingValue")
                        stats["total_reviews"] = agg.get("reviewCount")
                        break
                except (json.JSONDecodeError, TypeError):
                    continue

            # Trust score
            score_el = soup.select_one("[data-rating-typography]")
            stats["trust_score"] = score_el.get_text(strip=True) if score_el else None

            return stats

        except Exception as e:
            print(f"Stats scrape error: {e}")
            return None

Building the Sentiment Analysis Pipeline

With reviews collected, apply sentiment analysis to extract quantified insights:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from collections import Counter


class ReviewSentimentAnalyzer:
    """Analyzes sentiment in collected Trustpilot reviews."""

    def __init__(self):
        self.vader = SentimentIntensityAnalyzer()

    def analyze_reviews(self, reviews):
        """Add sentiment scores to a list of review dictionaries."""
        for review in reviews:
            text = f"{review.get('title', '')} {review.get('body', '')}".strip()

            if text:
                # VADER sentiment
                vader_scores = self.vader.polarity_scores(text)
                review["vader_compound"] = vader_scores["compound"]
                review["vader_positive"] = vader_scores["pos"]
                review["vader_negative"] = vader_scores["neg"]
                review["vader_neutral"] = vader_scores["neu"]

                # TextBlob sentiment
                blob = TextBlob(text)
                review["textblob_polarity"] = round(blob.sentiment.polarity, 4)
                review["textblob_subjectivity"] = round(blob.sentiment.subjectivity, 4)

                # Classify sentiment
                review["sentiment_label"] = self._classify_sentiment(
                    vader_scores["compound"]
                )
            else:
                review["vader_compound"] = 0
                review["sentiment_label"] = "neutral"

        return reviews

    @staticmethod
    def _classify_sentiment(compound_score):
        """Classify compound score into sentiment categories."""
        if compound_score >= 0.05:
            return "positive"
        elif compound_score <= -0.05:
            return "negative"
        return "neutral"

    def extract_themes(self, reviews, top_n=20):
        """Extract common themes from review text using keyword frequency."""
        # Common stop words to filter out
        stop_words = {
            "the", "a", "an", "is", "are", "was", "were", "be", "been",
            "being", "have", "has", "had", "do", "does", "did", "will",
            "would", "could", "should", "may", "might", "shall", "can",
            "to", "of", "in", "for", "on", "with", "at", "by", "from",
            "up", "about", "into", "through", "during", "before", "after",
            "above", "below", "between", "out", "off", "over", "under",
            "again", "further", "then", "once", "i", "me", "my", "we",
            "our", "you", "your", "he", "she", "it", "they", "them",
            "this", "that", "these", "those", "and", "but", "or", "so",
            "not", "no", "very", "just", "also", "than", "too", "all",
            "each", "every", "both", "few", "more", "most", "other",
            "some", "such", "only", "own", "same", "as", "if", "when",
            "which", "who", "what", "how", "there", "here", "get", "got",
        }

        positive_words = Counter()
        negative_words = Counter()

        for review in reviews:
            text = f"{review.get('title', '')} {review.get('body', '')}".lower()
            words = re.findall(r"\b[a-z]{3,}\b", text)
            filtered = [w for w in words if w not in stop_words]

            if review.get("sentiment_label") == "positive":
                positive_words.update(filtered)
            elif review.get("sentiment_label") == "negative":
                negative_words.update(filtered)

        return {
            "positive_themes": positive_words.most_common(top_n),
            "negative_themes": negative_words.most_common(top_n),
        }

    def sentiment_over_time(self, reviews_df):
        """Track sentiment trends over time."""
        if "date" not in reviews_df.columns:
            return None

        reviews_df["date_parsed"] = pd.to_datetime(
            reviews_df["date"], errors="coerce"
        )
        reviews_df = reviews_df.dropna(subset=["date_parsed"])

        monthly = reviews_df.set_index("date_parsed").resample("M").agg({
            "vader_compound": "mean",
            "rating": "mean",
            "title": "count",
        }).rename(columns={"title": "review_count"})

        return monthly

Competitive Sentiment Comparison

Compare sentiment across multiple competitors:

class CompetitiveSentimentTracker:
    """Compares review sentiment across competitor businesses."""

    def __init__(self, scraper, analyzer):
        self.scraper = scraper
        self.analyzer = analyzer

    def compare_companies(self, domains, max_reviews_each=500):
        """Scrape and analyze reviews for multiple companies."""
        comparison = {}

        for domain in domains:
            print(f"\nAnalyzing: {domain}")

            # Scrape reviews
            reviews = self.scraper.scrape_company_reviews(
                domain, max_pages=max_reviews_each // 20
            )

            if not reviews:
                print(f"No reviews found for {domain}")
                continue

            # Analyze sentiment
            analyzed = self.analyzer.analyze_reviews(reviews)

            # Calculate summary stats
            df = pd.DataFrame(analyzed)
            summary = {
                "total_reviews": len(df),
                "avg_rating": df["rating"].mean() if "rating" in df.columns else None,
                "avg_sentiment": df["vader_compound"].mean(),
                "positive_pct": (df["sentiment_label"] == "positive").mean() * 100,
                "negative_pct": (df["sentiment_label"] == "negative").mean() * 100,
                "neutral_pct": (df["sentiment_label"] == "neutral").mean() * 100,
            }

            # Extract themes
            themes = self.analyzer.extract_themes(analyzed)
            summary["top_positive_themes"] = [
                t[0] for t in themes["positive_themes"][:5]
            ]
            summary["top_negative_themes"] = [
                t[0] for t in themes["negative_themes"][:5]
            ]

            comparison[domain] = summary

            time.sleep(random.uniform(5, 10))

        return comparison

    def generate_report(self, comparison):
        """Generate a comparison report from analyzed data."""
        report_data = []

        for domain, stats in comparison.items():
            row = {"domain": domain}
            row.update(stats)
            report_data.append(row)

        df = pd.DataFrame(report_data)

        if not df.empty:
            df = df.sort_values("avg_sentiment", ascending=False)

        return df

Running the Complete Pipeline

def main():
    proxies = [
        "http://user:pass@proxy1.example.com:8080",
        "http://user:pass@proxy2.example.com:8080",
        "http://user:pass@proxy3.example.com:8080",
        "http://user:pass@proxy4.example.com:8080",
    ]

    pool = TrustpilotProxyPool(proxies)
    scraper = TrustpilotScraper(pool)
    analyzer = ReviewSentimentAnalyzer()

    # Scrape a single company
    domain = "example.com"
    reviews = scraper.scrape_company_reviews(domain, max_pages=20)

    # Analyze sentiment
    analyzed = analyzer.analyze_reviews(reviews)
    df = pd.DataFrame(analyzed)
    df.to_csv(f"trustpilot_{domain.replace('.', '_')}_reviews.csv", index=False)

    # Summary statistics
    print(f"\nResults for {domain}:")
    print(f"Total reviews: {len(df)}")
    print(f"Average rating: {df['rating'].mean():.2f}")
    print(f"Average sentiment: {df['vader_compound'].mean():.3f}")
    print(f"Positive: {(df['sentiment_label'] == 'positive').sum()}")
    print(f"Negative: {(df['sentiment_label'] == 'negative').sum()}")
    print(f"Neutral: {(df['sentiment_label'] == 'neutral').sum()}")

    # Theme extraction
    themes = analyzer.extract_themes(analyzed)
    print(f"\nTop positive themes: {themes['positive_themes'][:10]}")
    print(f"Top negative themes: {themes['negative_themes'][:10]}")

    # Sentiment over time
    monthly = analyzer.sentiment_over_time(df)
    if monthly is not None:
        monthly.to_csv(f"trustpilot_{domain.replace('.', '_')}_monthly.csv")
        print(f"\nMonthly sentiment trend exported")

    # Competitive comparison
    tracker = CompetitiveSentimentTracker(scraper, analyzer)
    competitors = ["competitor1.com", "competitor2.com", "competitor3.com"]
    comparison = tracker.compare_companies(competitors, max_reviews_each=200)
    report = tracker.generate_report(comparison)

    if not report.empty:
        print("\nCompetitive Comparison:")
        print(report[["domain", "total_reviews", "avg_rating", "avg_sentiment",
                       "positive_pct", "negative_pct"]].to_string())
        report.to_csv("trustpilot_competitive_report.csv", index=False)


if __name__ == "__main__":
    main()

Advanced Sentiment Analysis Techniques

Aspect-Based Sentiment

Go beyond overall sentiment to understand sentiment about specific aspects of a business:

def aspect_sentiment(reviews, aspects):
    """Calculate sentiment for specific aspects mentioned in reviews."""
    vader = SentimentIntensityAnalyzer()
    aspect_scores = {aspect: [] for aspect in aspects}

    for review in reviews:
        text = f"{review.get('title', '')} {review.get('body', '')}".lower()

        for aspect in aspects:
            if aspect.lower() in text:
                # Find sentences mentioning the aspect
                sentences = text.split(".")
                for sentence in sentences:
                    if aspect.lower() in sentence:
                        score = vader.polarity_scores(sentence)
                        aspect_scores[aspect].append(score["compound"])

    # Calculate averages
    results = {}
    for aspect, scores in aspect_scores.items():
        if scores:
            results[aspect] = {
                "avg_sentiment": round(sum(scores) / len(scores), 3),
                "mention_count": len(scores),
                "positive_mentions": sum(1 for s in scores if s > 0.05),
                "negative_mentions": sum(1 for s in scores if s < -0.05),
            }

    return results

Usage example for a proxy service:

aspects = [
    "speed", "customer service", "support", "price", "reliability",
    "connection", "bandwidth", "uptime", "documentation", "setup",
]
results = aspect_sentiment(analyzed_reviews, aspects)
for aspect, data in sorted(results.items(), key=lambda x: x[1]["avg_sentiment"]):
    print(f"{aspect}: sentiment={data['avg_sentiment']}, mentions={data['mention_count']}")

Best Practices for Trustpilot Scraping

Leverage JSON-LD first. Trustpilot’s structured data provides the cleanest review extraction. Only fall back to HTML parsing when JSON-LD is incomplete.

Respect pagination limits. Trustpilot caps accessible reviews at approximately 2,000 per business. Plan your data collection scope accordingly.

Filter by star rating. Use the stars URL parameter to scrape reviews of specific ratings. This is useful for targeted negative review analysis.

Monitor for page structure changes. Trustpilot updates its frontend regularly. Build your scraper with multiple fallback selectors to handle variations in CSS class names.

Pair with proxy rotation. Even moderate scraping volumes require proxy rotation. A pool of 3-5 mobile proxies handles most Trustpilot collection tasks effectively.

Conclusion

Trustpilot review scraping combined with sentiment analysis creates a powerful brand monitoring system. The platform’s JSON-LD structured data makes extraction reliable, while VADER sentiment analysis provides quick, accurate sentiment classification without requiring model training.

For businesses in competitive markets, automated Trustpilot monitoring reveals customer perception shifts before they impact sales. Combined with mobile proxy rotation for reliable data collection, this approach scales from single-company monitoring to industry-wide competitive analysis.

For related techniques, explore our web scraping proxy tutorials and social media scraping guides. The proxy glossary provides definitions for proxy concepts referenced throughout this article.


Related Reading

Scroll to Top