How to Scrape G2 Reviews with Proxies in 2026

How to Scrape G2 Reviews with Proxies in 2026

G2 is the world’s leading B2B software review platform, hosting millions of verified reviews across thousands of product categories. For SaaS companies, investors, and market researchers, G2 data is a goldmine of competitive intelligence. However, scraping G2 at any meaningful scale requires proxy infrastructure to bypass their anti-bot protections.

This guide covers how to extract G2 review data using Python with proxy rotation, from individual product reviews to large-scale category analysis.

Why G2 Data Matters

G2 reviews drive real business decisions in the B2B software space:

  • Competitive intelligence — Understand how customers rate your competitors, what they love, and what they hate
  • Product development — Mine review text for feature requests and pain points
  • Market mapping — Identify all players in a software category with their positioning
  • Sales enablement — Build battlecards using competitor weakness data
  • Investment research — Evaluate SaaS companies based on customer sentiment trends
  • Content marketing — Create comparison content backed by real user data
  • Win/loss analysis — Understand why customers choose one product over another

Data Points to Extract

G2 provides rich structured data on every product:

Data PointSourceUse Case
Overall ratingProduct pageQuick comparison
Review textReview cardsSentiment analysis
Star breakdownRating distributionQuality assessment
Reviewer infoReview metadataCompany size, industry
Pros and consStructured fieldsFeature comparison
Alternatives listedComparison sectionCompetitive mapping
Category rankingGrid reportsMarket positioning
Implementation ratingSpecific metricEase of adoption
Support ratingSpecific metricService quality
Feature ratingsIndividual scoresDetailed comparison
Review dateTimestampTrend analysis

Understanding G2’s Anti-Bot Defenses

G2 employs several protective measures:

  1. Cloudflare protection — G2 sits behind Cloudflare, which provides bot detection, JavaScript challenges, and IP reputation scoring
  2. Rate limiting — Aggressive request throttling per IP
  3. JavaScript rendering — Review content loaded dynamically
  4. Session validation — Cookie and token checks across page loads
  5. CAPTCHA triggers — Cloudflare Turnstile challenges for suspicious traffic

Setting Up Your Environment

pip install requests beautifulsoup4 lxml fake-useragent cloudscraper

We use cloudscraper to handle Cloudflare’s JavaScript challenges automatically.

Python Code: Scraping G2 Reviews

import cloudscraper
from bs4 import BeautifulSoup
import json
import time
import random
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class G2Scraper:
    def __init__(self, proxy_list: list):
        self.proxy_list = proxy_list
        self.base_url = "https://www.g2.com"
        self.reviews = []

    def get_proxy(self) -> dict:
        proxy = random.choice(self.proxy_list)
        return {"http": f"http://{proxy}", "https": f"http://{proxy}"}

    def create_scraper_session(self):
        """Create a cloudscraper session to handle Cloudflare."""
        scraper = cloudscraper.create_scraper(
            browser={
                "browser": "chrome",
                "platform": "windows",
                "desktop": True
            }
        )
        return scraper

    def scrape_product_reviews(self, product_slug: str, max_pages: int = 20):
        """Scrape all reviews for a G2 product."""
        session = self.create_scraper_session()

        for page in range(1, max_pages + 1):
            url = f"{self.base_url}/products/{product_slug}/reviews?page={page}"
            logger.info(f"Scraping reviews page {page}: {url}")

            try:
                response = session.get(
                    url,
                    proxies=self.get_proxy(),
                    timeout=30
                )

                if response.status_code == 200:
                    page_reviews = self.parse_reviews_page(response.text)
                    if not page_reviews:
                        logger.info(f"No more reviews on page {page}")
                        break
                    self.reviews.extend(page_reviews)
                    logger.info(f"Extracted {len(page_reviews)} reviews from page {page}")
                elif response.status_code == 403:
                    logger.warning("Cloudflare block detected -- rotating proxy")
                    session = self.create_scraper_session()
                    time.sleep(random.uniform(10, 20))
                    continue
                else:
                    logger.error(f"Status {response.status_code}")

            except Exception as e:
                logger.error(f"Request failed: {e}")
                session = self.create_scraper_session()

            time.sleep(random.uniform(4, 8))

    def parse_reviews_page(self, html: str) -> list:
        """Parse review data from G2 reviews page."""
        soup = BeautifulSoup(html, "lxml")
        reviews = []

        review_cards = soup.select("[class*='review'], [id*='review']")

        for card in review_cards:
            review = {}

            # Star rating
            rating_el = card.select_one("[class*='stars'], [class*='rating']")
            if rating_el:
                # G2 uses star icons; count filled stars or read aria-label
                aria = rating_el.get("aria-label", "")
                if "out of" in aria:
                    review["rating"] = aria.split(" out of")[0].strip()

            # Review title
            title_el = card.select_one("h3, [class*='review-title']")
            if title_el:
                review["title"] = title_el.get_text(strip=True)

            # What do you like best
            pros_el = card.select_one("[class*='like-best'], [data-testid*='like']")
            if pros_el:
                review["pros"] = pros_el.get_text(strip=True)

            # What do you dislike
            cons_el = card.select_one("[class*='dislike'], [data-testid*='dislike']")
            if cons_el:
                review["cons"] = cons_el.get_text(strip=True)

            # Reviewer details
            reviewer_el = card.select_one("[class*='reviewer'], [class*='user-info']")
            if reviewer_el:
                review["reviewer"] = reviewer_el.get_text(strip=True)

            # Company size
            company_el = card.select_one("[class*='company-size'], [class*='segment']")
            if company_el:
                review["company_size"] = company_el.get_text(strip=True)

            # Date
            date_el = card.select_one("time, [class*='date']")
            if date_el:
                review["date"] = date_el.get("datetime", date_el.get_text(strip=True))

            if review.get("title") or review.get("pros"):
                reviews.append(review)

        return reviews

    def scrape_product_info(self, product_slug: str) -> dict:
        """Scrape product overview information."""
        session = self.create_scraper_session()
        url = f"{self.base_url}/products/{product_slug}/reviews"

        try:
            response = session.get(
                url,
                proxies=self.get_proxy(),
                timeout=30
            )

            if response.status_code != 200:
                return {}

            soup = BeautifulSoup(response.text, "lxml")
            info = {}

            # Product name
            name_el = soup.select_one("h1, [class*='product-name']")
            if name_el:
                info["name"] = name_el.get_text(strip=True)

            # Overall rating
            rating_el = soup.select_one("[class*='overall-rating'], [class*='avg-rating']")
            if rating_el:
                info["overall_rating"] = rating_el.get_text(strip=True)

            # Total reviews
            count_el = soup.select_one("[class*='review-count'], [class*='total-reviews']")
            if count_el:
                info["total_reviews"] = count_el.get_text(strip=True)

            # Category
            category_el = soup.select_one("[class*='category-link'], [class*='breadcrumb']")
            if category_el:
                info["category"] = category_el.get_text(strip=True)

            # Alternatives
            alternatives = []
            alt_els = soup.select("[class*='alternative'] a, [class*='competitor'] a")
            for alt in alt_els:
                alternatives.append(alt.get_text(strip=True))
            info["alternatives"] = alternatives

            return info

        except Exception as e:
            logger.error(f"Product info scrape failed: {e}")
            return {}

    def scrape_category(self, category_slug: str, max_pages: int = 5) -> list:
        """Scrape all products in a G2 category."""
        session = self.create_scraper_session()
        products = []

        for page in range(1, max_pages + 1):
            url = f"{self.base_url}/categories/{category_slug}?page={page}"
            logger.info(f"Scraping category page {page}")

            try:
                response = session.get(
                    url,
                    proxies=self.get_proxy(),
                    timeout=30
                )

                if response.status_code == 200:
                    soup = BeautifulSoup(response.text, "lxml")
                    cards = soup.select("[class*='product-card'], [class*='listing']")
                    for card in cards:
                        name_el = card.select_one("a[href*='/products/']")
                        if name_el:
                            products.append({
                                "name": name_el.get_text(strip=True),
                                "slug": name_el["href"].split("/products/")[1].rstrip("/"),
                                "url": self.base_url + name_el["href"]
                            })

                    if not cards:
                        break

            except Exception as e:
                logger.error(f"Category scrape failed: {e}")

            time.sleep(random.uniform(3, 7))

        return products


# Usage
if __name__ == "__main__":
    proxies = [
        "user:pass@residential1.proxy.com:8080",
        "user:pass@residential2.proxy.com:8080",
        "user:pass@residential3.proxy.com:8080",
    ]

    scraper = G2Scraper(proxy_list=proxies)

    # Scrape reviews for a specific product
    scraper.scrape_product_reviews("slack/reviews", max_pages=10)

    # Get product info
    info = scraper.scrape_product_info("slack/reviews")

    print(f"Product: {info.get('name')}")
    print(f"Total reviews scraped: {len(scraper.reviews)}")

    with open("g2_reviews.json", "w") as f:
        json.dump({
            "product_info": info,
            "reviews": scraper.reviews
        }, f, indent=2)

Proxy Rotation Strategy for G2

G2’s Cloudflare protection requires a strategic approach to proxy usage:

  1. Use residential proxies exclusively — Cloudflare flags datacenter IPs immediately
  2. Rotate IPs every 2-3 requests — More frequent rotation helps avoid Cloudflare’s behavioral scoring
  3. Sticky sessions for detail pages — When scraping a single product’s reviews across pages, maintain the same IP for 3-5 pages before rotating
  4. US-based IPs preferred — G2 is primarily a US platform; US residential IPs get the least friction
  5. Pool size — Maintain a pool of at least 50 residential IPs for sustained scraping

Calculate bandwidth costs for your G2 scraping project with our proxy cost calculator.

Advanced Techniques

Extracting JSON-LD Data

G2 embeds structured data in JSON-LD format on product pages:

def extract_structured_data(html: str) -> list:
    """Extract JSON-LD structured data from G2 pages."""
    soup = BeautifulSoup(html, "lxml")
    scripts = soup.find_all("script", type="application/ld+json")
    data = []
    for script in scripts:
        try:
            parsed = json.loads(script.string)
            data.append(parsed)
        except (json.JSONDecodeError, TypeError):
            continue
    return data

Sentiment Analysis Pipeline

Once you have review data, run sentiment analysis to quantify opinions:

from collections import Counter

def analyze_sentiment(reviews: list) -> dict:
    """Basic sentiment analysis on G2 reviews."""
    positive_keywords = ["love", "great", "excellent", "easy", "intuitive", "powerful"]
    negative_keywords = ["slow", "buggy", "expensive", "confusing", "lacking", "poor"]

    pos_count = 0
    neg_count = 0

    for review in reviews:
        text = (review.get("pros", "") + " " + review.get("cons", "")).lower()
        for kw in positive_keywords:
            if kw in text:
                pos_count += 1
        for kw in negative_keywords:
            if kw in text:
                neg_count += 1

    return {
        "positive_mentions": pos_count,
        "negative_mentions": neg_count,
        "sentiment_ratio": pos_count / max(neg_count, 1)
    }

Troubleshooting

Problem: Cloudflare challenges blocking every request

  • Use cloudscraper library instead of plain requests. It handles JavaScript challenges.
  • If still blocked, switch to a headless browser (Playwright) with stealth plugins.
  • Verify proxy quality — low-reputation IPs trigger Cloudflare more aggressively.

Problem: Reviews page returns empty content

  • G2 lazy-loads reviews via JavaScript. Check if the initial HTML contains review data or if it requires JS execution.
  • Look for API endpoints in network traffic that return review JSON directly.

Problem: Getting different data than what the browser shows

  • G2 may serve different content based on authentication state. Some review details are gated behind G2 login.
  • Use browser cookies from a logged-in session to access full review content.

Problem: Rate limited after a small number of requests

  • Increase delays between requests to 5-10 seconds.
  • Rotate both IP and User-Agent on every request.
  • Spread scraping across different times of day.

Verify your proxy IP is clean using our IP lookup tool.

Legal and Ethical Considerations

Scraping G2 reviews involves several legal considerations:

  • G2’s Terms of Service — G2 prohibits automated scraping in their ToS. Commercial use of scraped data could expose you to legal claims.
  • Review ownership — Reviews are written by users but licensed to G2. Republishing full review text may infringe on G2’s rights.
  • Personal data — Reviewer names, titles, and company affiliations constitute personal data under GDPR and CCPA. Handle this data with care.
  • Fair use — Aggregating and analyzing review data for research purposes may fall under fair use, but this varies by jurisdiction.
  • G2 API alternatives — G2 offers official API access for some data. Consider using official channels for commercial applications.
  • Competitive use — Using scraped competitor reviews in marketing materials could raise unfair competition claims.

Always consult with a legal professional before scraping review platforms at scale.

Conclusion

G2 reviews are one of the most valuable data sources for B2B competitive intelligence. Scraping them requires Cloudflare bypass capabilities, residential proxies, and careful rate management. The cloudscraper library handles most Cloudflare challenges, but for the most reliable results, consider combining it with headless browser automation. Start with a specific product or category and expand your scraping scope gradually as you refine your approach.


Related Reading

Scroll to Top