How to Scrape Amazon Product Reviews in 2026

How to Scrape Amazon Product Reviews in 2026

Amazon is the world’s largest e-commerce platform, hosting hundreds of millions of product reviews that shape purchasing decisions for billions of consumers. For product researchers, brand managers, competitive analysts, and sentiment analysis teams, scraping Amazon reviews provides unmatched insights into customer satisfaction, product issues, and market trends.

This guide covers how to scrape Amazon product reviews using Python, handle their sophisticated anti-bot protections, and integrate proxies for reliable large-scale extraction.

What Data Can You Extract from Amazon Reviews?

Amazon reviews contain detailed data:

  • Review text (title, body content)
  • Star rating (1-5 stars)
  • Reviewer information (name, verified purchase badge)
  • Review date
  • Helpful vote count
  • Product variant (color, size purchased)
  • Images and videos attached to reviews
  • Aggregate rating breakdown (star distribution)
  • “Most helpful” and “Most recent” sorted reviews

Example JSON Output

{
  "asin": "B09V3KXJPB",
  "product_name": "Apple AirPods Pro (2nd Generation)",
  "overall_rating": 4.7,
  "total_reviews": 125430,
  "review": {
    "id": "R2ABCDEFGHIJK",
    "title": "Best noise canceling earbuds I've owned",
    "body": "The ANC is noticeably better than the first generation...",
    "rating": 5,
    "date": "March 5, 2026",
    "reviewer": "Tech Enthusiast",
    "verified_purchase": true,
    "helpful_votes": 342,
    "variant": "USB-C",
    "images": ["https://images-na.ssl-images-amazon.com/..."]
  }
}

Prerequisites

pip install requests beautifulsoup4 lxml fake-useragent selenium

Amazon has some of the most aggressive anti-bot protections. Residential proxies are absolutely essential.

Method 1: Scraping Amazon Reviews with Requests

Amazon renders review pages server-side, making requests-based scraping viable with proper headers and proxies.

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import json
import time
import random
import re

class AmazonReviewScraper:
    def __init__(self, proxy_url=None, domain="com"):
        self.session = requests.Session()
        self.ua = UserAgent()
        self.proxy_url = proxy_url
        self.base_url = f"https://www.amazon.{domain}"

    def _get_headers(self):
        return {
            "User-Agent": self.ua.random,
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Referer": self.base_url,
            "Connection": "keep-alive",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
        }

    def _get_proxies(self):
        if self.proxy_url:
            return {"http": self.proxy_url, "https": self.proxy_url}
        return None

    def scrape_reviews(self, asin, max_pages=10, sort_by="recent"):
        """Scrape reviews for a product by ASIN."""
        all_reviews = []
        sort_param = "recent" if sort_by == "recent" else "helpful"

        for page in range(1, max_pages + 1):
            url = f"{self.base_url}/product-reviews/{asin}/ref=cm_cr_getr_d_paging_btm_next_{page}?pageNumber={page}&sortBy={sort_param}"

            try:
                response = self.session.get(
                    url,
                    headers=self._get_headers(),
                    proxies=self._get_proxies(),
                    timeout=30
                )
                response.raise_for_status()

                # Check for CAPTCHA
                if "captcha" in response.text.lower() or "robot" in response.text.lower():
                    print(f"CAPTCHA detected on page {page}. Rotating proxy recommended.")
                    time.sleep(30)
                    continue

                soup = BeautifulSoup(response.text, "lxml")
                reviews = self._parse_reviews(soup)

                if not reviews:
                    print(f"No more reviews found on page {page}")
                    break

                all_reviews.extend(reviews)
                print(f"Page {page}: {len(reviews)} reviews (total: {len(all_reviews)})")

                time.sleep(random.uniform(3, 7))

            except requests.RequestException as e:
                print(f"Error on page {page}: {e}")
                time.sleep(10)
                continue

        return all_reviews

    def _parse_reviews(self, soup):
        """Parse individual reviews from the reviews page."""
        reviews = []
        review_divs = soup.select("div[data-hook='review']")

        for div in review_divs:
            try:
                review = {}

                # Title
                title_elem = div.select_one("a[data-hook='review-title'] span:last-child, a[data-hook='review-title']")
                review["title"] = title_elem.get_text(strip=True) if title_elem else None

                # Rating
                rating_elem = div.select_one("i[data-hook='review-star-rating'] span, i[data-hook='cmps-review-star-rating'] span")
                if rating_elem:
                    rating_text = rating_elem.get_text(strip=True)
                    match = re.search(r'(\d+\.?\d*)', rating_text)
                    review["rating"] = float(match.group(1)) if match else None

                # Body
                body_elem = div.select_one("span[data-hook='review-body'] span")
                review["body"] = body_elem.get_text(strip=True) if body_elem else None

                # Date
                date_elem = div.select_one("span[data-hook='review-date']")
                review["date"] = date_elem.get_text(strip=True) if date_elem else None

                # Reviewer
                reviewer_elem = div.select_one("span.a-profile-name")
                review["reviewer"] = reviewer_elem.get_text(strip=True) if reviewer_elem else None

                # Verified purchase
                verified_elem = div.select_one("span[data-hook='avp-badge']")
                review["verified_purchase"] = verified_elem is not None

                # Helpful votes
                helpful_elem = div.select_one("span[data-hook='helpful-vote-statement']")
                if helpful_elem:
                    text = helpful_elem.get_text(strip=True)
                    match = re.search(r'(\d+)', text)
                    review["helpful_votes"] = int(match.group(1)) if match else 1
                else:
                    review["helpful_votes"] = 0

                # Review ID
                review["review_id"] = div.get("id")

                if review.get("body") or review.get("title"):
                    reviews.append(review)

            except Exception as e:
                continue

        return reviews

    def get_product_rating_summary(self, asin):
        """Get the overall rating and breakdown for a product."""
        url = f"{self.base_url}/product-reviews/{asin}"

        try:
            response = self.session.get(
                url,
                headers=self._get_headers(),
                proxies=self._get_proxies(),
                timeout=30
            )
            response.raise_for_status()
            soup = BeautifulSoup(response.text, "lxml")

            summary = {}

            # Overall rating
            overall = soup.select_one("span[data-hook='rating-out-of-text']")
            if overall:
                match = re.search(r'(\d+\.?\d*)', overall.get_text())
                summary["overall_rating"] = float(match.group(1)) if match else None

            # Total reviews
            total = soup.select_one("div[data-hook='total-review-count'] span")
            if total:
                text = total.get_text(strip=True).replace(",", "")
                match = re.search(r'(\d+)', text)
                summary["total_reviews"] = int(match.group(1)) if match else None

            # Star breakdown
            breakdown = {}
            star_rows = soup.select("table#histogramTable tr")
            for row in star_rows:
                star_label = row.select_one("td:first-child a")
                pct = row.select_one("td:nth-child(3) a")
                if star_label and pct:
                    star = star_label.get_text(strip=True)
                    percentage = pct.get_text(strip=True)
                    breakdown[star] = percentage

            summary["breakdown"] = breakdown

            return summary

        except Exception as e:
            print(f"Error: {e}")
            return None


# Usage
if __name__ == "__main__":
    scraper = AmazonReviewScraper(proxy_url="http://user:pass@proxy:port")

    # Get rating summary
    summary = scraper.get_product_rating_summary("B09V3KXJPB")
    print(json.dumps(summary, indent=2))

    # Scrape reviews
    reviews = scraper.scrape_reviews("B09V3KXJPB", max_pages=5)
    print(f"Total reviews scraped: {len(reviews)}")

    with open("amazon_reviews.json", "w") as f:
        json.dump(reviews, f, indent=2)

Handling Amazon Anti-Bot Protections

1. CAPTCHA

Amazon presents CAPTCHAs frequently. Use residential proxies with rotation and implement exponential backoff when CAPTCHAs are detected.

2. Rate Limiting

Amazon blocks IPs after moderate activity. Use 3-7 second delays, rotate proxies every 5-10 requests, and implement session rotation.

3. Bot Detection

Amazon uses sophisticated fingerprinting. Rotate user agents, maintain consistent cookies within sessions, and vary request patterns.

4. Regional Variations

Amazon operates region-specific domains (amazon.com, amazon.co.uk, etc.). Use proxies from the target region for accurate data.

Proxy Recommendations

Proxy TypeSuccess RateBest For
Residential Rotating75-85%Review scraping
Mobile85-95%Bypassing CAPTCHAs
ISP70-80%Consistent sessions
Datacenter15-25%Not recommended

Rotating residential proxies are essential for Amazon review scraping.

Legal Considerations

  1. Terms of Service: Amazon’s ToS prohibits automated data collection.
  2. Review Copyright: Reviews are copyrighted by their authors.
  3. Legal History: Amazon has sued companies for scraping (e.g., hiQ Labs case).
  4. Commercial Use: Get legal counsel before using scraped reviews commercially.

See our web scraping compliance guide for details.

Frequently Asked Questions

How many Amazon reviews can I scrape per day?

With rotating residential proxies and 3-7 second delays, expect to scrape 5,000-15,000 reviews per day. CAPTCHAs may reduce throughput.

Can I scrape Amazon reviews across different countries?

Yes. Change the domain parameter (com, co.uk, de, co.jp, etc.) to access reviews from different Amazon marketplaces. Use proxies from the target country.

How do I handle Amazon CAPTCHAs?

Rotate proxies immediately when CAPTCHAs appear. Use residential proxies with clean IP reputation. If CAPTCHAs persist, switch to mobile proxies or implement CAPTCHA-solving services.

Can I scrape Amazon review images?

Yes. Review images are embedded in the review HTML with direct CDN URLs that can be downloaded.

Advanced Techniques

Handling Pagination

Most websites paginate their results. Implement robust pagination handling:

def scrape_all_pages(scraper, base_url, max_pages=20):
    all_data = []
    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        results = scraper.search(url)
        if not results:
            break
        all_data.extend(results)
        print(f"Page {page}: {len(results)} items (total: {len(all_data)})")
        time.sleep(random.uniform(2, 5))
    return all_data

Data Validation and Cleaning

Always validate scraped data before storage:

def validate_data(item):
    required_fields = ["title", "url"]
    for field in required_fields:
        if not item.get(field):
            return False
    return True

def clean_text(text):
    if not text:
        return None
    # Remove extra whitespace
    import re
    text = re.sub(r'\s+', ' ', text).strip()
    # Remove HTML entities
    import html
    text = html.unescape(text)
    return text

# Apply to results
cleaned = [item for item in results if validate_data(item)]
for item in cleaned:
    item["title"] = clean_text(item.get("title"))

Monitoring and Alerting

Build monitoring into your scraping pipeline:

import logging
from datetime import datetime

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class ScrapingMonitor:
    def __init__(self):
        self.start_time = datetime.now()
        self.requests = 0
        self.errors = 0
        self.items = 0

    def log_request(self, success=True):
        self.requests += 1
        if not success:
            self.errors += 1
        if self.requests % 50 == 0:
            elapsed = (datetime.now() - self.start_time).seconds
            rate = self.requests / max(elapsed, 1) * 60
            logger.info(f"Requests: {self.requests}, Errors: {self.errors}, "
                       f"Items: {self.items}, Rate: {rate:.1f}/min")

    def log_item(self, count=1):
        self.items += count

Error Handling and Retry Logic

Implement robust error handling:

import time
from requests.exceptions import RequestException

def retry_request(func, max_retries=3, base_delay=5):
    for attempt in range(max_retries):
        try:
            return func()
        except RequestException as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
            time.sleep(delay)
    return None

Data Storage Options

Choose the right storage for your scraping volume:

import json
import csv
import sqlite3

class DataStorage:
    def __init__(self, db_path="scraped_data.db"):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute('''CREATE TABLE IF NOT EXISTS items
            (id TEXT PRIMARY KEY, title TEXT, url TEXT, data JSON, scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)''')

    def save(self, item):
        self.conn.execute(
            "INSERT OR REPLACE INTO items (id, title, url, data) VALUES (?, ?, ?, ?)",
            (item.get("id"), item.get("title"), item.get("url"), json.dumps(item))
        )
        self.conn.commit()

    def export_json(self, output_path):
        cursor = self.conn.execute("SELECT data FROM items")
        items = [json.loads(row[0]) for row in cursor.fetchall()]
        with open(output_path, "w") as f:
            json.dump(items, f, indent=2)

    def export_csv(self, output_path):
        cursor = self.conn.execute("SELECT * FROM items")
        rows = cursor.fetchall()
        with open(output_path, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(["id", "title", "url", "data", "scraped_at"])
            writer.writerows(rows)

Frequently Asked Questions

How often should I scrape data?

The optimal frequency depends on how often the source data changes. For real-time data (stock prices, news), scrape every few minutes. For product listings, daily or weekly is usually sufficient. For reviews, weekly scraping captures new feedback without excessive load.

What happens if my IP gets blocked?

If you receive 403 or 429 status codes, your IP is likely blocked. Switch to a different proxy, implement exponential backoff, and slow your request rate. Rotating residential proxies automatically switch IPs to prevent blocks.

Should I use headless browsers or HTTP requests?

Use HTTP requests (with BeautifulSoup or similar) whenever possible — they are faster and use less resources. Switch to headless browsers (Selenium, Playwright) only when JavaScript rendering is required for the data you need.

How do I handle CAPTCHAs?

CAPTCHAs indicate aggressive bot detection. To minimize them: use residential or mobile proxies, implement realistic delays, rotate user agents, and maintain consistent session behavior. For persistent CAPTCHAs, consider CAPTCHA-solving services as a last resort.

Can I scrape data commercially?

The legality of commercial scraping depends on the platform’s ToS, the type of data collected, and your jurisdiction. Public data is generally more permissible, but always consult legal counsel for commercial use cases. See our compliance guide.

Conclusion

Amazon review scraping requires careful handling of their aggressive anti-bot systems. Requests-based scraping works well for server-rendered review pages when paired with residential proxy rotation and proper rate limiting. Focus on rotating proxies and user agents for sustained access.

For more e-commerce scraping guides, visit our e-commerce proxy guide and proxy provider comparisons.


Related Reading

Scroll to Top