How to Scrape Amazon Product Reviews in 2026
Amazon is the world’s largest e-commerce platform, hosting hundreds of millions of product reviews that shape purchasing decisions for billions of consumers. For product researchers, brand managers, competitive analysts, and sentiment analysis teams, scraping Amazon reviews provides unmatched insights into customer satisfaction, product issues, and market trends.
This guide covers how to scrape Amazon product reviews using Python, handle their sophisticated anti-bot protections, and integrate proxies for reliable large-scale extraction.
What Data Can You Extract from Amazon Reviews?
Amazon reviews contain detailed data:
- Review text (title, body content)
- Star rating (1-5 stars)
- Reviewer information (name, verified purchase badge)
- Review date
- Helpful vote count
- Product variant (color, size purchased)
- Images and videos attached to reviews
- Aggregate rating breakdown (star distribution)
- “Most helpful” and “Most recent” sorted reviews
Example JSON Output
{
"asin": "B09V3KXJPB",
"product_name": "Apple AirPods Pro (2nd Generation)",
"overall_rating": 4.7,
"total_reviews": 125430,
"review": {
"id": "R2ABCDEFGHIJK",
"title": "Best noise canceling earbuds I've owned",
"body": "The ANC is noticeably better than the first generation...",
"rating": 5,
"date": "March 5, 2026",
"reviewer": "Tech Enthusiast",
"verified_purchase": true,
"helpful_votes": 342,
"variant": "USB-C",
"images": ["https://images-na.ssl-images-amazon.com/..."]
}
}Prerequisites
pip install requests beautifulsoup4 lxml fake-useragent seleniumAmazon has some of the most aggressive anti-bot protections. Residential proxies are absolutely essential.
Method 1: Scraping Amazon Reviews with Requests
Amazon renders review pages server-side, making requests-based scraping viable with proper headers and proxies.
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import json
import time
import random
import re
class AmazonReviewScraper:
def __init__(self, proxy_url=None, domain="com"):
self.session = requests.Session()
self.ua = UserAgent()
self.proxy_url = proxy_url
self.base_url = f"https://www.amazon.{domain}"
def _get_headers(self):
return {
"User-Agent": self.ua.random,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": self.base_url,
"Connection": "keep-alive",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
}
def _get_proxies(self):
if self.proxy_url:
return {"http": self.proxy_url, "https": self.proxy_url}
return None
def scrape_reviews(self, asin, max_pages=10, sort_by="recent"):
"""Scrape reviews for a product by ASIN."""
all_reviews = []
sort_param = "recent" if sort_by == "recent" else "helpful"
for page in range(1, max_pages + 1):
url = f"{self.base_url}/product-reviews/{asin}/ref=cm_cr_getr_d_paging_btm_next_{page}?pageNumber={page}&sortBy={sort_param}"
try:
response = self.session.get(
url,
headers=self._get_headers(),
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
# Check for CAPTCHA
if "captcha" in response.text.lower() or "robot" in response.text.lower():
print(f"CAPTCHA detected on page {page}. Rotating proxy recommended.")
time.sleep(30)
continue
soup = BeautifulSoup(response.text, "lxml")
reviews = self._parse_reviews(soup)
if not reviews:
print(f"No more reviews found on page {page}")
break
all_reviews.extend(reviews)
print(f"Page {page}: {len(reviews)} reviews (total: {len(all_reviews)})")
time.sleep(random.uniform(3, 7))
except requests.RequestException as e:
print(f"Error on page {page}: {e}")
time.sleep(10)
continue
return all_reviews
def _parse_reviews(self, soup):
"""Parse individual reviews from the reviews page."""
reviews = []
review_divs = soup.select("div[data-hook='review']")
for div in review_divs:
try:
review = {}
# Title
title_elem = div.select_one("a[data-hook='review-title'] span:last-child, a[data-hook='review-title']")
review["title"] = title_elem.get_text(strip=True) if title_elem else None
# Rating
rating_elem = div.select_one("i[data-hook='review-star-rating'] span, i[data-hook='cmps-review-star-rating'] span")
if rating_elem:
rating_text = rating_elem.get_text(strip=True)
match = re.search(r'(\d+\.?\d*)', rating_text)
review["rating"] = float(match.group(1)) if match else None
# Body
body_elem = div.select_one("span[data-hook='review-body'] span")
review["body"] = body_elem.get_text(strip=True) if body_elem else None
# Date
date_elem = div.select_one("span[data-hook='review-date']")
review["date"] = date_elem.get_text(strip=True) if date_elem else None
# Reviewer
reviewer_elem = div.select_one("span.a-profile-name")
review["reviewer"] = reviewer_elem.get_text(strip=True) if reviewer_elem else None
# Verified purchase
verified_elem = div.select_one("span[data-hook='avp-badge']")
review["verified_purchase"] = verified_elem is not None
# Helpful votes
helpful_elem = div.select_one("span[data-hook='helpful-vote-statement']")
if helpful_elem:
text = helpful_elem.get_text(strip=True)
match = re.search(r'(\d+)', text)
review["helpful_votes"] = int(match.group(1)) if match else 1
else:
review["helpful_votes"] = 0
# Review ID
review["review_id"] = div.get("id")
if review.get("body") or review.get("title"):
reviews.append(review)
except Exception as e:
continue
return reviews
def get_product_rating_summary(self, asin):
"""Get the overall rating and breakdown for a product."""
url = f"{self.base_url}/product-reviews/{asin}"
try:
response = self.session.get(
url,
headers=self._get_headers(),
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
summary = {}
# Overall rating
overall = soup.select_one("span[data-hook='rating-out-of-text']")
if overall:
match = re.search(r'(\d+\.?\d*)', overall.get_text())
summary["overall_rating"] = float(match.group(1)) if match else None
# Total reviews
total = soup.select_one("div[data-hook='total-review-count'] span")
if total:
text = total.get_text(strip=True).replace(",", "")
match = re.search(r'(\d+)', text)
summary["total_reviews"] = int(match.group(1)) if match else None
# Star breakdown
breakdown = {}
star_rows = soup.select("table#histogramTable tr")
for row in star_rows:
star_label = row.select_one("td:first-child a")
pct = row.select_one("td:nth-child(3) a")
if star_label and pct:
star = star_label.get_text(strip=True)
percentage = pct.get_text(strip=True)
breakdown[star] = percentage
summary["breakdown"] = breakdown
return summary
except Exception as e:
print(f"Error: {e}")
return None
# Usage
if __name__ == "__main__":
scraper = AmazonReviewScraper(proxy_url="http://user:pass@proxy:port")
# Get rating summary
summary = scraper.get_product_rating_summary("B09V3KXJPB")
print(json.dumps(summary, indent=2))
# Scrape reviews
reviews = scraper.scrape_reviews("B09V3KXJPB", max_pages=5)
print(f"Total reviews scraped: {len(reviews)}")
with open("amazon_reviews.json", "w") as f:
json.dump(reviews, f, indent=2)Handling Amazon Anti-Bot Protections
1. CAPTCHA
Amazon presents CAPTCHAs frequently. Use residential proxies with rotation and implement exponential backoff when CAPTCHAs are detected.
2. Rate Limiting
Amazon blocks IPs after moderate activity. Use 3-7 second delays, rotate proxies every 5-10 requests, and implement session rotation.
3. Bot Detection
Amazon uses sophisticated fingerprinting. Rotate user agents, maintain consistent cookies within sessions, and vary request patterns.
4. Regional Variations
Amazon operates region-specific domains (amazon.com, amazon.co.uk, etc.). Use proxies from the target region for accurate data.
Proxy Recommendations
| Proxy Type | Success Rate | Best For |
|---|---|---|
| Residential Rotating | 75-85% | Review scraping |
| Mobile | 85-95% | Bypassing CAPTCHAs |
| ISP | 70-80% | Consistent sessions |
| Datacenter | 15-25% | Not recommended |
Rotating residential proxies are essential for Amazon review scraping.
Legal Considerations
- Terms of Service: Amazon’s ToS prohibits automated data collection.
- Review Copyright: Reviews are copyrighted by their authors.
- Legal History: Amazon has sued companies for scraping (e.g., hiQ Labs case).
- Commercial Use: Get legal counsel before using scraped reviews commercially.
See our web scraping compliance guide for details.
Frequently Asked Questions
How many Amazon reviews can I scrape per day?
With rotating residential proxies and 3-7 second delays, expect to scrape 5,000-15,000 reviews per day. CAPTCHAs may reduce throughput.
Can I scrape Amazon reviews across different countries?
Yes. Change the domain parameter (com, co.uk, de, co.jp, etc.) to access reviews from different Amazon marketplaces. Use proxies from the target country.
How do I handle Amazon CAPTCHAs?
Rotate proxies immediately when CAPTCHAs appear. Use residential proxies with clean IP reputation. If CAPTCHAs persist, switch to mobile proxies or implement CAPTCHA-solving services.
Can I scrape Amazon review images?
Yes. Review images are embedded in the review HTML with direct CDN URLs that can be downloaded.
Advanced Techniques
Handling Pagination
Most websites paginate their results. Implement robust pagination handling:
def scrape_all_pages(scraper, base_url, max_pages=20):
all_data = []
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
results = scraper.search(url)
if not results:
break
all_data.extend(results)
print(f"Page {page}: {len(results)} items (total: {len(all_data)})")
time.sleep(random.uniform(2, 5))
return all_dataData Validation and Cleaning
Always validate scraped data before storage:
def validate_data(item):
required_fields = ["title", "url"]
for field in required_fields:
if not item.get(field):
return False
return True
def clean_text(text):
if not text:
return None
# Remove extra whitespace
import re
text = re.sub(r'\s+', ' ', text).strip()
# Remove HTML entities
import html
text = html.unescape(text)
return text
# Apply to results
cleaned = [item for item in results if validate_data(item)]
for item in cleaned:
item["title"] = clean_text(item.get("title"))Monitoring and Alerting
Build monitoring into your scraping pipeline:
import logging
from datetime import datetime
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ScrapingMonitor:
def __init__(self):
self.start_time = datetime.now()
self.requests = 0
self.errors = 0
self.items = 0
def log_request(self, success=True):
self.requests += 1
if not success:
self.errors += 1
if self.requests % 50 == 0:
elapsed = (datetime.now() - self.start_time).seconds
rate = self.requests / max(elapsed, 1) * 60
logger.info(f"Requests: {self.requests}, Errors: {self.errors}, "
f"Items: {self.items}, Rate: {rate:.1f}/min")
def log_item(self, count=1):
self.items += countError Handling and Retry Logic
Implement robust error handling:
import time
from requests.exceptions import RequestException
def retry_request(func, max_retries=3, base_delay=5):
for attempt in range(max_retries):
try:
return func()
except RequestException as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
time.sleep(delay)
return NoneData Storage Options
Choose the right storage for your scraping volume:
import json
import csv
import sqlite3
class DataStorage:
def __init__(self, db_path="scraped_data.db"):
self.conn = sqlite3.connect(db_path)
self.conn.execute('''CREATE TABLE IF NOT EXISTS items
(id TEXT PRIMARY KEY, title TEXT, url TEXT, data JSON, scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)''')
def save(self, item):
self.conn.execute(
"INSERT OR REPLACE INTO items (id, title, url, data) VALUES (?, ?, ?, ?)",
(item.get("id"), item.get("title"), item.get("url"), json.dumps(item))
)
self.conn.commit()
def export_json(self, output_path):
cursor = self.conn.execute("SELECT data FROM items")
items = [json.loads(row[0]) for row in cursor.fetchall()]
with open(output_path, "w") as f:
json.dump(items, f, indent=2)
def export_csv(self, output_path):
cursor = self.conn.execute("SELECT * FROM items")
rows = cursor.fetchall()
with open(output_path, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["id", "title", "url", "data", "scraped_at"])
writer.writerows(rows)Frequently Asked Questions
How often should I scrape data?
The optimal frequency depends on how often the source data changes. For real-time data (stock prices, news), scrape every few minutes. For product listings, daily or weekly is usually sufficient. For reviews, weekly scraping captures new feedback without excessive load.
What happens if my IP gets blocked?
If you receive 403 or 429 status codes, your IP is likely blocked. Switch to a different proxy, implement exponential backoff, and slow your request rate. Rotating residential proxies automatically switch IPs to prevent blocks.
Should I use headless browsers or HTTP requests?
Use HTTP requests (with BeautifulSoup or similar) whenever possible — they are faster and use less resources. Switch to headless browsers (Selenium, Playwright) only when JavaScript rendering is required for the data you need.
How do I handle CAPTCHAs?
CAPTCHAs indicate aggressive bot detection. To minimize them: use residential or mobile proxies, implement realistic delays, rotate user agents, and maintain consistent session behavior. For persistent CAPTCHAs, consider CAPTCHA-solving services as a last resort.
Can I scrape data commercially?
The legality of commercial scraping depends on the platform’s ToS, the type of data collected, and your jurisdiction. Public data is generally more permissible, but always consult legal counsel for commercial use cases. See our compliance guide.
Conclusion
Amazon review scraping requires careful handling of their aggressive anti-bot systems. Requests-based scraping works well for server-rendered review pages when paired with residential proxy rotation and proper rate limiting. Focus on rotating proxies and user agents for sustained access.
For more e-commerce scraping guides, visit our e-commerce proxy guide and proxy provider comparisons.
- How to Scrape AliExpress Product Data
- How to Scrape Apollo.io Contact Data in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape AliExpress Product Data
- How to Scrape Apollo.io Contact Data in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix