How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
Amazon is the world’s largest e-commerce marketplace, hosting hundreds of millions of product listings across dozens of categories. For businesses conducting competitive analysis, price monitoring, or market research, extracting Amazon product data programmatically is not just useful — it is essential.
However, Amazon employs some of the most sophisticated anti-scraping defenses on the internet. Without a proper proxy infrastructure, your scraper will be blocked within minutes. In this guide, we walk through a complete Python-based approach to scraping Amazon product data using rotating mobile proxies and industry best practices.
Why You Need Proxies to Scrape Amazon
Amazon invests heavily in bot detection. Their systems analyze request patterns, headers, IP reputation, and behavioral signals to identify automated traffic. Here is what happens when you scrape without proxies:
- IP bans: Amazon blocks your IP address after detecting unusual request volumes.
- CAPTCHAs: You encounter verification challenges that halt automation.
- Misleading data: Amazon may serve altered prices or product details to suspected bots.
- Legal risk: Aggressive scraping from a single IP draws unwanted attention.
Mobile proxies are particularly effective for Amazon scraping because they route traffic through real mobile carrier IPs. These addresses are shared by thousands of legitimate users, making it nearly impossible for Amazon to block them without affecting real customers.
Setting Up Your Environment
Before writing any code, install the necessary Python packages:
pip install requests beautifulsoup4 lxml fake-useragentYou will also need access to a rotating proxy service. For this tutorial, we use a residential or mobile proxy endpoint that handles rotation automatically.
Building the Amazon Scraper
Step 1: Configure Proxy and Headers
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import time
import random
import json
# Proxy configuration
PROXY_HOST = "proxy.dataresearchtools.com"
PROXY_PORT = "8080"
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
proxies = {
"http": f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}",
"https": f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}",
}
ua = UserAgent()
def get_headers():
"""Generate realistic browser headers for each request."""
return {
"User-Agent": ua.random,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}Step 2: Fetch Product Pages with Retry Logic
def fetch_page(url, max_retries=3):
"""Fetch a page with retry logic and proxy rotation."""
for attempt in range(max_retries):
try:
response = requests.get(
url,
headers=get_headers(),
proxies=proxies,
timeout=30,
)
if response.status_code == 200:
# Check for CAPTCHA page
if "captcha" in response.text.lower():
print(f"CAPTCHA detected on attempt {attempt + 1}")
time.sleep(random.uniform(5, 15))
continue
return response.text
elif response.status_code == 503:
print(f"Service unavailable, retrying ({attempt + 1}/{max_retries})")
time.sleep(random.uniform(3, 8))
else:
print(f"Status {response.status_code} on attempt {attempt + 1}")
time.sleep(random.uniform(2, 5))
except requests.exceptions.RequestException as e:
print(f"Request error: {e}")
time.sleep(random.uniform(3, 8))
return NoneStep 3: Parse Product Data
def parse_product_page(html):
"""Extract structured product data from an Amazon product page."""
soup = BeautifulSoup(html, "lxml")
product = {}
# Product title
title_tag = soup.select_one("#productTitle")
product["title"] = title_tag.get_text(strip=True) if title_tag else None
# Price
price_tag = soup.select_one("span.a-price span.a-offscreen")
product["price"] = price_tag.get_text(strip=True) if price_tag else None
# Rating
rating_tag = soup.select_one("span[data-hook='rating-out-of-text']")
if not rating_tag:
rating_tag = soup.select_one("#acrPopover span.a-size-base")
product["rating"] = rating_tag.get_text(strip=True) if rating_tag else None
# Review count
review_tag = soup.select_one("#acrCustomerReviewText")
product["review_count"] = review_tag.get_text(strip=True) if review_tag else None
# Availability
avail_tag = soup.select_one("#availability span")
product["availability"] = avail_tag.get_text(strip=True) if avail_tag else None
# Product features / bullet points
feature_tags = soup.select("#feature-bullets ul li span.a-list-item")
product["features"] = [f.get_text(strip=True) for f in feature_tags if f.get_text(strip=True)]
# ASIN from URL or page
asin_tag = soup.select_one("input#ASIN")
product["asin"] = asin_tag["value"] if asin_tag else None
# Brand
brand_tag = soup.select_one("#bylineInfo")
product["brand"] = brand_tag.get_text(strip=True) if brand_tag else None
# Images
img_tags = soup.select("#altImages ul li img")
product["images"] = [img.get("src") for img in img_tags if img.get("src") and "sprite" not in img.get("src", "")]
return productStep 4: Scrape Search Results
def scrape_search_results(keyword, num_pages=3):
"""Scrape Amazon search results for a given keyword."""
all_products = []
for page in range(1, num_pages + 1):
url = f"https://www.amazon.com/s?k={keyword.replace(' ', '+')}&page={page}"
print(f"Scraping search page {page} for '{keyword}'...")
html = fetch_page(url)
if not html:
print(f"Failed to fetch page {page}")
continue
soup = BeautifulSoup(html, "lxml")
items = soup.select("div[data-asin][data-component-type='s-search-result']")
for item in items:
asin = item.get("data-asin", "")
if not asin:
continue
title_tag = item.select_one("h2 a span")
price_whole = item.select_one("span.a-price-whole")
price_frac = item.select_one("span.a-price-fraction")
rating_tag = item.select_one("span.a-icon-alt")
product = {
"asin": asin,
"title": title_tag.get_text(strip=True) if title_tag else None,
"price": f"{price_whole.get_text(strip=True)}{price_frac.get_text(strip=True)}" if price_whole and price_frac else None,
"rating": rating_tag.get_text(strip=True) if rating_tag else None,
"url": f"https://www.amazon.com/dp/{asin}",
}
all_products.append(product)
# Respectful delay between pages
time.sleep(random.uniform(2, 5))
return all_productsStep 5: Run the Complete Pipeline
def main():
# Scrape search results
keyword = "wireless headphones"
search_results = scrape_search_results(keyword, num_pages=3)
print(f"Found {len(search_results)} products in search results")
# Scrape individual product pages for detailed data
detailed_products = []
for i, product in enumerate(search_results[:10]): # Limit to first 10
print(f"Scraping product {i + 1}: {product['asin']}")
html = fetch_page(product["url"])
if html:
details = parse_product_page(html)
details["search_data"] = product
detailed_products.append(details)
time.sleep(random.uniform(3, 7)) # Respectful delay
# Save results
with open("amazon_products.json", "w", encoding="utf-8") as f:
json.dump(detailed_products, f, indent=2, ensure_ascii=False)
print(f"Saved {len(detailed_products)} detailed products")
if __name__ == "__main__":
main()Handling Amazon’s Anti-Bot Defenses
Amazon’s detection systems are multi-layered. Here are the key strategies to avoid blocks:
Request Throttling
Never send requests faster than a human would browse. Implement random delays between 2 and 7 seconds per request. For large-scale operations, distribute your scraping across longer time windows.
Header Rotation
Amazon checks for consistent User-Agent strings and missing headers. Rotate your User-Agent with each request and always include standard browser headers like Accept-Language and Accept-Encoding.
Session Management
Create new sessions periodically rather than reusing the same session for hundreds of requests. Each new session should pair with a fresh proxy IP.
Proxy Quality Matters
Not all proxies are equal for Amazon scraping. Datacenter proxies are detected almost instantly. Residential and mobile proxies provide the highest success rates because they use IP addresses assigned to real internet service providers and mobile carriers.
Structuring Your Extracted Data
For e-commerce data collection at scale, maintaining a clean data structure is critical. Here is a recommended schema:
product_schema = {
"asin": "B09V3KXJPB",
"title": "Product Name",
"price": "$29.99",
"currency": "USD",
"rating": "4.5 out of 5 stars",
"review_count": "2,847 ratings",
"availability": "In Stock",
"brand": "Brand Name",
"features": ["Feature 1", "Feature 2"],
"category": "Electronics > Headphones",
"images": ["url1", "url2"],
"scraped_at": "2026-03-09T12:00:00Z",
}Scaling Your Amazon Scraper
When you need to scrape thousands or millions of products, single-threaded scraping becomes impractical. Consider these scaling strategies:
- Concurrent requests: Use Python’s
concurrent.futures.ThreadPoolExecutorto run multiple requests simultaneously, each through a different proxy. - Queue-based architecture: Use Redis or RabbitMQ to manage a queue of URLs to scrape, with multiple worker processes consuming from the queue.
- Database storage: Replace JSON file output with a proper database like PostgreSQL for efficient querying and deduplication.
- Scheduled runs: Set up cron jobs or task schedulers to run your scraper at regular intervals for ongoing price monitoring.
Legal and Ethical Considerations
Web scraping exists in a complex legal landscape. While scraping publicly available data is generally permissible, there are important boundaries:
- Respect robots.txt: Check Amazon’s robots.txt file and understand which paths they restrict.
- Terms of Service: Amazon’s ToS prohibits scraping. Understand the risks before proceeding.
- Rate limiting: Never overwhelm Amazon’s servers. Aggressive scraping can constitute a denial-of-service attack.
- Personal data: Avoid collecting personal information about individual sellers or reviewers.
- Data usage: Use scraped data for legitimate business purposes like market research and competitive analysis.
Common Pitfalls and Solutions
| Problem | Cause | Solution |
|---|---|---|
| Empty responses | IP blocked | Switch to mobile proxies |
| CAPTCHA pages | Too many requests | Increase delays, improve proxy rotation |
| Missing prices | Dynamic rendering | Use headless browser or look for JSON-LD data |
| Stale data | Cached responses | Add cache-busting query parameters |
| Inconsistent HTML | A/B testing | Build multiple parser fallbacks |
Conclusion
Scraping Amazon product data requires a thoughtful approach combining proper proxy infrastructure, realistic request patterns, and robust parsing logic. The Python code examples in this guide provide a solid foundation for building your own Amazon scraper.
The most critical factor in successful Amazon scraping is your proxy infrastructure. Mobile proxies from DataResearchTools provide the highest success rates by routing your requests through genuine mobile carrier IP addresses that Amazon cannot easily distinguish from real user traffic.
For related scraping guides, explore our tutorials on web scraping with proxies and check our proxy glossary for technical terminology used throughout this article.
- How to Scrape Bing Search Results with Python and Proxies
- How to Scrape Booking.com Hotel Prices with Proxy Rotation
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- How to Scrape Bing Search Results with Python and Proxies
- How to Scrape Booking.com Hotel Prices with Proxy Rotation
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape Bing Search Results with Python and Proxies
- How to Scrape Booking.com Hotel Prices with Proxy Rotation
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape Bing Search Results with Python and Proxies
- How to Scrape Booking.com Hotel Prices with Proxy Rotation
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape Bing Search Results with Python and Proxies
- How to Scrape Booking.com Hotel Prices with Proxy Rotation
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix