How to Scrape Amazon Product Data with Proxies: 2026 Python Guide

Amazon is the world’s largest e-commerce marketplace, hosting hundreds of millions of product listings across dozens of categories. For businesses conducting competitive analysis, price monitoring, or market research, extracting Amazon product data programmatically is not just useful — it is essential.

However, Amazon employs some of the most sophisticated anti-scraping defenses on the internet. Without a proper proxy infrastructure, your scraper will be blocked within minutes. In this guide, we walk through a complete Python-based approach to scraping Amazon product data using rotating mobile proxies and industry best practices.

Why You Need Proxies to Scrape Amazon

Amazon invests heavily in bot detection. Their systems analyze request patterns, headers, IP reputation, and behavioral signals to identify automated traffic. Here is what happens when you scrape without proxies:

IP bans: Amazon blocks your IP address after detecting unusual request volumes.
CAPTCHAs: You encounter verification challenges that halt automation.
Misleading data: Amazon may serve altered prices or product details to suspected bots.
Legal risk: Aggressive scraping from a single IP draws unwanted attention.

Mobile proxies are particularly effective for Amazon scraping because they route traffic through real mobile carrier IPs. These addresses are shared by thousands of legitimate users, making it nearly impossible for Amazon to block them without affecting real customers.

Setting Up Your Environment

Before writing any code, install the necessary Python packages:

pip install requests beautifulsoup4 lxml fake-useragent

You will also need access to a rotating proxy service. For this tutorial, we use a residential or mobile proxy endpoint that handles rotation automatically.

Building the Amazon Scraper

Step 1: Configure Proxy and Headers

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import time
import random
import json

# Proxy configuration
PROXY_HOST = "proxy.dataresearchtools.com"
PROXY_PORT = "8080"
PROXY_USER = "your_username"
PROXY_PASS = "your_password"

proxies = {
    "http": f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}",
    "https": f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}",
}

ua = UserAgent()

def get_headers():
    """Generate realistic browser headers for each request."""
    return {
        "User-Agent": ua.random,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
    }

Step 2: Fetch Product Pages with Retry Logic

def fetch_page(url, max_retries=3):
    """Fetch a page with retry logic and proxy rotation."""
    for attempt in range(max_retries):
        try:
            response = requests.get(
                url,
                headers=get_headers(),
                proxies=proxies,
                timeout=30,
            )
            if response.status_code == 200:
                # Check for CAPTCHA page
                if "captcha" in response.text.lower():
                    print(f"CAPTCHA detected on attempt {attempt + 1}")
                    time.sleep(random.uniform(5, 15))
                    continue
                return response.text
            elif response.status_code == 503:
                print(f"Service unavailable, retrying ({attempt + 1}/{max_retries})")
                time.sleep(random.uniform(3, 8))
            else:
                print(f"Status {response.status_code} on attempt {attempt + 1}")
                time.sleep(random.uniform(2, 5))
        except requests.exceptions.RequestException as e:
            print(f"Request error: {e}")
            time.sleep(random.uniform(3, 8))
    return None

Step 3: Parse Product Data

def parse_product_page(html):
    """Extract structured product data from an Amazon product page."""
    soup = BeautifulSoup(html, "lxml")
    product = {}

    # Product title
    title_tag = soup.select_one("#productTitle")
    product["title"] = title_tag.get_text(strip=True) if title_tag else None

    # Price
    price_tag = soup.select_one("span.a-price span.a-offscreen")
    product["price"] = price_tag.get_text(strip=True) if price_tag else None

    # Rating
    rating_tag = soup.select_one("span[data-hook='rating-out-of-text']")
    if not rating_tag:
        rating_tag = soup.select_one("#acrPopover span.a-size-base")
    product["rating"] = rating_tag.get_text(strip=True) if rating_tag else None

    # Review count
    review_tag = soup.select_one("#acrCustomerReviewText")
    product["review_count"] = review_tag.get_text(strip=True) if review_tag else None

    # Availability
    avail_tag = soup.select_one("#availability span")
    product["availability"] = avail_tag.get_text(strip=True) if avail_tag else None

    # Product features / bullet points
    feature_tags = soup.select("#feature-bullets ul li span.a-list-item")
    product["features"] = [f.get_text(strip=True) for f in feature_tags if f.get_text(strip=True)]

    # ASIN from URL or page
    asin_tag = soup.select_one("input#ASIN")
    product["asin"] = asin_tag["value"] if asin_tag else None

    # Brand
    brand_tag = soup.select_one("#bylineInfo")
    product["brand"] = brand_tag.get_text(strip=True) if brand_tag else None

    # Images
    img_tags = soup.select("#altImages ul li img")
    product["images"] = [img.get("src") for img in img_tags if img.get("src") and "sprite" not in img.get("src", "")]

    return product

Step 4: Scrape Search Results

def scrape_search_results(keyword, num_pages=3):
    """Scrape Amazon search results for a given keyword."""
    all_products = []

    for page in range(1, num_pages + 1):
        url = f"https://www.amazon.com/s?k={keyword.replace(' ', '+')}&page={page}"
        print(f"Scraping search page {page} for '{keyword}'...")

        html = fetch_page(url)
        if not html:
            print(f"Failed to fetch page {page}")
            continue

        soup = BeautifulSoup(html, "lxml")
        items = soup.select("div[data-asin][data-component-type='s-search-result']")

        for item in items:
            asin = item.get("data-asin", "")
            if not asin:
                continue

            title_tag = item.select_one("h2 a span")
            price_whole = item.select_one("span.a-price-whole")
            price_frac = item.select_one("span.a-price-fraction")
            rating_tag = item.select_one("span.a-icon-alt")

            product = {
                "asin": asin,
                "title": title_tag.get_text(strip=True) if title_tag else None,
                "price": f"{price_whole.get_text(strip=True)}{price_frac.get_text(strip=True)}" if price_whole and price_frac else None,
                "rating": rating_tag.get_text(strip=True) if rating_tag else None,
                "url": f"https://www.amazon.com/dp/{asin}",
            }
            all_products.append(product)

        # Respectful delay between pages
        time.sleep(random.uniform(2, 5))

    return all_products

Step 5: Run the Complete Pipeline

def main():
    # Scrape search results
    keyword = "wireless headphones"
    search_results = scrape_search_results(keyword, num_pages=3)
    print(f"Found {len(search_results)} products in search results")

    # Scrape individual product pages for detailed data
    detailed_products = []
    for i, product in enumerate(search_results[:10]):  # Limit to first 10
        print(f"Scraping product {i + 1}: {product['asin']}")
        html = fetch_page(product["url"])
        if html:
            details = parse_product_page(html)
            details["search_data"] = product
            detailed_products.append(details)
        time.sleep(random.uniform(3, 7))  # Respectful delay

    # Save results
    with open("amazon_products.json", "w", encoding="utf-8") as f:
        json.dump(detailed_products, f, indent=2, ensure_ascii=False)
    print(f"Saved {len(detailed_products)} detailed products")

if __name__ == "__main__":
    main()

Handling Amazon’s Anti-Bot Defenses

Amazon’s detection systems are multi-layered. Here are the key strategies to avoid blocks:

Request Throttling

Never send requests faster than a human would browse. Implement random delays between 2 and 7 seconds per request. For large-scale operations, distribute your scraping across longer time windows.

Header Rotation

Amazon checks for consistent User-Agent strings and missing headers. Rotate your User-Agent with each request and always include standard browser headers like Accept-Language and Accept-Encoding.

Session Management

Create new sessions periodically rather than reusing the same session for hundreds of requests. Each new session should pair with a fresh proxy IP.

Proxy Quality Matters

Not all proxies are equal for Amazon scraping. Datacenter proxies are detected almost instantly. Residential and mobile proxies provide the highest success rates because they use IP addresses assigned to real internet service providers and mobile carriers.

Structuring Your Extracted Data

For e-commerce data collection at scale, maintaining a clean data structure is critical. Here is a recommended schema:

product_schema = {
    "asin": "B09V3KXJPB",
    "title": "Product Name",
    "price": "$29.99",
    "currency": "USD",
    "rating": "4.5 out of 5 stars",
    "review_count": "2,847 ratings",
    "availability": "In Stock",
    "brand": "Brand Name",
    "features": ["Feature 1", "Feature 2"],
    "category": "Electronics > Headphones",
    "images": ["url1", "url2"],
    "scraped_at": "2026-03-09T12:00:00Z",
}

Scaling Your Amazon Scraper

When you need to scrape thousands or millions of products, single-threaded scraping becomes impractical. Consider these scaling strategies:

Concurrent requests: Use Python’s concurrent.futures.ThreadPoolExecutor to run multiple requests simultaneously, each through a different proxy.
Queue-based architecture: Use Redis or RabbitMQ to manage a queue of URLs to scrape, with multiple worker processes consuming from the queue.
Database storage: Replace JSON file output with a proper database like PostgreSQL for efficient querying and deduplication.
Scheduled runs: Set up cron jobs or task schedulers to run your scraper at regular intervals for ongoing price monitoring.

Legal and Ethical Considerations

Web scraping exists in a complex legal landscape. While scraping publicly available data is generally permissible, there are important boundaries:

Respect robots.txt: Check Amazon’s robots.txt file and understand which paths they restrict.
Terms of Service: Amazon’s ToS prohibits scraping. Understand the risks before proceeding.
Rate limiting: Never overwhelm Amazon’s servers. Aggressive scraping can constitute a denial-of-service attack.
Personal data: Avoid collecting personal information about individual sellers or reviewers.
Data usage: Use scraped data for legitimate business purposes like market research and competitive analysis.

Common Pitfalls and Solutions

Problem	Cause	Solution
Empty responses	IP blocked	Switch to mobile proxies
CAPTCHA pages	Too many requests	Increase delays, improve proxy rotation
Missing prices	Dynamic rendering	Use headless browser or look for JSON-LD data
Stale data	Cached responses	Add cache-busting query parameters
Inconsistent HTML	A/B testing	Build multiple parser fallbacks

Conclusion

Scraping Amazon product data requires a thoughtful approach combining proper proxy infrastructure, realistic request patterns, and robust parsing logic. The Python code examples in this guide provide a solid foundation for building your own Amazon scraper.

The most critical factor in successful Amazon scraping is your proxy infrastructure. Mobile proxies from DataResearchTools provide the highest success rates by routing your requests through genuine mobile carrier IP addresses that Amazon cannot easily distinguish from real user traffic.

For related scraping guides, explore our tutorials on web scraping with proxies and check our proxy glossary for technical terminology used throughout this article.