Amazon Best Sellers data is one of the most commercially valuable signals in e-commerce intelligence, and scraping it across all 18 active Amazon marketplaces is harder than it looks. Each locale runs on a separate domain, uses localized anti-bot fingerprinting, and has its own ASIN catalog — meaning a scraper that works on amazon.com will fail silently on amazon.co.jp or amazon.com.br within hours.
What You’re Actually Scraping
Amazon Best Sellers pages follow a predictable URL pattern:
https://www.amazon.{tld}/Best-Sellers/{category}/zgbs/{node_id}Each page returns up to 50 ranked ASINs per category node, paginated across two pages (1-50, 51-100). The data you want per ASIN:
- Rank (1-100 within node)
- ASIN and product title
- Price (locale currency)
- Star rating and review count
- Sponsored flag (boolean — many scrapers miss this)
- Badge labels (“Amazon’s Choice”, “#1 New Release”)
The sponsored flag matters. Best Sellers pages increasingly mix organic rank with promoted listings, and conflating them will corrupt your rank-tracking dataset.
The 18 Marketplace Map
Amazon operates 18 public-facing marketplaces as of 2026. Not all are equal in scraping difficulty:
| Marketplace | TLD | Anti-bot Tier | Requires Local IP |
|---|---|---|---|
| US | .com | High | No (but helps) |
| UK | .co.uk | High | No |
| Germany | .de | High | No |
| Japan | .co.jp | Very High | Yes |
| India | .in | Medium | No |
| Brazil | .com.br | Medium | Yes |
| Mexico | .com.mx | Medium | No |
| Australia | .com.au | Medium | No |
| Canada | .ca | High | No |
| France | .fr | High | No |
| Italy | .it | Medium | No |
| Spain | .es | Medium | No |
| Netherlands | .nl | Medium | No |
| Sweden | .se | Low | No |
| Poland | .pl | Low | No |
| Saudi Arabia | .sa | Low | No |
| UAE | .ae | Low | No |
| Singapore | .sg | Low | No |
Japan and Brazil are the two that will block you fastest without residential or mobile IPs from the target country. Japan specifically rate-limits aggressively and serves CAPTCHAs within 3-5 requests if you’re on a datacenter IP.
Parsing Strategy: HTML vs. SP-API vs. Third-Party
You have three realistic options:
- Direct HTML scraping — highest fidelity, most fragile, requires proxy rotation and browser fingerprinting
- Amazon SP-API (Selling Partner API) — structured data, but requires an active seller account and doesn’t expose Best Sellers rank cleanly across all nodes
- Third-party aggregators (Rainforest API, Keepa, DataForSEO) — easiest to operationalize, costs $0.002-$0.02 per ASIN depending on freshness
For competitive intelligence at scale, direct HTML scraping with a rotating proxy layer gives you the freshest data and the widest node coverage. SP-API is better for sellers who need their own rank tracking tied to inventory operations.
If you’re already running proxy-dependent scrapers for other targets — like scraping Walmart Marketplace seller data — you can reuse that infrastructure directly. The same rotating IP pool, session management logic, and retry handlers transfer cleanly.
The Anti-Bot Stack You’ll Actually Face
Amazon runs a layered defense in 2026:
- TLS fingerprinting via BoringSSL — curl and requests fail on most locales without a matching TLS profile
- Browser fingerprint checks (canvas, WebGL, font enumeration) on JavaScript-rendered pages
- Behavioral analysis — consistent timing patterns trigger blocks faster than random delays
- Geographic IP scoring — datacenter ASNs get a higher suspicion score than residential
The practical fix: use a headless browser (Playwright with stealth patches) or a dedicated scraping browser like Browserless or Apify Actors, combined with residential or mobile proxies. For Japan and Brazil specifically, you need in-country mobile IPs. The same logic applies when scraping other commerce platforms with geo-restricted pricing — mobile proxies used for insurance quote comparison demonstrate this pattern clearly: local mobile IP plus rotating session equals consistent access.
A minimal Playwright config for Amazon scraping:
from playwright.async_api import async_playwright
async def scrape_best_sellers(url: str, proxy: dict) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(
proxy=proxy,
args=["--disable-blink-features=AutomationControlled"]
)
ctx = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
locale="en-US",
timezone_id="America/New_York",
)
page = await ctx.new_page()
await page.goto(url, wait_until="domcontentloaded")
content = await page.content()
await browser.close()
return contentSet locale and timezone_id to match the target marketplace country, not your proxy IP’s country. Mismatches are a detectable signal.
Structuring a Multi-Marketplace Pipeline
Running 18 markets in parallel is the right architecture, but naive parallelism gets you blocked. The structure that works:
- One session pool per marketplace — don’t reuse US session cookies on .co.uk
- Stagger requests per node — 2-5 second jitter between category pages within a single market
- Checkpoint by ASIN hash — if a page returns fewer than 40 ASINs, treat it as a soft block and retry with a fresh session, not the same one
- Deduplicate sponsored ASINs — store a
is_sponsoredboolean at ingest, filter downstream
For the data model, store raw HTML in object storage (S3/R2) and parse to structured rows separately. Amazon’s HTML structure changes without notice; having the raw payload means you can re-parse without re-scraping.
This pipeline architecture is similar to what you’d build for ATS platform scraping at scale — the same session isolation and checkpoint logic that makes iCIMS career site scraping and Taleo scraping at scale reliable also applies to marketplace data pipelines. The underlying problem — maintaining session integrity across distributed workers — is the same class of challenge.
For brand-level research, Best Sellers data pairs well with Amazon Brand Registry public page data, which gives you trademark registration dates and brand owner identities to enrich your ASIN-level records.
Handling Failures at Scale
Common failure modes and how to handle them:
- 503 / captcha page returned as 200 — parse response body for
containing “Robot Check” before processing - Redirect to signin page — session expired; rotate to fresh session, do not retry same credentials
- Missing rank badges — normal on low-traffic nodes; don’t treat as parse failure
- Price not rendered — JavaScript-dependent; ensure page fully loads before extracting, or use
wait_for_selectoron the price element
Rate your proxy health by marketplace separately. A proxy pool that performs well on amazon.com can be effectively blocked on amazon.co.jp. Monitor block rates per (proxy_asn, marketplace) tuple and drop underperforming ASNs from that market’s pool automatically.
Bottom Line
Scraping Amazon Best Sellers across 18 marketplaces is tractable in 2026 if you treat each locale as a separate target with its own proxy pool, session state, and block-rate monitoring. The two non-negotiables: residential or mobile IPs for Japan and Brazil, and a browser fingerprint that doesn’t expose automation. DRT covers the full stack of e-commerce and job board scraping infrastructure — if this article was useful, the rest of the scrape-target library will be too.