The skill is loaded. Let me write and humanize the article in one pass, then save it.
—
Draft Rewrite
Magento powers a surprising chunk of mid-market and enterprise ecommerce, and scraping it in 2026 means knowing which version you’re dealing with, whether the store exposes its REST or GraphQL API, and how hard the bot mitigation is. the platform is actually more scraper-friendly than most — if you know where to look.
Detect the Magento Version First
before writing a single line of scraper code, confirm the target is Magento and which generation. Magento 1 is EOL but still running on thousands of stores. Magento 2 (Adobe Commerce / Open Source) is the default target.
quick fingerprint signals:
/skin/frontend/in asset paths = Magento 1/static/version[hash]/frontend/= Magento 2X-Magento-Cache-Idresponse header = Magento 2Mage.Cookiesin page source = Magento 1
curl -sI https://example.com/ | grep -i magento
curl -s https://example.com/ | grep -o 'static/version[^/]*'Magento 1 stores have no official API, so they need pure HTML parsing (covered below). Magento 2 is the main event.
Magento 2 REST and GraphQL APIs
Magento 2 ships with a full REST API and a GraphQL endpoint. many stores leave at least the catalog endpoints publicly accessible without auth, because the storefront itself needs them for page rendering.
REST API
the base path is /rest/V1/. common public endpoints:
| Endpoint | Returns |
|---|---|
/rest/V1/products?searchCriteria[pageSize]=50 | product list with full attributes |
/rest/V1/products/{sku} | single product detail |
/rest/V1/categories | full category tree |
/rest/V1/products/{sku}/media | image URLs |
/rest/V1/configurable-products/{sku}/children | variant SKUs |
import httpx
BASE = "https://example.com/rest/V1"
params = {
"searchCriteria[pageSize]": 100,
"searchCriteria[currentPage]": 1,
"searchCriteria[sortOrders][0][field]": "id",
"searchCriteria[sortOrders][0][direction]": "ASC",
}
r = httpx.get(f"{BASE}/products", params=params, timeout=15)
data = r.json()
products = data["items"]
total = data["total_count"]paginate by incrementing currentPage until len(products) < pageSize. total_count tells you the full catalog size upfront, so you can size your job queue before firing a single extra request.
GraphQL
Magento 2.3+ has a GraphQL endpoint at /graphql. it's often faster than REST for storefront data because you pull exactly what you need in one round-trip.
{
products(search: "", pageSize: 50, currentPage: 1) {
total_count
items {
sku
name
price_range {
minimum_price { regular_price { value currency } }
}
categories { id name url_key }
}
}
}POST that as {"query": "..."} to /graphql. no auth needed for catalog data on most stores. GraphQL also handles bundled product structures and layered navigation filters in one shot, which REST fumbles.
if you're used to the structured API approach from other platforms, How to Scrape BigCommerce Stores Programmatically (2026) covers a similar REST-first pattern that maps cleanly to Magento's field structure.
HTML Scraping for Magento 1 and API-Blocked Stores
some stores disable the API entirely, put it behind OAuth, or just run Magento 1. fall back to HTML parsing. Magento's frontend is consistent enough that a few selectors cover most themes.
useful selectors on default Luma and blank themes:
- product list items:
.product-item - product name:
.product-item-link - price:
.price(or.special-price .pricefor sale items) - SKU on PDP:
[itemprop="sku"] - pagination:
rel="next"in
Magento 1 uses .product-name and .price-box but keeps the same microdata itemprop pattern.
numbered extraction flow for a category page:
- fetch the category URL, parse
nodes - extract
hreffrom.product-item-linkfor each PDP URL - fetch each PDP, extract
[itemprop="sku"],[itemprop="price"], and[itemprop="image"] - check for
rel="next"inand iterate - on configurable products, pull the
[data-role="swatch-options"]JSON blob for full variant data without extra requests
that JSON blob in step 5 is the real shortcut. Magento inlines the full variant matrix as a JavaScript object on the page. you get all variant prices and attribute combinations without touching the REST API at all, which matters when you're dealing with a catalog where every parent SKU has a dozen children.
this is the same embedded JSON island pattern described in How to Scrape WooCommerce Stores 2026: Pattern Recognition Approach, where most structured data lives inside Resources Proxy Signals Podcast Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)
Operator-level insights on mobile proxies and access infrastructure.