How to Scrape Magento Stores in 2026: API and HTML Patterns

The skill is loaded. Let me write and humanize the article in one pass, then save it.

Draft Rewrite

Magento powers a surprising chunk of mid-market and enterprise ecommerce, and scraping it in 2026 means knowing which version you’re dealing with, whether the store exposes its REST or GraphQL API, and how hard the bot mitigation is. the platform is actually more scraper-friendly than most — if you know where to look.

Detect the Magento Version First

before writing a single line of scraper code, confirm the target is Magento and which generation. Magento 1 is EOL but still running on thousands of stores. Magento 2 (Adobe Commerce / Open Source) is the default target.

quick fingerprint signals:

  • /skin/frontend/ in asset paths = Magento 1
  • /static/version[hash]/frontend/ = Magento 2
  • X-Magento-Cache-Id response header = Magento 2
  • Mage.Cookies in page source = Magento 1
curl -sI https://example.com/ | grep -i magento
curl -s https://example.com/ | grep -o 'static/version[^/]*'

Magento 1 stores have no official API, so they need pure HTML parsing (covered below). Magento 2 is the main event.

Magento 2 REST and GraphQL APIs

Magento 2 ships with a full REST API and a GraphQL endpoint. many stores leave at least the catalog endpoints publicly accessible without auth, because the storefront itself needs them for page rendering.

REST API

the base path is /rest/V1/. common public endpoints:

EndpointReturns
/rest/V1/products?searchCriteria[pageSize]=50product list with full attributes
/rest/V1/products/{sku}single product detail
/rest/V1/categoriesfull category tree
/rest/V1/products/{sku}/mediaimage URLs
/rest/V1/configurable-products/{sku}/childrenvariant SKUs
import httpx

BASE = "https://example.com/rest/V1"
params = {
    "searchCriteria[pageSize]": 100,
    "searchCriteria[currentPage]": 1,
    "searchCriteria[sortOrders][0][field]": "id",
    "searchCriteria[sortOrders][0][direction]": "ASC",
}
r = httpx.get(f"{BASE}/products", params=params, timeout=15)
data = r.json()
products = data["items"]
total = data["total_count"]

paginate by incrementing currentPage until len(products) < pageSize. total_count tells you the full catalog size upfront, so you can size your job queue before firing a single extra request.

GraphQL

Magento 2.3+ has a GraphQL endpoint at /graphql. it's often faster than REST for storefront data because you pull exactly what you need in one round-trip.

{
  products(search: "", pageSize: 50, currentPage: 1) {
    total_count
    items {
      sku
      name
      price_range {
        minimum_price { regular_price { value currency } }
      }
      categories { id name url_key }
    }
  }
}

POST that as {"query": "..."} to /graphql. no auth needed for catalog data on most stores. GraphQL also handles bundled product structures and layered navigation filters in one shot, which REST fumbles.

if you're used to the structured API approach from other platforms, How to Scrape BigCommerce Stores Programmatically (2026) covers a similar REST-first pattern that maps cleanly to Magento's field structure.

HTML Scraping for Magento 1 and API-Blocked Stores

some stores disable the API entirely, put it behind OAuth, or just run Magento 1. fall back to HTML parsing. Magento's frontend is consistent enough that a few selectors cover most themes.

useful selectors on default Luma and blank themes:

  • product list items: .product-item
  • product name: .product-item-link
  • price: .price (or .special-price .price for sale items)
  • SKU on PDP: [itemprop="sku"]
  • pagination: rel="next" in

Magento 1 uses .product-name and .price-box but keeps the same microdata itemprop pattern.

numbered extraction flow for a category page:

  1. fetch the category URL, parse
  2. nodes
  3. extract href from .product-item-link for each PDP URL
  4. fetch each PDP, extract [itemprop="sku"], [itemprop="price"], and [itemprop="image"]
  5. check for rel="next" in and iterate
  6. on configurable products, pull the [data-role="swatch-options"] JSON blob for full variant data without extra requests

that JSON blob in step 5 is the real shortcut. Magento inlines the full variant matrix as a JavaScript object on the page. you get all variant prices and attribute combinations without touching the REST API at all, which matters when you're dealing with a catalog where every parent SKU has a dozen children.

this is the same embedded JSON island pattern described in How to Scrape WooCommerce Stores 2026: Pattern Recognition Approach, where most structured data lives inside

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)