How to Scrape Shopify Store Product Catalogs

How to Scrape Shopify Store Product Catalogs

Shopify powers over 4 million active online stores, making it the dominant e-commerce platform worldwide. For competitive intelligence teams, dropshippers, market researchers, and pricing analysts, access to Shopify store product data provides critical insights into competitor pricing, product assortment, and inventory strategies.

What makes Shopify particularly interesting from a scraping perspective is its built-in /products.json endpoint. Every Shopify store exposes a JSON API that returns structured product data without requiring any HTML parsing. Combined with mobile proxy rotation for scale, this makes Shopify one of the most efficient e-commerce platforms to scrape.

The Shopify /products.json Endpoint

Every Shopify store has a public JSON endpoint at {store-url}/products.json that returns product data in a clean, structured format. This endpoint is part of Shopify’s storefront architecture and is accessible without authentication on most stores.

The endpoint supports pagination through a page parameter and returns up to 250 products per page by default. Here is the basic structure:

https://example-store.myshopify.com/products.json?limit=250&page=1

The response includes:

  • Product title, description, and vendor
  • All variant details (size, color, price, SKU)
  • Image URLs
  • Product type and tags
  • Creation and update timestamps
  • Availability status

This structured approach eliminates the need for HTML parsing entirely, making Shopify scraping significantly more reliable than most e-commerce scraping targets.

Setting Up the Environment

pip install requests pandas tqdm

Building the Shopify Product Scraper

The core scraper leverages the JSON endpoint with proxy rotation for handling large numbers of stores:

import requests
import pandas as pd
import time
import random
import json
from datetime import datetime
from tqdm import tqdm


class ShopifyProxyPool:
    """Manages proxy rotation for Shopify scraping."""

    def __init__(self, proxy_list):
        self.proxies = proxy_list
        self.index = 0
        self.cooldown = {}

    def get_proxy(self):
        """Return the next proxy, skipping those in cooldown."""
        now = time.time()
        available = [
            p for p in self.proxies
            if p not in self.cooldown or now > self.cooldown[p]
        ]

        if not available:
            self.cooldown.clear()
            available = self.proxies

        proxy = available[self.index % len(available)]
        self.index += 1
        return {"http": proxy, "https": proxy}

    def set_cooldown(self, proxy_dict, seconds=60):
        """Put a proxy on cooldown after a failure."""
        proxy_url = proxy_dict.get("http", "")
        self.cooldown[proxy_url] = time.time() + seconds


class ShopifyScraper:
    """Scrapes product data from Shopify stores using the JSON API."""

    def __init__(self, proxy_pool):
        self.proxy_pool = proxy_pool
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/120.0.0.0 Safari/537.36"
            ),
            "Accept": "application/json",
        })

    def scrape_store(self, store_url, max_products=None):
        """Scrape all products from a single Shopify store."""
        store_url = store_url.rstrip("/")
        all_products = []
        page = 1
        limit = 250

        while True:
            url = f"{store_url}/products.json?limit={limit}&page={page}"
            proxy = self.proxy_pool.get_proxy()

            try:
                response = self.session.get(url, proxies=proxy, timeout=15)

                if response.status_code == 200:
                    data = response.json()
                    products = data.get("products", [])

                    if not products:
                        break

                    parsed = [self._parse_product(p, store_url) for p in products]
                    all_products.extend(parsed)

                    print(f"Page {page}: {len(products)} products from {store_url}")

                    if max_products and len(all_products) >= max_products:
                        all_products = all_products[:max_products]
                        break

                    page += 1
                    time.sleep(random.uniform(1, 2))

                elif response.status_code == 429:
                    print(f"Rate limited on {store_url}, cooling down proxy...")
                    self.proxy_pool.set_cooldown(proxy, seconds=30)
                    time.sleep(random.uniform(5, 10))

                elif response.status_code == 430:
                    # Shopify-specific: too many requests
                    print(f"Shopify 430 error, backing off...")
                    self.proxy_pool.set_cooldown(proxy, seconds=60)
                    time.sleep(random.uniform(10, 20))

                else:
                    print(f"HTTP {response.status_code} from {store_url}")
                    break

            except requests.RequestException as e:
                print(f"Request error for {store_url}: {e}")
                self.proxy_pool.set_cooldown(proxy, seconds=30)
                break

        return all_products

    def _parse_product(self, product_data, store_url):
        """Parse a product JSON object into a flat structure."""
        product = {
            "store_url": store_url,
            "product_id": product_data.get("id"),
            "title": product_data.get("title"),
            "vendor": product_data.get("vendor"),
            "product_type": product_data.get("product_type"),
            "handle": product_data.get("handle"),
            "product_url": f"{store_url}/products/{product_data.get('handle', '')}",
            "description_html": product_data.get("body_html", ""),
            "tags": ", ".join(product_data.get("tags", [])),
            "created_at": product_data.get("created_at"),
            "updated_at": product_data.get("updated_at"),
            "published_at": product_data.get("published_at"),
        }

        # Extract variant data
        variants = product_data.get("variants", [])
        if variants:
            prices = [float(v.get("price", 0)) for v in variants if v.get("price")]
            product["min_price"] = min(prices) if prices else None
            product["max_price"] = max(prices) if prices else None
            product["variant_count"] = len(variants)
            product["total_inventory"] = sum(
                v.get("inventory_quantity", 0) for v in variants
                if v.get("inventory_quantity") is not None
            )

            # First variant details
            first_variant = variants[0]
            product["primary_price"] = first_variant.get("price")
            product["compare_at_price"] = first_variant.get("compare_at_price")
            product["sku"] = first_variant.get("sku")
            product["weight"] = first_variant.get("weight")
            product["requires_shipping"] = first_variant.get("requires_shipping")

        # Extract images
        images = product_data.get("images", [])
        product["image_count"] = len(images)
        product["primary_image_url"] = images[0].get("src") if images else None

        return product

    def scrape_store_detailed(self, store_url):
        """Scrape products with full variant-level detail."""
        store_url = store_url.rstrip("/")
        all_variants = []
        page = 1

        while True:
            url = f"{store_url}/products.json?limit=250&page={page}"
            proxy = self.proxy_pool.get_proxy()

            try:
                response = self.session.get(url, proxies=proxy, timeout=15)
                if response.status_code != 200:
                    break

                products = response.json().get("products", [])
                if not products:
                    break

                for product in products:
                    for variant in product.get("variants", []):
                        variant_data = {
                            "store_url": store_url,
                            "product_id": product["id"],
                            "product_title": product["title"],
                            "product_type": product.get("product_type"),
                            "vendor": product.get("vendor"),
                            "variant_id": variant["id"],
                            "variant_title": variant.get("title"),
                            "price": variant.get("price"),
                            "compare_at_price": variant.get("compare_at_price"),
                            "sku": variant.get("sku"),
                            "available": variant.get("available"),
                            "inventory_quantity": variant.get("inventory_quantity"),
                            "weight": variant.get("weight"),
                            "option1": variant.get("option1"),
                            "option2": variant.get("option2"),
                            "option3": variant.get("option3"),
                        }
                        all_variants.append(variant_data)

                page += 1
                time.sleep(random.uniform(1, 2))

            except Exception as e:
                print(f"Error: {e}")
                break

        return all_variants

Scraping Multiple Competitor Stores

For competitive analysis, scrape product data from multiple stores in a single pipeline:

class ShopifyCompetitorTracker:
    """Tracks product data across multiple Shopify competitor stores."""

    def __init__(self, scraper):
        self.scraper = scraper

    def scrape_competitors(self, store_urls):
        """Scrape all products from a list of competitor stores."""
        all_products = []

        for i, url in enumerate(store_urls):
            print(f"\n[{i + 1}/{len(store_urls)}] Scraping: {url}")

            try:
                products = self.scraper.scrape_store(url)
                all_products.extend(products)
                print(f"  Collected {len(products)} products")
            except Exception as e:
                print(f"  Failed: {e}")

            # Longer delay between stores
            time.sleep(random.uniform(3, 8))

        return all_products

    def price_comparison(self, store_urls):
        """Compare pricing across competitor stores."""
        all_data = self.scrape_competitors(store_urls)
        df = pd.DataFrame(all_data)

        if df.empty:
            return df

        # Analysis by store
        summary = df.groupby("store_url").agg({
            "product_id": "count",
            "min_price": ["mean", "min", "max"],
            "variant_count": "mean",
        }).round(2)

        return summary

    def find_common_products(self, store_urls):
        """Find products that appear across multiple stores (by vendor/type)."""
        all_data = self.scrape_competitors(store_urls)
        df = pd.DataFrame(all_data)

        if df.empty:
            return df

        # Group by vendor + product_type to find overlap
        vendor_counts = df.groupby(["vendor", "product_type"]).agg({
            "store_url": "nunique",
            "product_id": "count",
            "min_price": "mean",
        }).reset_index()

        # Products available in multiple stores
        overlap = vendor_counts[vendor_counts["store_url"] > 1]
        return overlap.sort_values("store_url", ascending=False)

Monitoring Price Changes Over Time

For ongoing competitive monitoring, store snapshots and track changes:

class ShopifyPriceMonitor:
    """Monitors price changes across Shopify stores over time."""

    def __init__(self, scraper, data_dir="shopify_data"):
        self.scraper = scraper
        self.data_dir = data_dir

    def take_snapshot(self, store_url):
        """Take a price snapshot of a store."""
        import os
        os.makedirs(self.data_dir, exist_ok=True)

        products = self.scraper.scrape_store(store_url)
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        store_name = store_url.split("//")[-1].split(".")[0]

        filename = f"{self.data_dir}/{store_name}_{timestamp}.json"
        with open(filename, "w") as f:
            json.dump(products, f, indent=2, default=str)

        print(f"Snapshot saved: {filename} ({len(products)} products)")
        return filename

    def compare_snapshots(self, file_old, file_new):
        """Compare two snapshots to find price changes."""
        with open(file_old) as f:
            old_data = json.load(f)
        with open(file_new) as f:
            new_data = json.load(f)

        old_prices = {p["product_id"]: p for p in old_data}
        new_prices = {p["product_id"]: p for p in new_data}

        changes = []

        for pid, new_product in new_prices.items():
            if pid in old_prices:
                old_product = old_prices[pid]
                old_price = old_product.get("primary_price")
                new_price = new_product.get("primary_price")

                if old_price and new_price and old_price != new_price:
                    changes.append({
                        "product_id": pid,
                        "title": new_product["title"],
                        "old_price": float(old_price),
                        "new_price": float(new_price),
                        "change_pct": round(
                            (float(new_price) - float(old_price)) / float(old_price) * 100, 2
                        ),
                    })

        # New products
        new_products = [
            new_prices[pid] for pid in new_prices
            if pid not in old_prices
        ]

        # Removed products
        removed_products = [
            old_prices[pid] for pid in old_prices
            if pid not in new_prices
        ]

        return {
            "price_changes": changes,
            "new_products": len(new_products),
            "removed_products": len(removed_products),
            "total_changes": len(changes),
        }

Running the Complete Pipeline

def main():
    proxies = [
        "http://user:pass@proxy1.example.com:8080",
        "http://user:pass@proxy2.example.com:8080",
        "http://user:pass@proxy3.example.com:8080",
    ]

    pool = ShopifyProxyPool(proxies)
    scraper = ShopifyScraper(pool)

    # Scrape a single store
    products = scraper.scrape_store("https://example-store.myshopify.com")
    df = pd.DataFrame(products)
    df.to_csv("shopify_products.csv", index=False)
    print(f"Total products: {len(products)}")

    # Price summary
    if not df.empty and "min_price" in df.columns:
        df["min_price"] = pd.to_numeric(df["min_price"], errors="coerce")
        print(f"\nPrice range: ${df['min_price'].min():.2f} - ${df['min_price'].max():.2f}")
        print(f"Average price: ${df['min_price'].mean():.2f}")
        print(f"Median price: ${df['min_price'].median():.2f}")

    # Competitor analysis
    tracker = ShopifyCompetitorTracker(scraper)
    competitor_stores = [
        "https://store-one.myshopify.com",
        "https://store-two.myshopify.com",
        "https://store-three.myshopify.com",
    ]

    all_competitor_data = tracker.scrape_competitors(competitor_stores)
    competitor_df = pd.DataFrame(all_competitor_data)
    competitor_df.to_csv("competitor_products.csv", index=False)

    # Detailed variant-level data
    variants = scraper.scrape_store_detailed("https://example-store.myshopify.com")
    variant_df = pd.DataFrame(variants)
    variant_df.to_csv("shopify_variants.csv", index=False)
    print(f"\nTotal variants: {len(variants)}")


if __name__ == "__main__":
    main()

Discovering Shopify Stores to Scrape

Finding competitor Shopify stores is part of the research process. Several indicators reveal whether a site runs on Shopify:

def is_shopify_store(url, proxy_pool):
    """Check if a URL is a Shopify store."""
    proxy = proxy_pool.get_proxy()
    try:
        response = requests.get(
            f"{url}/products.json?limit=1",
            proxies=proxy,
            timeout=10,
        )
        if response.status_code == 200:
            data = response.json()
            return "products" in data
    except Exception:
        pass
    return False

You can also identify Shopify stores by checking for cdn.shopify.com in page source or the presence of Shopify.theme JavaScript variables.

Handling Shopify-Specific Challenges

Rate limiting (HTTP 430). Shopify returns a non-standard HTTP 430 status when rate limiting takes effect. This is different from the standard 429. Implement specific handling for both status codes.

Private/password-protected stores. Some Shopify stores require a password to access. These stores return a redirect to the password page instead of product data. Detect this and skip these stores.

Custom domain mapping. Many Shopify stores use custom domains rather than *.myshopify.com. The /products.json endpoint works on custom domains as well.

Large catalogs. Stores with thousands of products require careful pagination. Shopify’s JSON endpoint has historically used page-based pagination, but newer implementations may use cursor-based pagination. Handle both patterns.

Inventory tracking disabled. Not all stores expose inventory quantities. When inventory_quantity is null, the product may still be available, but you cannot determine stock levels.

Conclusion

Shopify’s built-in JSON API makes it one of the most scraper-friendly e-commerce platforms on the internet. The structured data format eliminates HTML parsing headaches, and the predictable endpoint structure makes building reliable scrapers straightforward.

With mobile proxy rotation to handle rate limits, you can monitor competitor pricing across hundreds of Shopify stores automatically. For more e-commerce scraping strategies, explore our e-commerce proxy guides and the proxy glossary for technical reference.


Related Reading

Scroll to Top