How to Scrape Shopify Store Product Catalogs
Shopify powers over 4 million active online stores, making it the dominant e-commerce platform worldwide. For competitive intelligence teams, dropshippers, market researchers, and pricing analysts, access to Shopify store product data provides critical insights into competitor pricing, product assortment, and inventory strategies.
What makes Shopify particularly interesting from a scraping perspective is its built-in /products.json endpoint. Every Shopify store exposes a JSON API that returns structured product data without requiring any HTML parsing. Combined with mobile proxy rotation for scale, this makes Shopify one of the most efficient e-commerce platforms to scrape.
The Shopify /products.json Endpoint
Every Shopify store has a public JSON endpoint at {store-url}/products.json that returns product data in a clean, structured format. This endpoint is part of Shopify’s storefront architecture and is accessible without authentication on most stores.
The endpoint supports pagination through a page parameter and returns up to 250 products per page by default. Here is the basic structure:
https://example-store.myshopify.com/products.json?limit=250&page=1The response includes:
- Product title, description, and vendor
- All variant details (size, color, price, SKU)
- Image URLs
- Product type and tags
- Creation and update timestamps
- Availability status
This structured approach eliminates the need for HTML parsing entirely, making Shopify scraping significantly more reliable than most e-commerce scraping targets.
Setting Up the Environment
pip install requests pandas tqdmBuilding the Shopify Product Scraper
The core scraper leverages the JSON endpoint with proxy rotation for handling large numbers of stores:
import requests
import pandas as pd
import time
import random
import json
from datetime import datetime
from tqdm import tqdm
class ShopifyProxyPool:
"""Manages proxy rotation for Shopify scraping."""
def __init__(self, proxy_list):
self.proxies = proxy_list
self.index = 0
self.cooldown = {}
def get_proxy(self):
"""Return the next proxy, skipping those in cooldown."""
now = time.time()
available = [
p for p in self.proxies
if p not in self.cooldown or now > self.cooldown[p]
]
if not available:
self.cooldown.clear()
available = self.proxies
proxy = available[self.index % len(available)]
self.index += 1
return {"http": proxy, "https": proxy}
def set_cooldown(self, proxy_dict, seconds=60):
"""Put a proxy on cooldown after a failure."""
proxy_url = proxy_dict.get("http", "")
self.cooldown[proxy_url] = time.time() + seconds
class ShopifyScraper:
"""Scrapes product data from Shopify stores using the JSON API."""
def __init__(self, proxy_pool):
self.proxy_pool = proxy_pool
self.session = requests.Session()
self.session.headers.update({
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
"Accept": "application/json",
})
def scrape_store(self, store_url, max_products=None):
"""Scrape all products from a single Shopify store."""
store_url = store_url.rstrip("/")
all_products = []
page = 1
limit = 250
while True:
url = f"{store_url}/products.json?limit={limit}&page={page}"
proxy = self.proxy_pool.get_proxy()
try:
response = self.session.get(url, proxies=proxy, timeout=15)
if response.status_code == 200:
data = response.json()
products = data.get("products", [])
if not products:
break
parsed = [self._parse_product(p, store_url) for p in products]
all_products.extend(parsed)
print(f"Page {page}: {len(products)} products from {store_url}")
if max_products and len(all_products) >= max_products:
all_products = all_products[:max_products]
break
page += 1
time.sleep(random.uniform(1, 2))
elif response.status_code == 429:
print(f"Rate limited on {store_url}, cooling down proxy...")
self.proxy_pool.set_cooldown(proxy, seconds=30)
time.sleep(random.uniform(5, 10))
elif response.status_code == 430:
# Shopify-specific: too many requests
print(f"Shopify 430 error, backing off...")
self.proxy_pool.set_cooldown(proxy, seconds=60)
time.sleep(random.uniform(10, 20))
else:
print(f"HTTP {response.status_code} from {store_url}")
break
except requests.RequestException as e:
print(f"Request error for {store_url}: {e}")
self.proxy_pool.set_cooldown(proxy, seconds=30)
break
return all_products
def _parse_product(self, product_data, store_url):
"""Parse a product JSON object into a flat structure."""
product = {
"store_url": store_url,
"product_id": product_data.get("id"),
"title": product_data.get("title"),
"vendor": product_data.get("vendor"),
"product_type": product_data.get("product_type"),
"handle": product_data.get("handle"),
"product_url": f"{store_url}/products/{product_data.get('handle', '')}",
"description_html": product_data.get("body_html", ""),
"tags": ", ".join(product_data.get("tags", [])),
"created_at": product_data.get("created_at"),
"updated_at": product_data.get("updated_at"),
"published_at": product_data.get("published_at"),
}
# Extract variant data
variants = product_data.get("variants", [])
if variants:
prices = [float(v.get("price", 0)) for v in variants if v.get("price")]
product["min_price"] = min(prices) if prices else None
product["max_price"] = max(prices) if prices else None
product["variant_count"] = len(variants)
product["total_inventory"] = sum(
v.get("inventory_quantity", 0) for v in variants
if v.get("inventory_quantity") is not None
)
# First variant details
first_variant = variants[0]
product["primary_price"] = first_variant.get("price")
product["compare_at_price"] = first_variant.get("compare_at_price")
product["sku"] = first_variant.get("sku")
product["weight"] = first_variant.get("weight")
product["requires_shipping"] = first_variant.get("requires_shipping")
# Extract images
images = product_data.get("images", [])
product["image_count"] = len(images)
product["primary_image_url"] = images[0].get("src") if images else None
return product
def scrape_store_detailed(self, store_url):
"""Scrape products with full variant-level detail."""
store_url = store_url.rstrip("/")
all_variants = []
page = 1
while True:
url = f"{store_url}/products.json?limit=250&page={page}"
proxy = self.proxy_pool.get_proxy()
try:
response = self.session.get(url, proxies=proxy, timeout=15)
if response.status_code != 200:
break
products = response.json().get("products", [])
if not products:
break
for product in products:
for variant in product.get("variants", []):
variant_data = {
"store_url": store_url,
"product_id": product["id"],
"product_title": product["title"],
"product_type": product.get("product_type"),
"vendor": product.get("vendor"),
"variant_id": variant["id"],
"variant_title": variant.get("title"),
"price": variant.get("price"),
"compare_at_price": variant.get("compare_at_price"),
"sku": variant.get("sku"),
"available": variant.get("available"),
"inventory_quantity": variant.get("inventory_quantity"),
"weight": variant.get("weight"),
"option1": variant.get("option1"),
"option2": variant.get("option2"),
"option3": variant.get("option3"),
}
all_variants.append(variant_data)
page += 1
time.sleep(random.uniform(1, 2))
except Exception as e:
print(f"Error: {e}")
break
return all_variantsScraping Multiple Competitor Stores
For competitive analysis, scrape product data from multiple stores in a single pipeline:
class ShopifyCompetitorTracker:
"""Tracks product data across multiple Shopify competitor stores."""
def __init__(self, scraper):
self.scraper = scraper
def scrape_competitors(self, store_urls):
"""Scrape all products from a list of competitor stores."""
all_products = []
for i, url in enumerate(store_urls):
print(f"\n[{i + 1}/{len(store_urls)}] Scraping: {url}")
try:
products = self.scraper.scrape_store(url)
all_products.extend(products)
print(f" Collected {len(products)} products")
except Exception as e:
print(f" Failed: {e}")
# Longer delay between stores
time.sleep(random.uniform(3, 8))
return all_products
def price_comparison(self, store_urls):
"""Compare pricing across competitor stores."""
all_data = self.scrape_competitors(store_urls)
df = pd.DataFrame(all_data)
if df.empty:
return df
# Analysis by store
summary = df.groupby("store_url").agg({
"product_id": "count",
"min_price": ["mean", "min", "max"],
"variant_count": "mean",
}).round(2)
return summary
def find_common_products(self, store_urls):
"""Find products that appear across multiple stores (by vendor/type)."""
all_data = self.scrape_competitors(store_urls)
df = pd.DataFrame(all_data)
if df.empty:
return df
# Group by vendor + product_type to find overlap
vendor_counts = df.groupby(["vendor", "product_type"]).agg({
"store_url": "nunique",
"product_id": "count",
"min_price": "mean",
}).reset_index()
# Products available in multiple stores
overlap = vendor_counts[vendor_counts["store_url"] > 1]
return overlap.sort_values("store_url", ascending=False)Monitoring Price Changes Over Time
For ongoing competitive monitoring, store snapshots and track changes:
class ShopifyPriceMonitor:
"""Monitors price changes across Shopify stores over time."""
def __init__(self, scraper, data_dir="shopify_data"):
self.scraper = scraper
self.data_dir = data_dir
def take_snapshot(self, store_url):
"""Take a price snapshot of a store."""
import os
os.makedirs(self.data_dir, exist_ok=True)
products = self.scraper.scrape_store(store_url)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
store_name = store_url.split("//")[-1].split(".")[0]
filename = f"{self.data_dir}/{store_name}_{timestamp}.json"
with open(filename, "w") as f:
json.dump(products, f, indent=2, default=str)
print(f"Snapshot saved: {filename} ({len(products)} products)")
return filename
def compare_snapshots(self, file_old, file_new):
"""Compare two snapshots to find price changes."""
with open(file_old) as f:
old_data = json.load(f)
with open(file_new) as f:
new_data = json.load(f)
old_prices = {p["product_id"]: p for p in old_data}
new_prices = {p["product_id"]: p for p in new_data}
changes = []
for pid, new_product in new_prices.items():
if pid in old_prices:
old_product = old_prices[pid]
old_price = old_product.get("primary_price")
new_price = new_product.get("primary_price")
if old_price and new_price and old_price != new_price:
changes.append({
"product_id": pid,
"title": new_product["title"],
"old_price": float(old_price),
"new_price": float(new_price),
"change_pct": round(
(float(new_price) - float(old_price)) / float(old_price) * 100, 2
),
})
# New products
new_products = [
new_prices[pid] for pid in new_prices
if pid not in old_prices
]
# Removed products
removed_products = [
old_prices[pid] for pid in old_prices
if pid not in new_prices
]
return {
"price_changes": changes,
"new_products": len(new_products),
"removed_products": len(removed_products),
"total_changes": len(changes),
}Running the Complete Pipeline
def main():
proxies = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
pool = ShopifyProxyPool(proxies)
scraper = ShopifyScraper(pool)
# Scrape a single store
products = scraper.scrape_store("https://example-store.myshopify.com")
df = pd.DataFrame(products)
df.to_csv("shopify_products.csv", index=False)
print(f"Total products: {len(products)}")
# Price summary
if not df.empty and "min_price" in df.columns:
df["min_price"] = pd.to_numeric(df["min_price"], errors="coerce")
print(f"\nPrice range: ${df['min_price'].min():.2f} - ${df['min_price'].max():.2f}")
print(f"Average price: ${df['min_price'].mean():.2f}")
print(f"Median price: ${df['min_price'].median():.2f}")
# Competitor analysis
tracker = ShopifyCompetitorTracker(scraper)
competitor_stores = [
"https://store-one.myshopify.com",
"https://store-two.myshopify.com",
"https://store-three.myshopify.com",
]
all_competitor_data = tracker.scrape_competitors(competitor_stores)
competitor_df = pd.DataFrame(all_competitor_data)
competitor_df.to_csv("competitor_products.csv", index=False)
# Detailed variant-level data
variants = scraper.scrape_store_detailed("https://example-store.myshopify.com")
variant_df = pd.DataFrame(variants)
variant_df.to_csv("shopify_variants.csv", index=False)
print(f"\nTotal variants: {len(variants)}")
if __name__ == "__main__":
main()Discovering Shopify Stores to Scrape
Finding competitor Shopify stores is part of the research process. Several indicators reveal whether a site runs on Shopify:
def is_shopify_store(url, proxy_pool):
"""Check if a URL is a Shopify store."""
proxy = proxy_pool.get_proxy()
try:
response = requests.get(
f"{url}/products.json?limit=1",
proxies=proxy,
timeout=10,
)
if response.status_code == 200:
data = response.json()
return "products" in data
except Exception:
pass
return FalseYou can also identify Shopify stores by checking for cdn.shopify.com in page source or the presence of Shopify.theme JavaScript variables.
Handling Shopify-Specific Challenges
Rate limiting (HTTP 430). Shopify returns a non-standard HTTP 430 status when rate limiting takes effect. This is different from the standard 429. Implement specific handling for both status codes.
Private/password-protected stores. Some Shopify stores require a password to access. These stores return a redirect to the password page instead of product data. Detect this and skip these stores.
Custom domain mapping. Many Shopify stores use custom domains rather than *.myshopify.com. The /products.json endpoint works on custom domains as well.
Large catalogs. Stores with thousands of products require careful pagination. Shopify’s JSON endpoint has historically used page-based pagination, but newer implementations may use cursor-based pagination. Handle both patterns.
Inventory tracking disabled. Not all stores expose inventory quantities. When inventory_quantity is null, the product may still be available, but you cannot determine stock levels.
Conclusion
Shopify’s built-in JSON API makes it one of the most scraper-friendly e-commerce platforms on the internet. The structured data format eliminates HTML parsing headaches, and the predictable endpoint structure makes building reliable scrapers straightforward.
With mobile proxy rotation to handle rate limits, you can monitor competitor pricing across hundreds of Shopify stores automatically. For more e-commerce scraping strategies, explore our e-commerce proxy guides and the proxy glossary for technical reference.
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix