How to Scrape Costco Product Data in 2026
Costco Wholesale is the third-largest retailer in the world, with over 870 warehouse locations and a massive online storefront. Known for its membership model and bulk pricing, Costco carries a curated selection of products at competitive prices. For pricing analysts, competitor intelligence teams, and supply chain researchers, scraping Costco provides insights into wholesale pricing, product availability, and market trends.
This guide covers how to scrape Costco product data using Python, handle their anti-bot measures, and integrate proxies for reliable extraction.
What Data Can You Extract from Costco?
Costco’s website contains valuable product information:
- Product titles and descriptions
- Pricing (member price, non-member price, instant savings)
- Product specifications and dimensions
- Availability (online, in-warehouse, delivery options)
- Customer ratings and reviews
- Brand information
- Item numbers and model numbers
- Product images
- Shipping and delivery information
- Quantity and pack size details
Example JSON Output
{
"item_number": "1234567",
"title": "Kirkland Signature Extra Virgin Olive Oil, 2L, 2-count",
"price": 24.99,
"member_only": true,
"currency": "USD",
"rating": 4.7,
"review_count": 3421,
"brand": "Kirkland Signature",
"delivery": {
"available": true,
"shipping": "Free Shipping"
},
"specifications": {
"Pack Size": "2-count",
"Volume": "2L each",
"Origin": "Italy"
},
"categories": ["Grocery", "Pantry", "Oils & Vinegars"],
"url": "https://www.costco.com/kirkland-olive-oil.product.1234567.html"
}Prerequisites
pip install requests beautifulsoup4 lxml fake-useragent playwright
playwright install chromiumCostco’s website is protected by sophisticated anti-bot systems. Residential proxies with US IP addresses are essential.
Method 1: Scraping Costco with Playwright
Costco’s website relies heavily on JavaScript rendering and has strong bot detection. Playwright is the recommended approach.
import asyncio
from playwright.async_api import async_playwright
import json
import random
class CostcoScraper:
def __init__(self, proxy=None):
self.proxy = proxy
async def search_products(self, query, max_pages=3):
"""Search Costco and extract product data."""
async with async_playwright() as p:
browser_args = {"headless": True}
if self.proxy:
browser_args["proxy"] = {
"server": self.proxy["server"],
"username": self.proxy.get("username"),
"password": self.proxy.get("password"),
}
browser = await p.chromium.launch(**browser_args)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
page = await context.new_page()
all_products = []
for pg in range(1, max_pages + 1):
offset = (pg - 1) * 24
url = f"https://www.costco.com/CatalogSearch?dept=All&keyword={query}¤tPage={pg}"
try:
await page.goto(url, wait_until="networkidle", timeout=60000)
await asyncio.sleep(3)
# Scroll to load lazy content
for _ in range(6):
await page.evaluate("window.scrollBy(0, 600)")
await asyncio.sleep(0.5)
# Extract product data
products = await page.evaluate("""
() => {
const items = [];
const cards = document.querySelectorAll('.product-tile, [class*="product-card"]');
cards.forEach(card => {
const title = card.querySelector('.description a, [class*="product-title"]');
const price = card.querySelector('.price, [class*="product-price"]');
const rating = card.querySelector('[class*="star-rating"], [class*="reviews"]');
const link = card.querySelector('a[href*=".product."]');
items.push({
title: title ? title.innerText.trim() : null,
price: price ? price.innerText.trim() : null,
rating: rating ? rating.innerText.trim() : null,
url: link ? link.href : null
});
});
return items;
}
""")
all_products.extend(products)
print(f"Page {pg}: Found {len(products)} products")
except Exception as e:
print(f"Error on page {pg}: {e}")
await asyncio.sleep(random.uniform(3, 7))
await browser.close()
return all_products
async def scrape_product_page(self, url):
"""Scrape detailed product information."""
async with async_playwright() as p:
browser_args = {"headless": True}
if self.proxy:
browser_args["proxy"] = {"server": self.proxy["server"]}
browser = await p.chromium.launch(**browser_args)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle", timeout=60000)
await asyncio.sleep(2)
# Extract JSON-LD structured data
product = await page.evaluate("""
() => {
const scripts = document.querySelectorAll('script[type="application/ld+json"]');
for (const script of scripts) {
try {
const data = JSON.parse(script.textContent);
if (data['@type'] === 'Product') return data;
} catch {}
}
// Fallback to DOM extraction
const title = document.querySelector('h1[itemprop="name"], h1');
const price = document.querySelector('[class*="your-price"] span, .price');
const desc = document.querySelector('[itemprop="description"]');
return {
title: title ? title.innerText.trim() : null,
price: price ? price.innerText.trim() : null,
description: desc ? desc.innerText.trim() : null
};
}
""")
await browser.close()
return product
# Usage
proxy_config = {
"server": "http://proxy-server:port",
"username": "user",
"password": "pass"
}
scraper = CostcoScraper(proxy=proxy_config)
results = asyncio.run(scraper.search_products("olive oil", max_pages=2))
print(json.dumps(results[:5], indent=2))Method 2: Using Costco’s Internal API
Costco’s frontend communicates with internal APIs that can be intercepted for cleaner data access.
import requests
from fake_useragent import UserAgent
import json
import time
import random
class CostcoAPIScraper:
def __init__(self, proxy_url=None):
self.session = requests.Session()
self.ua = UserAgent()
self.proxy_url = proxy_url
def _get_headers(self):
return {
"User-Agent": self.ua.random,
"Accept": "application/json, text/plain, */*",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.costco.com/",
"Origin": "https://www.costco.com",
"Connection": "keep-alive",
}
def _get_proxies(self):
if self.proxy_url:
return {"http": self.proxy_url, "https": self.proxy_url}
return None
def search_products(self, query, page_size=24, page=1):
"""Search Costco via internal API."""
url = "https://www.costco.com/CatalogSearch"
params = {
"keyword": query,
"pageSize": page_size,
"currentPage": page,
"responseGroup": "Large",
}
try:
response = self.session.get(
url,
params=params,
headers=self._get_headers(),
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
# Parse the HTML response for product data
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "lxml")
products = []
cards = soup.select(".product-tile, .product")
for card in cards:
title = card.select_one(".description a")
price = card.select_one(".price")
products.append({
"title": title.get_text(strip=True) if title else None,
"price": price.get_text(strip=True) if price else None,
"url": "https://www.costco.com" + title["href"] if title and title.get("href") else None
})
return products
except requests.RequestException as e:
print(f"Error: {e}")
return []
# Usage
scraper = CostcoAPIScraper(proxy_url="http://user:pass@proxy:port")
results = scraper.search_products("kirkland supplements")
print(json.dumps(results[:5], indent=2))Handling Costco Anti-Bot Protections
1. Bot Detection (Akamai/PerimeterX)
Costco uses advanced bot detection that analyzes browser fingerprints, mouse behavior, and request patterns. Use stealth browser configurations and residential proxies.
2. Membership Gating
Some Costco prices and products are only visible to members. Use authenticated sessions with cookies from a valid login for full access.
3. Rate Limiting
Costco limits request frequency aggressively. Keep delays at 3-7 seconds between requests and rotate proxies every 10-15 requests.
4. Geographic Restrictions
Costco content varies by warehouse location. Set your ZIP code via cookies to get accurate regional pricing.
Proxy Recommendations for Costco
| Proxy Type | Success Rate | Best For |
|---|---|---|
| US Residential | 75-85% | General product scraping |
| Mobile Proxies | 85-95% | Bypassing bot detection |
| ISP Proxies | 70-80% | Price monitoring |
| Datacenter | 10-20% | Not recommended |
US residential proxies are essential for Costco scraping. The site aggressively blocks datacenter IPs and non-US traffic.
Legal Considerations
- Terms of Service: Costco’s ToS prohibits automated scraping.
- Member-Only Data: Scraping member-only pricing may have additional legal implications.
- Rate Limiting: Respect server capacity and implement proper delays.
- Commercial Use: Get legal counsel before using scraped data commercially.
See our web scraping compliance guide for details.
Frequently Asked Questions
Does Costco have a public API?
No. Costco does not offer a public API for product data. Web scraping with browser-based tools like Playwright is the primary extraction method.
Can I scrape Costco without membership?
You can scrape publicly visible product listings without membership. However, member-only pricing and certain product categories require authenticated access.
Why does Costco block my scraper so quickly?
Costco uses PerimeterX/Akamai bot detection that checks browser fingerprints, JavaScript execution, and behavioral patterns. Use stealth Playwright configurations with residential proxies and human-like delays.
How often do Costco prices change?
Costco prices change less frequently than competitors like Amazon. Weekly monitoring is typically sufficient for most pricing intelligence use cases.
Advanced Techniques
Handling Pagination
Most websites paginate their results. Implement robust pagination handling:
def scrape_all_pages(scraper, base_url, max_pages=20):
all_data = []
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
results = scraper.search(url)
if not results:
break
all_data.extend(results)
print(f"Page {page}: {len(results)} items (total: {len(all_data)})")
time.sleep(random.uniform(2, 5))
return all_dataData Validation and Cleaning
Always validate scraped data before storage:
def validate_data(item):
required_fields = ["title", "url"]
for field in required_fields:
if not item.get(field):
return False
return True
def clean_text(text):
if not text:
return None
# Remove extra whitespace
import re
text = re.sub(r'\s+', ' ', text).strip()
# Remove HTML entities
import html
text = html.unescape(text)
return text
# Apply to results
cleaned = [item for item in results if validate_data(item)]
for item in cleaned:
item["title"] = clean_text(item.get("title"))Monitoring and Alerting
Build monitoring into your scraping pipeline:
import logging
from datetime import datetime
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ScrapingMonitor:
def __init__(self):
self.start_time = datetime.now()
self.requests = 0
self.errors = 0
self.items = 0
def log_request(self, success=True):
self.requests += 1
if not success:
self.errors += 1
if self.requests % 50 == 0:
elapsed = (datetime.now() - self.start_time).seconds
rate = self.requests / max(elapsed, 1) * 60
logger.info(f"Requests: {self.requests}, Errors: {self.errors}, "
f"Items: {self.items}, Rate: {rate:.1f}/min")
def log_item(self, count=1):
self.items += countError Handling and Retry Logic
Implement robust error handling:
import time
from requests.exceptions import RequestException
def retry_request(func, max_retries=3, base_delay=5):
for attempt in range(max_retries):
try:
return func()
except RequestException as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
time.sleep(delay)
return NoneData Storage Options
Choose the right storage for your scraping volume:
import json
import csv
import sqlite3
class DataStorage:
def __init__(self, db_path="scraped_data.db"):
self.conn = sqlite3.connect(db_path)
self.conn.execute('''CREATE TABLE IF NOT EXISTS items
(id TEXT PRIMARY KEY, title TEXT, url TEXT, data JSON, scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)''')
def save(self, item):
self.conn.execute(
"INSERT OR REPLACE INTO items (id, title, url, data) VALUES (?, ?, ?, ?)",
(item.get("id"), item.get("title"), item.get("url"), json.dumps(item))
)
self.conn.commit()
def export_json(self, output_path):
cursor = self.conn.execute("SELECT data FROM items")
items = [json.loads(row[0]) for row in cursor.fetchall()]
with open(output_path, "w") as f:
json.dump(items, f, indent=2)
def export_csv(self, output_path):
cursor = self.conn.execute("SELECT * FROM items")
rows = cursor.fetchall()
with open(output_path, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["id", "title", "url", "data", "scraped_at"])
writer.writerows(rows)Frequently Asked Questions
How often should I scrape data?
The optimal frequency depends on how often the source data changes. For real-time data (stock prices, news), scrape every few minutes. For product listings, daily or weekly is usually sufficient. For reviews, weekly scraping captures new feedback without excessive load.
What happens if my IP gets blocked?
If you receive 403 or 429 status codes, your IP is likely blocked. Switch to a different proxy, implement exponential backoff, and slow your request rate. Rotating residential proxies automatically switch IPs to prevent blocks.
Should I use headless browsers or HTTP requests?
Use HTTP requests (with BeautifulSoup or similar) whenever possible — they are faster and use less resources. Switch to headless browsers (Selenium, Playwright) only when JavaScript rendering is required for the data you need.
How do I handle CAPTCHAs?
CAPTCHAs indicate aggressive bot detection. To minimize them: use residential or mobile proxies, implement realistic delays, rotate user agents, and maintain consistent session behavior. For persistent CAPTCHAs, consider CAPTCHA-solving services as a last resort.
Can I scrape data commercially?
The legality of commercial scraping depends on the platform’s ToS, the type of data collected, and your jurisdiction. Public data is generally more permissible, but always consult legal counsel for commercial use cases. See our compliance guide.
Conclusion
Costco’s strong anti-bot protections make it one of the more challenging retail sites to scrape. Playwright with stealth configurations and US residential proxies provides the most reliable approach. Focus on JSON-LD extraction for structured product data and implement careful rate limiting.
For more retail scraping guides, visit our e-commerce proxy guide and proxy comparison tools.
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix