How to Scrape Costco Product Data in 2026

How to Scrape Costco Product Data in 2026

Costco Wholesale is the third-largest retailer in the world, with over 870 warehouse locations and a massive online storefront. Known for its membership model and bulk pricing, Costco carries a curated selection of products at competitive prices. For pricing analysts, competitor intelligence teams, and supply chain researchers, scraping Costco provides insights into wholesale pricing, product availability, and market trends.

This guide covers how to scrape Costco product data using Python, handle their anti-bot measures, and integrate proxies for reliable extraction.

What Data Can You Extract from Costco?

Costco’s website contains valuable product information:

  • Product titles and descriptions
  • Pricing (member price, non-member price, instant savings)
  • Product specifications and dimensions
  • Availability (online, in-warehouse, delivery options)
  • Customer ratings and reviews
  • Brand information
  • Item numbers and model numbers
  • Product images
  • Shipping and delivery information
  • Quantity and pack size details

Example JSON Output

{
  "item_number": "1234567",
  "title": "Kirkland Signature Extra Virgin Olive Oil, 2L, 2-count",
  "price": 24.99,
  "member_only": true,
  "currency": "USD",
  "rating": 4.7,
  "review_count": 3421,
  "brand": "Kirkland Signature",
  "delivery": {
    "available": true,
    "shipping": "Free Shipping"
  },
  "specifications": {
    "Pack Size": "2-count",
    "Volume": "2L each",
    "Origin": "Italy"
  },
  "categories": ["Grocery", "Pantry", "Oils & Vinegars"],
  "url": "https://www.costco.com/kirkland-olive-oil.product.1234567.html"
}

Prerequisites

pip install requests beautifulsoup4 lxml fake-useragent playwright
playwright install chromium

Costco’s website is protected by sophisticated anti-bot systems. Residential proxies with US IP addresses are essential.

Method 1: Scraping Costco with Playwright

Costco’s website relies heavily on JavaScript rendering and has strong bot detection. Playwright is the recommended approach.

import asyncio
from playwright.async_api import async_playwright
import json
import random

class CostcoScraper:
    def __init__(self, proxy=None):
        self.proxy = proxy

    async def search_products(self, query, max_pages=3):
        """Search Costco and extract product data."""
        async with async_playwright() as p:
            browser_args = {"headless": True}
            if self.proxy:
                browser_args["proxy"] = {
                    "server": self.proxy["server"],
                    "username": self.proxy.get("username"),
                    "password": self.proxy.get("password"),
                }

            browser = await p.chromium.launch(**browser_args)
            context = await browser.new_context(
                viewport={"width": 1920, "height": 1080},
                user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
            )
            page = await context.new_page()
            all_products = []

            for pg in range(1, max_pages + 1):
                offset = (pg - 1) * 24
                url = f"https://www.costco.com/CatalogSearch?dept=All&keyword={query}&currentPage={pg}"

                try:
                    await page.goto(url, wait_until="networkidle", timeout=60000)
                    await asyncio.sleep(3)

                    # Scroll to load lazy content
                    for _ in range(6):
                        await page.evaluate("window.scrollBy(0, 600)")
                        await asyncio.sleep(0.5)

                    # Extract product data
                    products = await page.evaluate("""
                        () => {
                            const items = [];
                            const cards = document.querySelectorAll('.product-tile, [class*="product-card"]');
                            cards.forEach(card => {
                                const title = card.querySelector('.description a, [class*="product-title"]');
                                const price = card.querySelector('.price, [class*="product-price"]');
                                const rating = card.querySelector('[class*="star-rating"], [class*="reviews"]');
                                const link = card.querySelector('a[href*=".product."]');

                                items.push({
                                    title: title ? title.innerText.trim() : null,
                                    price: price ? price.innerText.trim() : null,
                                    rating: rating ? rating.innerText.trim() : null,
                                    url: link ? link.href : null
                                });
                            });
                            return items;
                        }
                    """)

                    all_products.extend(products)
                    print(f"Page {pg}: Found {len(products)} products")

                except Exception as e:
                    print(f"Error on page {pg}: {e}")

                await asyncio.sleep(random.uniform(3, 7))

            await browser.close()
            return all_products

    async def scrape_product_page(self, url):
        """Scrape detailed product information."""
        async with async_playwright() as p:
            browser_args = {"headless": True}
            if self.proxy:
                browser_args["proxy"] = {"server": self.proxy["server"]}

            browser = await p.chromium.launch(**browser_args)
            page = await browser.new_page()

            await page.goto(url, wait_until="networkidle", timeout=60000)
            await asyncio.sleep(2)

            # Extract JSON-LD structured data
            product = await page.evaluate("""
                () => {
                    const scripts = document.querySelectorAll('script[type="application/ld+json"]');
                    for (const script of scripts) {
                        try {
                            const data = JSON.parse(script.textContent);
                            if (data['@type'] === 'Product') return data;
                        } catch {}
                    }

                    // Fallback to DOM extraction
                    const title = document.querySelector('h1[itemprop="name"], h1');
                    const price = document.querySelector('[class*="your-price"] span, .price');
                    const desc = document.querySelector('[itemprop="description"]');

                    return {
                        title: title ? title.innerText.trim() : null,
                        price: price ? price.innerText.trim() : null,
                        description: desc ? desc.innerText.trim() : null
                    };
                }
            """)

            await browser.close()
            return product


# Usage
proxy_config = {
    "server": "http://proxy-server:port",
    "username": "user",
    "password": "pass"
}

scraper = CostcoScraper(proxy=proxy_config)
results = asyncio.run(scraper.search_products("olive oil", max_pages=2))
print(json.dumps(results[:5], indent=2))

Method 2: Using Costco’s Internal API

Costco’s frontend communicates with internal APIs that can be intercepted for cleaner data access.

import requests
from fake_useragent import UserAgent
import json
import time
import random

class CostcoAPIScraper:
    def __init__(self, proxy_url=None):
        self.session = requests.Session()
        self.ua = UserAgent()
        self.proxy_url = proxy_url

    def _get_headers(self):
        return {
            "User-Agent": self.ua.random,
            "Accept": "application/json, text/plain, */*",
            "Accept-Language": "en-US,en;q=0.9",
            "Referer": "https://www.costco.com/",
            "Origin": "https://www.costco.com",
            "Connection": "keep-alive",
        }

    def _get_proxies(self):
        if self.proxy_url:
            return {"http": self.proxy_url, "https": self.proxy_url}
        return None

    def search_products(self, query, page_size=24, page=1):
        """Search Costco via internal API."""
        url = "https://www.costco.com/CatalogSearch"
        params = {
            "keyword": query,
            "pageSize": page_size,
            "currentPage": page,
            "responseGroup": "Large",
        }

        try:
            response = self.session.get(
                url,
                params=params,
                headers=self._get_headers(),
                proxies=self._get_proxies(),
                timeout=30
            )
            response.raise_for_status()

            # Parse the HTML response for product data
            from bs4 import BeautifulSoup
            soup = BeautifulSoup(response.text, "lxml")

            products = []
            cards = soup.select(".product-tile, .product")
            for card in cards:
                title = card.select_one(".description a")
                price = card.select_one(".price")
                products.append({
                    "title": title.get_text(strip=True) if title else None,
                    "price": price.get_text(strip=True) if price else None,
                    "url": "https://www.costco.com" + title["href"] if title and title.get("href") else None
                })

            return products

        except requests.RequestException as e:
            print(f"Error: {e}")
            return []


# Usage
scraper = CostcoAPIScraper(proxy_url="http://user:pass@proxy:port")
results = scraper.search_products("kirkland supplements")
print(json.dumps(results[:5], indent=2))

Handling Costco Anti-Bot Protections

1. Bot Detection (Akamai/PerimeterX)

Costco uses advanced bot detection that analyzes browser fingerprints, mouse behavior, and request patterns. Use stealth browser configurations and residential proxies.

2. Membership Gating

Some Costco prices and products are only visible to members. Use authenticated sessions with cookies from a valid login for full access.

3. Rate Limiting

Costco limits request frequency aggressively. Keep delays at 3-7 seconds between requests and rotate proxies every 10-15 requests.

4. Geographic Restrictions

Costco content varies by warehouse location. Set your ZIP code via cookies to get accurate regional pricing.

Proxy Recommendations for Costco

Proxy TypeSuccess RateBest For
US Residential75-85%General product scraping
Mobile Proxies85-95%Bypassing bot detection
ISP Proxies70-80%Price monitoring
Datacenter10-20%Not recommended

US residential proxies are essential for Costco scraping. The site aggressively blocks datacenter IPs and non-US traffic.

Legal Considerations

  1. Terms of Service: Costco’s ToS prohibits automated scraping.
  2. Member-Only Data: Scraping member-only pricing may have additional legal implications.
  3. Rate Limiting: Respect server capacity and implement proper delays.
  4. Commercial Use: Get legal counsel before using scraped data commercially.

See our web scraping compliance guide for details.

Frequently Asked Questions

Does Costco have a public API?

No. Costco does not offer a public API for product data. Web scraping with browser-based tools like Playwright is the primary extraction method.

Can I scrape Costco without membership?

You can scrape publicly visible product listings without membership. However, member-only pricing and certain product categories require authenticated access.

Why does Costco block my scraper so quickly?

Costco uses PerimeterX/Akamai bot detection that checks browser fingerprints, JavaScript execution, and behavioral patterns. Use stealth Playwright configurations with residential proxies and human-like delays.

How often do Costco prices change?

Costco prices change less frequently than competitors like Amazon. Weekly monitoring is typically sufficient for most pricing intelligence use cases.

Advanced Techniques

Handling Pagination

Most websites paginate their results. Implement robust pagination handling:

def scrape_all_pages(scraper, base_url, max_pages=20):
    all_data = []
    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        results = scraper.search(url)
        if not results:
            break
        all_data.extend(results)
        print(f"Page {page}: {len(results)} items (total: {len(all_data)})")
        time.sleep(random.uniform(2, 5))
    return all_data

Data Validation and Cleaning

Always validate scraped data before storage:

def validate_data(item):
    required_fields = ["title", "url"]
    for field in required_fields:
        if not item.get(field):
            return False
    return True

def clean_text(text):
    if not text:
        return None
    # Remove extra whitespace
    import re
    text = re.sub(r'\s+', ' ', text).strip()
    # Remove HTML entities
    import html
    text = html.unescape(text)
    return text

# Apply to results
cleaned = [item for item in results if validate_data(item)]
for item in cleaned:
    item["title"] = clean_text(item.get("title"))

Monitoring and Alerting

Build monitoring into your scraping pipeline:

import logging
from datetime import datetime

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class ScrapingMonitor:
    def __init__(self):
        self.start_time = datetime.now()
        self.requests = 0
        self.errors = 0
        self.items = 0

    def log_request(self, success=True):
        self.requests += 1
        if not success:
            self.errors += 1
        if self.requests % 50 == 0:
            elapsed = (datetime.now() - self.start_time).seconds
            rate = self.requests / max(elapsed, 1) * 60
            logger.info(f"Requests: {self.requests}, Errors: {self.errors}, "
                       f"Items: {self.items}, Rate: {rate:.1f}/min")

    def log_item(self, count=1):
        self.items += count

Error Handling and Retry Logic

Implement robust error handling:

import time
from requests.exceptions import RequestException

def retry_request(func, max_retries=3, base_delay=5):
    for attempt in range(max_retries):
        try:
            return func()
        except RequestException as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
            time.sleep(delay)
    return None

Data Storage Options

Choose the right storage for your scraping volume:

import json
import csv
import sqlite3

class DataStorage:
    def __init__(self, db_path="scraped_data.db"):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute('''CREATE TABLE IF NOT EXISTS items
            (id TEXT PRIMARY KEY, title TEXT, url TEXT, data JSON, scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)''')

    def save(self, item):
        self.conn.execute(
            "INSERT OR REPLACE INTO items (id, title, url, data) VALUES (?, ?, ?, ?)",
            (item.get("id"), item.get("title"), item.get("url"), json.dumps(item))
        )
        self.conn.commit()

    def export_json(self, output_path):
        cursor = self.conn.execute("SELECT data FROM items")
        items = [json.loads(row[0]) for row in cursor.fetchall()]
        with open(output_path, "w") as f:
            json.dump(items, f, indent=2)

    def export_csv(self, output_path):
        cursor = self.conn.execute("SELECT * FROM items")
        rows = cursor.fetchall()
        with open(output_path, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(["id", "title", "url", "data", "scraped_at"])
            writer.writerows(rows)

Frequently Asked Questions

How often should I scrape data?

The optimal frequency depends on how often the source data changes. For real-time data (stock prices, news), scrape every few minutes. For product listings, daily or weekly is usually sufficient. For reviews, weekly scraping captures new feedback without excessive load.

What happens if my IP gets blocked?

If you receive 403 or 429 status codes, your IP is likely blocked. Switch to a different proxy, implement exponential backoff, and slow your request rate. Rotating residential proxies automatically switch IPs to prevent blocks.

Should I use headless browsers or HTTP requests?

Use HTTP requests (with BeautifulSoup or similar) whenever possible — they are faster and use less resources. Switch to headless browsers (Selenium, Playwright) only when JavaScript rendering is required for the data you need.

How do I handle CAPTCHAs?

CAPTCHAs indicate aggressive bot detection. To minimize them: use residential or mobile proxies, implement realistic delays, rotate user agents, and maintain consistent session behavior. For persistent CAPTCHAs, consider CAPTCHA-solving services as a last resort.

Can I scrape data commercially?

The legality of commercial scraping depends on the platform’s ToS, the type of data collected, and your jurisdiction. Public data is generally more permissible, but always consult legal counsel for commercial use cases. See our compliance guide.

Conclusion

Costco’s strong anti-bot protections make it one of the more challenging retail sites to scrape. Playwright with stealth configurations and US residential proxies provides the most reliable approach. Focus on JSON-LD extraction for structured product data and implement careful rate limiting.

For more retail scraping guides, visit our e-commerce proxy guide and proxy comparison tools.


Related Reading

Scroll to Top