How to Scrape Shopee Product Data
Shopee is the leading e-commerce platform in Southeast Asia, operating in markets including Singapore, Malaysia, Thailand, Indonesia, Vietnam, the Philippines, and Taiwan. With hundreds of millions of active users, Shopee data is invaluable for cross-border sellers, market researchers, and competitive analysts operating in the APAC region.
This guide covers how to scrape Shopee product data using Python, including their internal APIs, anti-bot bypass techniques, and region-specific considerations.
What Data Can You Extract from Shopee?
Shopee listings contain rich marketplace data:
- Product titles and descriptions
- Pricing (with flash deals, vouchers, and discounts)
- Seller information (shop name, rating, response rate)
- Product ratings and review counts
- Sales volume and historical data
- Category classification
- Product images and videos
- Shipping options by region
- Variation details (color, size, model)
- Stock levels
Example JSON Output
{
"item_id": 12345678901,
"shop_id": 98765432,
"title": "Wireless Bluetooth Earbuds TWS - Premium Quality",
"price": {
"min": 5.99,
"max": 12.99,
"currency": "SGD",
"discount_percentage": 45
},
"rating": {
"average": 4.8,
"count": 15432,
"stars": {"5": 12000, "4": 2500, "3": 600, "2": 200, "1": 132}
},
"sold": 85000,
"stock": 4523,
"seller": {
"shop_name": "TechHub.SG Official",
"rating": 4.9,
"response_rate": 98,
"response_time": "within hours",
"joined": "2020-03-15",
"follower_count": 125000
},
"categories": ["Mobile & Gadgets", "Earphones & Headphones", "TWS Earbuds"],
"shipping": {
"free_shipping": true,
"estimated_delivery": "2-4 days"
},
"variations": [
{"name": "Color", "options": ["Black", "White", "Blue"]},
{"name": "Model", "options": ["Standard", "Pro", "Pro Max"]}
],
"country": "SG",
"url": "https://shopee.sg/product/98765432/12345678901"
}Prerequisites
pip install requests beautifulsoup4 selenium fake-useragent lxmlShopee uses aggressive anti-bot systems. Residential proxies from Southeast Asian countries are essential for reliable scraping.
Method 1: Shopee Internal API Scraping
Shopee’s frontend communicates with internal APIs that return well-structured JSON. This is the most efficient scraping method.
import requests
from fake_useragent import UserAgent
import json
import time
import random
class ShopeeScraper:
COUNTRY_DOMAINS = {
"SG": "shopee.sg",
"MY": "shopee.com.my",
"TH": "shopee.co.th",
"ID": "shopee.co.id",
"VN": "shopee.vn",
"PH": "shopee.ph",
"TW": "shopee.tw",
"BR": "shopee.com.br",
}
def __init__(self, country="SG", proxy_url=None):
self.session = requests.Session()
self.ua = UserAgent()
self.proxy_url = proxy_url
self.country = country
self.domain = self.COUNTRY_DOMAINS.get(country, "shopee.sg")
self.base_url = f"https://{self.domain}"
self.api_base = f"https://{self.domain}/api/v4"
def _get_headers(self):
return {
"User-Agent": self.ua.random,
"Accept": "application/json",
"Accept-Language": "en-US,en;q=0.9",
"Referer": f"{self.base_url}/",
"X-Requested-With": "XMLHttpRequest",
"If-None-Match-": "*",
"Connection": "keep-alive",
}
def _get_proxies(self):
if self.proxy_url:
return {"http": self.proxy_url, "https": self.proxy_url}
return None
def _init_session(self):
"""Initialize session by visiting homepage first."""
self.session.get(
self.base_url,
headers={**self._get_headers(), "Accept": "text/html"},
proxies=self._get_proxies(),
timeout=30
)
time.sleep(2)
def search_products(self, keyword, page=0, limit=60):
"""Search Shopee products via API."""
url = f"{self.api_base}/search/search_items"
params = {
"by": "relevancy",
"keyword": keyword,
"limit": limit,
"newest": page * limit,
"order": "desc",
"page_type": "search",
"scenario": "PAGE_GLOBAL_SEARCH",
"version": 2,
}
try:
response = self.session.get(
url,
params=params,
headers=self._get_headers(),
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
data = response.json()
items = data.get("items", [])
products = []
for item in items:
item_data = item.get("item_basic", {})
product = {
"item_id": item_data.get("itemid"),
"shop_id": item_data.get("shopid"),
"title": item_data.get("name"),
"price_min": item_data.get("price_min", 0) / 100000,
"price_max": item_data.get("price_max", 0) / 100000,
"rating": item_data.get("item_rating", {}).get("rating_star"),
"rating_count": item_data.get("item_rating", {}).get("rating_count", []),
"sold": item_data.get("sold"),
"historical_sold": item_data.get("historical_sold"),
"stock": item_data.get("stock"),
"image": f"https://cf.shopee.sg/file/{item_data.get('image')}",
"shop_location": item_data.get("shop_location"),
"url": f"{self.base_url}/product/{item_data.get('shopid')}/{item_data.get('itemid')}",
}
products.append(product)
return products
except requests.RequestException as e:
print(f"Search error: {e}")
return []
def get_product_detail(self, shop_id, item_id):
"""Get detailed product information."""
url = f"{self.api_base}/item/get"
params = {
"itemid": item_id,
"shopid": shop_id,
}
try:
response = self.session.get(
url,
params=params,
headers=self._get_headers(),
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
data = response.json().get("data", {})
product = {
"item_id": data.get("itemid"),
"shop_id": data.get("shopid"),
"title": data.get("name"),
"description": data.get("description"),
"price_min": data.get("price_min", 0) / 100000,
"price_max": data.get("price_max", 0) / 100000,
"stock": data.get("stock"),
"sold": data.get("sold"),
"historical_sold": data.get("historical_sold"),
"rating": data.get("item_rating", {}).get("rating_star"),
"categories": [
cat.get("display_name") for cat in data.get("categories", [])
],
"attributes": [
{"name": attr.get("name"), "value": attr.get("value")}
for attr in data.get("attributes", [])
],
"models": [
{
"name": model.get("name"),
"price": model.get("price", 0) / 100000,
"stock": model.get("stock"),
}
for model in data.get("models", [])
],
}
return product
except requests.RequestException as e:
print(f"Product detail error: {e}")
return None
def get_shop_info(self, shop_id):
"""Get seller/shop information."""
url = f"{self.api_base}/shop/get_shop_detail"
params = {"shopid": shop_id}
try:
response = self.session.get(
url,
params=params,
headers=self._get_headers(),
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
data = response.json().get("data", {})
return {
"shop_id": data.get("shopid"),
"shop_name": data.get("name"),
"rating": data.get("rating_star"),
"item_count": data.get("item_count"),
"follower_count": data.get("follower_count"),
"response_rate": data.get("response_rate"),
"response_time": data.get("response_time"),
"country": data.get("country"),
}
except requests.RequestException as e:
print(f"Shop info error: {e}")
return None
def get_reviews(self, shop_id, item_id, offset=0, limit=20):
"""Get product reviews."""
url = f"{self.api_base}/item/get_ratings"
params = {
"itemid": item_id,
"shopid": shop_id,
"offset": offset,
"limit": limit,
"type": 0, # 0=all, 1-5=star rating
}
try:
response = self.session.get(
url,
params=params,
headers=self._get_headers(),
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
data = response.json().get("data", {})
reviews = []
for rating in data.get("ratings", []):
reviews.append({
"rating": rating.get("rating_star"),
"comment": rating.get("comment"),
"author": rating.get("author_username"),
"time": rating.get("ctime"),
"variation": rating.get("product_items", [{}])[0].get("model_name") if rating.get("product_items") else None,
})
return reviews
except requests.RequestException as e:
print(f"Reviews error: {e}")
return []
# Usage
if __name__ == "__main__":
scraper = ShopeeScraper(country="SG", proxy_url="http://user:pass@proxy:port")
scraper._init_session()
# Search products
results = scraper.search_products("wireless earbuds", page=0)
print(f"Found {len(results)} products")
# Get product details
for product in results[:3]:
details = scraper.get_product_detail(product["shop_id"], product["item_id"])
print(json.dumps(details, indent=2))
time.sleep(random.uniform(2, 5))Method 2: Selenium for Dynamic Pages
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import json
import time
class ShopeeSeleniumScraper:
def __init__(self, country="SG", proxy=None):
options = Options()
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")
options.add_argument("--disable-blink-features=AutomationControlled")
if proxy:
options.add_argument(f"--proxy-server={proxy}")
self.driver = webdriver.Chrome(options=options)
domains = {"SG": "shopee.sg", "MY": "shopee.com.my", "ID": "shopee.co.id"}
self.domain = domains.get(country, "shopee.sg")
def search_products(self, query):
url = f"https://{self.domain}/search?keyword={query}"
self.driver.get(url)
WebDriverWait(self.driver, 20).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".shopee-search-item-result__item"))
)
# Scroll to load all products
for _ in range(5):
self.driver.execute_script("window.scrollBy(0, 800);")
time.sleep(1)
products = self.driver.execute_script("""
const items = document.querySelectorAll('.shopee-search-item-result__item');
return Array.from(items).map(item => {
const title = item.querySelector('[data-sqe="name"]');
const price = item.querySelector('[class*="price"]');
const link = item.querySelector('a');
const sold = item.querySelector('[class*="sold"]');
return {
title: title ? title.innerText.trim() : null,
price: price ? price.innerText.trim() : null,
url: link ? 'https://""" + self.domain + """' + link.getAttribute('href') : null,
sold: sold ? sold.innerText.trim() : null
};
});
""")
return products
def close(self):
self.driver.quit()Handling Shopee’s Anti-Bot Protections
1. Anti-Fraud System
Shopee uses a custom anti-fraud system that tracks:
- Request frequency and patterns
- Browser fingerprinting
- Geographic consistency between IP and domain
# Ensure geographic consistency
# If scraping shopee.sg, use Singapore-based proxies
proxy_sg = "http://user:pass@sg-residential.proxy:port"2. SPC Token / CSRF Protection
Shopee requires specific tokens for API access:
def get_csrf_token(session, base_url):
"""Get CSRF token from Shopee."""
response = session.get(base_url)
cookies = session.cookies.get_dict()
return cookies.get("csrftoken") or cookies.get("SPC_CDS")3. Rate Limiting
Shopee limits API requests per IP. Use rotating proxies:
import itertools
class ProxyRotator:
def __init__(self, proxies):
self.proxy_cycle = itertools.cycle(proxies)
def get_next(self):
proxy = next(self.proxy_cycle)
return {"http": proxy, "https": proxy}Proxy Recommendations for Shopee
| Proxy Type | Success Rate | Best For |
|---|---|---|
| SEA Residential | 85-90% | All Shopee domains |
| Mobile (SEA) | 90-95% | High-volume scraping |
| ISP (Country-specific) | 75-85% | Single-country focus |
| Datacenter | 20-30% | Not recommended |
Geographic matching is critical for Shopee. Use Southeast Asian proxies that match the Shopee domain you’re scraping. For example, use Singapore IPs for shopee.sg and Indonesian IPs for shopee.co.id.
Legal Considerations
- Terms of Service: Shopee prohibits automated data collection across all regional platforms.
- Data Privacy: Each Shopee country operates under different privacy laws (PDPA in Singapore, PDPA in Thailand, PDP Law in Indonesia).
- Cross-Border Data: Moving scraped data across borders may violate data localization requirements.
- Seller Data: Seller personal information is protected under regional privacy laws.
- Competition Law: Using scraped data for anti-competitive practices is illegal in all Shopee markets.
Our PDPA compliance guide covers Southeast Asian data protection requirements.
Rate Limiting Best Practices
- API requests: Maximum 1-2 per second per IP
- Search pages: 1 request every 3-5 seconds
- Product pages: 1 request every 4-6 seconds
- Rotate IPs: Every 20-30 requests
- Country-specific limits: Some Shopee domains are more aggressive than others (ID and VN are strictest)
Advanced Techniques
Handling Pagination
Most websites paginate their results. Implement robust pagination handling:
def scrape_all_pages(scraper, base_url, max_pages=20):
all_data = []
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
results = scraper.search(url)
if not results:
break
all_data.extend(results)
print(f"Page {page}: {len(results)} items (total: {len(all_data)})")
time.sleep(random.uniform(2, 5))
return all_dataData Validation and Cleaning
Always validate scraped data before storage:
def validate_data(item):
required_fields = ["title", "url"]
for field in required_fields:
if not item.get(field):
return False
return True
def clean_text(text):
if not text:
return None
# Remove extra whitespace
import re
text = re.sub(r'\s+', ' ', text).strip()
# Remove HTML entities
import html
text = html.unescape(text)
return text
# Apply to results
cleaned = [item for item in results if validate_data(item)]
for item in cleaned:
item["title"] = clean_text(item.get("title"))Monitoring and Alerting
Build monitoring into your scraping pipeline:
import logging
from datetime import datetime
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ScrapingMonitor:
def __init__(self):
self.start_time = datetime.now()
self.requests = 0
self.errors = 0
self.items = 0
def log_request(self, success=True):
self.requests += 1
if not success:
self.errors += 1
if self.requests % 50 == 0:
elapsed = (datetime.now() - self.start_time).seconds
rate = self.requests / max(elapsed, 1) * 60
logger.info(f"Requests: {self.requests}, Errors: {self.errors}, "
f"Items: {self.items}, Rate: {rate:.1f}/min")
def log_item(self, count=1):
self.items += countError Handling and Retry Logic
Implement robust error handling:
import time
from requests.exceptions import RequestException
def retry_request(func, max_retries=3, base_delay=5):
for attempt in range(max_retries):
try:
return func()
except RequestException as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
time.sleep(delay)
return NoneData Storage Options
Choose the right storage for your scraping volume:
import json
import csv
import sqlite3
class DataStorage:
def __init__(self, db_path="scraped_data.db"):
self.conn = sqlite3.connect(db_path)
self.conn.execute('''CREATE TABLE IF NOT EXISTS items
(id TEXT PRIMARY KEY, title TEXT, url TEXT, data JSON, scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)''')
def save(self, item):
self.conn.execute(
"INSERT OR REPLACE INTO items (id, title, url, data) VALUES (?, ?, ?, ?)",
(item.get("id"), item.get("title"), item.get("url"), json.dumps(item))
)
self.conn.commit()
def export_json(self, output_path):
cursor = self.conn.execute("SELECT data FROM items")
items = [json.loads(row[0]) for row in cursor.fetchall()]
with open(output_path, "w") as f:
json.dump(items, f, indent=2)
def export_csv(self, output_path):
cursor = self.conn.execute("SELECT * FROM items")
rows = cursor.fetchall()
with open(output_path, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["id", "title", "url", "data", "scraped_at"])
writer.writerows(rows)Frequently Asked Questions
How often should I scrape data?
The optimal frequency depends on how often the source data changes. For real-time data (stock prices, news), scrape every few minutes. For product listings, daily or weekly is usually sufficient. For reviews, weekly scraping captures new feedback without excessive load.
What happens if my IP gets blocked?
If you receive 403 or 429 status codes, your IP is likely blocked. Switch to a different proxy, implement exponential backoff, and slow your request rate. Rotating residential proxies automatically switch IPs to prevent blocks.
Should I use headless browsers or HTTP requests?
Use HTTP requests (with BeautifulSoup or similar) whenever possible — they are faster and use less resources. Switch to headless browsers (Selenium, Playwright) only when JavaScript rendering is required for the data you need.
How do I handle CAPTCHAs?
CAPTCHAs indicate aggressive bot detection. To minimize them: use residential or mobile proxies, implement realistic delays, rotate user agents, and maintain consistent session behavior. For persistent CAPTCHAs, consider CAPTCHA-solving services as a last resort.
Can I scrape data commercially?
The legality of commercial scraping depends on the platform’s ToS, the type of data collected, and your jurisdiction. Public data is generally more permissible, but always consult legal counsel for commercial use cases. See our compliance guide.
Conclusion
Shopee’s internal APIs make it one of the more accessible marketplaces to scrape, but geographic matching and anti-fraud systems require careful handling. The API approach provides clean, structured data without the overhead of HTML parsing.
For reliable Shopee scraping, pair your setup with country-specific residential proxies and respect rate limits. Visit our e-commerce proxy guide for more APAC marketplace scraping strategies.
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix