How to Scrape AliExpress Product Data

How to Scrape AliExpress Product Data

AliExpress is one of the world’s largest online retail marketplaces, owned by Alibaba Group and connecting international buyers with Chinese manufacturers and sellers. With over 150 million active buyers and millions of product listings, AliExpress is a goldmine for dropshippers, price comparison tools, and e-commerce researchers.

This guide walks you through scraping AliExpress product data with Python, handling their robust anti-bot systems, and building a scalable data extraction pipeline.

What Data Can You Extract from AliExpress?

AliExpress product pages contain rich, structured data:

  • Product titles and descriptions
  • Pricing (with bulk discounts and flash deals)
  • Seller information (store name, rating, years on platform)
  • Product ratings and review counts
  • Shipping options and costs by destination
  • Product specifications and attributes
  • Order count and popularity metrics
  • Product images and videos
  • Variation details (colors, sizes, styles)

Example JSON Output

{
  "product_id": "1005006789012345",
  "title": "Wireless Bluetooth Earbuds TWS Noise Canceling",
  "price": {
    "current": 15.99,
    "original": 39.99,
    "discount_percentage": 60,
    "currency": "USD"
  },
  "rating": 4.7,
  "review_count": 8432,
  "orders": 25600,
  "seller": {
    "store_name": "TechGadget Official Store",
    "store_rating": 96.2,
    "followers": 45000,
    "years_on_platform": 5
  },
  "shipping": {
    "free_shipping": true,
    "estimated_delivery": "15-25 days",
    "ship_from": "China"
  },
  "specifications": {
    "Brand": "Generic",
    "Bluetooth Version": "5.3",
    "Battery Life": "6 hours",
    "Waterproof Rating": "IPX5"
  },
  "categories": ["Consumer Electronics", "Earphones & Headphones", "TWS Earbuds"],
  "url": "https://www.aliexpress.com/item/1005006789012345.html"
}

Prerequisites

Install the required Python packages:

pip install requests beautifulsoup4 selenium webdriver-manager fake-useragent lxml

AliExpress has strong anti-bot protections, so rotating residential proxies are essential for any meaningful scraping operation.

Method 1: Scraping with Requests and BeautifulSoup

AliExpress renders significant content via JavaScript, but search result metadata can be captured from embedded JSON data.

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import json
import re
import time
import random

class AliExpressScraper:
    def __init__(self, proxy_url=None):
        self.session = requests.Session()
        self.ua = UserAgent()
        self.proxy_url = proxy_url
        self.base_url = "https://www.aliexpress.com"

    def _get_headers(self):
        return {
            "User-Agent": self.ua.random,
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Referer": "https://www.aliexpress.com/",
            "Connection": "keep-alive",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
        }

    def _get_proxies(self):
        if self.proxy_url:
            return {"http": self.proxy_url, "https": self.proxy_url}
        return None

    def search_products(self, query, max_pages=5):
        """Scrape AliExpress search results."""
        all_products = []

        for page in range(1, max_pages + 1):
            url = f"{self.base_url}/wholesale?SearchText={query}&page={page}"

            try:
                response = self.session.get(
                    url,
                    headers=self._get_headers(),
                    proxies=self._get_proxies(),
                    timeout=30
                )
                response.raise_for_status()

                products = self._extract_search_data(response.text)
                all_products.extend(products)
                print(f"Page {page}: Found {len(products)} products")

                time.sleep(random.uniform(3, 7))

            except requests.RequestException as e:
                print(f"Error on page {page}: {e}")
                continue

        return all_products

    def _extract_search_data(self, html):
        """Extract product data from embedded JavaScript."""
        products = []

        # AliExpress often embeds product data in script tags
        pattern = r'_init_data_\s*=\s*{\s*data:\s*({.*?})\s*}'
        match = re.search(pattern, html, re.DOTALL)

        if match:
            try:
                data = json.loads(match.group(1))
                items = data.get("data", {}).get("root", {}).get("fields", {}).get("mods", {}).get("itemList", {}).get("content", [])

                for item in items:
                    product = {
                        "product_id": item.get("productId"),
                        "title": item.get("title", {}).get("displayTitle"),
                        "price": item.get("prices", {}).get("salePrice", {}).get("minPrice"),
                        "original_price": item.get("prices", {}).get("originalPrice", {}).get("minPrice"),
                        "rating": item.get("evaluation", {}).get("starRating"),
                        "orders": item.get("trade", {}).get("tradeDesc"),
                        "store_name": item.get("store", {}).get("storeName"),
                        "url": f"https://www.aliexpress.com/item/{item.get('productId')}.html",
                        "image": item.get("image", {}).get("imgUrl"),
                        "free_shipping": item.get("logistics", {}).get("freeShipping", False),
                    }
                    products.append(product)

            except (json.JSONDecodeError, KeyError) as e:
                print(f"Error parsing JSON data: {e}")

        # Fallback: parse HTML directly
        if not products:
            soup = BeautifulSoup(html, "lxml")
            products = self._parse_html_search(soup)

        return products

    def _parse_html_search(self, soup):
        """Fallback HTML parsing for search results."""
        products = []
        cards = soup.select("div[class*='product-card']")

        for card in cards:
            try:
                title_elem = card.select_one("h3, [class*='title']")
                price_elem = card.select_one("[class*='price']")
                link_elem = card.select_one("a[href*='/item/']")

                product = {
                    "title": title_elem.get_text(strip=True) if title_elem else None,
                    "price": price_elem.get_text(strip=True) if price_elem else None,
                    "url": link_elem["href"] if link_elem else None,
                }
                products.append(product)
            except Exception:
                continue

        return products

    def scrape_product_detail(self, product_id):
        """Scrape a single product page for detailed data."""
        url = f"{self.base_url}/item/{product_id}.html"

        try:
            response = self.session.get(
                url,
                headers=self._get_headers(),
                proxies=self._get_proxies(),
                timeout=30
            )
            response.raise_for_status()

            soup = BeautifulSoup(response.text, "lxml")

            # Try to extract structured data
            script_tags = soup.find_all("script", type="application/ld+json")
            for script in script_tags:
                try:
                    data = json.loads(script.string)
                    if data.get("@type") == "Product":
                        return {
                            "title": data.get("name"),
                            "description": data.get("description"),
                            "price": data.get("offers", {}).get("price"),
                            "currency": data.get("offers", {}).get("priceCurrency"),
                            "rating": data.get("aggregateRating", {}).get("ratingValue"),
                            "review_count": data.get("aggregateRating", {}).get("reviewCount"),
                            "image": data.get("image"),
                        }
                except json.JSONDecodeError:
                    continue

            # Extract from embedded page data
            return self._extract_product_page_data(response.text)

        except requests.RequestException as e:
            print(f"Error scraping product {product_id}: {e}")
            return None

    def _extract_product_page_data(self, html):
        """Extract product details from page scripts."""
        pattern = r'window\.runParams\s*=\s*({.*?});'
        match = re.search(pattern, html, re.DOTALL)

        if match:
            try:
                data = json.loads(match.group(1))
                action_data = data.get("data", {}).get("actionModule", {})
                title_data = data.get("data", {}).get("titleModule", {})
                price_data = data.get("data", {}).get("priceModule", {})

                return {
                    "title": title_data.get("subject"),
                    "product_id": action_data.get("productId"),
                    "price": price_data.get("formattedActivityPrice"),
                    "original_price": price_data.get("formattedPrice"),
                }
            except (json.JSONDecodeError, KeyError):
                pass

        return None


# Usage
if __name__ == "__main__":
    scraper = AliExpressScraper(proxy_url="http://user:pass@proxy:port")
    results = scraper.search_products("wireless earbuds", max_pages=3)

    for product in results[:3]:
        if product.get("product_id"):
            details = scraper.scrape_product_detail(product["product_id"])
            print(json.dumps(details, indent=2))
            time.sleep(random.uniform(4, 8))

Method 2: Scraping AliExpress with Selenium

AliExpress heavily relies on JavaScript. Selenium is often necessary for full data extraction.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import json
import time
import random

class AliExpressSeleniumScraper:
    def __init__(self, proxy=None):
        chrome_options = Options()
        chrome_options.add_argument("--headless=new")
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-blink-features=AutomationControlled")
        chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])

        if proxy:
            chrome_options.add_argument(f"--proxy-server={proxy}")

        self.driver = webdriver.Chrome(options=chrome_options)
        self.driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
            "source": """
                Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
                window.chrome = {runtime: {}};
            """
        })

    def search_and_scrape(self, query, max_pages=3):
        """Search AliExpress and scrape product data."""
        products = []

        for page in range(1, max_pages + 1):
            url = f"https://www.aliexpress.com/wholesale?SearchText={query}&page={page}"
            self.driver.get(url)

            # Wait for products to load
            try:
                WebDriverWait(self.driver, 20).until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, "[class*='product-card'], [class*='search-item']"))
                )
            except Exception:
                print(f"Timeout on page {page}")
                continue

            # Scroll to load all products
            self._lazy_scroll()

            # Extract product data via JavaScript
            page_products = self.driver.execute_script("""
                const products = [];
                const cards = document.querySelectorAll('[class*="product-card"], [class*="search-item"]');
                cards.forEach(card => {
                    const title = card.querySelector('h3, [class*="title"]');
                    const price = card.querySelector('[class*="price"]');
                    const link = card.querySelector('a[href*="/item/"]');
                    const rating = card.querySelector('[class*="star"]');

                    products.push({
                        title: title ? title.innerText.trim() : null,
                        price: price ? price.innerText.trim() : null,
                        url: link ? link.href : null,
                        rating: rating ? rating.innerText.trim() : null
                    });
                });
                return products;
            """)

            products.extend(page_products)
            print(f"Page {page}: {len(page_products)} products")
            time.sleep(random.uniform(3, 6))

        return products

    def _lazy_scroll(self):
        """Gradually scroll the page to trigger lazy-loaded content."""
        last_height = self.driver.execute_script("return document.body.scrollHeight")
        while True:
            self.driver.execute_script("window.scrollBy(0, 800);")
            time.sleep(1)
            new_height = self.driver.execute_script("return document.body.scrollHeight")
            if new_height == last_height:
                break
            last_height = new_height

    def close(self):
        self.driver.quit()

Handling AliExpress Anti-Bot Protections

AliExpress has some of the most aggressive anti-bot systems among e-commerce sites:

1. Akamai Bot Manager

AliExpress uses Akamai’s bot detection which analyzes browser fingerprints, mouse movements, and behavioral patterns. Countermeasures include:

  • Use undetected-chromedriver for Selenium-based scraping
  • Simulate human-like mouse movements and scrolling
  • Use residential proxies that look like real users
pip install undetected-chromedriver
import undetected_chromedriver as uc

driver = uc.Chrome(headless=True)
driver.get("https://www.aliexpress.com/")

2. CAPTCHA and Slide Verification

AliExpress frequently presents slide CAPTCHAs. Strategies:

  • Reduce request frequency to avoid triggering
  • Use mobile proxies for cleaner IP reputation
  • Implement CAPTCHA-solving services as a fallback

3. Cookie and Session Management

Maintain proper cookies to appear as a returning visitor:

# Save and load cookies between sessions
import pickle

# Save cookies
pickle.dump(driver.get_cookies(), open("aliexpress_cookies.pkl", "wb"))

# Load cookies
cookies = pickle.load(open("aliexpress_cookies.pkl", "rb"))
for cookie in cookies:
    driver.add_cookie(cookie)

Proxy Recommendations for AliExpress

Proxy TypeSuccess RateBest For
Residential Rotating70-80%Search results, basic scraping
Mobile Proxies90%+High-volume, anti-bot bypass
ISP Proxies75-85%Session-based scraping
Datacenter20-30%Not recommended

For AliExpress, mobile proxies offer the highest success rates due to their shared IP pools that are trusted by AliExpress’s systems. For budget-conscious projects, rotating residential proxies are a good alternative.

Legal Considerations

  1. Terms of Service: AliExpress explicitly prohibits scraping in their ToS. Proceed at your own risk.
  2. Data Protection: Chinese data protection laws (PIPL) may apply to seller data. Avoid collecting personal information.
  3. Intellectual Property: Product images and descriptions are protected by copyright.
  4. Rate Limits: Aggressive scraping can be considered a denial-of-service attack. Always use respectful rate limiting.
  5. Commercial Use: Consult legal counsel before using scraped data for commercial purposes.

See our web scraping legal guide for detailed compliance information.

Rate Limiting Best Practices

AliExpress is particularly sensitive to scraping patterns:

  1. Minimum 3-5 second delays between requests
  2. Rotate IPs every 5-10 requests
  3. Rotate user agents on every request
  4. Limit to 200-500 requests per hour per IP
  5. Implement exponential backoff on errors
def smart_rate_limit(request_count, base_delay=3):
    """Adaptive rate limiting based on request count."""
    if request_count % 50 == 0:
        # Take a longer break every 50 requests
        time.sleep(random.uniform(30, 60))
    elif request_count % 10 == 0:
        time.sleep(random.uniform(10, 20))
    else:
        time.sleep(random.uniform(base_delay, base_delay * 2))

Conclusion

Scraping AliExpress requires more sophisticated techniques than many other e-commerce sites due to their aggressive anti-bot protections. By combining Selenium with undetected-chromedriver, rotating residential or mobile proxies, and careful rate limiting, you can build a reliable data pipeline.

For best results, use high-quality proxy services optimized for e-commerce scraping. Check out our proxy comparison guides to find the best provider for your AliExpress scraping needs.


Related Reading

Scroll to Top