How to Scrape Used Car Listings from Carousell and Mudah

How to Scrape Used Car Listings from Carousell and Mudah

Carousell and Mudah are two of the largest online classifieds platforms in Southeast Asia. Together, they host millions of used car listings across Singapore, Malaysia, the Philippines, Indonesia, and beyond. For automotive researchers, dealers, and data-driven businesses, these platforms represent an invaluable source of real-time market data.

This guide walks you through the process of scraping used car listings from both platforms, covering technical setup, proxy configuration, data extraction strategies, and common challenges you will encounter along the way.

Why Scrape Carousell and Mudah for Vehicle Data

Market Intelligence

Used car pricing in Southeast Asia is highly dynamic. Prices fluctuate based on supply, demand, government policies such as COE in Singapore, currency movements, and seasonal trends. Scraping these platforms gives you access to real-time pricing data that is impossible to obtain through any official API.

Competitive Analysis

Car dealers and resellers use scraped listing data to understand competitor pricing, identify undervalued vehicles, and spot inventory gaps in the market. A dealer in Kuala Lumpur can monitor every Mudah listing in their area to ensure their prices remain competitive.

Research and Analytics

Researchers studying automotive market trends, depreciation curves, and consumer behavior rely on large datasets that can only be built through systematic data collection from these platforms.

Understanding the Technical Landscape

Carousell Architecture

Carousell operates as a mobile-first platform with a web interface. Key technical characteristics include:

  • Single Page Application (SPA): Built with React, Carousell renders content dynamically using JavaScript. This means traditional HTTP scraping may miss content that loads asynchronously.
  • GraphQL API: Carousell uses GraphQL for its internal API, which can be accessed directly once you identify the correct endpoints and query structures.
  • Anti-bot measures: Rate limiting, device fingerprinting, and session validation are all employed to detect automated access.
  • Mobile API: The mobile app communicates through dedicated API endpoints that often return cleaner, more structured data than the web interface.

Mudah Architecture

Mudah (mudah.my) serves primarily the Malaysian market with a more traditional web architecture:

  • Server-side rendering: Much of Mudah’s content is rendered server-side, making basic HTTP scraping more straightforward.
  • Structured data: Listings include structured metadata that can be parsed from HTML without JavaScript rendering.
  • Rate limiting: Mudah implements IP-based rate limiting that triggers after sustained request volumes.
  • Regional filtering: Listings are organized by Malaysian states and regions, with location data embedded in URLs.

Setting Up Your Proxy Infrastructure

Both Carousell and Mudah will block your IP address if you send too many requests without proxies. For reliable scraping, you need a proxy solution that meets these requirements:

Geographic Requirements

  • Carousell: Use proxies from Singapore, Malaysia, the Philippines, or Indonesia depending on which country’s listings you need. Carousell shows different listings based on your apparent location.
  • Mudah: Malaysian proxies are essential since Mudah serves the Malaysian market exclusively. Singapore proxies can also work but may trigger additional verification.

Proxy Type Recommendations

For Carousell, mobile proxies from DataResearchTools are the optimal choice. Since Carousell is mobile-first, traffic from mobile IPs matches the platform’s expected user behavior perfectly. Mobile proxies also bypass the device fingerprinting checks that trip up datacenter and residential proxies.

For Mudah, residential proxies work well for basic listing scraping, while mobile proxies are recommended for higher-volume operations or when accessing listing details that require session continuity.

Configuration Example

from dataresearchtools_proxy import ProxyManager

# Configure proxies for Carousell (Singapore)
carousell_proxy = ProxyManager(
    provider="dataresearchtools",
    country="SG",
    proxy_type="mobile",
    rotation="per_request"
)

# Configure proxies for Mudah (Malaysia)
mudah_proxy = ProxyManager(
    provider="dataresearchtools",
    country="MY",
    proxy_type="mobile",
    rotation="sticky",
    session_duration=300  # 5-minute sticky sessions
)

Scraping Carousell Used Car Listings

Method 1: Web Scraping with Headless Browser

Because Carousell uses client-side rendering, a headless browser approach is often necessary:

from playwright.sync_api import sync_playwright
import json

def scrape_carousell_cars(proxy_config):
    with sync_playwright() as p:
        browser = p.chromium.launch(
            proxy={
                "server": proxy_config["server"],
                "username": proxy_config["username"],
                "password": proxy_config["password"]
            }
        )

        page = browser.new_page()
        page.set_extra_http_headers({
            "Accept-Language": "en-SG,en;q=0.9"
        })

        # Navigate to car listings
        page.goto("https://www.carousell.sg/categories/cars-159/")
        page.wait_for_selector('[data-testid="listing-card"]')

        listings = []
        cards = page.query_selector_all('[data-testid="listing-card"]')

        for card in cards:
            listing = {
                "title": card.query_selector("p").inner_text(),
                "price": card.query_selector('[data-testid="listing-price"]').inner_text(),
                "url": card.query_selector("a").get_attribute("href"),
            }
            listings.append(listing)

        browser.close()
        return listings

Method 2: GraphQL API Direct Access

For faster and more efficient data collection, you can query Carousell’s GraphQL API directly:

import requests

def query_carousell_api(proxy, search_params):
    url = "https://www.carousell.sg/api/2.0/graphql/"

    query = {
        "operationName": "searchProducts",
        "variables": {
            "categoryId": 159,  # Cars category
            "country": "SG",
            "count": 40,
            "filters": search_params
        }
    }

    headers = {
        "Content-Type": "application/json",
        "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 16_0 like Mac OS X)"
    }

    response = requests.post(url, json=query, headers=headers, proxies=proxy)
    return response.json()

Data Points Available from Carousell

Each Carousell car listing typically contains:

  • Vehicle make and model
  • Year of manufacture
  • Asking price
  • Mileage (odometer reading)
  • Transmission type
  • Fuel type
  • COE expiry date (Singapore listings)
  • Seller type (dealer or individual)
  • Listing date and last updated time
  • Location
  • Photos (URLs)
  • Description text

Scraping Mudah Used Car Listings

Basic HTTP Scraping

Mudah’s server-rendered pages make basic HTTP scraping viable:

import requests
from bs4 import BeautifulSoup

def scrape_mudah_cars(proxy, page_num=1):
    url = f"https://www.mudah.my/malaysia/cars-for-sale?o={page_num}"

    headers = {
        "User-Agent": "Mozilla/5.0 (Linux; Android 12; SM-G991B)",
        "Accept-Language": "en-MY,en;q=0.9,ms;q=0.8"
    }

    response = requests.get(url, headers=headers, proxies=proxy)
    soup = BeautifulSoup(response.text, "html.parser")

    listings = []
    for item in soup.select(".listing-item"):
        listing = {
            "title": item.select_one(".listing-title").get_text(strip=True),
            "price": item.select_one(".listing-price").get_text(strip=True),
            "location": item.select_one(".listing-location").get_text(strip=True),
            "year": item.select_one(".listing-year").get_text(strip=True),
            "url": item.select_one("a")["href"],
        }
        listings.append(listing)

    return listings

Paginating Through Results

Mudah organizes listings with pagination. To collect comprehensive data, you need to iterate through all available pages:

def scrape_all_mudah_listings(proxy_manager):
    all_listings = []
    page = 1

    while True:
        proxy = proxy_manager.get_proxy()
        listings = scrape_mudah_cars(proxy, page)

        if not listings:
            break

        all_listings.extend(listings)
        page += 1

        # Respectful delay between pages
        time.sleep(random.uniform(2, 5))

    return all_listings

Data Points Available from Mudah

Mudah car listings typically include:

  • Vehicle make, model, and variant
  • Year of manufacture
  • Price in Malaysian Ringgit
  • Mileage
  • Transmission type
  • Fuel type
  • Body type
  • Color
  • Seller location (state and area)
  • Seller type (dealer or owner)
  • Listing date
  • Contact information
  • Photos

Handling Common Challenges

Challenge 1: Dynamic Content Loading

Carousell uses infinite scroll to load additional listings. When using headless browsers, you need to simulate scrolling:

async def scroll_and_collect(page, max_listings=200):
    collected = 0
    previous_count = 0

    while collected < max_listings:
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        await page.wait_for_timeout(2000)

        cards = await page.query_selector_all('[data-testid="listing-card"]')
        collected = len(cards)

        if collected == previous_count:
            break  # No new listings loaded

        previous_count = collected

    return cards

Challenge 2: CAPTCHA Encounters

Both platforms may present CAPTCHAs during scraping. Strategies to minimize CAPTCHA frequency:

  • Use mobile proxies from DataResearchTools, which carry high trust scores and rarely trigger CAPTCHAs
  • Implement realistic browsing patterns with variable delays
  • Rotate user agents alongside proxy rotation
  • When a CAPTCHA appears, switch to a new IP immediately rather than attempting to solve it

Challenge 3: Price Format Variations

Car prices on these platforms come in various formats that need normalization:

import re

def normalize_price(price_text, currency="SGD"):
    # Remove currency symbols and formatting
    clean = re.sub(r'[^\d.]', '', price_text)

    if clean:
        return float(clean)
    return None

Challenge 4: Listing Deduplication

Both platforms allow sellers to relist vehicles, creating duplicates. Implement deduplication based on:

  • Vehicle images (perceptual hashing)
  • Seller ID combined with vehicle description
  • Price and specification matching algorithms

Building a Complete Data Pipeline

A production-grade pipeline for Carousell and Mudah scraping should include these components:

Scheduler

Run your scrapers at regular intervals, typically every 4-6 hours for active markets. Avoid peak hours when anti-bot systems are most sensitive.

Data Storage

Store raw scraped data in a structured format. A database schema for car listings might include:

  • Listing ID (platform-specific)
  • Platform source
  • Vehicle make and model
  • Year
  • Price
  • Mileage
  • Location
  • Seller type
  • Scrape timestamp
  • Listing URL
  • Raw HTML or JSON (for reprocessing)

Change Detection

Track price changes, new listings, and removed listings between scrape cycles. This change data is often more valuable than the raw listings themselves.

Data Validation

Implement validation rules to filter out obviously incorrect data such as unrealistic prices, impossible mileage figures, or incomplete listings.

Scaling Your Operation

As your data needs grow, consider these scaling strategies:

  • Parallel scraping: Run multiple scraper instances simultaneously using different proxy sessions from DataResearchTools
  • Geographic distribution: Scrape different regions in parallel rather than sequentially
  • Incremental updates: After your initial full scrape, switch to monitoring only new and changed listings
  • API preference: Where available, prefer API access over HTML scraping for higher throughput and cleaner data

Conclusion

Scraping used car listings from Carousell and Mudah provides access to rich automotive market data across Southeast Asia. The key to successful, sustainable scraping lies in using the right proxy infrastructure, implementing respectful scraping patterns, and building robust data pipelines.

DataResearchTools mobile proxies are particularly effective for these platforms because they match the mobile-first user behavior that both Carousell and Mudah expect. With proper proxy rotation, geographic targeting, and session management, you can build comprehensive used car market datasets that power pricing intelligence, competitive analysis, and market research across the region.


Related Reading

Scroll to Top