How to Build a Car Price Comparison API with Proxy Infrastructure

How to Build a Car Price Comparison API with Proxy Infrastructure

Car price comparison tools are among the most valuable products in the automotive technology space. Consumers use them to find the best deals, dealers use them for competitive intelligence, and financial institutions use them for vehicle valuations. Behind every car price comparison API is a data collection engine that aggregates pricing information from dozens of sources.

This guide walks you through building a car price comparison API from the ground up, focusing on the proxy infrastructure needed to collect reliable pricing data at scale across Southeast Asian markets.

System Architecture Overview

A car price comparison API consists of four main components:

[Data Collection Layer]  -->  [Data Processing Layer]  -->  [Storage Layer]  -->  [API Layer]
   (Scrapers + Proxies)       (Normalization + Matching)    (Database + Cache)    (REST/GraphQL)

Each component must be designed for reliability, accuracy, and performance. The data collection layer is where proxy infrastructure plays its critical role.

Designing the Data Collection Layer

Source Selection

For a comprehensive Southeast Asian car price comparison, you need data from these source categories:

Marketplace Platforms:

  • Carousell (multi-country)
  • Mudah (Malaysia)
  • OLX (Indonesia, Philippines)
  • Kaidee (Thailand)
  • Cho Tot (Vietnam)

Dealer Aggregators:

  • SGCarMart (Singapore)
  • Carlist.my (Malaysia)
  • One2Car (Thailand)

Certified Pre-Owned Platforms:

  • Carro (multi-country)
  • Carsome (multi-country)

New Car Pricing:

  • Manufacturer websites
  • Dealer websites
  • Financial comparison sites

Proxy Infrastructure

Your data collection must route through appropriate proxies for each source. DataResearchTools mobile proxies provide the geographic coverage needed to access all these sources from their native countries:

class PriceCollectionProxyManager:
    def __init__(self, api_key):
        self.api_key = api_key
        self.endpoint = "proxy.dataresearchtools.com"

    def get_source_proxy(self, source_name, country):
        session_id = f"{source_name}-{uuid4().hex[:8]}"
        auth = f"{self.api_key}:country-{country}-type-mobile-session-{session_id}"
        return {
            "http": f"http://{auth}@{self.endpoint}:8080",
            "https": f"http://{auth}@{self.endpoint}:8080"
        }

Scraper Framework

Build a unified scraper framework that handles all sources:

from abc import ABC, abstractmethod

class BaseCarScraper(ABC):
    def __init__(self, proxy_manager, country):
        self.proxy_manager = proxy_manager
        self.country = country
        self.source_name = self.__class__.__name__

    @abstractmethod
    def search(self, make=None, model=None, year_from=None, year_to=None, page=1):
        pass

    @abstractmethod
    def get_listing_detail(self, listing_id):
        pass

    def get_proxy(self):
        return self.proxy_manager.get_source_proxy(self.source_name, self.country)

    def make_request(self, url, method="GET", **kwargs):
        proxy = self.get_proxy()
        headers = kwargs.pop("headers", {})
        headers.setdefault("User-Agent", get_random_mobile_ua())

        try:
            response = requests.request(
                method, url, proxies=proxy, headers=headers, timeout=30, **kwargs
            )
            response.raise_for_status()
            return response
        except requests.exceptions.RequestException as e:
            logger.error(f"Request failed for {self.source_name}: {e}")
            return None


class SGCarMartScraper(BaseCarScraper):
    def search(self, make=None, model=None, year_from=None, year_to=None, page=1):
        url = "https://www.sgcarmart.com/used_cars/listing.php"
        params = {"RPG": 40, "AVession": page}
        if make:
            params["MOD"] = make
        if year_from:
            params["YRF"] = year_from
        if year_to:
            params["YRT"] = year_to

        response = self.make_request(url, params=params)
        if response:
            return self.parse_search_results(response.text)
        return []

    def parse_search_results(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        listings = []
        for item in soup.select('.listing_table tr'):
            listing = self.extract_listing(item)
            if listing.get("title"):
                listing["source"] = "sgcarmart"
                listing["country"] = "SG"
                listings.append(listing)
        return listings

    def get_listing_detail(self, listing_id):
        url = f"https://www.sgcarmart.com/used_cars/info.php?ID={listing_id}"
        response = self.make_request(url)
        if response:
            return self.parse_detail_page(response.text)
        return None

Data Processing Layer

Price Normalization

Different sources format prices differently. Normalize everything into a consistent format:

class PriceNormalizer:
    CURRENCY_MAP = {
        "SG": {"currency": "SGD", "symbols": ["S$", "SGD", "$"]},
        "MY": {"currency": "MYR", "symbols": ["RM", "MYR"]},
        "TH": {"currency": "THB", "symbols": ["฿", "THB", "บาท"]},
        "ID": {"currency": "IDR", "symbols": ["Rp", "IDR"]},
        "PH": {"currency": "PHP", "symbols": ["₱", "PHP", "P"]},
        "VN": {"currency": "VND", "symbols": ["₫", "VND", "đ"]},
    }

    def __init__(self, exchange_rate_provider):
        self.fx = exchange_rate_provider

    def normalize(self, price_str, country):
        currency_info = self.CURRENCY_MAP.get(country, {})
        currency = currency_info.get("currency", "USD")

        # Remove currency symbols
        clean = price_str
        for symbol in currency_info.get("symbols", []):
            clean = clean.replace(symbol, "")

        # Remove formatting
        clean = clean.strip().replace(",", "").replace(" ", "")

        # Handle Indonesian pricing (often uses "jt" for millions)
        if country == "ID" and "jt" in clean.lower():
            clean = clean.lower().replace("jt", "").replace("juta", "")
            try:
                amount = float(clean) * 1000000
            except ValueError:
                return None, None
        else:
            # Handle decimal separators
            if clean.count('.') > 1:
                clean = clean.replace('.', '')
            try:
                amount = float(clean)
            except ValueError:
                return None, None

        usd_amount = self.fx.convert(amount, currency, "USD")

        return {
            "amount_local": amount,
            "currency": currency,
            "amount_usd": usd_amount
        }

Vehicle Matching

Match the same vehicle across different sources for accurate comparison:

class VehicleMatcher:
    def __init__(self):
        self.make_aliases = self.load_make_aliases()
        self.model_aliases = self.load_model_aliases()

    def normalize_vehicle(self, listing):
        make = self.normalize_make(listing.get("make", ""))
        model = self.normalize_model(make, listing.get("model", ""))
        year = self.extract_year(listing)

        return {
            "normalized_make": make,
            "normalized_model": model,
            "year": year,
            "key": f"{make}|{model}|{year}".lower()
        }

    def normalize_make(self, raw_make):
        raw_lower = raw_make.strip().lower()
        return self.make_aliases.get(raw_lower, raw_make.strip().title())

    def normalize_model(self, make, raw_model):
        key = f"{make.lower()}_{raw_model.strip().lower()}"
        return self.model_aliases.get(key, raw_model.strip())

    def load_make_aliases(self):
        return {
            "merc": "Mercedes-Benz",
            "mercedes": "Mercedes-Benz",
            "mercedes benz": "Mercedes-Benz",
            "mercedes-benz": "Mercedes-Benz",
            "benz": "Mercedes-Benz",
            "vw": "Volkswagen",
            "volkswagon": "Volkswagen",
            "chevy": "Chevrolet",
            "beemer": "BMW",
            "bmw": "BMW",
        }

    def find_matches(self, listing, all_listings):
        normalized = self.normalize_vehicle(listing)
        matches = []

        for candidate in all_listings:
            candidate_norm = self.normalize_vehicle(candidate)
            if candidate_norm["key"] == normalized["key"]:
                matches.append(candidate)

        return matches

Deduplication

Remove duplicate listings that appear on multiple platforms:

class ListingDeduplicator:
    def deduplicate(self, listings):
        # Group by VIN if available
        vin_groups = {}
        no_vin = []

        for listing in listings:
            vin = listing.get("vin")
            if vin and len(vin) == 17:
                if vin not in vin_groups:
                    vin_groups[vin] = []
                vin_groups[vin].append(listing)
            else:
                no_vin.append(listing)

        # For VIN-matched groups, keep the listing with most data
        deduplicated = []
        for vin, group in vin_groups.items():
            best = max(group, key=lambda x: self.data_completeness_score(x))
            best["other_sources"] = [
                {"source": l["source"], "price": l["price"], "url": l.get("url")}
                for l in group if l != best
            ]
            deduplicated.append(best)

        # For listings without VIN, use fuzzy matching
        deduplicated.extend(self.fuzzy_deduplicate(no_vin))

        return deduplicated

    def data_completeness_score(self, listing):
        score = 0
        for field in ["price", "mileage", "year", "make", "model", "photos", "description", "vin"]:
            if listing.get(field):
                score += 1
        return score

Storage Layer

Database Schema

CREATE TABLE vehicles (
    vehicle_id SERIAL PRIMARY KEY,
    normalized_make VARCHAR(100) NOT NULL,
    normalized_model VARCHAR(200) NOT NULL,
    year INTEGER NOT NULL,
    variant VARCHAR(200),
    body_type VARCHAR(50),
    fuel_type VARCHAR(30),
    transmission VARCHAR(20),
    engine_cc INTEGER,
    UNIQUE (normalized_make, normalized_model, year, variant)
);

CREATE TABLE listings (
    listing_id SERIAL PRIMARY KEY,
    vehicle_id INTEGER REFERENCES vehicles(vehicle_id),
    source_platform VARCHAR(50) NOT NULL,
    source_listing_id VARCHAR(200),
    country VARCHAR(5) NOT NULL,
    region VARCHAR(100),
    price_local DECIMAL(15, 2),
    currency VARCHAR(5),
    price_usd DECIMAL(15, 2),
    mileage_km INTEGER,
    color VARCHAR(50),
    seller_type VARCHAR(20),
    listing_url VARCHAR(500),
    image_urls TEXT[],
    vin VARCHAR(17),
    first_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_price_change TIMESTAMP,
    is_active BOOLEAN DEFAULT true,
    raw_data JSONB,
    UNIQUE (source_platform, source_listing_id)
);

CREATE TABLE price_snapshots (
    id SERIAL PRIMARY KEY,
    listing_id INTEGER REFERENCES listings(listing_id),
    price_local DECIMAL(15, 2),
    price_usd DECIMAL(15, 2),
    recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Indexes for API queries
CREATE INDEX idx_listings_vehicle ON listings(vehicle_id, is_active);
CREATE INDEX idx_listings_country_price ON listings(country, price_usd) WHERE is_active;
CREATE INDEX idx_vehicles_make_model ON vehicles(normalized_make, normalized_model, year);

Caching Strategy

Cache frequently requested price comparisons:

class PriceCache:
    def __init__(self, redis_client, default_ttl=3600):
        self.redis = redis_client
        self.default_ttl = default_ttl

    def get_price_comparison(self, make, model, year, country=None):
        cache_key = f"price:{make}:{model}:{year}:{country or 'all'}"
        cached = self.redis.get(cache_key)
        if cached:
            return json.loads(cached)
        return None

    def set_price_comparison(self, make, model, year, country, data):
        cache_key = f"price:{make}:{model}:{year}:{country or 'all'}"
        self.redis.setex(cache_key, self.default_ttl, json.dumps(data))

API Layer

REST API Design

from fastapi import FastAPI, Query
from typing import Optional

app = FastAPI(title="Car Price Comparison API")

@app.get("/api/v1/prices")
async def get_price_comparison(
    make: str = Query(..., description="Vehicle make"),
    model: str = Query(..., description="Vehicle model"),
    year: Optional[int] = Query(None, description="Model year"),
    country: Optional[str] = Query(None, description="Country code (SG, MY, TH, ID)"),
    mileage_max: Optional[int] = Query(None, description="Maximum mileage in km"),
):
    # Check cache first
    cached = cache.get_price_comparison(make, model, year, country)
    if cached:
        return cached

    # Query database
    query = build_price_query(make, model, year, country, mileage_max)
    listings = db.execute(query)

    # Calculate statistics
    result = {
        "query": {
            "make": make,
            "model": model,
            "year": year,
            "country": country,
        },
        "summary": calculate_price_summary(listings),
        "by_country": group_by_country(listings) if not country else None,
        "by_source": group_by_source(listings),
        "listings": [format_listing(l) for l in listings[:50]],
        "total_listings": len(listings),
        "data_freshness": get_data_freshness(),
    }

    # Cache result
    cache.set_price_comparison(make, model, year, country, result)

    return result


def calculate_price_summary(listings):
    if not listings:
        return None

    prices = [l["price_usd"] for l in listings if l.get("price_usd")]
    if not prices:
        return None

    return {
        "average_price_usd": round(statistics.mean(prices), 2),
        "median_price_usd": round(statistics.median(prices), 2),
        "min_price_usd": round(min(prices), 2),
        "max_price_usd": round(max(prices), 2),
        "price_std_dev": round(statistics.stdev(prices), 2) if len(prices) > 1 else 0,
        "sample_size": len(prices),
    }


@app.get("/api/v1/prices/history")
async def get_price_history(
    make: str,
    model: str,
    year: int,
    country: str,
    days: int = Query(90, le=365),
):
    history = db.get_price_history(make, model, year, country, days)

    return {
        "query": {"make": make, "model": model, "year": year, "country": country},
        "history": [
            {
                "date": entry["date"].isoformat(),
                "avg_price_usd": entry["avg_price"],
                "listing_count": entry["count"],
            }
            for entry in history
        ],
        "trend": calculate_trend(history),
    }


@app.get("/api/v1/market/overview")
async def get_market_overview(country: str):
    return {
        "country": country,
        "total_active_listings": db.count_active_listings(country),
        "top_makes": db.get_top_makes(country, limit=10),
        "price_segments": db.get_price_segments(country),
        "avg_days_on_market": db.get_avg_dom(country),
        "new_listings_24h": db.count_new_listings(country, hours=24),
    }

API Authentication and Rate Limiting

from fastapi import Depends, HTTPException
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

async def verify_api_key(credentials: HTTPAuthorizationCredentials = Depends(security)):
    api_key = credentials.credentials
    user = db.get_user_by_api_key(api_key)

    if not user:
        raise HTTPException(status_code=401, detail="Invalid API key")

    # Check rate limit
    if not rate_limiter.check(api_key, user["plan_limit"]):
        raise HTTPException(status_code=429, detail="Rate limit exceeded")

    return user

Data Collection Scheduling

Orchestrating Regular Updates

class CollectionOrchestrator:
    def __init__(self, proxy_manager):
        self.proxy_manager = proxy_manager
        self.scrapers = self.initialize_scrapers()

    def run_collection_cycle(self):
        results = {"success": 0, "failed": 0, "total_listings": 0}

        for source_name, scraper in self.scrapers.items():
            try:
                listings = scraper.collect_all()
                processed = self.process_and_store(listings, source_name)
                results["success"] += 1
                results["total_listings"] += len(processed)
            except Exception as e:
                logger.error(f"Collection failed for {source_name}: {e}")
                results["failed"] += 1

        # Invalidate relevant caches
        cache.flush_stale_entries()

        return results

    def process_and_store(self, listings, source):
        normalizer = PriceNormalizer(exchange_rate_provider)
        matcher = VehicleMatcher()
        deduplicator = ListingDeduplicator()

        # Normalize prices
        for listing in listings:
            listing["normalized_price"] = normalizer.normalize(
                listing.get("price", ""), listing.get("country", "")
            )

        # Match to canonical vehicles
        for listing in listings:
            listing["vehicle_match"] = matcher.normalize_vehicle(listing)

        # Deduplicate
        unique_listings = deduplicator.deduplicate(listings)

        # Store
        db.upsert_listings(unique_listings)

        return unique_listings

Monetization Strategies

API Plans

Structure your API plans based on usage:

  • Free tier: 100 requests/day, basic price data, limited to 1 country
  • Starter: 5,000 requests/day, full price data, all countries, $99/month
  • Professional: 50,000 requests/day, price history, analytics, webhooks, $499/month
  • Enterprise: Unlimited requests, raw data access, custom integrations, custom pricing

Value-Added Features

  • Price alerts: Notify users when a vehicle drops below a target price
  • Market reports: Weekly/monthly market analysis by segment
  • Valuation API: Instant vehicle valuation based on market data
  • Dealer analytics: Dashboard for dealer customers showing competitive position

Conclusion

Building a car price comparison API requires a solid foundation of data collection infrastructure powered by reliable proxies. The quality of your API depends entirely on the breadth, freshness, and accuracy of your underlying data.

DataResearchTools mobile proxies provide the infrastructure needed to collect pricing data reliably across all major Southeast Asian automotive platforms. With carrier-grade IPs in every target market, geographic precision for accessing country-specific pricing, and the bandwidth to support continuous data collection cycles, DataResearchTools ensures your price comparison API always serves current, accurate market data.

The combination of comprehensive data collection, intelligent processing, and a well-designed API creates a product that serves the entire automotive ecosystem, from individual consumers seeking the best deal to enterprise clients building their own automotive intelligence platforms.


Related Reading

Scroll to Top