How to Build a Car Price Comparison API with Proxy Infrastructure
Car price comparison tools are among the most valuable products in the automotive technology space. Consumers use them to find the best deals, dealers use them for competitive intelligence, and financial institutions use them for vehicle valuations. Behind every car price comparison API is a data collection engine that aggregates pricing information from dozens of sources.
This guide walks you through building a car price comparison API from the ground up, focusing on the proxy infrastructure needed to collect reliable pricing data at scale across Southeast Asian markets.
System Architecture Overview
A car price comparison API consists of four main components:
[Data Collection Layer] --> [Data Processing Layer] --> [Storage Layer] --> [API Layer]
(Scrapers + Proxies) (Normalization + Matching) (Database + Cache) (REST/GraphQL)Each component must be designed for reliability, accuracy, and performance. The data collection layer is where proxy infrastructure plays its critical role.
Designing the Data Collection Layer
Source Selection
For a comprehensive Southeast Asian car price comparison, you need data from these source categories:
Marketplace Platforms:
- Carousell (multi-country)
- Mudah (Malaysia)
- OLX (Indonesia, Philippines)
- Kaidee (Thailand)
- Cho Tot (Vietnam)
Dealer Aggregators:
- SGCarMart (Singapore)
- Carlist.my (Malaysia)
- One2Car (Thailand)
Certified Pre-Owned Platforms:
- Carro (multi-country)
- Carsome (multi-country)
New Car Pricing:
- Manufacturer websites
- Dealer websites
- Financial comparison sites
Proxy Infrastructure
Your data collection must route through appropriate proxies for each source. DataResearchTools mobile proxies provide the geographic coverage needed to access all these sources from their native countries:
class PriceCollectionProxyManager:
def __init__(self, api_key):
self.api_key = api_key
self.endpoint = "proxy.dataresearchtools.com"
def get_source_proxy(self, source_name, country):
session_id = f"{source_name}-{uuid4().hex[:8]}"
auth = f"{self.api_key}:country-{country}-type-mobile-session-{session_id}"
return {
"http": f"http://{auth}@{self.endpoint}:8080",
"https": f"http://{auth}@{self.endpoint}:8080"
}Scraper Framework
Build a unified scraper framework that handles all sources:
from abc import ABC, abstractmethod
class BaseCarScraper(ABC):
def __init__(self, proxy_manager, country):
self.proxy_manager = proxy_manager
self.country = country
self.source_name = self.__class__.__name__
@abstractmethod
def search(self, make=None, model=None, year_from=None, year_to=None, page=1):
pass
@abstractmethod
def get_listing_detail(self, listing_id):
pass
def get_proxy(self):
return self.proxy_manager.get_source_proxy(self.source_name, self.country)
def make_request(self, url, method="GET", **kwargs):
proxy = self.get_proxy()
headers = kwargs.pop("headers", {})
headers.setdefault("User-Agent", get_random_mobile_ua())
try:
response = requests.request(
method, url, proxies=proxy, headers=headers, timeout=30, **kwargs
)
response.raise_for_status()
return response
except requests.exceptions.RequestException as e:
logger.error(f"Request failed for {self.source_name}: {e}")
return None
class SGCarMartScraper(BaseCarScraper):
def search(self, make=None, model=None, year_from=None, year_to=None, page=1):
url = "https://www.sgcarmart.com/used_cars/listing.php"
params = {"RPG": 40, "AVession": page}
if make:
params["MOD"] = make
if year_from:
params["YRF"] = year_from
if year_to:
params["YRT"] = year_to
response = self.make_request(url, params=params)
if response:
return self.parse_search_results(response.text)
return []
def parse_search_results(self, html):
soup = BeautifulSoup(html, 'html.parser')
listings = []
for item in soup.select('.listing_table tr'):
listing = self.extract_listing(item)
if listing.get("title"):
listing["source"] = "sgcarmart"
listing["country"] = "SG"
listings.append(listing)
return listings
def get_listing_detail(self, listing_id):
url = f"https://www.sgcarmart.com/used_cars/info.php?ID={listing_id}"
response = self.make_request(url)
if response:
return self.parse_detail_page(response.text)
return NoneData Processing Layer
Price Normalization
Different sources format prices differently. Normalize everything into a consistent format:
class PriceNormalizer:
CURRENCY_MAP = {
"SG": {"currency": "SGD", "symbols": ["S$", "SGD", "$"]},
"MY": {"currency": "MYR", "symbols": ["RM", "MYR"]},
"TH": {"currency": "THB", "symbols": ["฿", "THB", "บาท"]},
"ID": {"currency": "IDR", "symbols": ["Rp", "IDR"]},
"PH": {"currency": "PHP", "symbols": ["₱", "PHP", "P"]},
"VN": {"currency": "VND", "symbols": ["₫", "VND", "đ"]},
}
def __init__(self, exchange_rate_provider):
self.fx = exchange_rate_provider
def normalize(self, price_str, country):
currency_info = self.CURRENCY_MAP.get(country, {})
currency = currency_info.get("currency", "USD")
# Remove currency symbols
clean = price_str
for symbol in currency_info.get("symbols", []):
clean = clean.replace(symbol, "")
# Remove formatting
clean = clean.strip().replace(",", "").replace(" ", "")
# Handle Indonesian pricing (often uses "jt" for millions)
if country == "ID" and "jt" in clean.lower():
clean = clean.lower().replace("jt", "").replace("juta", "")
try:
amount = float(clean) * 1000000
except ValueError:
return None, None
else:
# Handle decimal separators
if clean.count('.') > 1:
clean = clean.replace('.', '')
try:
amount = float(clean)
except ValueError:
return None, None
usd_amount = self.fx.convert(amount, currency, "USD")
return {
"amount_local": amount,
"currency": currency,
"amount_usd": usd_amount
}Vehicle Matching
Match the same vehicle across different sources for accurate comparison:
class VehicleMatcher:
def __init__(self):
self.make_aliases = self.load_make_aliases()
self.model_aliases = self.load_model_aliases()
def normalize_vehicle(self, listing):
make = self.normalize_make(listing.get("make", ""))
model = self.normalize_model(make, listing.get("model", ""))
year = self.extract_year(listing)
return {
"normalized_make": make,
"normalized_model": model,
"year": year,
"key": f"{make}|{model}|{year}".lower()
}
def normalize_make(self, raw_make):
raw_lower = raw_make.strip().lower()
return self.make_aliases.get(raw_lower, raw_make.strip().title())
def normalize_model(self, make, raw_model):
key = f"{make.lower()}_{raw_model.strip().lower()}"
return self.model_aliases.get(key, raw_model.strip())
def load_make_aliases(self):
return {
"merc": "Mercedes-Benz",
"mercedes": "Mercedes-Benz",
"mercedes benz": "Mercedes-Benz",
"mercedes-benz": "Mercedes-Benz",
"benz": "Mercedes-Benz",
"vw": "Volkswagen",
"volkswagon": "Volkswagen",
"chevy": "Chevrolet",
"beemer": "BMW",
"bmw": "BMW",
}
def find_matches(self, listing, all_listings):
normalized = self.normalize_vehicle(listing)
matches = []
for candidate in all_listings:
candidate_norm = self.normalize_vehicle(candidate)
if candidate_norm["key"] == normalized["key"]:
matches.append(candidate)
return matchesDeduplication
Remove duplicate listings that appear on multiple platforms:
class ListingDeduplicator:
def deduplicate(self, listings):
# Group by VIN if available
vin_groups = {}
no_vin = []
for listing in listings:
vin = listing.get("vin")
if vin and len(vin) == 17:
if vin not in vin_groups:
vin_groups[vin] = []
vin_groups[vin].append(listing)
else:
no_vin.append(listing)
# For VIN-matched groups, keep the listing with most data
deduplicated = []
for vin, group in vin_groups.items():
best = max(group, key=lambda x: self.data_completeness_score(x))
best["other_sources"] = [
{"source": l["source"], "price": l["price"], "url": l.get("url")}
for l in group if l != best
]
deduplicated.append(best)
# For listings without VIN, use fuzzy matching
deduplicated.extend(self.fuzzy_deduplicate(no_vin))
return deduplicated
def data_completeness_score(self, listing):
score = 0
for field in ["price", "mileage", "year", "make", "model", "photos", "description", "vin"]:
if listing.get(field):
score += 1
return scoreStorage Layer
Database Schema
CREATE TABLE vehicles (
vehicle_id SERIAL PRIMARY KEY,
normalized_make VARCHAR(100) NOT NULL,
normalized_model VARCHAR(200) NOT NULL,
year INTEGER NOT NULL,
variant VARCHAR(200),
body_type VARCHAR(50),
fuel_type VARCHAR(30),
transmission VARCHAR(20),
engine_cc INTEGER,
UNIQUE (normalized_make, normalized_model, year, variant)
);
CREATE TABLE listings (
listing_id SERIAL PRIMARY KEY,
vehicle_id INTEGER REFERENCES vehicles(vehicle_id),
source_platform VARCHAR(50) NOT NULL,
source_listing_id VARCHAR(200),
country VARCHAR(5) NOT NULL,
region VARCHAR(100),
price_local DECIMAL(15, 2),
currency VARCHAR(5),
price_usd DECIMAL(15, 2),
mileage_km INTEGER,
color VARCHAR(50),
seller_type VARCHAR(20),
listing_url VARCHAR(500),
image_urls TEXT[],
vin VARCHAR(17),
first_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_price_change TIMESTAMP,
is_active BOOLEAN DEFAULT true,
raw_data JSONB,
UNIQUE (source_platform, source_listing_id)
);
CREATE TABLE price_snapshots (
id SERIAL PRIMARY KEY,
listing_id INTEGER REFERENCES listings(listing_id),
price_local DECIMAL(15, 2),
price_usd DECIMAL(15, 2),
recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Indexes for API queries
CREATE INDEX idx_listings_vehicle ON listings(vehicle_id, is_active);
CREATE INDEX idx_listings_country_price ON listings(country, price_usd) WHERE is_active;
CREATE INDEX idx_vehicles_make_model ON vehicles(normalized_make, normalized_model, year);Caching Strategy
Cache frequently requested price comparisons:
class PriceCache:
def __init__(self, redis_client, default_ttl=3600):
self.redis = redis_client
self.default_ttl = default_ttl
def get_price_comparison(self, make, model, year, country=None):
cache_key = f"price:{make}:{model}:{year}:{country or 'all'}"
cached = self.redis.get(cache_key)
if cached:
return json.loads(cached)
return None
def set_price_comparison(self, make, model, year, country, data):
cache_key = f"price:{make}:{model}:{year}:{country or 'all'}"
self.redis.setex(cache_key, self.default_ttl, json.dumps(data))API Layer
REST API Design
from fastapi import FastAPI, Query
from typing import Optional
app = FastAPI(title="Car Price Comparison API")
@app.get("/api/v1/prices")
async def get_price_comparison(
make: str = Query(..., description="Vehicle make"),
model: str = Query(..., description="Vehicle model"),
year: Optional[int] = Query(None, description="Model year"),
country: Optional[str] = Query(None, description="Country code (SG, MY, TH, ID)"),
mileage_max: Optional[int] = Query(None, description="Maximum mileage in km"),
):
# Check cache first
cached = cache.get_price_comparison(make, model, year, country)
if cached:
return cached
# Query database
query = build_price_query(make, model, year, country, mileage_max)
listings = db.execute(query)
# Calculate statistics
result = {
"query": {
"make": make,
"model": model,
"year": year,
"country": country,
},
"summary": calculate_price_summary(listings),
"by_country": group_by_country(listings) if not country else None,
"by_source": group_by_source(listings),
"listings": [format_listing(l) for l in listings[:50]],
"total_listings": len(listings),
"data_freshness": get_data_freshness(),
}
# Cache result
cache.set_price_comparison(make, model, year, country, result)
return result
def calculate_price_summary(listings):
if not listings:
return None
prices = [l["price_usd"] for l in listings if l.get("price_usd")]
if not prices:
return None
return {
"average_price_usd": round(statistics.mean(prices), 2),
"median_price_usd": round(statistics.median(prices), 2),
"min_price_usd": round(min(prices), 2),
"max_price_usd": round(max(prices), 2),
"price_std_dev": round(statistics.stdev(prices), 2) if len(prices) > 1 else 0,
"sample_size": len(prices),
}
@app.get("/api/v1/prices/history")
async def get_price_history(
make: str,
model: str,
year: int,
country: str,
days: int = Query(90, le=365),
):
history = db.get_price_history(make, model, year, country, days)
return {
"query": {"make": make, "model": model, "year": year, "country": country},
"history": [
{
"date": entry["date"].isoformat(),
"avg_price_usd": entry["avg_price"],
"listing_count": entry["count"],
}
for entry in history
],
"trend": calculate_trend(history),
}
@app.get("/api/v1/market/overview")
async def get_market_overview(country: str):
return {
"country": country,
"total_active_listings": db.count_active_listings(country),
"top_makes": db.get_top_makes(country, limit=10),
"price_segments": db.get_price_segments(country),
"avg_days_on_market": db.get_avg_dom(country),
"new_listings_24h": db.count_new_listings(country, hours=24),
}API Authentication and Rate Limiting
from fastapi import Depends, HTTPException
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
security = HTTPBearer()
async def verify_api_key(credentials: HTTPAuthorizationCredentials = Depends(security)):
api_key = credentials.credentials
user = db.get_user_by_api_key(api_key)
if not user:
raise HTTPException(status_code=401, detail="Invalid API key")
# Check rate limit
if not rate_limiter.check(api_key, user["plan_limit"]):
raise HTTPException(status_code=429, detail="Rate limit exceeded")
return userData Collection Scheduling
Orchestrating Regular Updates
class CollectionOrchestrator:
def __init__(self, proxy_manager):
self.proxy_manager = proxy_manager
self.scrapers = self.initialize_scrapers()
def run_collection_cycle(self):
results = {"success": 0, "failed": 0, "total_listings": 0}
for source_name, scraper in self.scrapers.items():
try:
listings = scraper.collect_all()
processed = self.process_and_store(listings, source_name)
results["success"] += 1
results["total_listings"] += len(processed)
except Exception as e:
logger.error(f"Collection failed for {source_name}: {e}")
results["failed"] += 1
# Invalidate relevant caches
cache.flush_stale_entries()
return results
def process_and_store(self, listings, source):
normalizer = PriceNormalizer(exchange_rate_provider)
matcher = VehicleMatcher()
deduplicator = ListingDeduplicator()
# Normalize prices
for listing in listings:
listing["normalized_price"] = normalizer.normalize(
listing.get("price", ""), listing.get("country", "")
)
# Match to canonical vehicles
for listing in listings:
listing["vehicle_match"] = matcher.normalize_vehicle(listing)
# Deduplicate
unique_listings = deduplicator.deduplicate(listings)
# Store
db.upsert_listings(unique_listings)
return unique_listingsMonetization Strategies
API Plans
Structure your API plans based on usage:
- Free tier: 100 requests/day, basic price data, limited to 1 country
- Starter: 5,000 requests/day, full price data, all countries, $99/month
- Professional: 50,000 requests/day, price history, analytics, webhooks, $499/month
- Enterprise: Unlimited requests, raw data access, custom integrations, custom pricing
Value-Added Features
- Price alerts: Notify users when a vehicle drops below a target price
- Market reports: Weekly/monthly market analysis by segment
- Valuation API: Instant vehicle valuation based on market data
- Dealer analytics: Dashboard for dealer customers showing competitive position
Conclusion
Building a car price comparison API requires a solid foundation of data collection infrastructure powered by reliable proxies. The quality of your API depends entirely on the breadth, freshness, and accuracy of your underlying data.
DataResearchTools mobile proxies provide the infrastructure needed to collect pricing data reliably across all major Southeast Asian automotive platforms. With carrier-grade IPs in every target market, geographic precision for accessing country-specific pricing, and the bandwidth to support continuous data collection cycles, DataResearchTools ensures your price comparison API always serves current, accurate market data.
The combination of comprehensive data collection, intelligent processing, and a well-designed API creates a product that serves the entire automotive ecosystem, from individual consumers seeking the best deal to enterprise clients building their own automotive intelligence platforms.
- Automotive Inventory Tracking Across Multiple Dealer Websites
- Automotive Review Aggregation Using Proxy Networks
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- Automotive Inventory Tracking Across Multiple Dealer Websites
- Automotive Review Aggregation Using Proxy Networks
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- Automotive Inventory Tracking Across Multiple Dealer Websites
- Automotive Review Aggregation Using Proxy Networks
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- Automotive Inventory Tracking Across Multiple Dealer Websites
- Automotive Review Aggregation Using Proxy Networks
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
Related Reading
- Automotive Inventory Tracking Across Multiple Dealer Websites
- Automotive Review Aggregation Using Proxy Networks
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)