How to Scrape Vehicle VIN History and Specifications Data
Every vehicle manufactured for sale has a unique Vehicle Identification Number, a 17-character code that unlocks a wealth of information about that specific car. VIN data includes manufacturing details, specifications, ownership history, accident records, service history, and recall information. For automotive businesses, accessing this data at scale is essential for vehicle valuation, fraud detection, risk assessment, and market analysis.
This guide covers how to extract VIN-related data from various sources using proxy infrastructure, including VIN decoding, history report extraction, and specification database building.
Understanding VIN Structure
A standard 17-character VIN encodes specific information in each position:
- Positions 1-3 (WMI): World Manufacturer Identifier – identifies the manufacturer and country of origin
- Positions 4-8 (VDS): Vehicle Descriptor Section – describes the vehicle type, body style, engine, and transmission
- Position 9: Check digit – mathematical validation of the VIN
- Position 10: Model year
- Positions 11: Assembly plant
- Positions 12-17: Sequential production number
For Southeast Asian markets, VIN structures may vary slightly depending on whether the vehicle was manufactured locally or imported.
Data Sources for VIN Information
Public VIN Decoders
Several services decode VIN numbers into readable specifications:
- NHTSA VIN Decoder API: Free, official US government API that decodes North American VINs
- Manufacturer databases: Some manufacturers provide VIN lookup tools on their websites
- Third-party decoders: Commercial services that provide comprehensive VIN decoding
Vehicle History Services
These services aggregate vehicle history data:
- CARFAX: Accident history, service records, ownership changes (North America)
- AutoCheck: Similar to CARFAX with different data partnerships
- SGCarMart VIN Check: Singapore-specific vehicle history
- PUSPAKOM (Malaysia): Malaysian vehicle inspection records
Government Databases
Public registration data varies by country:
- LTA (Singapore): Vehicle registration and deregistration records
- JPJ (Malaysia): Road tax and registration information
- Department of Land Transport (Thailand): Vehicle registration data
Setting Up VIN Data Collection
NHTSA API Integration
The NHTSA VIN decoder is free and does not require proxies for moderate use, but high-volume decoding benefits from proxy rotation to avoid rate limits:
import requests
class NHTSADecoder:
def __init__(self, proxy_manager=None):
self.base_url = "https://vpic.nhtsa.dot.gov/api/vehicles"
self.proxy_manager = proxy_manager
def decode_vin(self, vin):
url = f"{self.base_url}/DecodeVin/{vin}?format=json"
proxies = {}
if self.proxy_manager:
proxies = self.proxy_manager.get_proxy("US")
response = requests.get(url, proxies=proxies, timeout=30)
data = response.json()
return self.parse_nhtsa_response(data)
def decode_batch(self, vin_list):
"""Decode up to 50 VINs in a single request"""
vin_string = ";".join(vin_list)
url = f"{self.base_url}/DecodeVINValuesBatch/"
proxies = {}
if self.proxy_manager:
proxies = self.proxy_manager.get_proxy("US")
response = requests.post(
url,
data={"format": "json", "data": vin_string},
proxies=proxies,
timeout=60
)
return response.json()
def parse_nhtsa_response(self, data):
results = data.get("Results", [])
decoded = {}
for item in results:
variable = item.get("Variable")
value = item.get("Value")
if value and value.strip():
decoded[variable] = value.strip()
return {
"make": decoded.get("Make"),
"model": decoded.get("Model"),
"year": decoded.get("Model Year"),
"body_class": decoded.get("Body Class"),
"drive_type": decoded.get("Drive Type"),
"engine_cylinders": decoded.get("Engine Number of Cylinders"),
"engine_displacement": decoded.get("Displacement (L)"),
"fuel_type": decoded.get("Fuel Type - Primary"),
"transmission": decoded.get("Transmission Style"),
"plant_country": decoded.get("Plant Country"),
"plant_city": decoded.get("Plant City"),
"gvwr": decoded.get("Gross Vehicle Weight Rating From"),
"doors": decoded.get("Doors"),
"seats": decoded.get("Number of Seats"),
}Scraping Vehicle History Services
Vehicle history services require proxies due to their aggressive anti-bot measures:
class VehicleHistoryScraper:
def __init__(self, proxy_manager):
self.proxy_manager = proxy_manager
def scrape_history_preview(self, vin, service="generic"):
"""Scrape free preview information from vehicle history services"""
proxy = self.proxy_manager.get_proxy("US")
session = requests.Session()
session.proxies.update(proxy)
session.headers.update({
"User-Agent": get_random_ua(),
"Accept-Language": "en-US,en;q=0.9"
})
# Many services offer free preview data
results = {}
# Check for recalls
recalls = self.check_recalls(vin, session)
results["recalls"] = recalls
# Check for complaints
complaints = self.check_complaints(vin, session)
results["complaints"] = complaints
return results
def check_recalls(self, vin, session):
"""Check NHTSA recalls for a VIN"""
url = f"https://api.nhtsa.gov/recalls/recallsByVehicle?make=&model=&modelYear=&vin={vin}"
response = session.get(url, timeout=30)
if response.status_code == 200:
data = response.json()
return data.get("results", [])
return []
def check_complaints(self, vin, session):
"""Check NHTSA complaints database"""
# First decode VIN to get make/model/year
decoded = NHTSADecoder().decode_vin(vin)
url = f"https://api.nhtsa.gov/complaints/complaintsByVehicle"
params = {
"make": decoded.get("make"),
"model": decoded.get("model"),
"modelYear": decoded.get("year")
}
response = session.get(url, params=params, timeout=30)
if response.status_code == 200:
return response.json().get("results", [])
return []Scraping Singapore Vehicle Data
For Singapore-specific VIN and vehicle data:
class SGVehicleScraper:
def __init__(self, proxy_manager):
self.proxy_manager = proxy_manager
def scrape_sgcarmart_vehicle(self, listing_url):
"""Extract vehicle details from SGCarMart listing"""
proxy = self.proxy_manager.get_proxy("SG")
session = requests.Session()
session.proxies.update(proxy)
session.headers.update({
"User-Agent": get_random_mobile_ua(),
})
response = session.get(listing_url, timeout=30)
soup = BeautifulSoup(response.text, 'html.parser')
details = {}
spec_table = soup.select('.car-spec-table tr, .vehicle-info tr')
for row in spec_table:
cells = row.select('td')
if len(cells) >= 2:
key = cells[0].get_text(strip=True).lower()
value = cells[1].get_text(strip=True)
details[key] = value
return {
"registration_date": details.get("reg date"),
"coe_expiry": details.get("coe expiry date"),
"coe_category": details.get("coe category"),
"arf": details.get("arf"),
"omv": details.get("omv"),
"engine_cc": details.get("engine capacity"),
"power_kw": details.get("power"),
"road_tax": details.get("road tax"),
"mileage": details.get("mileage"),
"owners": details.get("no. of owners"),
"vehicle_type": details.get("vehicle type"),
}Building a VIN Specification Database
Data Collection Pipeline
class VINDatabaseBuilder:
def __init__(self, proxy_manager, db):
self.proxy_manager = proxy_manager
self.db = db
self.nhtsa = NHTSADecoder(proxy_manager)
def process_vin(self, vin):
# Check if already decoded
existing = self.db.get_vin_data(vin)
if existing:
return existing
# Decode VIN
specs = self.nhtsa.decode_vin(vin)
# Enrich with additional data
specs["recalls"] = self.check_recalls(vin)
specs["vin"] = vin
# Store in database
self.db.save_vin_data(specs)
return specs
def process_batch(self, vin_list, batch_size=50):
results = []
for i in range(0, len(vin_list), batch_size):
batch = vin_list[i:i+batch_size]
# Filter out already decoded VINs
new_vins = [v for v in batch if not self.db.get_vin_data(v)]
if new_vins:
batch_results = self.nhtsa.decode_batch(new_vins)
for result in batch_results.get("Results", []):
self.db.save_vin_data(result)
results.append(result)
time.sleep(random.uniform(1, 3))
return resultsDatabase Schema
CREATE TABLE vin_specifications (
vin VARCHAR(17) PRIMARY KEY,
make VARCHAR(100),
model VARCHAR(200),
year INTEGER,
trim VARCHAR(200),
body_class VARCHAR(100),
drive_type VARCHAR(50),
engine_cylinders INTEGER,
engine_displacement_l DECIMAL(3, 1),
fuel_type VARCHAR(50),
transmission VARCHAR(50),
plant_country VARCHAR(100),
plant_city VARCHAR(100),
doors INTEGER,
seats INTEGER,
gvwr_kg DECIMAL(8, 2),
decoded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE vin_recalls (
id SERIAL PRIMARY KEY,
vin VARCHAR(17) REFERENCES vin_specifications(vin),
recall_number VARCHAR(50),
recall_date DATE,
component VARCHAR(200),
summary TEXT,
consequence TEXT,
remedy TEXT,
fetched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE vin_history_events (
id SERIAL PRIMARY KEY,
vin VARCHAR(17) REFERENCES vin_specifications(vin),
event_date DATE,
event_type VARCHAR(50),
description TEXT,
location VARCHAR(200),
source VARCHAR(100),
fetched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);VIN Data Applications
Vehicle Valuation
Use VIN-decoded specifications to support accurate vehicle valuations:
class VINBasedValuation:
def estimate_value(self, vin, market_data):
specs = db.get_vin_data(vin)
if not specs:
return None
# Find comparable vehicles in market data
comparables = self.find_comparables(specs, market_data)
if len(comparables) < 3:
return {"confidence": "low", "message": "Insufficient comparable data"}
prices = [c["price"] for c in comparables]
return {
"estimated_value": statistics.median(prices),
"range_low": np.percentile(prices, 25),
"range_high": np.percentile(prices, 75),
"comparable_count": len(comparables),
"confidence": "high" if len(comparables) >= 10 else "medium"
}
def find_comparables(self, specs, market_data):
return [v for v in market_data
if v["make"] == specs["make"]
and v["model"] == specs["model"]
and abs(v["year"] - int(specs["year"])) <= 1
and v.get("transmission") == specs.get("transmission")]Fraud Detection
VIN data helps detect common types of automotive fraud:
class VINFraudDetector:
def check_for_fraud(self, listing, vin_data):
flags = []
# Check VIN validity
if not self.validate_vin_checksum(listing["vin"]):
flags.append({"type": "INVALID_VIN", "severity": "high"})
# Check if listing specs match VIN data
if vin_data:
if listing.get("year") and str(listing["year"]) != str(vin_data.get("year")):
flags.append({
"type": "YEAR_MISMATCH",
"severity": "high",
"detail": f"Listed as {listing['year']}, VIN indicates {vin_data['year']}"
})
if listing.get("make") and listing["make"].lower() != vin_data.get("make", "").lower():
flags.append({
"type": "MAKE_MISMATCH",
"severity": "high"
})
# Check for cloned VINs (same VIN in multiple active listings)
duplicates = db.find_active_listings_with_vin(listing["vin"])
if len(duplicates) > 1:
flags.append({
"type": "DUPLICATE_VIN",
"severity": "medium",
"detail": f"VIN found in {len(duplicates)} active listings"
})
return flags
def validate_vin_checksum(self, vin):
if len(vin) != 17:
return False
transliteration = {
'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7,
'H': 8, 'J': 1, 'K': 2, 'L': 3, 'M': 4, 'N': 5, 'P': 7,
'R': 9, 'S': 2, 'T': 3, 'U': 4, 'V': 5, 'W': 6, 'X': 7,
'Y': 8, 'Z': 9
}
weights = [8, 7, 6, 5, 4, 3, 2, 10, 0, 9, 8, 7, 6, 5, 4, 3, 2]
total = 0
for i, char in enumerate(vin.upper()):
if char.isdigit():
value = int(char)
else:
value = transliteration.get(char, 0)
total += value * weights[i]
check = total % 11
check_char = 'X' if check == 10 else str(check)
return vin[8] == check_charMarket Research
Aggregate VIN data to understand market composition:
def analyze_market_by_vin(vin_list, proxy_manager):
"""Analyze a set of VINs to understand market composition"""
decoder = NHTSADecoder(proxy_manager)
makes = Counter()
countries_of_origin = Counter()
fuel_types = Counter()
body_types = Counter()
engine_sizes = []
for vin in vin_list:
data = decoder.decode_vin(vin)
if data:
if data.get("make"):
makes[data["make"]] += 1
if data.get("plant_country"):
countries_of_origin[data["plant_country"]] += 1
if data.get("fuel_type"):
fuel_types[data["fuel_type"]] += 1
if data.get("body_class"):
body_types[data["body_class"]] += 1
if data.get("engine_displacement"):
try:
engine_sizes.append(float(data["engine_displacement"]))
except:
pass
return {
"top_makes": makes.most_common(10),
"origin_countries": dict(countries_of_origin),
"fuel_type_distribution": dict(fuel_types),
"body_type_distribution": dict(body_types),
"avg_engine_size": statistics.mean(engine_sizes) if engine_sizes else None,
"total_decoded": len(vin_list)
}Proxy Best Practices for VIN Data Collection
Rate Management
VIN databases and history services have strict rate limits. Best practices:
- Batch NHTSA requests (up to 50 VINs per batch call)
- Space requests 2-5 seconds apart for other services
- Use DataResearchTools mobile proxies for services that detect and block datacenter IPs
- Implement exponential backoff on rate limit responses
Geographic Considerations
- Use US proxies for NHTSA and North American VIN services
- Use Singapore proxies for LTA and SGCarMart
- Use Malaysian proxies for JPJ and Malaysian dealer sites
- DataResearchTools provides the geographic coverage needed for multi-region VIN data collection
Session Management
For services that require login or session continuity:
- Use sticky sessions from DataResearchTools to maintain a consistent IP throughout a session
- Rotate sessions between different VIN lookups
- Clear cookies between sessions to avoid tracking
Conclusion
VIN data is the backbone of automotive intelligence. From basic specification decoding to comprehensive vehicle history extraction, the ability to collect and analyze VIN data at scale powers valuations, fraud detection, and market research across the automotive industry.
DataResearchTools provides the proxy infrastructure needed to access VIN data sources reliably. With mobile proxies covering both North American and Southeast Asian markets, geographic targeting for region-specific databases, and the throughput to handle high-volume VIN processing, DataResearchTools supports the full spectrum of VIN data collection needs. Whether you are building a vehicle history tool, validating dealer inventory, or analyzing market composition, reliable proxy access to VIN databases is a fundamental requirement.
- Automotive Inventory Tracking Across Multiple Dealer Websites
- Automotive Review Aggregation Using Proxy Networks
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- Automotive Inventory Tracking Across Multiple Dealer Websites
- Automotive Review Aggregation Using Proxy Networks
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- Automotive Inventory Tracking Across Multiple Dealer Websites
- Automotive Review Aggregation Using Proxy Networks
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
Related Reading
- Automotive Inventory Tracking Across Multiple Dealer Websites
- Automotive Review Aggregation Using Proxy Networks
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)