How to Scrape Vehicle VIN History and Specifications Data

How to Scrape Vehicle VIN History and Specifications Data

Every vehicle manufactured for sale has a unique Vehicle Identification Number, a 17-character code that unlocks a wealth of information about that specific car. VIN data includes manufacturing details, specifications, ownership history, accident records, service history, and recall information. For automotive businesses, accessing this data at scale is essential for vehicle valuation, fraud detection, risk assessment, and market analysis.

This guide covers how to extract VIN-related data from various sources using proxy infrastructure, including VIN decoding, history report extraction, and specification database building.

Understanding VIN Structure

A standard 17-character VIN encodes specific information in each position:

  • Positions 1-3 (WMI): World Manufacturer Identifier – identifies the manufacturer and country of origin
  • Positions 4-8 (VDS): Vehicle Descriptor Section – describes the vehicle type, body style, engine, and transmission
  • Position 9: Check digit – mathematical validation of the VIN
  • Position 10: Model year
  • Positions 11: Assembly plant
  • Positions 12-17: Sequential production number

For Southeast Asian markets, VIN structures may vary slightly depending on whether the vehicle was manufactured locally or imported.

Data Sources for VIN Information

Public VIN Decoders

Several services decode VIN numbers into readable specifications:

  • NHTSA VIN Decoder API: Free, official US government API that decodes North American VINs
  • Manufacturer databases: Some manufacturers provide VIN lookup tools on their websites
  • Third-party decoders: Commercial services that provide comprehensive VIN decoding

Vehicle History Services

These services aggregate vehicle history data:

  • CARFAX: Accident history, service records, ownership changes (North America)
  • AutoCheck: Similar to CARFAX with different data partnerships
  • SGCarMart VIN Check: Singapore-specific vehicle history
  • PUSPAKOM (Malaysia): Malaysian vehicle inspection records

Government Databases

Public registration data varies by country:

  • LTA (Singapore): Vehicle registration and deregistration records
  • JPJ (Malaysia): Road tax and registration information
  • Department of Land Transport (Thailand): Vehicle registration data

Setting Up VIN Data Collection

NHTSA API Integration

The NHTSA VIN decoder is free and does not require proxies for moderate use, but high-volume decoding benefits from proxy rotation to avoid rate limits:

import requests

class NHTSADecoder:
    def __init__(self, proxy_manager=None):
        self.base_url = "https://vpic.nhtsa.dot.gov/api/vehicles"
        self.proxy_manager = proxy_manager

    def decode_vin(self, vin):
        url = f"{self.base_url}/DecodeVin/{vin}?format=json"

        proxies = {}
        if self.proxy_manager:
            proxies = self.proxy_manager.get_proxy("US")

        response = requests.get(url, proxies=proxies, timeout=30)
        data = response.json()

        return self.parse_nhtsa_response(data)

    def decode_batch(self, vin_list):
        """Decode up to 50 VINs in a single request"""
        vin_string = ";".join(vin_list)
        url = f"{self.base_url}/DecodeVINValuesBatch/"

        proxies = {}
        if self.proxy_manager:
            proxies = self.proxy_manager.get_proxy("US")

        response = requests.post(
            url,
            data={"format": "json", "data": vin_string},
            proxies=proxies,
            timeout=60
        )

        return response.json()

    def parse_nhtsa_response(self, data):
        results = data.get("Results", [])
        decoded = {}

        for item in results:
            variable = item.get("Variable")
            value = item.get("Value")
            if value and value.strip():
                decoded[variable] = value.strip()

        return {
            "make": decoded.get("Make"),
            "model": decoded.get("Model"),
            "year": decoded.get("Model Year"),
            "body_class": decoded.get("Body Class"),
            "drive_type": decoded.get("Drive Type"),
            "engine_cylinders": decoded.get("Engine Number of Cylinders"),
            "engine_displacement": decoded.get("Displacement (L)"),
            "fuel_type": decoded.get("Fuel Type - Primary"),
            "transmission": decoded.get("Transmission Style"),
            "plant_country": decoded.get("Plant Country"),
            "plant_city": decoded.get("Plant City"),
            "gvwr": decoded.get("Gross Vehicle Weight Rating From"),
            "doors": decoded.get("Doors"),
            "seats": decoded.get("Number of Seats"),
        }

Scraping Vehicle History Services

Vehicle history services require proxies due to their aggressive anti-bot measures:

class VehicleHistoryScraper:
    def __init__(self, proxy_manager):
        self.proxy_manager = proxy_manager

    def scrape_history_preview(self, vin, service="generic"):
        """Scrape free preview information from vehicle history services"""
        proxy = self.proxy_manager.get_proxy("US")

        session = requests.Session()
        session.proxies.update(proxy)
        session.headers.update({
            "User-Agent": get_random_ua(),
            "Accept-Language": "en-US,en;q=0.9"
        })

        # Many services offer free preview data
        results = {}

        # Check for recalls
        recalls = self.check_recalls(vin, session)
        results["recalls"] = recalls

        # Check for complaints
        complaints = self.check_complaints(vin, session)
        results["complaints"] = complaints

        return results

    def check_recalls(self, vin, session):
        """Check NHTSA recalls for a VIN"""
        url = f"https://api.nhtsa.gov/recalls/recallsByVehicle?make=&model=&modelYear=&vin={vin}"
        response = session.get(url, timeout=30)

        if response.status_code == 200:
            data = response.json()
            return data.get("results", [])
        return []

    def check_complaints(self, vin, session):
        """Check NHTSA complaints database"""
        # First decode VIN to get make/model/year
        decoded = NHTSADecoder().decode_vin(vin)

        url = f"https://api.nhtsa.gov/complaints/complaintsByVehicle"
        params = {
            "make": decoded.get("make"),
            "model": decoded.get("model"),
            "modelYear": decoded.get("year")
        }

        response = session.get(url, params=params, timeout=30)
        if response.status_code == 200:
            return response.json().get("results", [])
        return []

Scraping Singapore Vehicle Data

For Singapore-specific VIN and vehicle data:

class SGVehicleScraper:
    def __init__(self, proxy_manager):
        self.proxy_manager = proxy_manager

    def scrape_sgcarmart_vehicle(self, listing_url):
        """Extract vehicle details from SGCarMart listing"""
        proxy = self.proxy_manager.get_proxy("SG")

        session = requests.Session()
        session.proxies.update(proxy)
        session.headers.update({
            "User-Agent": get_random_mobile_ua(),
        })

        response = session.get(listing_url, timeout=30)
        soup = BeautifulSoup(response.text, 'html.parser')

        details = {}
        spec_table = soup.select('.car-spec-table tr, .vehicle-info tr')

        for row in spec_table:
            cells = row.select('td')
            if len(cells) >= 2:
                key = cells[0].get_text(strip=True).lower()
                value = cells[1].get_text(strip=True)
                details[key] = value

        return {
            "registration_date": details.get("reg date"),
            "coe_expiry": details.get("coe expiry date"),
            "coe_category": details.get("coe category"),
            "arf": details.get("arf"),
            "omv": details.get("omv"),
            "engine_cc": details.get("engine capacity"),
            "power_kw": details.get("power"),
            "road_tax": details.get("road tax"),
            "mileage": details.get("mileage"),
            "owners": details.get("no. of owners"),
            "vehicle_type": details.get("vehicle type"),
        }

Building a VIN Specification Database

Data Collection Pipeline

class VINDatabaseBuilder:
    def __init__(self, proxy_manager, db):
        self.proxy_manager = proxy_manager
        self.db = db
        self.nhtsa = NHTSADecoder(proxy_manager)

    def process_vin(self, vin):
        # Check if already decoded
        existing = self.db.get_vin_data(vin)
        if existing:
            return existing

        # Decode VIN
        specs = self.nhtsa.decode_vin(vin)

        # Enrich with additional data
        specs["recalls"] = self.check_recalls(vin)
        specs["vin"] = vin

        # Store in database
        self.db.save_vin_data(specs)

        return specs

    def process_batch(self, vin_list, batch_size=50):
        results = []
        for i in range(0, len(vin_list), batch_size):
            batch = vin_list[i:i+batch_size]

            # Filter out already decoded VINs
            new_vins = [v for v in batch if not self.db.get_vin_data(v)]

            if new_vins:
                batch_results = self.nhtsa.decode_batch(new_vins)
                for result in batch_results.get("Results", []):
                    self.db.save_vin_data(result)
                    results.append(result)

            time.sleep(random.uniform(1, 3))

        return results

Database Schema

CREATE TABLE vin_specifications (
    vin VARCHAR(17) PRIMARY KEY,
    make VARCHAR(100),
    model VARCHAR(200),
    year INTEGER,
    trim VARCHAR(200),
    body_class VARCHAR(100),
    drive_type VARCHAR(50),
    engine_cylinders INTEGER,
    engine_displacement_l DECIMAL(3, 1),
    fuel_type VARCHAR(50),
    transmission VARCHAR(50),
    plant_country VARCHAR(100),
    plant_city VARCHAR(100),
    doors INTEGER,
    seats INTEGER,
    gvwr_kg DECIMAL(8, 2),
    decoded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE vin_recalls (
    id SERIAL PRIMARY KEY,
    vin VARCHAR(17) REFERENCES vin_specifications(vin),
    recall_number VARCHAR(50),
    recall_date DATE,
    component VARCHAR(200),
    summary TEXT,
    consequence TEXT,
    remedy TEXT,
    fetched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE vin_history_events (
    id SERIAL PRIMARY KEY,
    vin VARCHAR(17) REFERENCES vin_specifications(vin),
    event_date DATE,
    event_type VARCHAR(50),
    description TEXT,
    location VARCHAR(200),
    source VARCHAR(100),
    fetched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

VIN Data Applications

Vehicle Valuation

Use VIN-decoded specifications to support accurate vehicle valuations:

class VINBasedValuation:
    def estimate_value(self, vin, market_data):
        specs = db.get_vin_data(vin)
        if not specs:
            return None

        # Find comparable vehicles in market data
        comparables = self.find_comparables(specs, market_data)

        if len(comparables) < 3:
            return {"confidence": "low", "message": "Insufficient comparable data"}

        prices = [c["price"] for c in comparables]
        return {
            "estimated_value": statistics.median(prices),
            "range_low": np.percentile(prices, 25),
            "range_high": np.percentile(prices, 75),
            "comparable_count": len(comparables),
            "confidence": "high" if len(comparables) >= 10 else "medium"
        }

    def find_comparables(self, specs, market_data):
        return [v for v in market_data
                if v["make"] == specs["make"]
                and v["model"] == specs["model"]
                and abs(v["year"] - int(specs["year"])) <= 1
                and v.get("transmission") == specs.get("transmission")]

Fraud Detection

VIN data helps detect common types of automotive fraud:

class VINFraudDetector:
    def check_for_fraud(self, listing, vin_data):
        flags = []

        # Check VIN validity
        if not self.validate_vin_checksum(listing["vin"]):
            flags.append({"type": "INVALID_VIN", "severity": "high"})

        # Check if listing specs match VIN data
        if vin_data:
            if listing.get("year") and str(listing["year"]) != str(vin_data.get("year")):
                flags.append({
                    "type": "YEAR_MISMATCH",
                    "severity": "high",
                    "detail": f"Listed as {listing['year']}, VIN indicates {vin_data['year']}"
                })

            if listing.get("make") and listing["make"].lower() != vin_data.get("make", "").lower():
                flags.append({
                    "type": "MAKE_MISMATCH",
                    "severity": "high"
                })

        # Check for cloned VINs (same VIN in multiple active listings)
        duplicates = db.find_active_listings_with_vin(listing["vin"])
        if len(duplicates) > 1:
            flags.append({
                "type": "DUPLICATE_VIN",
                "severity": "medium",
                "detail": f"VIN found in {len(duplicates)} active listings"
            })

        return flags

    def validate_vin_checksum(self, vin):
        if len(vin) != 17:
            return False

        transliteration = {
            'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7,
            'H': 8, 'J': 1, 'K': 2, 'L': 3, 'M': 4, 'N': 5, 'P': 7,
            'R': 9, 'S': 2, 'T': 3, 'U': 4, 'V': 5, 'W': 6, 'X': 7,
            'Y': 8, 'Z': 9
        }
        weights = [8, 7, 6, 5, 4, 3, 2, 10, 0, 9, 8, 7, 6, 5, 4, 3, 2]

        total = 0
        for i, char in enumerate(vin.upper()):
            if char.isdigit():
                value = int(char)
            else:
                value = transliteration.get(char, 0)
            total += value * weights[i]

        check = total % 11
        check_char = 'X' if check == 10 else str(check)

        return vin[8] == check_char

Market Research

Aggregate VIN data to understand market composition:

def analyze_market_by_vin(vin_list, proxy_manager):
    """Analyze a set of VINs to understand market composition"""
    decoder = NHTSADecoder(proxy_manager)

    makes = Counter()
    countries_of_origin = Counter()
    fuel_types = Counter()
    body_types = Counter()
    engine_sizes = []

    for vin in vin_list:
        data = decoder.decode_vin(vin)
        if data:
            if data.get("make"):
                makes[data["make"]] += 1
            if data.get("plant_country"):
                countries_of_origin[data["plant_country"]] += 1
            if data.get("fuel_type"):
                fuel_types[data["fuel_type"]] += 1
            if data.get("body_class"):
                body_types[data["body_class"]] += 1
            if data.get("engine_displacement"):
                try:
                    engine_sizes.append(float(data["engine_displacement"]))
                except:
                    pass

    return {
        "top_makes": makes.most_common(10),
        "origin_countries": dict(countries_of_origin),
        "fuel_type_distribution": dict(fuel_types),
        "body_type_distribution": dict(body_types),
        "avg_engine_size": statistics.mean(engine_sizes) if engine_sizes else None,
        "total_decoded": len(vin_list)
    }

Proxy Best Practices for VIN Data Collection

Rate Management

VIN databases and history services have strict rate limits. Best practices:

  • Batch NHTSA requests (up to 50 VINs per batch call)
  • Space requests 2-5 seconds apart for other services
  • Use DataResearchTools mobile proxies for services that detect and block datacenter IPs
  • Implement exponential backoff on rate limit responses

Geographic Considerations

  • Use US proxies for NHTSA and North American VIN services
  • Use Singapore proxies for LTA and SGCarMart
  • Use Malaysian proxies for JPJ and Malaysian dealer sites
  • DataResearchTools provides the geographic coverage needed for multi-region VIN data collection

Session Management

For services that require login or session continuity:

  • Use sticky sessions from DataResearchTools to maintain a consistent IP throughout a session
  • Rotate sessions between different VIN lookups
  • Clear cookies between sessions to avoid tracking

Conclusion

VIN data is the backbone of automotive intelligence. From basic specification decoding to comprehensive vehicle history extraction, the ability to collect and analyze VIN data at scale powers valuations, fraud detection, and market research across the automotive industry.

DataResearchTools provides the proxy infrastructure needed to access VIN data sources reliably. With mobile proxies covering both North American and Southeast Asian markets, geographic targeting for region-specific databases, and the throughput to handle high-volume VIN processing, DataResearchTools supports the full spectrum of VIN data collection needs. Whether you are building a vehicle history tool, validating dealer inventory, or analyzing market composition, reliable proxy access to VIN databases is a fundamental requirement.


Related Reading

Scroll to Top