How to Scrape Craigslist Listings Across Multiple Cities

How to Scrape Craigslist Listings Across Multiple Cities

Craigslist remains one of the largest classified advertising platforms in the United States, with listings spanning housing, vehicles, jobs, services, and goods across hundreds of cities. For real estate analysts, market researchers, automotive dealers, and economic researchers, Craigslist data provides ground-level pricing signals that more polished platforms do not capture.

The unique challenge with Craigslist scraping is its geo-distributed architecture. Each city operates as a semi-independent subdomain with its own listings. Collecting data across multiple cities requires a scraper that can navigate this distributed structure efficiently while rotating proxies to avoid IP-based blocking.

This guide demonstrates how to build a multi-city Craigslist scraper using Python, BeautifulSoup, and mobile proxy rotation.

Understanding Craigslist’s Architecture

Craigslist uses a subdomain-based geographic structure:

  • newyork.craigslist.org for New York City
  • sfbay.craigslist.org for San Francisco Bay Area
  • losangeles.craigslist.org for Los Angeles
  • chicago.craigslist.org for Chicago

Each subdomain hosts the same category structure but contains entirely different listings. This means a comprehensive national dataset requires scraping each city independently.

Craigslist’s anti-scraping measures include:

IP-based rate limiting. Craigslist blocks IPs that make too many requests in a short period. This is the primary defense mechanism and where web scraping proxies become essential.

CAPTCHA challenges. Excessive requests trigger CAPTCHA pages that block automated access until solved.

No official API. Unlike most modern platforms, Craigslist does not offer a public API for data access.

Minimal JavaScript. Craigslist pages are largely static HTML, which actually makes parsing straightforward once you get past the rate limiting.

Setting Up the Environment

pip install requests beautifulsoup4 pandas lxml tqdm

Building the Multi-City Craigslist Scraper

The scraper assigns proxies per city to maintain geographic consistency and avoid triggering rate limits:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import re
from datetime import datetime
from urllib.parse import urljoin
from tqdm import tqdm


# Major US Craigslist city subdomains
CRAIGSLIST_CITIES = {
    "new_york": "newyork",
    "los_angeles": "losangeles",
    "chicago": "chicago",
    "houston": "houston",
    "phoenix": "phoenix",
    "philadelphia": "philadelphia",
    "san_antonio": "sanantonio",
    "san_diego": "sandiego",
    "dallas": "dallas",
    "san_francisco": "sfbay",
    "austin": "austin",
    "seattle": "seattle",
    "denver": "denver",
    "boston": "boston",
    "portland": "portland",
    "atlanta": "atlanta",
    "miami": "miami",
    "detroit": "detroit",
    "minneapolis": "minneapolis",
    "las_vegas": "lasvegas",
}


class CraigslistProxyManager:
    """Assigns dedicated proxies to cities for consistent scraping."""

    def __init__(self, proxy_list):
        self.proxies = proxy_list
        self.city_assignments = {}
        self.index = 0

    def get_proxy_for_city(self, city_code):
        """Return a consistent proxy for a given city."""
        if city_code not in self.city_assignments:
            proxy = self.proxies[self.index % len(self.proxies)]
            self.city_assignments[city_code] = proxy
            self.index += 1

        return {
            "http": self.city_assignments[city_code],
            "https": self.city_assignments[city_code],
        }

    def rotate_city_proxy(self, city_code):
        """Force rotation for a city that got blocked."""
        self.index += 1
        new_proxy = self.proxies[self.index % len(self.proxies)]
        self.city_assignments[city_code] = new_proxy
        return {"http": new_proxy, "https": new_proxy}


class CraigslistScraper:
    """Scrapes Craigslist listings across multiple cities."""

    def __init__(self, proxy_manager):
        self.proxy_manager = proxy_manager
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/120.0.0.0 Safari/537.36"
            ),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.5",
        })

    def scrape_category(self, city_code, category, max_results=500):
        """Scrape listings from a specific category in a specific city."""
        base_url = f"https://{city_code}.craigslist.org/search/{category}"
        all_listings = []
        offset = 0
        step = 120  # Craigslist shows 120 results per page

        while offset < max_results:
            url = f"{base_url}?s={offset}" if offset > 0 else base_url
            proxy = self.proxy_manager.get_proxy_for_city(city_code)

            try:
                response = self.session.get(url, proxies=proxy, timeout=15)

                if response.status_code == 200:
                    listings = self._parse_listing_page(response.text, city_code, category)

                    if not listings:
                        break

                    all_listings.extend(listings)
                    offset += step

                    print(f"{city_code}/{category}: {len(all_listings)} listings (page {offset // step})")
                    time.sleep(random.uniform(3, 7))

                elif response.status_code == 403:
                    print(f"Blocked on {city_code}, rotating proxy...")
                    self.proxy_manager.rotate_city_proxy(city_code)
                    time.sleep(random.uniform(10, 20))

                else:
                    print(f"HTTP {response.status_code} for {city_code}/{category}")
                    break

            except requests.RequestException as e:
                print(f"Request error for {city_code}: {e}")
                self.proxy_manager.rotate_city_proxy(city_code)
                time.sleep(random.uniform(5, 10))

        return all_listings[:max_results]

    def _parse_listing_page(self, html, city_code, category):
        """Extract listings from a Craigslist search results page."""
        soup = BeautifulSoup(html, "lxml")
        listings = []

        # Craigslist uses .cl-static-search-result or .result-row
        result_rows = soup.select(".cl-static-search-result, .result-row, li.cl-search-result")

        for row in result_rows:
            listing = self._parse_single_listing(row, city_code, category)
            if listing:
                listings.append(listing)

        return listings

    def _parse_single_listing(self, row, city_code, category):
        """Parse a single listing row from search results."""
        listing = {
            "city": city_code,
            "category": category,
            "scraped_at": datetime.now().isoformat(),
        }

        # Title and URL
        title_link = row.select_one("a.titlestring, a.result-title, a.posting-title")
        if title_link:
            listing["title"] = title_link.get_text(strip=True)
            listing["url"] = title_link.get("href", "")
            if listing["url"] and not listing["url"].startswith("http"):
                listing["url"] = f"https://{city_code}.craigslist.org{listing['url']}"
        else:
            return None

        # Price
        price_el = row.select_one(".priceinfo, .result-price, span.price")
        if price_el:
            price_text = price_el.get_text(strip=True)
            listing["price"] = self._clean_price(price_text)
            listing["price_raw"] = price_text
        else:
            listing["price"] = None
            listing["price_raw"] = None

        # Location/neighborhood
        hood_el = row.select_one(".result-hood, .neighborhood, .meta .subreddit")
        listing["neighborhood"] = hood_el.get_text(strip=True).strip("() ") if hood_el else None

        # Date
        date_el = row.select_one("time, .result-date, .meta .date")
        if date_el:
            listing["posted_date"] = date_el.get("datetime") or date_el.get_text(strip=True)
        else:
            listing["posted_date"] = None

        # Extract listing ID from URL
        if listing.get("url"):
            id_match = re.search(r"/(\d+)\.html", listing["url"])
            listing["listing_id"] = id_match.group(1) if id_match else None

        return listing

    @staticmethod
    def _clean_price(price_text):
        """Extract numeric price from text like '$1,500'."""
        match = re.search(r"[\d,]+\.?\d*", price_text.replace(",", ""))
        return float(match.group()) if match else None

    def scrape_listing_detail(self, url, city_code):
        """Scrape detailed information from a single listing page."""
        proxy = self.proxy_manager.get_proxy_for_city(city_code)

        try:
            response = self.session.get(url, proxies=proxy, timeout=15)
            if response.status_code != 200:
                return None

            soup = BeautifulSoup(response.text, "lxml")
            detail = {"url": url}

            # Full description
            body = soup.select_one("#postingbody")
            if body:
                # Remove the "QR Code" link text
                for tag in body.select(".print-information"):
                    tag.decompose()
                detail["description"] = body.get_text(strip=True)

            # Attributes (for housing: sqft, bedrooms, etc.)
            attrs = soup.select(".attrgroup span")
            detail["attributes"] = [a.get_text(strip=True) for a in attrs]

            # Images
            images = soup.select("#thumbs a")
            detail["image_urls"] = [img.get("href") for img in images if img.get("href")]
            detail["image_count"] = len(detail["image_urls"])

            # Map coordinates
            map_el = soup.select_one("#map")
            if map_el:
                detail["latitude"] = map_el.get("data-latitude")
                detail["longitude"] = map_el.get("data-longitude")

            # Posting info
            post_info = soup.select_one(".postinginfos")
            if post_info:
                detail["posting_info"] = post_info.get_text(strip=True)

            return detail

        except Exception as e:
            print(f"Detail scrape error: {e}")
            return None

Scraping Across Multiple Cities

The multi-city scraper coordinates data collection across all target cities:

class MultiCityScraper:
    """Coordinates scraping across multiple Craigslist cities."""

    def __init__(self, scraper, cities=None):
        self.scraper = scraper
        self.cities = cities or CRAIGSLIST_CITIES

    def scrape_national(self, category, max_per_city=200):
        """Scrape a category across all configured cities."""
        national_data = []

        city_list = list(self.cities.items())
        random.shuffle(city_list)  # Randomize order to distribute load

        for city_name, city_code in tqdm(city_list, desc=f"Scraping {category}"):
            try:
                listings = self.scraper.scrape_category(
                    city_code, category, max_results=max_per_city
                )
                for listing in listings:
                    listing["city_name"] = city_name
                national_data.extend(listings)

                print(f"{city_name}: {len(listings)} listings")

            except Exception as e:
                print(f"Error scraping {city_name}: {e}")

            # Delay between cities
            time.sleep(random.uniform(5, 15))

        return national_data

    def housing_market_analysis(self, max_per_city=500):
        """Collect housing rental data across cities for market analysis."""
        categories = {
            "apa": "apartments",
            "hou": "housing",
            "roo": "rooms",
        }

        all_housing = []

        for cat_code, cat_name in categories.items():
            print(f"\nScraping {cat_name} listings...")
            data = self.scrape_national(cat_code, max_per_city=max_per_city)
            for listing in data:
                listing["housing_type"] = cat_name
            all_housing.extend(data)

        return all_housing

    def vehicle_market_analysis(self, max_per_city=300):
        """Collect vehicle listing data across cities."""
        categories = {
            "cta": "cars_trucks",
            "mca": "motorcycles",
        }

        all_vehicles = []

        for cat_code, cat_name in categories.items():
            print(f"\nScraping {cat_name} listings...")
            data = self.scrape_national(cat_code, max_per_city=max_per_city)
            for listing in data:
                listing["vehicle_type"] = cat_name
            all_vehicles.extend(data)

        return all_vehicles

Analyzing the Collected Data

def analyze_housing_data(df):
    """Perform basic analysis on collected housing data."""
    # Filter to listings with prices
    priced = df[df["price"].notna() & (df["price"] > 0)].copy()

    # City-level summary
    city_summary = priced.groupby("city_name").agg({
        "price": ["count", "mean", "median", "min", "max"],
    }).round(2)

    city_summary.columns = [
        "listing_count", "avg_price", "median_price", "min_price", "max_price"
    ]
    city_summary = city_summary.sort_values("median_price", ascending=False)

    return city_summary


def find_price_outliers(df, std_multiplier=2):
    """Identify unusually priced listings that may represent deals or errors."""
    priced = df[df["price"].notna() & (df["price"] > 0)].copy()

    city_stats = priced.groupby("city_name")["price"].agg(["mean", "std"])

    outliers = []
    for _, row in priced.iterrows():
        city = row["city_name"]
        if city in city_stats.index:
            mean = city_stats.loc[city, "mean"]
            std = city_stats.loc[city, "std"]
            if std > 0 and abs(row["price"] - mean) > std_multiplier * std:
                row_dict = row.to_dict()
                row_dict["z_score"] = (row["price"] - mean) / std
                outliers.append(row_dict)

    return pd.DataFrame(outliers)

Running the Complete Pipeline

def main():
    proxies = [
        "http://user:pass@proxy1.example.com:8080",
        "http://user:pass@proxy2.example.com:8080",
        "http://user:pass@proxy3.example.com:8080",
        "http://user:pass@proxy4.example.com:8080",
        "http://user:pass@proxy5.example.com:8080",
    ]

    proxy_manager = CraigslistProxyManager(proxies)
    scraper = CraigslistScraper(proxy_manager)
    multi_city = MultiCityScraper(scraper)

    # Scrape apartment listings nationally
    housing_data = multi_city.scrape_national("apa", max_per_city=200)
    housing_df = pd.DataFrame(housing_data)
    housing_df.to_csv("craigslist_apartments_national.csv", index=False)

    # Analyze
    if not housing_df.empty:
        summary = analyze_housing_data(housing_df)
        print("\nHousing Market Summary by City:")
        print(summary.to_string())
        summary.to_csv("craigslist_housing_summary.csv")

        outliers = find_price_outliers(housing_df)
        if not outliers.empty:
            print(f"\nFound {len(outliers)} price outliers")
            outliers.to_csv("craigslist_price_outliers.csv", index=False)

    # Scrape vehicle listings
    vehicle_data = multi_city.vehicle_market_analysis(max_per_city=200)
    vehicle_df = pd.DataFrame(vehicle_data)
    vehicle_df.to_csv("craigslist_vehicles_national.csv", index=False)

    print(f"\nTotal housing listings: {len(housing_data)}")
    print(f"Total vehicle listings: {len(vehicle_data)}")


if __name__ == "__main__":
    main()

Proxy Strategy for Multi-City Scraping

The geographic distribution of Craigslist creates a natural alignment with proxy rotation strategy:

Assign proxies per city. By keeping one proxy dedicated to one city’s subdomain, you reduce the per-IP request volume on each subdomain. This is more effective than randomly rotating proxies across all cities.

Geographic proxy matching. When possible, use mobile proxies from geographic regions that match the Craigslist cities you are scraping. Requests from a New York mobile IP to newyork.craigslist.org appear more natural than requests from a foreign IP.

Stagger city scraping. Do not scrape all cities simultaneously. Process them sequentially or in small batches with randomized delays between cities. This distributes the load over time and reduces the chance of triggering site-wide rate limits.

Monitor for blocks. Craigslist blocks manifest as HTTP 403 responses or CAPTCHA redirect pages. Implement automatic proxy rotation when these are detected, and add exponential backoff before retrying.

Data Quality and Cleaning

Craigslist data requires significant cleaning:

  • Prices may be entered inconsistently (e.g., “$1500” vs “$1,500/mo” vs “$15” for an item worth $1,500)
  • Listings may be duplicated across nearby city subdomains
  • Spam and scam listings inflate certain categories
  • Neighborhood names are user-entered and inconsistent
  • Date formats may vary between the old and new Craigslist interfaces

Filter outliers aggressively and validate prices against expected ranges for each category.

Conclusion

Craigslist’s geo-distributed architecture makes it a unique scraping target that rewards careful proxy management and city-by-city data collection. The platform’s relatively simple HTML structure means the technical parsing is straightforward; the challenge lies in scale and rate limit management.

With a properly sized mobile proxy pool and per-city proxy assignment, you can build a comprehensive national Craigslist dataset for housing market analysis, vehicle pricing research, or job market intelligence. For related scraping techniques, explore our web scraping proxy guides and the proxy glossary for technical definitions.


Related Reading

Scroll to Top