How to Scrape Craigslist Listings Across Multiple Cities
Craigslist remains one of the largest classified advertising platforms in the United States, with listings spanning housing, vehicles, jobs, services, and goods across hundreds of cities. For real estate analysts, market researchers, automotive dealers, and economic researchers, Craigslist data provides ground-level pricing signals that more polished platforms do not capture.
The unique challenge with Craigslist scraping is its geo-distributed architecture. Each city operates as a semi-independent subdomain with its own listings. Collecting data across multiple cities requires a scraper that can navigate this distributed structure efficiently while rotating proxies to avoid IP-based blocking.
This guide demonstrates how to build a multi-city Craigslist scraper using Python, BeautifulSoup, and mobile proxy rotation.
Understanding Craigslist’s Architecture
Craigslist uses a subdomain-based geographic structure:
newyork.craigslist.orgfor New York Citysfbay.craigslist.orgfor San Francisco Bay Arealosangeles.craigslist.orgfor Los Angeleschicago.craigslist.orgfor Chicago
Each subdomain hosts the same category structure but contains entirely different listings. This means a comprehensive national dataset requires scraping each city independently.
Craigslist’s anti-scraping measures include:
IP-based rate limiting. Craigslist blocks IPs that make too many requests in a short period. This is the primary defense mechanism and where web scraping proxies become essential.
CAPTCHA challenges. Excessive requests trigger CAPTCHA pages that block automated access until solved.
No official API. Unlike most modern platforms, Craigslist does not offer a public API for data access.
Minimal JavaScript. Craigslist pages are largely static HTML, which actually makes parsing straightforward once you get past the rate limiting.
Setting Up the Environment
pip install requests beautifulsoup4 pandas lxml tqdmBuilding the Multi-City Craigslist Scraper
The scraper assigns proxies per city to maintain geographic consistency and avoid triggering rate limits:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import re
from datetime import datetime
from urllib.parse import urljoin
from tqdm import tqdm
# Major US Craigslist city subdomains
CRAIGSLIST_CITIES = {
"new_york": "newyork",
"los_angeles": "losangeles",
"chicago": "chicago",
"houston": "houston",
"phoenix": "phoenix",
"philadelphia": "philadelphia",
"san_antonio": "sanantonio",
"san_diego": "sandiego",
"dallas": "dallas",
"san_francisco": "sfbay",
"austin": "austin",
"seattle": "seattle",
"denver": "denver",
"boston": "boston",
"portland": "portland",
"atlanta": "atlanta",
"miami": "miami",
"detroit": "detroit",
"minneapolis": "minneapolis",
"las_vegas": "lasvegas",
}
class CraigslistProxyManager:
"""Assigns dedicated proxies to cities for consistent scraping."""
def __init__(self, proxy_list):
self.proxies = proxy_list
self.city_assignments = {}
self.index = 0
def get_proxy_for_city(self, city_code):
"""Return a consistent proxy for a given city."""
if city_code not in self.city_assignments:
proxy = self.proxies[self.index % len(self.proxies)]
self.city_assignments[city_code] = proxy
self.index += 1
return {
"http": self.city_assignments[city_code],
"https": self.city_assignments[city_code],
}
def rotate_city_proxy(self, city_code):
"""Force rotation for a city that got blocked."""
self.index += 1
new_proxy = self.proxies[self.index % len(self.proxies)]
self.city_assignments[city_code] = new_proxy
return {"http": new_proxy, "https": new_proxy}
class CraigslistScraper:
"""Scrapes Craigslist listings across multiple cities."""
def __init__(self, proxy_manager):
self.proxy_manager = proxy_manager
self.session = requests.Session()
self.session.headers.update({
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
})
def scrape_category(self, city_code, category, max_results=500):
"""Scrape listings from a specific category in a specific city."""
base_url = f"https://{city_code}.craigslist.org/search/{category}"
all_listings = []
offset = 0
step = 120 # Craigslist shows 120 results per page
while offset < max_results:
url = f"{base_url}?s={offset}" if offset > 0 else base_url
proxy = self.proxy_manager.get_proxy_for_city(city_code)
try:
response = self.session.get(url, proxies=proxy, timeout=15)
if response.status_code == 200:
listings = self._parse_listing_page(response.text, city_code, category)
if not listings:
break
all_listings.extend(listings)
offset += step
print(f"{city_code}/{category}: {len(all_listings)} listings (page {offset // step})")
time.sleep(random.uniform(3, 7))
elif response.status_code == 403:
print(f"Blocked on {city_code}, rotating proxy...")
self.proxy_manager.rotate_city_proxy(city_code)
time.sleep(random.uniform(10, 20))
else:
print(f"HTTP {response.status_code} for {city_code}/{category}")
break
except requests.RequestException as e:
print(f"Request error for {city_code}: {e}")
self.proxy_manager.rotate_city_proxy(city_code)
time.sleep(random.uniform(5, 10))
return all_listings[:max_results]
def _parse_listing_page(self, html, city_code, category):
"""Extract listings from a Craigslist search results page."""
soup = BeautifulSoup(html, "lxml")
listings = []
# Craigslist uses .cl-static-search-result or .result-row
result_rows = soup.select(".cl-static-search-result, .result-row, li.cl-search-result")
for row in result_rows:
listing = self._parse_single_listing(row, city_code, category)
if listing:
listings.append(listing)
return listings
def _parse_single_listing(self, row, city_code, category):
"""Parse a single listing row from search results."""
listing = {
"city": city_code,
"category": category,
"scraped_at": datetime.now().isoformat(),
}
# Title and URL
title_link = row.select_one("a.titlestring, a.result-title, a.posting-title")
if title_link:
listing["title"] = title_link.get_text(strip=True)
listing["url"] = title_link.get("href", "")
if listing["url"] and not listing["url"].startswith("http"):
listing["url"] = f"https://{city_code}.craigslist.org{listing['url']}"
else:
return None
# Price
price_el = row.select_one(".priceinfo, .result-price, span.price")
if price_el:
price_text = price_el.get_text(strip=True)
listing["price"] = self._clean_price(price_text)
listing["price_raw"] = price_text
else:
listing["price"] = None
listing["price_raw"] = None
# Location/neighborhood
hood_el = row.select_one(".result-hood, .neighborhood, .meta .subreddit")
listing["neighborhood"] = hood_el.get_text(strip=True).strip("() ") if hood_el else None
# Date
date_el = row.select_one("time, .result-date, .meta .date")
if date_el:
listing["posted_date"] = date_el.get("datetime") or date_el.get_text(strip=True)
else:
listing["posted_date"] = None
# Extract listing ID from URL
if listing.get("url"):
id_match = re.search(r"/(\d+)\.html", listing["url"])
listing["listing_id"] = id_match.group(1) if id_match else None
return listing
@staticmethod
def _clean_price(price_text):
"""Extract numeric price from text like '$1,500'."""
match = re.search(r"[\d,]+\.?\d*", price_text.replace(",", ""))
return float(match.group()) if match else None
def scrape_listing_detail(self, url, city_code):
"""Scrape detailed information from a single listing page."""
proxy = self.proxy_manager.get_proxy_for_city(city_code)
try:
response = self.session.get(url, proxies=proxy, timeout=15)
if response.status_code != 200:
return None
soup = BeautifulSoup(response.text, "lxml")
detail = {"url": url}
# Full description
body = soup.select_one("#postingbody")
if body:
# Remove the "QR Code" link text
for tag in body.select(".print-information"):
tag.decompose()
detail["description"] = body.get_text(strip=True)
# Attributes (for housing: sqft, bedrooms, etc.)
attrs = soup.select(".attrgroup span")
detail["attributes"] = [a.get_text(strip=True) for a in attrs]
# Images
images = soup.select("#thumbs a")
detail["image_urls"] = [img.get("href") for img in images if img.get("href")]
detail["image_count"] = len(detail["image_urls"])
# Map coordinates
map_el = soup.select_one("#map")
if map_el:
detail["latitude"] = map_el.get("data-latitude")
detail["longitude"] = map_el.get("data-longitude")
# Posting info
post_info = soup.select_one(".postinginfos")
if post_info:
detail["posting_info"] = post_info.get_text(strip=True)
return detail
except Exception as e:
print(f"Detail scrape error: {e}")
return NoneScraping Across Multiple Cities
The multi-city scraper coordinates data collection across all target cities:
class MultiCityScraper:
"""Coordinates scraping across multiple Craigslist cities."""
def __init__(self, scraper, cities=None):
self.scraper = scraper
self.cities = cities or CRAIGSLIST_CITIES
def scrape_national(self, category, max_per_city=200):
"""Scrape a category across all configured cities."""
national_data = []
city_list = list(self.cities.items())
random.shuffle(city_list) # Randomize order to distribute load
for city_name, city_code in tqdm(city_list, desc=f"Scraping {category}"):
try:
listings = self.scraper.scrape_category(
city_code, category, max_results=max_per_city
)
for listing in listings:
listing["city_name"] = city_name
national_data.extend(listings)
print(f"{city_name}: {len(listings)} listings")
except Exception as e:
print(f"Error scraping {city_name}: {e}")
# Delay between cities
time.sleep(random.uniform(5, 15))
return national_data
def housing_market_analysis(self, max_per_city=500):
"""Collect housing rental data across cities for market analysis."""
categories = {
"apa": "apartments",
"hou": "housing",
"roo": "rooms",
}
all_housing = []
for cat_code, cat_name in categories.items():
print(f"\nScraping {cat_name} listings...")
data = self.scrape_national(cat_code, max_per_city=max_per_city)
for listing in data:
listing["housing_type"] = cat_name
all_housing.extend(data)
return all_housing
def vehicle_market_analysis(self, max_per_city=300):
"""Collect vehicle listing data across cities."""
categories = {
"cta": "cars_trucks",
"mca": "motorcycles",
}
all_vehicles = []
for cat_code, cat_name in categories.items():
print(f"\nScraping {cat_name} listings...")
data = self.scrape_national(cat_code, max_per_city=max_per_city)
for listing in data:
listing["vehicle_type"] = cat_name
all_vehicles.extend(data)
return all_vehiclesAnalyzing the Collected Data
def analyze_housing_data(df):
"""Perform basic analysis on collected housing data."""
# Filter to listings with prices
priced = df[df["price"].notna() & (df["price"] > 0)].copy()
# City-level summary
city_summary = priced.groupby("city_name").agg({
"price": ["count", "mean", "median", "min", "max"],
}).round(2)
city_summary.columns = [
"listing_count", "avg_price", "median_price", "min_price", "max_price"
]
city_summary = city_summary.sort_values("median_price", ascending=False)
return city_summary
def find_price_outliers(df, std_multiplier=2):
"""Identify unusually priced listings that may represent deals or errors."""
priced = df[df["price"].notna() & (df["price"] > 0)].copy()
city_stats = priced.groupby("city_name")["price"].agg(["mean", "std"])
outliers = []
for _, row in priced.iterrows():
city = row["city_name"]
if city in city_stats.index:
mean = city_stats.loc[city, "mean"]
std = city_stats.loc[city, "std"]
if std > 0 and abs(row["price"] - mean) > std_multiplier * std:
row_dict = row.to_dict()
row_dict["z_score"] = (row["price"] - mean) / std
outliers.append(row_dict)
return pd.DataFrame(outliers)Running the Complete Pipeline
def main():
proxies = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
"http://user:pass@proxy4.example.com:8080",
"http://user:pass@proxy5.example.com:8080",
]
proxy_manager = CraigslistProxyManager(proxies)
scraper = CraigslistScraper(proxy_manager)
multi_city = MultiCityScraper(scraper)
# Scrape apartment listings nationally
housing_data = multi_city.scrape_national("apa", max_per_city=200)
housing_df = pd.DataFrame(housing_data)
housing_df.to_csv("craigslist_apartments_national.csv", index=False)
# Analyze
if not housing_df.empty:
summary = analyze_housing_data(housing_df)
print("\nHousing Market Summary by City:")
print(summary.to_string())
summary.to_csv("craigslist_housing_summary.csv")
outliers = find_price_outliers(housing_df)
if not outliers.empty:
print(f"\nFound {len(outliers)} price outliers")
outliers.to_csv("craigslist_price_outliers.csv", index=False)
# Scrape vehicle listings
vehicle_data = multi_city.vehicle_market_analysis(max_per_city=200)
vehicle_df = pd.DataFrame(vehicle_data)
vehicle_df.to_csv("craigslist_vehicles_national.csv", index=False)
print(f"\nTotal housing listings: {len(housing_data)}")
print(f"Total vehicle listings: {len(vehicle_data)}")
if __name__ == "__main__":
main()Proxy Strategy for Multi-City Scraping
The geographic distribution of Craigslist creates a natural alignment with proxy rotation strategy:
Assign proxies per city. By keeping one proxy dedicated to one city’s subdomain, you reduce the per-IP request volume on each subdomain. This is more effective than randomly rotating proxies across all cities.
Geographic proxy matching. When possible, use mobile proxies from geographic regions that match the Craigslist cities you are scraping. Requests from a New York mobile IP to newyork.craigslist.org appear more natural than requests from a foreign IP.
Stagger city scraping. Do not scrape all cities simultaneously. Process them sequentially or in small batches with randomized delays between cities. This distributes the load over time and reduces the chance of triggering site-wide rate limits.
Monitor for blocks. Craigslist blocks manifest as HTTP 403 responses or CAPTCHA redirect pages. Implement automatic proxy rotation when these are detected, and add exponential backoff before retrying.
Data Quality and Cleaning
Craigslist data requires significant cleaning:
- Prices may be entered inconsistently (e.g., “$1500” vs “$1,500/mo” vs “$15” for an item worth $1,500)
- Listings may be duplicated across nearby city subdomains
- Spam and scam listings inflate certain categories
- Neighborhood names are user-entered and inconsistent
- Date formats may vary between the old and new Craigslist interfaces
Filter outliers aggressively and validate prices against expected ranges for each category.
Conclusion
Craigslist’s geo-distributed architecture makes it a unique scraping target that rewards careful proxy management and city-by-city data collection. The platform’s relatively simple HTML structure means the technical parsing is straightforward; the challenge lies in scale and rate limit management.
With a properly sized mobile proxy pool and per-city proxy assignment, you can build a comprehensive national Craigslist dataset for housing market analysis, vehicle pricing research, or job market intelligence. For related scraping techniques, explore our web scraping proxy guides and the proxy glossary for technical definitions.
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix