Scraping real estate listings: Zillow, Redfin, Rightmove

Scraping real estate listings: Zillow, Redfin, Rightmove

Scrape real estate listings and you tap into one of the most analytically rich datasets on the public web. Zillow alone publishes more than 135 million U.S. property records, Redfin overlaps with a similar national footprint, and Rightmove dominates the UK with more than a million active listings. Each platform exposes a different schema, a different anti-bot posture, and a different update cadence, but the core analytical questions are the same: what is on the market, how is it priced, how long does it sit, and how do those signals trend over time. The scraping landscape is shaped by three things: aggressive bot detection on Zillow specifically, an MLS-feed structure that constrains what can legally be republished, and geographic specificity in URL patterns that requires per-region scraping rather than a single global crawl.

This guide covers the U.S. (Zillow and Redfin) and the UK (Rightmove). The patterns transfer to Realtor.com, Trulia, and continental European portals like ImmoScout24 with minor adjustments.

Source taxonomy and listing identifiers

Each platform uses its own identifier scheme but they all anchor on a property address. Zillow uses a numeric zpid (Zillow Property ID) that persists for the life of the property. Redfin uses a numeric propertyId. Rightmove uses a numeric propertyId with a different namespace. Cross-platform deduplication relies on the canonical address rather than any platform-specific ID.

Address normalization is the first hard problem in real estate scraping. The same property can be listed as “123 Main St, Apt 4B, New York, NY 10001” or “123 Main Street #4B, New York 10001” or “123 MAIN ST APT 4B NEW YORK NY 10001”. Use a structured address parser like libpostal or the USPS API to normalize before deduplication.

import httpx

ZILLOW_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Accept": "application/json",
    "Accept-Language": "en-US,en;q=0.9",
}

async def fetch_zillow_property(zpid: int, proxy: str):
    url = f"https://www.zillow.com/graphql/?zpid={zpid}"
    payload = {
        "operationName": "FullPropertyDetailsQuery",
        "variables": {"zpid": zpid},
        "query": "query FullPropertyDetailsQuery($zpid: ID!) { property(zpid: $zpid) { zpid streetAddress city state zipcode price beds baths livingArea zestimate rentZestimate yearBuilt homeStatus } }",
    }
    async with httpx.AsyncClient(proxy=proxy, headers=ZILLOW_HEADERS, timeout=20) as c:
        r = await c.post(url, json=payload)
        if r.status_code == 200:
            return r.json().get("data", {}).get("property")
        return None

Zillow’s GraphQL endpoint is the canonical access path for structured property data. The response includes the canonical fields plus zestimate, rent zestimate, school assignments, and historical price data. Zillow gates access aggressively, so this endpoint requires U.S. residential proxies for sustained scraping.

Proxy strategy for the major platforms

Zillow has the most aggressive bot detection of any major real estate platform. The site fingerprints TLS, HTTP/2 frames, header order, and behavioral patterns. U.S. residential or mobile IPs are required, and even with clean IPs, request rates above 1 per 5 seconds per IP trigger soft blocks. Plan for a substantial proxy budget if you intend to scrape Zillow at scale.

Redfin is moderately defended. U.S. datacenter IPs sometimes work for short bursts, but residential is strongly preferred for sustained operation. Rate limits are similar to Zillow but the challenge frequency is lower.

Rightmove is the lightest of the three. UK datacenter IPs work for moderate workloads, and UK residential pools handle high-volume scraping comfortably. The platform does enforce rate limits per IP but does not aggressively challenge.

PlatformRecommended proxyTolerance per IPNotes
ZillowU.S. residential or mobile12 req/min per IPGraphQL endpoint, very strict
RedfinU.S. residential30 req/min per IPAPI and HTML both work
RightmoveUK residential or datacenter60 req/min per IPPublic listing URLs
Realtor.comU.S. residential30 req/min per IPSimilar to Redfin

Geographic crawl strategy

Real estate is hyperlocal, which means the natural unit of crawling is a geographic region (a city, a county, a school district, or a postcode). Each platform exposes region-level listing endpoints that you can iterate to enumerate active properties.

For Zillow, the region search endpoint accepts a polygon or a city name and returns listings within that region. For Rightmove, the search URL takes a locationIdentifier parameter that maps to their internal geographic taxonomy. Rightmove publishes the location identifier dictionary at a static endpoint that you can cache.

async def search_rightmove(location_id: str, page: int, proxy: str):
    url = "https://www.rightmove.co.uk/api/_search"
    params = {
        "locationIdentifier": location_id,
        "numberOfPropertiesPerPage": 24,
        "index": (page - 1) * 24,
        "sortType": 6,
        "channel": "BUY",
    }
    async with httpx.AsyncClient(proxy=proxy, timeout=20) as c:
        r = await c.get(url, params=params, headers={"User-Agent": "Mozilla/5.0"})
        if r.status_code == 200:
            return r.json().get("properties", [])
        return []

For comprehensive U.S. coverage, decompose by ZIP code. There are roughly 33,000 ZIP codes in the U.S., and a daily snapshot at the ZIP level gives you complete national coverage. For UK coverage, decompose by Rightmove location identifier (roughly 50,000 nodes). For most analytical use cases, the major metro areas (top 50 U.S. metros, top 20 UK regions) capture 80%+ of the meaningful market activity.

Schema for cross-platform property snapshots

CREATE TABLE property_listing_snapshot (
    snapshot_at TIMESTAMP NOT NULL,
    canonical_address_hash VARCHAR(64) NOT NULL,
    source VARCHAR(16) NOT NULL,
    platform_id VARCHAR(64) NOT NULL,
    asking_price DECIMAL(14,2),
    currency VARCHAR(3),
    beds INT,
    baths DECIMAL(4,1),
    living_area_sqft INT,
    list_date DATE,
    days_on_market INT,
    status VARCHAR(16),
    PRIMARY KEY (snapshot_at, canonical_address_hash, source)
);

The canonical_address_hash is a SHA-256 of the normalized address tuple. This lets you deduplicate the same property across multiple platforms without storing the raw address as the join key. For analytics that need address text (rare), you can join back to a separate address dictionary table.

Days-on-market and price-history derivation

The two most analytically valuable derived metrics are days-on-market and price-history. Most platforms expose days-on-market directly, but the value drifts over time because listings get re-listed or status-changed. The cleanest approach is to derive your own days-on-market from the first appearance of a property in your snapshots.

For price-history, compare the asking_price across consecutive snapshots and emit a price-change event whenever it differs by more than 0.5% (to filter rounding noise). Aggregating price-change events at the metro level reveals the headline real estate market story far more clearly than averaging the listed prices.

For broader pattern guidance on real estate scraping, see our residential proxy provider ranking and our headless browser frameworks ranking.

Detecting and routing around bot challenges

When Zillow and similar real estate sources flag your traffic, the response is usually a Cloudflare or vendor interrogation page rather than a clean HTTP error. Your scraper needs to detect this content-type swap explicitly. Look for the signature cf-mitigated header, the presence of __cf_chl_ cookies, or HTML containing Just a moment....

def is_challenged(response) -> bool:
    if response.status_code in (403, 503):
        return True
    if "cf-mitigated" in response.headers:
        return True
    if "__cf_chl_" in response.headers.get("set-cookie", ""):
        return True
    body = response.text[:2000].lower()
    return "just a moment" in body or "checking your browser" in body

When you detect a challenge, do not retry on the same IP for at least 30 minutes. Mark that IP as cooling and route subsequent requests to a different IP in your pool. Aggressive retries on a flagged IP cause the cooling window to extend and can lead to long-term blacklisting of your subnet. For pages that absolutely must be fetched, have a fallback path that uses a headless browser. Most production setups maintain a 95/5 split between the lightweight HTTP path and the browser fallback path.

Operational monitoring and alerting

Every production scraper needs three monitoring layers regardless of vertical. The first is per-IP success rate over a 5-minute window, alerting if any IP drops below 80%. The second is parser error rate, alerting if more than 1% of fetched pages fail to extract the canonical fields. The third is data freshness, alerting if your downstream consumers see snapshots more than 24 hours old.

import time
from collections import deque

class IPHealthTracker:
    def __init__(self, window_seconds: int = 300):
        self.window = window_seconds
        self.events = {}

    def record(self, ip: str, success: bool):
        bucket = self.events.setdefault(ip, deque())
        now = time.time()
        bucket.append((now, success))
        while bucket and bucket[0][0] < now - self.window:
            bucket.popleft()

    def success_rate(self, ip: str) -> float:
        bucket = self.events.get(ip)
        if not bucket:
            return 1.0
        successes = sum(1 for _, ok in bucket if ok)
        return successes / len(bucket)

Wire this into Prometheus or your existing observability stack so the on-call engineer sees IP degradation as it happens rather than after the daily snapshot fails. For long-running operations, IP rotation triggered by the health tracker is more reliable than fixed rotation schedules.

Pipeline orchestration and scheduling

For any non-trivial real estate scraping operation, a dedicated orchestration layer is the difference between a script you babysit and a service that runs unattended. The two strong open-source choices in 2026 are Prefect 3 and Dagster.

from prefect import flow, task

@task(retries=3, retry_delay_seconds=60)
def fetch_source(source_id: str, page: int):
    return crawl_one_page(source_id, page)

@flow(name="real-estate-daily-sweep")
def daily_sweep(source_ids: list):
    futures = []
    for sid in source_ids:
        for page in range(1, 30):
            futures.append(fetch_source.submit(sid, page))
    return [f.result() for f in futures]

Run the flow on a cadence aligned to how dynamic the underlying data is. For real estate where records change intraday, a 4-6 hour cadence catches meaningful movements. For longer-cycle data, daily is sufficient.

Data quality monitoring patterns

Beyond per-IP success rate, every snapshot should pass a small battery of data quality checks before being considered authoritative. Structural checks verify that every required field is present and of the expected type. Distributional checks compare the current snapshot against recent history. Semantic checks compare related fields for consistency.

def quality_check(snapshot: list[dict]) -> list[str]:
    errors = []
    if not snapshot:
        errors.append("empty snapshot")
        return errors
    avg_yesterday = get_yesterday_avg_size()
    if len(snapshot) < avg_yesterday * 0.7:
        errors.append(f"snapshot size {len(snapshot)} is 30% below yesterday")
    return errors

Run quality checks as a separate flow that gates promotion of the snapshot from staging to production. A snapshot that fails quality checks should be quarantined for human review.

Cost optimization strategies

Proxy bandwidth is usually the dominant cost in a production scraping operation. Three optimization patterns consistently reduce cost without hurting data quality. The first is request deduplication: serve cached responses when consumers ask for the same record within the same hour. The second is conditional GET using ETag or If-Modified-Since headers. The third is selective field hydration when the upstream API supports it.

For workloads above 100 GB of monthly proxy bandwidth, these three optimizations together reduce cost by 40-60% without changing the analytical output.

End-to-end pipeline architecture

A production-grade scraping pipeline has four layers: collection, parsing, storage, and serving. The collection layer handles the network conversation and knows nothing about data shape. The parsing layer transforms raw bytes into structured records and owns the schema. The storage layer holds the canonical snapshots in a query-optimized format like DuckDB or ClickHouse. The serving layer exposes the data to consumers and should be denormalized and pre-aggregated.

Decoupling these layers enables independent scaling. The collection layer is bound by proxy capacity. The parsing layer is CPU-bound. The storage layer is bound by I/O. The serving layer is bound by query concurrency.

Legal and compliance considerations

Public real estate data is generally treated as fair to scrape in most jurisdictions, but always confine your collection to non-personal data: identifiers, structured attributes, and aggregates. Avoid collecting personally identifying details, and avoid pulling any data behind a login.

For commercial deployment, document your basis for processing, your data retention period, and your purpose limitation. The W3C Web Annotation guidance and similar published frameworks remain useful starting points for documenting your approach.

Sample analytics queries

-- Volume trend over the last 30 days
SELECT date_trunc('day', snapshot_at) AS day, COUNT(*) AS records
FROM snapshot
WHERE snapshot_at > now() - interval '30 days'
GROUP BY 1 ORDER BY 1;

-- New entities first seen in the last 14 days
SELECT entity_id, MIN(snapshot_at) AS first_seen
FROM snapshot
GROUP BY entity_id
HAVING MIN(snapshot_at) > now() - interval '14 days';

-- Source distribution
SELECT source, COUNT(*) AS records
FROM snapshot
WHERE snapshot_at > now() - interval '7 days'
GROUP BY source
ORDER BY records DESC;

Add a category share view, a source concentration view, and a price-volatility view (where applicable) and you have a solid foundation for a real estate intelligence product.

Versioning your scraper for source evolution

Every real estate source evolves its schema regularly. New fields appear, old fields are deprecated, and display logic changes. Stamp every snapshot row with the scraper version that produced it. Downstream analytics can filter by version when they need consistent semantics across a time range. Pair this with a small registry table that documents what each scraper version did differently so debugging unexpected metric jumps becomes tractable.

Building a metro-level dashboard from the dataset

The most common analytical product on top of real estate scraping is a metro-level market-temperature dashboard. The dashboard tracks median asking price, median days-on-market, inventory count, and price-change-event count for each major metro on a daily basis. Across the U.S. top 100 metros, this is roughly 10,000 data points per day, which is trivially storable and queryable.

def daily_metro_summary(snap_df, metro_col):
    return snap_df.groupby([metro_col, 'snapshot_date']).agg(
        median_price=('asking_price', 'median'),
        active_count=('canonical_address_hash', 'nunique'),
        median_dom=('days_on_market', 'median'),
    )

The headline metric for most consumer-facing products is the median asking price trend. The headline metric for most professional-facing products is days-on-market because it leads price changes by 4-6 weeks. Build both views and let the consumer choose.

For richer analytics, layer in school district overlays, walkability scores, and crime data. Each of these is its own scraping or licensing challenge but the combined dataset is dramatically more valuable than property listings alone.

Cross-platform pricing arbitrage signal

When the same property is listed on Zillow, Redfin, and the listing brokerage’s own site at different prices, you have a small but meaningful signal of either listing-staleness or active price-testing by the agent. Track the cross-platform price-spread per property as a derived metric and surface the largest spreads to subscribers as alerts.

The pattern works particularly well in the UK where Rightmove and Zoopla often show different prices for the same property because different agents post the property on different platforms. The price spread itself is uninteresting, but the time-derivative of the spread (how the spread changes when one platform updates and the other lags) is a strong indicator of which platform is the agent’s primary channel.

International real estate scraping notes

Outside the U.S. and UK, the dominant real estate portals shift but the patterns transfer. ImmoScout24 dominates Germany, Idealista dominates Spain and Italy, SeLoger dominates France, and PropertyGuru dominates Singapore and Malaysia. Each has its own anti-bot posture and its own URL structure, but the canonical fields (address, price, beds, baths, area, list date) are universal.

For multi-country pipelines, build a per-country adapter pattern with a shared canonical schema. The shared schema is the integration point; the adapters handle the source-specific quirks. This pattern scales cleanly to 10-20 countries without becoming an unmaintainable mess.

Common pitfalls when scraping real estate listings

Three issues account for most production incidents. The first is listing-status staleness. MLS data flows through aggregators (Zillow, Realtor.com, Redfin) with 15-60 minute lags. A listing marked Active on the public site can be Pending or Closed in the underlying MLS feed. For absorption-rate or days-on-market analytics, cross-check the listing status against the public records layer rather than trusting the portal status field alone.

The second is duplicate-listing inflation. The same property frequently appears with different listing IDs across portals when the listing agreement changes (relisting after expiration). A scraper that deduplicates by listing id double-counts the property. Use the parcel ID or address hash as the canonical property key and treat listing id as a child.

The third is square-footage source ambiguity. The portal can pull square footage from the MLS, the county tax record, the appraisal, or the listing agent’s input, and these often disagree by 5-20%. Capture which source produced the value (sqft_source field where available) and prefer county-record values for analytics that compare across markets.

FAQ

Are Zillow and Redfin listings legal to scrape?
Property listings are mostly aggregated from MLS feeds, which are licensed datasets with redistribution restrictions. Public listing pages are generally fair to scrape for analytical use, but redistributing the listings or building a competing portal raises licensing complexity. Confine your collection to non-personal data and consult counsel for commercial use cases.

What about Zestimate values? Are those reliable?
Zestimates are Zillow’s automated valuation model output. They have well-documented accuracy issues, especially in rural and atypical-property segments. For most analytical purposes the asking price and the actual sold price (when available) are stronger signals than the Zestimate.

How do I track sold properties vs. active listings?
Both Zillow and Redfin expose sold listings as a separate filter. Snapshot active and sold separately and join on the property identifier to compute time-from-list-to-sale. This metric is one of the strongest indicators of market temperature.

Can I scrape MLS data directly?
MLS data is licensed through regional Multiple Listing Services and requires an agent or broker membership. Direct MLS scraping is generally not feasible legally. Public-facing portals like Zillow and Redfin remain the realistic source for analytical work.

Does Rightmove expose sold-price history?
Yes. Rightmove has a separate sold-price section that exposes UK Land Registry data. The data is available without aggressive bot defenses and is dramatically more comprehensive than U.S. equivalents because the UK Land Registry publishes sold prices publicly.

Is scraping Zillow’s public site legal in 2026?
Public listing data is widely accepted as scrapeable for analytical purposes per the hiQ ruling lineage. Reselling raw listings or contact information falls into a different regulatory zone and requires direct MLS or syndication agreements.

How do I track price reductions over time on a single listing?
Snapshot daily and store every price observation with the scrape timestamp. Compute price-reduction events by detecting downward changes in the rolling daily series.

To build broader real estate intelligence pipelines, browse the ecommerce scraping category for tooling reviews and framework deep dives.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)