How to Scrape Zillow Real Estate Data Using Python and Proxies

How to Scrape Zillow Real Estate Data Using Python and Proxies

Zillow is the dominant real estate platform in the United States, hosting data on over 100 million properties including listings, price estimates (Zestimates), tax records, and transaction histories. For real estate investors, analysts, and proptech companies, Zillow data is an invaluable resource for market analysis, property valuation, and investment decision-making.

Extracting this data at scale requires navigating Zillow’s anti-scraping protections, which have become increasingly sophisticated. This guide provides a complete Python framework for scraping Zillow property data using residential proxies and robust parsing techniques.

Why Proxies Are Essential for Zillow Scraping

Zillow employs multiple layers of anti-bot protection:

  • IP-based rate limiting: Zillow tracks request volume per IP and blocks addresses exceeding normal browsing patterns.
  • Bot detection scripts: Client-side JavaScript checks for automation tools, headless browsers, and unusual browser environments.
  • CAPTCHA challenges: Suspicious traffic triggers reCAPTCHA verification.
  • Dynamic content loading: Property details load asynchronously, requiring JavaScript execution.
  • API authentication: Zillow’s internal APIs use authentication tokens that expire and rotate.

Using rotating residential or mobile proxies ensures your requests originate from legitimate ISP-assigned IP addresses that Zillow cannot easily distinguish from real home buyers browsing listings.

Setting Up Your Environment

pip install requests beautifulsoup4 lxml pandas selenium webdriver-manager

Understanding Zillow’s Data Structure

Zillow exposes property data through several channels:

  1. Search results pages: Listings matching location and filter criteria.
  2. Individual property pages: Detailed information for specific addresses.
  3. Internal API endpoints: JSON data that powers the dynamic page content.

The most reliable approach combines search page scraping for discovery with API endpoint scraping for detailed data.

Building the Zillow Scraper

Step 1: Configure Proxy and Session

import requests
from bs4 import BeautifulSoup
import json
import time
import random
import re
import pandas as pd

class ZillowScraper:
    """Scrape Zillow property listings with proxy rotation."""

    SEARCH_URL = "https://www.zillow.com/search/GetSearchPageState.htm"
    BASE_URL = "https://www.zillow.com"

    def __init__(self, proxy_url):
        self.session = requests.Session()
        self.session.proxies = {
            "http": proxy_url,
            "https": proxy_url,
        }
        self.session.headers.update({
            "User-Agent": (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/120.0.0.0 Safari/537.36"
            ),
            "Accept": "*/*",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Referer": "https://www.zillow.com/",
            "Origin": "https://www.zillow.com",
        })

    def _make_request(self, url, params=None, max_retries=3):
        """Make a request with retry logic."""
        for attempt in range(max_retries):
            try:
                response = self.session.get(url, params=params, timeout=20)

                if response.status_code == 200:
                    return response
                elif response.status_code == 403:
                    print(f"Blocked (403). Rotating proxy recommended. Attempt {attempt + 1}")
                    time.sleep(random.uniform(10, 20))
                elif response.status_code == 429:
                    print(f"Rate limited. Waiting... Attempt {attempt + 1}")
                    time.sleep(random.uniform(30, 60))
                else:
                    print(f"Status {response.status_code}, attempt {attempt + 1}")
                    time.sleep(random.uniform(3, 8))

            except requests.exceptions.RequestException as e:
                print(f"Request error: {e}")
                time.sleep(random.uniform(5, 10))

        return None

Step 2: Search Properties by Location

    def search_properties(self, location, num_pages=5, filters=None):
        """Search for property listings in a given location."""
        all_listings = []

        # First, get the search page to obtain region bounds
        search_url = f"{self.BASE_URL}/{location.lower().replace(' ', '-').replace(',', '')}"
        response = self._make_request(search_url)
        if not response:
            print("Failed to load search page")
            return []

        # Extract search state from the page
        html = response.text
        listings_from_page = self._extract_listings_from_html(html)
        all_listings.extend(listings_from_page)
        print(f"Page 1: Found {len(listings_from_page)} listings")

        # Try to get more via the API endpoint
        search_state = self._extract_search_state(html)
        if search_state:
            for page in range(2, num_pages + 1):
                search_state["pagination"] = {"currentPage": page}
                params = {
                    "searchQueryState": json.dumps(search_state),
                    "wants": json.dumps({
                        "cat1": ["listResults", "mapResults"],
                        "cat2": ["total"],
                    }),
                    "requestId": random.randint(1, 100),
                }

                response = self._make_request(self.SEARCH_URL, params=params)
                if response:
                    try:
                        data = response.json()
                        results = (
                            data.get("cat1", {})
                            .get("searchResults", {})
                            .get("listResults", [])
                        )
                        for result in results:
                            listing = self._parse_api_listing(result)
                            if listing:
                                all_listings.append(listing)
                        print(f"Page {page}: Found {len(results)} listings")
                    except json.JSONDecodeError:
                        print(f"Failed to parse API response on page {page}")

                time.sleep(random.uniform(3, 7))

        return all_listings

    def _extract_listings_from_html(self, html):
        """Extract property listings from the search results HTML."""
        listings = []
        soup = BeautifulSoup(html, "lxml")

        # Look for the JSON data embedded in the page
        scripts = soup.find_all("script", {"type": "application/json"})
        for script in scripts:
            try:
                data = json.loads(script.string)
                # Navigate the nested structure to find listings
                results = self._find_listings_in_json(data)
                for result in results:
                    listing = self._parse_api_listing(result)
                    if listing:
                        listings.append(listing)
            except (json.JSONDecodeError, TypeError):
                continue

        # Fallback: parse HTML directly
        if not listings:
            cards = soup.select("article[data-test='property-card']")
            for card in cards:
                listing = self._parse_html_card(card)
                if listing:
                    listings.append(listing)

        return listings

    def _find_listings_in_json(self, data, results=None):
        """Recursively search for listing results in nested JSON."""
        if results is None:
            results = []

        if isinstance(data, dict):
            if "zpid" in data and "price" in data:
                results.append(data)
            for value in data.values():
                self._find_listings_in_json(value, results)
        elif isinstance(data, list):
            for item in data:
                self._find_listings_in_json(item, results)

        return results

    def _extract_search_state(self, html):
        """Extract the search query state from page HTML."""
        match = re.search(r'"searchQueryState":({.*?}),"', html)
        if match:
            try:
                return json.loads(match.group(1))
            except json.JSONDecodeError:
                pass
        return None

Step 3: Parse Property Data

    def _parse_api_listing(self, data):
        """Parse a listing from API response data."""
        if not data:
            return None

        listing = {
            "zpid": data.get("zpid") or data.get("id"),
            "address": data.get("address"),
            "price": data.get("price") or data.get("unformattedPrice"),
            "price_formatted": data.get("formattedPrice"),
            "bedrooms": data.get("beds"),
            "bathrooms": data.get("baths"),
            "sqft": data.get("area") or data.get("livingArea"),
            "lot_size": data.get("lotAreaValue"),
            "lot_unit": data.get("lotAreaUnit"),
            "property_type": data.get("homeType") or data.get("propertyType"),
            "listing_status": data.get("statusType") or data.get("homeStatus"),
            "zestimate": data.get("zestimate"),
            "rent_zestimate": data.get("rentZestimate"),
            "latitude": data.get("latitude") or data.get("latLong", {}).get("latitude"),
            "longitude": data.get("longitude") or data.get("latLong", {}).get("longitude"),
            "url": data.get("detailUrl"),
            "image_url": data.get("imgSrc"),
            "days_on_zillow": data.get("daysOnZillow"),
            "listing_agent": data.get("brokerName"),
        }

        # Clean up URL
        if listing["url"] and not listing["url"].startswith("http"):
            listing["url"] = f"https://www.zillow.com{listing['url']}"

        return listing

    def _parse_html_card(self, card):
        """Fallback parser for HTML property cards."""
        listing = {}

        # Address
        addr = card.select_one("address")
        listing["address"] = addr.get_text(strip=True) if addr else None

        # Price
        price = card.select_one("span[data-test='property-card-price']")
        listing["price_formatted"] = price.get_text(strip=True) if price else None

        # Details (beds, baths, sqft)
        details = card.select("ul li")
        for detail in details:
            text = detail.get_text(strip=True).lower()
            if "bd" in text or "bed" in text:
                listing["bedrooms"] = re.search(r"(\d+)", text)
                listing["bedrooms"] = listing["bedrooms"].group(1) if listing["bedrooms"] else None
            elif "ba" in text or "bath" in text:
                listing["bathrooms"] = re.search(r"([\d.]+)", text)
                listing["bathrooms"] = listing["bathrooms"].group(1) if listing["bathrooms"] else None
            elif "sqft" in text:
                listing["sqft"] = re.search(r"([\d,]+)", text)
                listing["sqft"] = listing["sqft"].group(1).replace(",", "") if listing["sqft"] else None

        # URL
        link = card.select_one("a[data-test='property-card-link']")
        if link:
            href = link.get("href", "")
            listing["url"] = f"https://www.zillow.com{href}" if not href.startswith("http") else href

        return listing if listing.get("address") else None

Step 4: Extract Detailed Property Information

    def get_property_details(self, zpid):
        """Fetch detailed property information using Zillow's internal API."""
        url = f"{self.BASE_URL}/graphql/"

        # GraphQL query for property details
        payload = {
            "operationName": "ForSaleDoubleScrollFullRenderQuery",
            "variables": {
                "zpid": int(zpid),
                "contactFormRenderParameter": {
                    "zpid": str(zpid),
                    "platform": "desktop",
                    "isDoubleScroll": True,
                },
            },
            "query": """
                query ForSaleDoubleScrollFullRenderQuery($zpid: ID!) {
                    property(zpid: $zpid) {
                        zpid
                        streetAddress
                        city
                        state
                        zipcode
                        price
                        bedrooms
                        bathrooms
                        livingArea
                        lotSize
                        homeType
                        homeStatus
                        yearBuilt
                        description
                        zestimate
                        rentZestimate
                        taxAssessedValue
                        taxAssessedYear
                        priceHistory { date price event }
                        taxHistory { year taxPaid value }
                        schools { name rating distance level }
                        nearbyHomes { zpid price address }
                    }
                }
            """,
        }

        try:
            response = self.session.post(
                url,
                json=payload,
                timeout=20,
            )

            if response.status_code == 200:
                data = response.json()
                return data.get("data", {}).get("property")
        except Exception as e:
            print(f"Error fetching property details for zpid {zpid}: {e}")

        return None

Step 5: Run the Complete Pipeline

def main():
    proxy_url = "http://user:pass@proxy.dataresearchtools.com:8080"
    scraper = ZillowScraper(proxy_url)

    # Search multiple locations
    locations = [
        "San Francisco CA",
        "Austin TX",
        "Miami FL",
    ]

    all_listings = []
    for location in locations:
        print(f"\nSearching properties in {location}...")
        listings = scraper.search_properties(location, num_pages=3)
        for listing in listings:
            listing["search_location"] = location
        all_listings.extend(listings)
        print(f"Found {len(listings)} listings in {location}")
        time.sleep(random.uniform(5, 10))

    print(f"\nTotal listings: {len(all_listings)}")

    # Get detailed data for selected properties
    detailed = []
    for listing in all_listings[:20]:
        zpid = listing.get("zpid")
        if zpid:
            print(f"Fetching details for zpid {zpid}...")
            details = scraper.get_property_details(zpid)
            if details:
                detailed.append(details)
            time.sleep(random.uniform(3, 7))

    # Save all data
    with open("zillow_listings.json", "w", encoding="utf-8") as f:
        json.dump(all_listings, f, indent=2, ensure_ascii=False)

    with open("zillow_detailed.json", "w", encoding="utf-8") as f:
        json.dump(detailed, f, indent=2, ensure_ascii=False)

    # Analysis
    df = pd.DataFrame(all_listings)
    df.to_csv("zillow_listings.csv", index=False)

    print(f"\nPrice Statistics:")
    numeric_prices = pd.to_numeric(df["price"], errors="coerce")
    print(f"  Median: ${numeric_prices.median():,.0f}")
    print(f"  Mean: ${numeric_prices.mean():,.0f}")
    print(f"  Min: ${numeric_prices.min():,.0f}")
    print(f"  Max: ${numeric_prices.max():,.0f}")


if __name__ == "__main__":
    main()

Handling Zillow’s Anti-Bot Measures

Request Pacing

Zillow monitors request patterns closely. Follow these guidelines:

  • Space requests 3-7 seconds apart for search pages.
  • Wait 5-10 seconds between detail page fetches.
  • Add longer pauses (15-30 seconds) when switching between locations.
  • Limit total requests to under 500 per day per IP address.

Cookie and Session Management

Zillow sets tracking cookies on initial page loads. Always visit the homepage or a search page first to establish a valid cookie session before making API calls.

def warm_up_session(scraper):
    """Establish a valid session by visiting the homepage first."""
    response = scraper._make_request("https://www.zillow.com/")
    if response:
        print("Session established successfully")
        time.sleep(random.uniform(2, 4))
    return response is not None

Handling CAPTCHAs

When Zillow serves a CAPTCHA, the best approach is to back off and try again with a different proxy IP. Attempting to solve CAPTCHAs programmatically adds complexity and cost. With quality residential proxies, CAPTCHAs should be rare.

Data Applications for Real Estate

The property data extracted from Zillow enables powerful analysis:

  • Investment analysis: Compare property prices, rental yields, and appreciation trends across markets.
  • Comparable analysis (comps): Find similar properties to establish fair market values.
  • Market timing: Track days-on-market and price reduction patterns to identify buyer’s or seller’s markets.
  • Neighborhood analysis: Aggregate school ratings, price distributions, and property types by zip code.
  • Portfolio monitoring: Track Zestimate changes for owned properties over time.

For e-commerce and market intelligence applications beyond real estate, the same proxy infrastructure and scraping techniques apply.

Scaling Considerations

When scraping Zillow at scale:

  1. Distribute across IPs: Use a large pool of rotating proxies to distribute requests.
  2. Stagger timing: Run scraping during off-peak hours (late night/early morning) when detection systems may be less aggressive.
  3. Database storage: Use PostgreSQL or MongoDB for efficient querying and deduplication across multiple runs.
  4. Incremental updates: Track previously scraped properties by zpid and only fetch new or updated listings.
  5. Geographic partitioning: Break large metro areas into smaller zip code searches for more complete coverage.

Legal Considerations

Zillow’s Terms of Use prohibit scraping. Additionally:

  • The data Zillow publishes may include copyrighted content (property descriptions, photos).
  • MLS (Multiple Listing Service) data shown on Zillow has its own licensing restrictions.
  • Fair Housing Act considerations apply to how you use housing data.

Use scraped real estate data responsibly and consult with legal professionals before deploying commercial scraping operations.

Conclusion

Scraping Zillow provides access to one of the richest real estate datasets available, enabling sophisticated market analysis and investment research. The combination of API-based data extraction and HTML parsing creates a resilient scraper that adapts to Zillow’s evolving page structure.

Success depends on your proxy infrastructure. Residential and mobile proxies from DataResearchTools provide the IP legitimacy needed to sustain high-volume Zillow scraping without blocks. For more scraping techniques, visit our web scraping guides and proxy glossary.


Related Reading

Scroll to Top