How to Scrape Zillow Real Estate Data Using Python and Proxies
Zillow is the dominant real estate platform in the United States, hosting data on over 100 million properties including listings, price estimates (Zestimates), tax records, and transaction histories. For real estate investors, analysts, and proptech companies, Zillow data is an invaluable resource for market analysis, property valuation, and investment decision-making.
Extracting this data at scale requires navigating Zillow’s anti-scraping protections, which have become increasingly sophisticated. This guide provides a complete Python framework for scraping Zillow property data using residential proxies and robust parsing techniques.
Why Proxies Are Essential for Zillow Scraping
Zillow employs multiple layers of anti-bot protection:
- IP-based rate limiting: Zillow tracks request volume per IP and blocks addresses exceeding normal browsing patterns.
- Bot detection scripts: Client-side JavaScript checks for automation tools, headless browsers, and unusual browser environments.
- CAPTCHA challenges: Suspicious traffic triggers reCAPTCHA verification.
- Dynamic content loading: Property details load asynchronously, requiring JavaScript execution.
- API authentication: Zillow’s internal APIs use authentication tokens that expire and rotate.
Using rotating residential or mobile proxies ensures your requests originate from legitimate ISP-assigned IP addresses that Zillow cannot easily distinguish from real home buyers browsing listings.
Setting Up Your Environment
pip install requests beautifulsoup4 lxml pandas selenium webdriver-managerUnderstanding Zillow’s Data Structure
Zillow exposes property data through several channels:
- Search results pages: Listings matching location and filter criteria.
- Individual property pages: Detailed information for specific addresses.
- Internal API endpoints: JSON data that powers the dynamic page content.
The most reliable approach combines search page scraping for discovery with API endpoint scraping for detailed data.
Building the Zillow Scraper
Step 1: Configure Proxy and Session
import requests
from bs4 import BeautifulSoup
import json
import time
import random
import re
import pandas as pd
class ZillowScraper:
"""Scrape Zillow property listings with proxy rotation."""
SEARCH_URL = "https://www.zillow.com/search/GetSearchPageState.htm"
BASE_URL = "https://www.zillow.com"
def __init__(self, proxy_url):
self.session = requests.Session()
self.session.proxies = {
"http": proxy_url,
"https": proxy_url,
}
self.session.headers.update({
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
"Accept": "*/*",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.zillow.com/",
"Origin": "https://www.zillow.com",
})
def _make_request(self, url, params=None, max_retries=3):
"""Make a request with retry logic."""
for attempt in range(max_retries):
try:
response = self.session.get(url, params=params, timeout=20)
if response.status_code == 200:
return response
elif response.status_code == 403:
print(f"Blocked (403). Rotating proxy recommended. Attempt {attempt + 1}")
time.sleep(random.uniform(10, 20))
elif response.status_code == 429:
print(f"Rate limited. Waiting... Attempt {attempt + 1}")
time.sleep(random.uniform(30, 60))
else:
print(f"Status {response.status_code}, attempt {attempt + 1}")
time.sleep(random.uniform(3, 8))
except requests.exceptions.RequestException as e:
print(f"Request error: {e}")
time.sleep(random.uniform(5, 10))
return NoneStep 2: Search Properties by Location
def search_properties(self, location, num_pages=5, filters=None):
"""Search for property listings in a given location."""
all_listings = []
# First, get the search page to obtain region bounds
search_url = f"{self.BASE_URL}/{location.lower().replace(' ', '-').replace(',', '')}"
response = self._make_request(search_url)
if not response:
print("Failed to load search page")
return []
# Extract search state from the page
html = response.text
listings_from_page = self._extract_listings_from_html(html)
all_listings.extend(listings_from_page)
print(f"Page 1: Found {len(listings_from_page)} listings")
# Try to get more via the API endpoint
search_state = self._extract_search_state(html)
if search_state:
for page in range(2, num_pages + 1):
search_state["pagination"] = {"currentPage": page}
params = {
"searchQueryState": json.dumps(search_state),
"wants": json.dumps({
"cat1": ["listResults", "mapResults"],
"cat2": ["total"],
}),
"requestId": random.randint(1, 100),
}
response = self._make_request(self.SEARCH_URL, params=params)
if response:
try:
data = response.json()
results = (
data.get("cat1", {})
.get("searchResults", {})
.get("listResults", [])
)
for result in results:
listing = self._parse_api_listing(result)
if listing:
all_listings.append(listing)
print(f"Page {page}: Found {len(results)} listings")
except json.JSONDecodeError:
print(f"Failed to parse API response on page {page}")
time.sleep(random.uniform(3, 7))
return all_listings
def _extract_listings_from_html(self, html):
"""Extract property listings from the search results HTML."""
listings = []
soup = BeautifulSoup(html, "lxml")
# Look for the JSON data embedded in the page
scripts = soup.find_all("script", {"type": "application/json"})
for script in scripts:
try:
data = json.loads(script.string)
# Navigate the nested structure to find listings
results = self._find_listings_in_json(data)
for result in results:
listing = self._parse_api_listing(result)
if listing:
listings.append(listing)
except (json.JSONDecodeError, TypeError):
continue
# Fallback: parse HTML directly
if not listings:
cards = soup.select("article[data-test='property-card']")
for card in cards:
listing = self._parse_html_card(card)
if listing:
listings.append(listing)
return listings
def _find_listings_in_json(self, data, results=None):
"""Recursively search for listing results in nested JSON."""
if results is None:
results = []
if isinstance(data, dict):
if "zpid" in data and "price" in data:
results.append(data)
for value in data.values():
self._find_listings_in_json(value, results)
elif isinstance(data, list):
for item in data:
self._find_listings_in_json(item, results)
return results
def _extract_search_state(self, html):
"""Extract the search query state from page HTML."""
match = re.search(r'"searchQueryState":({.*?}),"', html)
if match:
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
pass
return NoneStep 3: Parse Property Data
def _parse_api_listing(self, data):
"""Parse a listing from API response data."""
if not data:
return None
listing = {
"zpid": data.get("zpid") or data.get("id"),
"address": data.get("address"),
"price": data.get("price") or data.get("unformattedPrice"),
"price_formatted": data.get("formattedPrice"),
"bedrooms": data.get("beds"),
"bathrooms": data.get("baths"),
"sqft": data.get("area") or data.get("livingArea"),
"lot_size": data.get("lotAreaValue"),
"lot_unit": data.get("lotAreaUnit"),
"property_type": data.get("homeType") or data.get("propertyType"),
"listing_status": data.get("statusType") or data.get("homeStatus"),
"zestimate": data.get("zestimate"),
"rent_zestimate": data.get("rentZestimate"),
"latitude": data.get("latitude") or data.get("latLong", {}).get("latitude"),
"longitude": data.get("longitude") or data.get("latLong", {}).get("longitude"),
"url": data.get("detailUrl"),
"image_url": data.get("imgSrc"),
"days_on_zillow": data.get("daysOnZillow"),
"listing_agent": data.get("brokerName"),
}
# Clean up URL
if listing["url"] and not listing["url"].startswith("http"):
listing["url"] = f"https://www.zillow.com{listing['url']}"
return listing
def _parse_html_card(self, card):
"""Fallback parser for HTML property cards."""
listing = {}
# Address
addr = card.select_one("address")
listing["address"] = addr.get_text(strip=True) if addr else None
# Price
price = card.select_one("span[data-test='property-card-price']")
listing["price_formatted"] = price.get_text(strip=True) if price else None
# Details (beds, baths, sqft)
details = card.select("ul li")
for detail in details:
text = detail.get_text(strip=True).lower()
if "bd" in text or "bed" in text:
listing["bedrooms"] = re.search(r"(\d+)", text)
listing["bedrooms"] = listing["bedrooms"].group(1) if listing["bedrooms"] else None
elif "ba" in text or "bath" in text:
listing["bathrooms"] = re.search(r"([\d.]+)", text)
listing["bathrooms"] = listing["bathrooms"].group(1) if listing["bathrooms"] else None
elif "sqft" in text:
listing["sqft"] = re.search(r"([\d,]+)", text)
listing["sqft"] = listing["sqft"].group(1).replace(",", "") if listing["sqft"] else None
# URL
link = card.select_one("a[data-test='property-card-link']")
if link:
href = link.get("href", "")
listing["url"] = f"https://www.zillow.com{href}" if not href.startswith("http") else href
return listing if listing.get("address") else NoneStep 4: Extract Detailed Property Information
def get_property_details(self, zpid):
"""Fetch detailed property information using Zillow's internal API."""
url = f"{self.BASE_URL}/graphql/"
# GraphQL query for property details
payload = {
"operationName": "ForSaleDoubleScrollFullRenderQuery",
"variables": {
"zpid": int(zpid),
"contactFormRenderParameter": {
"zpid": str(zpid),
"platform": "desktop",
"isDoubleScroll": True,
},
},
"query": """
query ForSaleDoubleScrollFullRenderQuery($zpid: ID!) {
property(zpid: $zpid) {
zpid
streetAddress
city
state
zipcode
price
bedrooms
bathrooms
livingArea
lotSize
homeType
homeStatus
yearBuilt
description
zestimate
rentZestimate
taxAssessedValue
taxAssessedYear
priceHistory { date price event }
taxHistory { year taxPaid value }
schools { name rating distance level }
nearbyHomes { zpid price address }
}
}
""",
}
try:
response = self.session.post(
url,
json=payload,
timeout=20,
)
if response.status_code == 200:
data = response.json()
return data.get("data", {}).get("property")
except Exception as e:
print(f"Error fetching property details for zpid {zpid}: {e}")
return NoneStep 5: Run the Complete Pipeline
def main():
proxy_url = "http://user:pass@proxy.dataresearchtools.com:8080"
scraper = ZillowScraper(proxy_url)
# Search multiple locations
locations = [
"San Francisco CA",
"Austin TX",
"Miami FL",
]
all_listings = []
for location in locations:
print(f"\nSearching properties in {location}...")
listings = scraper.search_properties(location, num_pages=3)
for listing in listings:
listing["search_location"] = location
all_listings.extend(listings)
print(f"Found {len(listings)} listings in {location}")
time.sleep(random.uniform(5, 10))
print(f"\nTotal listings: {len(all_listings)}")
# Get detailed data for selected properties
detailed = []
for listing in all_listings[:20]:
zpid = listing.get("zpid")
if zpid:
print(f"Fetching details for zpid {zpid}...")
details = scraper.get_property_details(zpid)
if details:
detailed.append(details)
time.sleep(random.uniform(3, 7))
# Save all data
with open("zillow_listings.json", "w", encoding="utf-8") as f:
json.dump(all_listings, f, indent=2, ensure_ascii=False)
with open("zillow_detailed.json", "w", encoding="utf-8") as f:
json.dump(detailed, f, indent=2, ensure_ascii=False)
# Analysis
df = pd.DataFrame(all_listings)
df.to_csv("zillow_listings.csv", index=False)
print(f"\nPrice Statistics:")
numeric_prices = pd.to_numeric(df["price"], errors="coerce")
print(f" Median: ${numeric_prices.median():,.0f}")
print(f" Mean: ${numeric_prices.mean():,.0f}")
print(f" Min: ${numeric_prices.min():,.0f}")
print(f" Max: ${numeric_prices.max():,.0f}")
if __name__ == "__main__":
main()Handling Zillow’s Anti-Bot Measures
Request Pacing
Zillow monitors request patterns closely. Follow these guidelines:
- Space requests 3-7 seconds apart for search pages.
- Wait 5-10 seconds between detail page fetches.
- Add longer pauses (15-30 seconds) when switching between locations.
- Limit total requests to under 500 per day per IP address.
Cookie and Session Management
Zillow sets tracking cookies on initial page loads. Always visit the homepage or a search page first to establish a valid cookie session before making API calls.
def warm_up_session(scraper):
"""Establish a valid session by visiting the homepage first."""
response = scraper._make_request("https://www.zillow.com/")
if response:
print("Session established successfully")
time.sleep(random.uniform(2, 4))
return response is not NoneHandling CAPTCHAs
When Zillow serves a CAPTCHA, the best approach is to back off and try again with a different proxy IP. Attempting to solve CAPTCHAs programmatically adds complexity and cost. With quality residential proxies, CAPTCHAs should be rare.
Data Applications for Real Estate
The property data extracted from Zillow enables powerful analysis:
- Investment analysis: Compare property prices, rental yields, and appreciation trends across markets.
- Comparable analysis (comps): Find similar properties to establish fair market values.
- Market timing: Track days-on-market and price reduction patterns to identify buyer’s or seller’s markets.
- Neighborhood analysis: Aggregate school ratings, price distributions, and property types by zip code.
- Portfolio monitoring: Track Zestimate changes for owned properties over time.
For e-commerce and market intelligence applications beyond real estate, the same proxy infrastructure and scraping techniques apply.
Scaling Considerations
When scraping Zillow at scale:
- Distribute across IPs: Use a large pool of rotating proxies to distribute requests.
- Stagger timing: Run scraping during off-peak hours (late night/early morning) when detection systems may be less aggressive.
- Database storage: Use PostgreSQL or MongoDB for efficient querying and deduplication across multiple runs.
- Incremental updates: Track previously scraped properties by zpid and only fetch new or updated listings.
- Geographic partitioning: Break large metro areas into smaller zip code searches for more complete coverage.
Legal Considerations
Zillow’s Terms of Use prohibit scraping. Additionally:
- The data Zillow publishes may include copyrighted content (property descriptions, photos).
- MLS (Multiple Listing Service) data shown on Zillow has its own licensing restrictions.
- Fair Housing Act considerations apply to how you use housing data.
Use scraped real estate data responsibly and consult with legal professionals before deploying commercial scraping operations.
Conclusion
Scraping Zillow provides access to one of the richest real estate datasets available, enabling sophisticated market analysis and investment research. The combination of API-based data extraction and HTML parsing creates a resilient scraper that adapts to Zillow’s evolving page structure.
Success depends on your proxy infrastructure. Residential and mobile proxies from DataResearchTools provide the IP legitimacy needed to sustain high-volume Zillow scraping without blocks. For more scraping techniques, visit our web scraping guides and proxy glossary.
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Build an Ethical Web Scraping Policy for Your Company
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape Amazon Product Data with Proxies: 2026 Python Guide
- How to Scrape Bing Search Results with Python and Proxies
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix