How to Scrape Realtor.com Property Data in 2026 (Bypass Next.js Protection)
the cleanest way to scrape realtor.com in 2026 is to extract the embedded __NEXT_DATA__ json from each listing page. it contains everything the website renders, in structured form, with no parsing brittleness. you’ll need a residential proxy to avoid the akamai bot manager block, and python with httpx + parsel does the rest. this tutorial ships working code.
why next_data is the trick
realtor.com runs on next.js. every server-rendered page bakes a hidden <script id="__NEXT_DATA__" type="application/json"> block into the html. inside that block sits the entire react state for the page: full property details, agent info, school info, pricing history, photos, the whole structured tree.
if you parse the html directly (price from .price-display, address from .address-line), realtor.com will rename or restructure that markup every few months. your scraper breaks. if you parse __NEXT_DATA__, you get raw json from their backend, and they rarely change those keys because their own frontend depends on them.
we cover this pattern in depth in our javascript-rendered pages scraping guide, but realtor.com is the textbook case.
the anti-bot situation
realtor.com sits behind akamai bot manager and a custom rate limiter. behavior:
(1) datacenter ips: blocked at the cdn edge. you get a 403 with a captcha challenge page.
(2) residential or mobile ips with a clean fingerprint: 200 ok response, 50-200 requests per ip per hour before throttling.
(3) high-volume requests from the same ip: 429 rate limit, 5-15 minute cooldown.
practical implication: you need rotating residential ips with sticky sessions long enough to fetch a single page. you don’t need a full headless browser, plain http requests with the right headers work fine.
installing dependencies
pip install httpx parsel orjson
httpx for async http, parsel for css selectors, orjson for fast json parsing. that’s it.
the basic listing fetcher
import httpx
import parsel
import orjson
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/126.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Cache-Control": "no-cache",
"Pragma": "no-cache",
"Sec-Ch-Ua": '"Chromium";v="126", "Not(A:Brand";v="24", "Google Chrome";v="126"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"macOS"',
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
}
def fetch_listing(url: str, proxy: str) -> dict:
with httpx.Client(
proxies=proxy,
headers=HEADERS,
timeout=30,
http2=True,
) as client:
resp = client.get(url)
resp.raise_for_status()
sel = parsel.Selector(resp.text)
data = sel.css("script#__NEXT_DATA__::text").get()
if not data:
raise ValueError("__NEXT_DATA__ missing - likely blocked")
return orjson.loads(data)
if __name__ == "__main__":
proxy = "http://user-session-abc:pwd@gate.provider.com:8000"
data = fetch_listing(
"https://www.realtor.com/realestateandhomes-detail/123-Main-St_Anytown_CA_90210_M12345-67890",
proxy,
)
print(orjson.dumps(data, option=orjson.OPT_INDENT_2).decode())
key headers:
Sec-Ch-Ua block must match the user-agent. mismatched user-agent and sec-ch-ua is one of the fastest ways to get flagged.
http2 enabled. realtor.com uses http2 internally and bot managers flag plain http/1.1 client behavior on http2 sites.
navigating next_data
the json structure looks like this (pruned):
{
"props": {
"pageProps": {
"initialReduxState": {
"propertyDetails": {
"property": {
"list_price": 750000,
"address": {
"line": "123 Main St",
"city": "Anytown",
"state": "CA",
"postal_code": "90210"
},
"description": {
"beds": 3,
"baths": 2.5,
"sqft": 1800,
"year_built": 1985,
"type": "single_family"
},
"photos": [...],
"advertisers": [...],
"schools": {...},
"tax_history": [...]
}
}
}
}
}
}
the exact path varies slightly between listing types (single family, condo, lot, rental). a robust extractor walks the tree:
def extract_property(data: dict) -> dict:
page_props = data["props"]["pageProps"]
redux = page_props.get("initialReduxState", {})
prop = redux.get("propertyDetails", {}).get("property", {})
if not prop:
# fallback for newer page structures
prop = page_props.get("property", {})
return {
"price": prop.get("list_price"),
"address": prop.get("address", {}).get("line"),
"city": prop.get("address", {}).get("city"),
"state": prop.get("address", {}).get("state"),
"postal_code": prop.get("address", {}).get("postal_code"),
"beds": prop.get("description", {}).get("beds"),
"baths": prop.get("description", {}).get("baths"),
"sqft": prop.get("description", {}).get("sqft"),
"year_built": prop.get("description", {}).get("year_built"),
"property_type": prop.get("description", {}).get("type"),
"photos": [p.get("href") for p in prop.get("photos", [])],
"agent_name": (prop.get("advertisers", [{}])[0]).get("name"),
"schools": prop.get("schools"),
"tax_history": prop.get("tax_history"),
}
now you have clean structured data ready for a database.
scraping search results
the listing detail page is the easy part. search result pages also embed __NEXT_DATA__ with a list of properties:
def fetch_search_results(city: str, state: str, page: int, proxy: str) -> list:
url = f"https://www.realtor.com/realestateandhomes-search/{city}_{state}/pg-{page}"
with httpx.Client(proxies=proxy, headers=HEADERS, timeout=30, http2=True) as client:
resp = client.get(url)
resp.raise_for_status()
sel = parsel.Selector(resp.text)
data = orjson.loads(sel.css("script#__NEXT_DATA__::text").get())
listings = data["props"]["pageProps"]["properties"]
return [
{
"property_id": l.get("property_id"),
"url": f"https://www.realtor.com{l.get('rdc_web_url', '')}",
"list_price": l.get("list_price"),
"address": l.get("address"),
}
for l in listings
]
a typical search returns 42 listings per page. multi-page pagination is just incrementing pg-N until the result list is empty.
the proxy setup
residential rotating with 1-5 minute sticky sessions is what you want. one ip per page fetch keeps the request footprint tiny. avoid mobile (overkill, more expensive) and datacenter (blocked).
import uuid
def make_session_proxy() -> str:
sid = uuid.uuid4().hex[:12]
return f"http://user-country-us-session-{sid}:pwd@gate.provider.com:8000"
generate a fresh session per page. if a request fails, generate another and retry.
handling 403 challenges
when akamai flags you, the response is a redirect to a challenge page or an html with "unable to verify" in the body. detect both:
def looks_blocked(resp: httpx.Response) -> bool:
if resp.status_code == 403:
return True
body = resp.text.lower()
if "unable to verify" in body or "challenge" in body:
return True
if "<script id=\"__next_data__\"" not in body:
return True
return False
retry with a fresh session. if you get blocked 3 times in a row from different sessions, the entire ip range is hot. wait 10-15 minutes before retrying.
throttling for politeness
even with rotating ips, hammering realtor.com is rude and gets your provider’s pool flagged. cap your request rate:
import asyncio
import random
async def throttled_fetch(url, proxy):
await asyncio.sleep(random.uniform(2.0, 5.0))
return await async_fetch_listing(url, proxy)
2-5 second delay per worker, 10-20 workers in parallel. you’ll fetch 200-400 pages/minute, which is plenty without abusing the site.
storing the data
a postgres table for structured fields, jsonb column for the full extracted payload (so you can backfill new fields later):
CREATE TABLE realtor_listings (
property_id TEXT PRIMARY KEY,
scraped_at TIMESTAMP NOT NULL DEFAULT NOW(),
list_price NUMERIC(12, 2),
address TEXT,
city TEXT,
state TEXT,
postal_code TEXT,
beds NUMERIC(4, 1),
baths NUMERIC(4, 1),
sqft INTEGER,
year_built INTEGER,
property_type TEXT,
raw JSONB,
UNIQUE(property_id, scraped_at)
);
if your data lands in bigquery instead of postgres, the same pattern works (we wrote a scraping to bigquery pipeline that fits this scraper directly).
scaling considerations
at 200 listings/min, scraping all active us listings (~1.5M) takes ~5 days. realistic budgets:
- proxy bandwidth: ~150kb per detail page = 250mb for 1500 listings. at $4/gb residential = $1/1500 listings.
- compute: a single python worker handles 200/min. for parallel scraping, deploy 5-10 workers across regions.
- storage: a million listings is ~3gb in postgres with the jsonb column.
cheap relative to the data value. real estate scraping pipelines that produce $5k-50k/month in saas revenue spend $200-800/month on infrastructure.
legal and ethical notes
scraping public listings is widely accepted. the data is published for the world to see. but: realtor.com’s tos forbids automated access. they can block your ip range, send a cease-and-desist, or pursue legal action if you redistribute their data commercially.
(1) don’t republish realtor.com’s photos or copy. fair use for analysis is one thing, building a competing listings site is another.
(2) don’t scrape pii (broker phone numbers, emails) and resell it.
(3) respect throttling. if they ratelimit, back off.
your use case (price analysis, market trends, lead generation for buyers) is usually fine. competing directly with mls licensees is a fast way to get sued.
frequently asked questions
why doesn’t realtor.com just block next_data entirely?
their own website depends on it. removing the embedded json would break their progressive enhancement and seo. they could obfuscate keys, but they haven’t, because the cost of breaking their own analytics tooling outweighs the gain of slowing scrapers.
can i scrape with playwright instead of httpx?
you can, but it’s overkill. realtor.com’s next_data is in the initial server response, so you don’t need to wait for js execution. httpx is 10x faster and 50x cheaper.
what residential proxy provider works best for realtor.com?
any reputable residential pool with us ips. our proxy provider comparison ranks them. avoid datacenter and avoid the smaller regional providers, your success rate suffers.
how do i scrape realtor.com photos?
photo urls are in the next_data payload. download them through the same proxy pool, but at higher bandwidth cost. budget ~3mb per photo, ~30mb per listing.
does realtor.com offer an official api?
not for public scraping. they license data through partnerships and via the underlying mls feeds. licensing fees are typically $thousands/month plus per-record charges. scraping is the budget alternative.
will my scraper survive realtor.com html changes?
the next_data approach is far more stable than css selectors. expect occasional key renames (every 12-18 months) but minor adjustments rather than full rewrites.
final thoughts
realtor.com is one of the cleaner real estate scraping targets in 2026 if you go through the front door (__NEXT_DATA__ extraction with residential proxies). most failures we see are from people trying to use datacenter ips, parse the rendered html, or run headless browsers when they don’t need to. the lighter your stack, the faster and cheaper your scraper. ship the simple httpx version first, add complexity only when something specific breaks.