—
ImovelWeb is Brazil’s second-largest property portal, listing 3+ million active rental and sale properties across São Paulo, Rio de Janeiro, and every major metro. if you’re building a Brazilian real estate dataset — for investment analysis, price forecasting, or competitive research — scraping ImovelWeb is faster and more complete than any official data source. here’s how to build a reliable pipeline in 2026.
What ImovelWeb Serves and How It Protects Itself
ImovelWeb runs on a React frontend with server-side rendering. most listing pages load critical data (price, address, specs) inline in the HTML, which means you don’t need to execute JavaScript for basic fields. detail pages hydrate additional data via XHR calls to their internal API, so a two-pass approach (static HTML for listing index + XHR interception for full property detail) is the most efficient architecture.
anti-bot defenses as of 2026:
- Cloudflare Turnstile on search result pages at high request volume
- rate limiting by IP: roughly 60-80 requests per minute before soft blocks appear
- user-agent and header fingerprinting on the detail page XHR endpoints
- cookie-based session tokens that expire after ~10 minutes of inactivity
no CAPTCHA on individual property pages at moderate volume, but aggressive crawling triggers 429s fast. the defense profile is similar to what you’d encounter on Realtor.com — if you’ve read How to Scrape Realtor.com Property Data in 2026 (Bypass Next.js Protection), the same proxy rotation and header hygiene principles apply directly here.
Parsing the HTML: Key Selectors
ImovelWeb listing pages use consistent CSS classes that have been stable through 2025-2026. the search results grid renders listing cards server-side, which is the cleanest extraction path.
import httpx
from selectolax.parser import HTMLParser
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept-Language": "pt-BR,pt;q=0.9",
"Accept": "text/html,application/xhtml+xml",
"Referer": "https://www.imovelweb.com.br/",
}
def parse_listings(html: str) -> list[dict]:
tree = HTMLParser(html)
results = []
for card in tree.css("div[data-qa='posting PROPERTY']"):
price = card.css_first("div[data-qa='POSTING_CARD_PRICE']")
address = card.css_first("div[data-qa='POSTING_CARD_LOCATION']")
link = card.css_first("a[data-qa='posting PROPERTY']")
results.append({
"price": price.text(strip=True) if price else None,
"address": address.text(strip=True) if address else None,
"url": "https://www.imovelweb.com.br" + link.attrs.get("href", "") if link else None,
})
return resultskey attributes to extract from cards: data-qa="POSTING_CARD_PRICE", POSTING_CARD_FEATURES (beds/baths/m²), POSTING_CARD_LOCATION, and the canonical listing URL. for the full detail page, the JSON-LD block under contains structured RealEstateListing data including geo-coordinates and agent contact.
if you've worked through Google Shopping HTML Selectors 2026: sh-dgr__content and a8pemb Explained, the data-qa attribute pattern here is conceptually identical -- stable semantic hooks that survive minor redesigns.
Proxy Strategy and IP Requirements
Brazil is a geo-restricted target. ImovelWeb redirects non-Brazilian IPs to a regional landing page and degrades search results for international traffic. you need Brazilian residential or mobile IPs, not datacenter IPs from São Paulo AWS nodes -- those are fingerprinted and blocked within minutes.
provider comparison for Brazilian residential IPs (2026):
| Provider | BR Residential | BR Mobile | Price/GB | Sticky Sessions |
|---|---|---|---|---|
| Bright Data | yes | yes | ~$8.40 | up to 30 min |
| Oxylabs | yes | yes | ~$8.00 | up to 30 min |
| Smartproxy | yes | limited | ~$7.00 | up to 10 min |
| IPRoyal | yes | no | ~$3.50 | up to 24h |
| SOAX | yes | yes | ~$6.00 | up to 30 min |
mobile IPs are worth the premium for search result pages where Cloudflare Turnstile activates. for detail pages at moderate volume (under 20 req/min per IP), residential IPs are sufficient and cheaper. the same IP rotation logic used for review-site pipelines applies here -- see How Proxies Help Scrape Reviews at Scale: Yelp, Google, Trustpilot (2026) for rotation interval benchmarks that carry over directly.
session stickiness matters for pagination: ImovelWeb sets a gclid and _iw_session cookie on the first request, and paginating without carrying that session forward causes result deduplication errors. use sticky sessions of at least 5 minutes per spider thread.
Pipeline Architecture
a production ImovelWeb scraper has three stages:
- URL generation -- ImovelWeb uses a structured URL schema:
imovelweb.com.br/imoveis-venda-{city}-{neighborhood}.html. generate the full matrix of city/neighborhood/property-type combinations from their sitemap (sitemap_index.xmllinks to per-city sitemaps). - listing index crawl -- fetch paginated search results (up to page 50, ~25 listings/page). store canonical URLs and card-level data to a staging table.
- detail page enrichment -- for each canonical URL, fetch the full property page, extract JSON-LD, agent info, photo count, and the internal
postingId. use this ID to optionally call the XHR endpoint/api/v3/posting/{postingId}for fields not in the HTML (HOA fees, energy rating, floor number).
infrastructure checklist:
- use
httpxwith an async connection pool (50-100 concurrent workers is safe with proxy rotation) - retry on 429 with exponential backoff -- start at 5 seconds, cap at 60
- store raw HTML alongside parsed fields; ImovelWeb's selector names have shifted twice in the past 18 months
- checkpoint progress to a database by city+page; full Brazil crawls take 8-14 hours depending on proxy speed
for teams evaluating managed scraping tools that handle proxy integration natively, Tools That Integrate Proxies for B2B Data Collection at Scale (2026) covers platforms like Apify, ScrapeOps, and Zyte that can reduce infrastructure overhead significantly.
Common Errors and Fixes
| Error | Cause | Fix |
|---|---|---|
| 403 on search pages | no Brazilian IP or expired session | rotate to BR residential, reseed cookies |
| empty listing cards | JS hydration path used | switch to SSR HTML path, not Playwright |
redirect to /en |
non-BR IP detected | confirm proxy geo, set Accept-Language: pt-BR |
| duplicate listings | session cookie not carried across pages | enable sticky proxy sessions |
| 429 bursts | too many requests from one IP | drop to 15 req/min per IP, add jitter |
one non-obvious issue: ImovelWeb serves a stale cached page to requests missing the Referer header pointing back to their own domain. always set Referer: https://www.imovelweb.com.br/ even on direct detail page hits. this is the same header discipline required for B2B dataset extraction -- see Best Proxies for Extracting Jobs + B2B Datasets at Scale (2026) for a header template that works across multiple structured-data targets.
Bottom Line
ImovelWeb is scrapable at scale in 2026 with Brazilian residential proxies, the right data-qa selectors, and a staged pipeline that separates index crawls from detail enrichment. skip datacenter IPs entirely -- they're blocked on first contact. for teams building recurring pipelines, pair the architecture above with a managed proxy rotation layer to avoid the maintenance overhead of IP pool management. DRT covers the full stack of scraping infrastructure decisions, from selector stability to proxy provider tradeoffs, so bookmark the site if you're building production data pipelines.
Related guides on dataresearchtools.com
- How Proxies Help Scrape Reviews at Scale: Yelp, Google, Trustpilot (2026)
- Best Proxies for Extracting Jobs + B2B Datasets at Scale (2026)
- Google Shopping HTML Selectors 2026: sh-dgr__content and a8pemb Explained
- Tools That Integrate Proxies for B2B Data Collection at Scale (2026)
- Pillar: How to Scrape Realtor.com Property Data in 2026 (Bypass Next.js Protection)