How to scrape OLX Brazil and LATAM marketplaces
Scrape OLX Brazil and you tap into the largest classifieds platform in Latin America, covering used cars, real estate, jobs, and general merchandise across Brazil, Argentina, Colombia, Peru, and Ecuador. OLX operates as a horizontal classifieds platform where individual sellers and small businesses publish ads, which makes the data shape fundamentally different from a marketplace like Mercado Libre. The scraping landscape is shaped by three things: a state-level geographic hierarchy that gates listings, an undocumented but stable JSON API behind the search experience, and Cloudflare bot protection that classifies non-Brazilian traffic aggressively.
This guide focuses on OLX Brazil at olx.com.br as the canonical example, with notes on how the patterns transfer to OLX Argentina, OLX Colombia, and the smaller LATAM properties.
Mapping OLX URL and listing structure
OLX Brazil URLs follow a regional hierarchy. National listings live under the root domain, but most search activity is filtered by state. URLs include the state code in the path: https://www.olx.com.br/sp/regiao-de-sao-paulo for Sao Paulo state, https://www.olx.com.br/rj for Rio de Janeiro, and so on for the 27 federative units. Within a state, listings are further filtered by category, sub-category, and city.
Individual ad URLs follow the pattern https://<region>.olx.com.br/<category>/<subcategory>/<title-slug>-<adId>. The trailing adId is the canonical identifier you should store as the primary key. Slugs change when sellers edit ads; adIds are stable for the life of the listing.
The OLX search experience is powered by a JSON API at https://www.olx.com.br/api/relevance/search. The endpoint accepts a category, subcategory, region, and pagination parameters and returns a structured listing payload. This is dramatically more reliable than HTML scraping because the API contract is stable while the visible markup changes regularly.
import httpx
SEARCH_API = "https://www.olx.com.br/api/relevance/search"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "application/json",
"Accept-Language": "pt-BR,pt;q=0.9",
}
async def search_olx(category_id: int, region_code: str, page: int, proxy: str):
params = {
"category": category_id,
"region": region_code,
"o": page,
"lim": 50,
}
async with httpx.AsyncClient(proxy=proxy, headers=HEADERS, timeout=20) as c:
r = await c.get(SEARCH_API, params=params)
if r.status_code == 200:
return r.json().get("ad_list", [])
return []
The response includes list_id (the canonical ad id), subject (title), body (description), price in BRL, category_id, category_name, state, city, seller_name, seller_phone_status, images, and created_at. For most analytical use cases this is sufficient and you do not need to fetch the individual ad detail page.
Brazilian proxy strategy
OLX Brazil’s bot detection is aggressive against non-Brazilian IPs. Cloudflare profiles the visitor IP geography, and any IP outside Brazil receives elevated scrutiny. For sustained scraping, Brazilian residential or mobile IPs are required. Vivo, Claro, and TIM mobile pools work well. Brazilian residential pools through major providers also work but cost more per gigabyte than the mobile equivalent at scale.
For workloads under 10,000 ad reads per day, a small Brazilian residential pool with sticky 15-minute sessions is sufficient. For higher volumes, dedicated mobile ports are the cleaner path because they can sustain higher request rates without challenges.
| Region | Recommended proxy origin | Tolerance per IP |
|---|---|---|
| Sao Paulo state | Brazilian mobile or residential in SP | 200 req/hr per IP |
| Rio de Janeiro | Brazilian residential anywhere | 200 req/hr per IP |
| Northeast states | Brazilian residential | 250 req/hr per IP |
| South states | Brazilian residential | 250 req/hr per IP |
Cross-state IPs work fine for any state target because the OLX bot logic does not enforce intra-Brazil geographic consistency. The geographic check is at the country level only.
Crawling categories and pagination
OLX exposes roughly 80 top-level categories and 400 sub-categories across the Brazilian site. The full category tree is published at https://www.olx.com.br/api/categories and changes slowly enough that you can cache it weekly. For each category, the search API allows pagination up to roughly 100 pages of 50 ads each, for a maximum of 5,000 ads per query.
For deeper coverage of large categories like used cars or real estate, decompose by state and then by city. The cities of Sao Paulo, Rio de Janeiro, Belo Horizonte, Salvador, Brasilia, Curitiba, and Recife collectively cover roughly 40% of total OLX listings; the remaining listings spread across hundreds of smaller cities.
async def crawl_state_category(state: str, category_id: int, proxy_pool, max_pages: int = 100):
all_ads = []
for page in range(1, max_pages + 1):
proxy = proxy_pool.next()
ads = await search_olx(category_id, state, page, proxy)
if not ads:
break
all_ads.extend(ads)
return all_ads
A national daily snapshot across all states and categories is roughly 10-15 million ad reads. That is achievable on a moderately-sized mobile proxy pool with parallelism across 20-50 concurrent sessions. Plan for 8-12 hours of wall-clock time for the full sweep.
Phone number redaction and contact extraction
OLX masks seller phone numbers behind a click-to-reveal flow. The seller_phone_status field in the search response indicates whether a phone is published, but the actual number is only revealed by hitting the contact API with an authenticated session. From a privacy perspective, treat the phone number as personal data and avoid collecting it unless you have a clear lawful basis and a documented use case.
For analytics that need to dedupe sellers across many ads, hash a stable seller identifier (the OLX user_id is published in the ad detail) rather than collecting raw phone numbers. The hashed user_id is sufficient for seller concentration analysis without creating personal-data exposure.
Schema for OLX classifieds snapshots
CREATE TABLE olx_ad_snapshot (
snapshot_at TIMESTAMP NOT NULL,
list_id BIGINT NOT NULL,
state VARCHAR(2) NOT NULL,
city VARCHAR(64),
category_id INT,
subject TEXT,
price_brl DECIMAL(12,2),
seller_user_id_hash VARCHAR(64),
image_count INT,
created_at TIMESTAMP,
is_active BOOLEAN,
PRIMARY KEY (snapshot_at, list_id)
);
CREATE INDEX olx_state_cat_idx ON olx_ad_snapshot(state, category_id);
For longitudinal classifieds analytics, the most valuable derived metric is time-on-market: the number of days an ad stays active before being delisted (because the item sold or the seller withdrew). Compute this from the diff between consecutive snapshots. Time-on-market by category, by price band, and by state is the headline insight in any classifieds intelligence product.
For broader pattern guidance on classifieds-style scrapers, see our residential proxy provider ranking and our Mercado Libre Mexico scraping guide.
Detecting and routing around CAPTCHA challenges on OLX
When OLX flags your traffic, the response is usually a Cloudflare or vendor interrogation page rather than a clean HTTP error. Your scraper needs to detect this content-type swap explicitly. Look for the signature cf-mitigated header, the presence of __cf_chl_ cookies, or HTML containing Just a moment.... Treat any of these as a soft block.
def is_challenged(response) -> bool:
if response.status_code in (403, 503):
return True
if "cf-mitigated" in response.headers:
return True
if "__cf_chl_" in response.headers.get("set-cookie", ""):
return True
body = response.text[:2000].lower()
return "just a moment" in body or "checking your browser" in body
When you detect a challenge, do not retry on the same IP for at least 30 minutes. Mark that IP as cooling and route subsequent requests to a different IP in your pool. Aggressive retries on a flagged IP cause the cooling window to extend and can lead to long-term blacklisting of your subnet. For pages that absolutely must be fetched, have a fallback path that uses a headless browser with a real Brazil residential IP. The browser path costs more per page but solves the small percentage of challenges that the API path cannot handle. Most production setups maintain a 95/5 split between the lightweight HTTP path and the browser fallback path.
Working with BRL pricing and FX normalization
Pricing on OLX is denominated in BRL, and any cross-market analysis requires careful FX normalization. The naive approach of converting at scrape time using a live FX feed introduces noise into your trend lines because exchange rate movements get conflated with real price changes. Store the price in local BRL and apply FX conversion at query time using a daily reference rate.
CREATE TABLE fx_rates (
rate_date DATE NOT NULL,
base_ccy VARCHAR(3) NOT NULL,
quote_ccy VARCHAR(3) NOT NULL,
rate DECIMAL(18,8) NOT NULL,
PRIMARY KEY (rate_date, base_ccy, quote_ccy)
);
Source the daily rates from a reliable feed such as the European Central Bank reference rates or your bank wholesale feed. Avoid scraping retail FX rates because they include the bank spread and produce inconsistent comparisons. For analyses that span multiple years, also account for currency revaluation events that occasionally happen in emerging markets.
Comparing OLX to other regional marketplaces
| Marketplace | Country focus | Catalogue scale | Bot strictness |
|---|---|---|---|
| OLX | Brazil | Large | High |
| Mercado Libre | Adjacent markets | Medium | Medium |
| Webmotors | Adjacent markets | Smaller | Lower |
Cross-marketplace analyses help separate platform-specific dynamics from genuine market trends. If a price drops on OLX but stays flat across the comparable competitors, that is a platform-driven event rather than a market-wide signal. Your scraping pipeline should ingest from at least three platforms in any market where you intend to publish category insights.
Operational monitoring and alerting
Every production scraper needs three monitoring layers regardless of target. The first is per-IP success rate over a 5-minute window, alerting if any IP drops below 80%. The second is parser error rate, alerting if more than 1% of fetched pages fail to extract the canonical fields. The third is data freshness, alerting if your downstream consumers see snapshots more than 24 hours old.
import time
from collections import deque
class IPHealthTracker:
def __init__(self, window_seconds: int = 300):
self.window = window_seconds
self.events = {}
def record(self, ip: str, success: bool):
bucket = self.events.setdefault(ip, deque())
now = time.time()
bucket.append((now, success))
while bucket and bucket[0][0] < now - self.window:
bucket.popleft()
def success_rate(self, ip: str) -> float:
bucket = self.events.get(ip)
if not bucket:
return 1.0
successes = sum(1 for _, ok in bucket if ok)
return successes / len(bucket)
Wire this into Prometheus or your existing observability stack so the on-call engineer sees IP degradation as it happens rather than after the daily snapshot fails. For long-running operations against OLX, IP rotation triggered by the health tracker is more reliable than fixed rotation schedules.
Legal and compliance considerations for Brazil
Public product, price, and availability data are generally treated as fair to scrape in most jurisdictions, but Brazil has its own consumer protection and personal data frameworks. Confine your collection to non-personal data: SKU identifiers, prices, descriptions, ratings as aggregates, and seller display names. Avoid collecting individual buyer reviews with names, phone numbers, or email addresses attached, and avoid pulling any data behind a login.
For commercial deployment, document your basis for processing, your data retention period, and your purpose limitation. Most data protection regimes treat scraped public data more favorably when there is a clear lawful basis and the data is not used for direct marketing to identified individuals. The W3C Web Annotation guidance and similar published frameworks remain useful starting points for documenting your approach.
Pipeline orchestration and scheduling
For any non-trivial scraping operation, a dedicated orchestration layer is the difference between a script you babysit and a service that runs unattended. The two strong open-source choices in 2026 are Prefect 3 and Dagster. Both handle the patterns you need: DAG dependencies, retries, observability, secret management, and dynamic fan-out across IPs and categories.
from prefect import flow, task
@task(retries=3, retry_delay_seconds=60)
def fetch_category(category_id: int, page: int):
return crawl_one_page(category_id, page)
@task
def store_pages(pages: list):
write_to_db(pages)
@flow(name="olx-daily-sweep")
def daily_sweep(category_ids: list):
futures = []
for cid in category_ids:
for page in range(1, 50):
futures.append(fetch_category.submit(cid, page))
pages = [f.result() for f in futures]
store_pages(pages)
Run the flow on a 6-hour or 24-hour schedule depending on how dynamic the underlying catalogue is. For seasonal markets like apparel where pricing changes daily, a 6-hour cadence catches the meaningful movements without driving up proxy costs unnecessarily. For long-tail categories like books or industrial supplies, daily is sufficient and the cost saving is meaningful.
Sample analytics queries on the collected dataset
Once your snapshots are landing reliably, the analytics layer is where the value materializes. A few queries that consistently come up across OLX datasets:
-- Top 50 SKUs by price drop in the last 7 days
SELECT sku, MIN(selling_price) - MAX(selling_price) AS price_drop
FROM snapshot
WHERE snapshot_at > now() - interval '7 days'
GROUP BY sku
ORDER BY price_drop ASC
LIMIT 50;
-- Stock-out frequency per category
SELECT category_id,
SUM(CASE WHEN in_stock = 0 THEN 1 ELSE 0 END)::float / COUNT(*) AS oos_rate
FROM snapshot
WHERE snapshot_at > now() - interval '30 days'
GROUP BY category_id
ORDER BY oos_rate DESC;
-- New SKUs first seen in the last 14 days
SELECT sku, MIN(snapshot_at) AS first_seen
FROM snapshot
GROUP BY sku
HAVING MIN(snapshot_at) > now() - interval '14 days'
ORDER BY first_seen DESC;
These three queries alone power most of the dashboards a category manager wants. Add a brand share view, a seller concentration view, and a campaign-frequency view and you have a competitive intelligence product. The collection layer is the prerequisite; the analytics layer is where you create defensible value.
Building robust deduplication across noisy listings
When you scrape any ecommerce marketplace at scale, the long-tail catalogue is full of near-duplicate listings. The same physical product appears under different titles, different sellers, slight variations in pack size, and slightly different image sets. For analytics that try to compute brand share or category trends, deduplication is mandatory and it is harder than it looks.
The standard approach uses a three-pass funnel. The first pass groups by exact match on EAN or GTIN where present. The second pass groups by normalized title plus brand using a TF-IDF cosine similarity threshold of 0.85. The third pass groups by image hash similarity using perceptual hashing.
import imagehash
from PIL import Image
def perceptual_hash(image_path: str) -> str:
img = Image.open(image_path)
return str(imagehash.phash(img, hash_size=16))
Tune the similarity thresholds against a hand-labeled gold set of 500 to 1,000 known duplicate clusters. Without a gold set, you will either over-merge (collapsing distinct variants) or under-merge (leaving the same product in many groups).
Versioning your scraper for catalogue evolution
Every ecommerce site evolves its catalogue structure regularly. Stamp every snapshot row with the scraper version that produced it. When you deploy a new version of the parser, increment the version number. Downstream analytics can filter by version when they need consistent semantics across a time range, or join across versions when they want long-running trend analysis.
Building robust deduplication across noisy listings
When you scrape any ecommerce marketplace at scale, the long-tail catalogue is full of near-duplicate listings. The same physical product appears under different titles, different sellers, slight variations in pack size, and slightly different image sets. For analytics that try to compute brand share or category trends, deduplication is mandatory and it is harder than it looks. The standard approach uses a three-pass funnel. The first pass groups by exact match on EAN or GTIN where present. The second pass groups by normalized title plus brand using a TF-IDF cosine similarity threshold of 0.85. The third pass groups by image hash similarity using perceptual hashing.
import imagehash
from PIL import Image
def perceptual_hash(image_path: str) -> str:
img = Image.open(image_path)
return str(imagehash.phash(img, hash_size=16))
Tune the similarity thresholds against a hand-labeled gold set of 500 to 1,000 known duplicate clusters. Without a gold set, you will either over-merge (collapsing distinct variants) or under-merge (leaving the same product in many groups). Both failure modes break downstream analytics in subtle ways that take weeks to detect.
Caching strategy and incremental crawls
Full daily snapshots scale linearly with catalogue size, which becomes expensive at multi-million SKU scale. Most production deployments shift from full snapshots to incremental refreshes after the initial ramp. The pattern uses three signals to decide what to refetch on each cycle. The first signal is freshness deadline. Every SKU has a maximum staleness budget, and any SKU older than its budget gets refreshed. The second signal is volatility. SKUs that have changed price recently get higher refresh priority. The third signal is business priority. SKUs that downstream users actually query get higher refresh priority than dormant SKUs that nobody has looked at in months.
This kind of priority-driven scheduler reduces total request volume by 60-80% compared to blind full snapshots, while keeping the data fresh on the SKUs that actually matter to the business.
FAQ
Can I scrape all LATAM OLX countries with the same code?
The API contract is mostly consistent across OLX Argentina, OLX Colombia, and OLX Peru, but the parameter names and the category trees differ. Build a per-country adapter rather than assuming one code path covers everything. Brazilian-specific date parsing (DD/MM/YYYY) and currency formatting (R$ prefix) need country-specific handlers.
Is the OLX API officially documented?
No. The endpoints described here are the unauthenticated APIs that the public web site uses. They have been stable for several years but they are not contractually supported. Build defensively: log unknown response fields, alert on schema drift, and keep the parser version-stamped.
How does OLX handle ads from professional sellers vs. private sellers?
The ad object includes a professional_ad boolean flag that distinguishes dealer or business listings from individual private sellers. For analytics, separating professional from private listings is essential because the price dynamics and turnover characteristics are very different.
Can I track an ad across multiple snapshots to see if the price changed?
Yes. The list_id is stable for the life of the ad. Comparing the price field across consecutive snapshots gives you the price-change history. About 15-20% of ads see at least one price change before the ad is delisted.
What about OLX motors vs. general merchandise?
OLX Brazil operates a sub-property at autos.olx.com.br for vehicles with additional automotive-specific filters and an inventory model that overlaps with Webmotors. The same API patterns work but with vehicle-specific category IDs and additional fields like make, model, year, and mileage.
To build a broader LATAM classifieds intelligence stack, browse the ecommerce scraping category for tooling reviews, proxy comparisons, and framework deep dives that pair with the patterns above.