Scraping hotel availability and ADR data in 2026
Scrape hotel rates and you tap into one of the most analytically rich travel datasets, supporting use cases from competitive intelligence for hotel chains to market research for institutional investors in hospitality real estate. Average Daily Rate (ADR) is the headline metric in the hotel industry, and ADR by submarket is the foundation of hotel revenue management. The scraping landscape is shaped by three things: dominant aggregator sites (Booking.com, Expedia, Hotels.com) that expose rich availability data behind aggressive bot defenses, direct hotel chain sites that often expose better rates than aggregators, and a per-search dimensionality (date times occupancy times length of stay) that creates a combinatorial explosion similar to flight search.
This guide focuses on practical patterns for hotel ADR research that produce useful intelligence without requiring full enterprise-scale infrastructure.
Source taxonomy and search patterns
The hotel pricing ecosystem has three distinct source types.
Aggregator sites (Booking.com, Expedia, Hotels.com, Agoda) consolidate inventory from hundreds of thousands of hotels worldwide. They expose powerful search interfaces with calendar-based availability and have aggressive bot defenses. Booking.com is the largest and the most heavily defended.
Direct hotel chain sites (marriott.com, hilton.com, ihg.com, accor.com) publish their own inventories and often have lower rates than aggregators because chains avoid aggregator commissions on direct bookings. The chain sites use brand-specific search APIs that are well-structured but rate-limited.
Independent hotel websites are the long-tail source. Most independent hotels use one of a handful of property management systems (Cloudbeds, Mews, Opera) that expose booking widgets with consistent structures. Scraping at this level is typically only worthwhile for specific submarkets where the major chains do not dominate.
import httpx
BOOKING_HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "application/json",
"Accept-Language": "en-US,en;q=0.9",
}
async def search_booking(city_id: str, checkin: str, checkout: str, proxy: str):
url = "https://www.booking.com/searchresults.json"
params = {
"dest_id": city_id,
"dest_type": "city",
"checkin": checkin,
"checkout": checkout,
"group_adults": 2,
"no_rooms": 1,
"selected_currency": "USD",
}
async with httpx.AsyncClient(proxy=proxy, headers=BOOKING_HEADERS, timeout=30) as c:
r = await c.get(url, params=params)
if r.status_code == 200:
return r.json().get("hotels", [])
return []
Booking.com’s search API is undocumented. The endpoint paths and parameter names change periodically, and the bot defenses are sophisticated. Plan for active maintenance and dedicated alerting on parser breakage.
ADR computation and submarket aggregation
ADR is computed at the submarket level by averaging the bookable rate across hotels of comparable class. The standard hospitality classification uses the STR (Smith Travel Research) chain scale: Luxury, Upper Upscale, Upscale, Upper Midscale, Midscale, Economy. Each submarket plus chain scale combination produces a meaningful ADR signal.
def compute_adr(hotel_rates: list, chain_scale: str = None):
if chain_scale:
hotel_rates = [h for h in hotel_rates if h.get("chain_scale") == chain_scale]
rates = [h["rate"] for h in hotel_rates if h.get("rate")]
if not rates:
return None
return sum(rates) / len(rates)
For a meaningful ADR series, snapshot the same set of hotels across the same set of arrival dates daily. This produces a consistent panel that supports clean year-over-year and month-over-month comparisons. Ad hoc snapshots that include different hotels on different days produce noisy ADR series that conflate composition shifts with real rate changes.
Search-space management for hotels
The hotel search space is large but more tractable than flight search. There are roughly 200,000 hotels listed on Booking.com globally, with the top 30 cities accounting for 40% of bookable inventory. For most analytical use cases, sampling 10,000-20,000 hotels across 50 priority markets times 7-14 arrival dates produces a high-fidelity ADR dataset.
| Market segment | Hotels to track | Dates ahead | Frequency |
|---|---|---|---|
| Top 50 cities | 10,000+ | 7, 14, 30 | Daily |
| Secondary markets | 3,000-5,000 | 14, 30 | 3x weekly |
| Resort destinations | 2,000-3,000 | 30, 60, 90 | Weekly |
The frequency-by-segment approach optimizes proxy spend against analytical value. High-frequency cities support real-time competitive intelligence; lower-frequency markets support broader trend analysis.
Schema for hotel rate snapshots
CREATE TABLE hotel_rate_snapshot (
snapshot_at TIMESTAMP NOT NULL,
hotel_id VARCHAR(64) NOT NULL,
source VARCHAR(16) NOT NULL,
arrival_date DATE NOT NULL,
los INT NOT NULL,
rate_usd DECIMAL(10,2),
currency VARCHAR(3),
cancellation VARCHAR(32),
breakfast_included BOOLEAN,
available BOOLEAN,
PRIMARY KEY (snapshot_at, hotel_id, source, arrival_date, los)
);
The Length-of-Stay (LOS) dimension matters because hotels often price LOS-1 differently from LOS-3 or LOS-7 nights. For most analytical use cases, snapshot LOS-1 (the canonical ADR signal) plus LOS-3 (the canonical leisure signal). LOS-7 is useful for resort destinations specifically.
For broader pattern guidance, see our residential proxy provider ranking and our headless browser frameworks ranking.
Detecting and routing around bot challenges
When hotel aggregators flag your traffic, the response is usually a Cloudflare or vendor interrogation page rather than a clean HTTP error. Your scraper needs to detect this content-type swap explicitly. Look for the signature cf-mitigated header, the presence of __cf_chl_ cookies, or HTML containing Just a moment....
def is_challenged(response) -> bool:
if response.status_code in (403, 503):
return True
if "cf-mitigated" in response.headers:
return True
if "__cf_chl_" in response.headers.get("set-cookie", ""):
return True
body = response.text[:2000].lower()
return "just a moment" in body or "checking your browser" in body
When you detect a challenge, do not retry on the same IP for at least 30 minutes. Mark that IP as cooling and route subsequent requests to a different IP in your pool. For pages that absolutely must be fetched, have a fallback path that uses a headless browser. Most production setups maintain a 95/5 split between the lightweight HTTP path and the browser fallback path.
Operational monitoring and alerting
Every production scraper needs three monitoring layers. The first is per-IP success rate over a 5-minute window, alerting if any IP drops below 80%. The second is parser error rate, alerting if more than 1% of fetched pages fail to extract the canonical fields. The third is data freshness, alerting if your downstream consumers see snapshots more than 24 hours old.
import time
from collections import deque
class IPHealthTracker:
def __init__(self, window_seconds: int = 300):
self.window = window_seconds
self.events = {}
def record(self, ip: str, success: bool):
bucket = self.events.setdefault(ip, deque())
now = time.time()
bucket.append((now, success))
while bucket and bucket[0][0] < now - self.window:
bucket.popleft()
def success_rate(self, ip: str) -> float:
bucket = self.events.get(ip)
if not bucket:
return 1.0
return sum(1 for _, ok in bucket if ok) / len(bucket)
Wire this into Prometheus or your existing observability stack so the on-call engineer sees IP degradation as it happens rather than after the daily snapshot fails.
Pipeline orchestration and scheduling
For any non-trivial hotel pricing scraping operation, a dedicated orchestration layer is the difference between a script you babysit and a service that runs unattended. The two strong open-source choices in 2026 are Prefect 3 and Dagster.
from prefect import flow, task
@task(retries=3, retry_delay_seconds=60)
def fetch_source(source_id: str, page: int):
return crawl_one_page(source_id, page)
@flow(name="hotel-pricing-daily-sweep")
def daily_sweep(source_ids: list):
futures = []
for sid in source_ids:
for page in range(1, 30):
futures.append(fetch_source.submit(sid, page))
return [f.result() for f in futures]
Run the flow on a cadence aligned to how dynamic the underlying data is. For hotel pricing where records change intraday, a 4-6 hour cadence catches meaningful movements without driving up proxy costs. For longer-cycle data, daily is sufficient.
Data quality monitoring patterns
Beyond per-IP success rate, every snapshot should pass a small battery of data quality checks before being considered authoritative. Structural checks verify that every required field is present and of the expected type. Distributional checks compare the current snapshot against recent history. Semantic checks compare related fields for consistency.
def quality_check(snapshot: list[dict]) -> list[str]:
errors = []
if not snapshot:
errors.append("empty snapshot")
return errors
avg_yesterday = get_yesterday_avg_size()
if len(snapshot) < avg_yesterday * 0.7:
errors.append("snapshot size below threshold")
return errors
Run quality checks as a separate flow that gates promotion of the snapshot from staging to production. A snapshot that fails quality checks should be quarantined for human review.
Cost optimization strategies
Proxy bandwidth is usually the dominant cost in a production scraping operation. Three optimization patterns consistently reduce cost without hurting data quality. The first is request deduplication: serve cached responses when consumers ask for the same record within the same hour. The second is conditional GET using ETag or If-Modified-Since headers when supported. The third is selective field hydration when the upstream API supports field selection.
For workloads above 100 GB of monthly proxy bandwidth, these three optimizations together reduce cost by 40-60% without changing the analytical output. The engineering effort is modest and the payback period is usually under a month at production volume.
End-to-end pipeline architecture
A production-grade scraping pipeline has four layers: collection, parsing, storage, and serving. The collection layer handles the network conversation and knows nothing about data shape. The parsing layer transforms raw bytes into structured records and owns the schema. The storage layer holds the canonical snapshots in a query-optimized format like DuckDB or ClickHouse. The serving layer exposes the data to consumers and should be denormalized and pre-aggregated where possible.
Decoupling these layers also enables independent scaling. The collection layer is bound by proxy capacity and network bandwidth. The parsing layer is CPU-bound. The storage layer is bound by I/O and disk capacity. The serving layer is bound by query concurrency.
Legal and compliance considerations
Public hotel pricing data is generally treated as fair to scrape in most jurisdictions, but always confine your collection to non-personal data: identifiers, structured attributes, and aggregates. Avoid collecting personally identifying details, and avoid pulling any data behind a login.
For commercial deployment, document your basis for processing, your data retention period, and your purpose limitation. The W3C Web Annotation guidance and similar published frameworks remain useful starting points for documenting your approach.
Sample analytics queries on the collected dataset
-- Volume trend over the last 30 days
SELECT date_trunc('day', snapshot_at) AS day, COUNT(*) AS records
FROM snapshot
WHERE snapshot_at > now() - interval '30 days'
GROUP BY 1 ORDER BY 1;
-- Source distribution
SELECT source, COUNT(*) AS records
FROM snapshot
WHERE snapshot_at > now() - interval '7 days'
GROUP BY source
ORDER BY records DESC;
Add a category share view, a source concentration view, and a price-volatility view (where applicable) and you have a solid foundation for a hotel pricing intelligence product.
Versioning your scraper for source evolution
Every hotel pricing source evolves its schema regularly. Stamp every snapshot row with the scraper version that produced it. Downstream analytics can filter by version when they need consistent semantics across a time range. Pair this with a small registry table that documents what each scraper version did differently.
Caching strategy and incremental crawls
Full daily snapshots scale linearly with source size, which becomes expensive at multi-million record scale. Most production deployments shift from full snapshots to incremental refreshes after the initial ramp using freshness deadline, volatility, and business priority signals. Priority-driven scheduling reduces total request volume by 60-80% compared to blind full snapshots.
Building a market intelligence dashboard from the dataset
The most common analytical product on top of hotel scraping is a market intelligence dashboard that tracks ADR, occupancy proxy (using room-availability counts as a leading indicator), and per-segment rate movements. The headline metric is week-over-week ADR change per submarket per chain scale. A 5%+ week-over-week move in upper upscale Manhattan ADR is a real signal that warrants investigation.
def adr_trend(df):
return df.groupby(['submarket', 'chain_scale', 'snapshot_date']).agg(
adr=('rate_usd', 'mean'),
availability_pct=('available', lambda x: 100 * x.mean()),
hotel_count=('hotel_id', 'nunique'),
)
For institutional investors in hospitality real estate, the dashboard also tracks new-supply signals: hotels appearing in the dataset for the first time. New supply lags real construction by 6-12 months because hotels appear on aggregators only when the operator opens bookings. The lag is itself useful information for capital allocators.
For revenue managers at specific hotels, the most useful view is the comp-set view: rates for a hand-curated set of competitive hotels in the same submarket, refreshed multiple times per day. The comp-set rate movements directly inform pricing decisions for the next 1-7 days.
Forward-booking-curve analytics
The forward booking curve is the price progression for a given arrival date as the date approaches. Hotels typically run 3-5 distinct pricing tiers across the booking window: a deep-advance discount tier 60-90 days out, a moderate tier 30-60 days out, a regular tier 14-30 days out, a high-demand tier 7-14 days out, and a last-minute tier inside 7 days.
For competitive intelligence, computing the forward booking curve for a comp-set reveals how aggressively competitors are managing their advance-purchase windows. A competitor that flips from deep-advance to high-demand pricing 45 days out (rather than the typical 14 days) is signaling unusual demand. That signal is actionable for revenue managers.
Brand-driven rate variation
Within the same submarket and the same chain scale, brand-level rate variation is meaningful. Marriott Renaissance and Marriott Westin (both upper upscale) often price differently in the same market because of brand-target-segment differences. Tracking ADR by brand reveals which brands command rate premium and how that premium evolves.
For brand strategy work at hotel chains, this brand-level ADR view is the most direct signal of brand health in a market.
International hotel scraping notes
Outside the U.S., the dominant aggregators shift. Trip.com dominates China and Southeast Asia. Booking.com dominates Europe. Agoda has strong APAC presence. The patterns transfer with per-region adapters and per-region proxy sourcing because country-specific IPs improve success rates significantly. For pan-global hotel intelligence products, plan for 3-5 regional pipelines feeding a unified canonical schema.
OTA vs direct rate parity tracking
Rate parity between OTAs and direct chain sites is enforced through MFN clauses, but parity violations happen routinely and have meaningful commercial impact. For a hotel chain compliance team, the daily question is: are any of our hotels showing lower rates on Booking.com, Expedia, or Hotels.com than on our own site for the same date and room type?
The pattern is to scrape the direct chain site and the major OTAs in parallel for the same hotel plus date plus room combination, then diff the rates. A meaningful violation is a 5%+ rate gap that persists for more than 24 hours. Smaller and shorter gaps are often arbitrage spreads that close quickly.
async def parity_check(hotel_id, arrival, los):
direct = await fetch_direct_rate(hotel_id, arrival, los)
booking = await fetch_booking_rate(hotel_id, arrival, los)
expedia = await fetch_expedia_rate(hotel_id, arrival, los)
rates = [direct, booking, expedia]
spread = max(r for r in rates if r) - min(r for r in rates if r)
if direct and any(r and r < direct * 0.95 for r in [booking, expedia]):
alert_parity_violation(hotel_id, arrival, rates)
For chains with thousands of properties, the parity-check pipeline runs continuously and produces a daily violation report that goes to the brand compliance team. The data itself is straightforward; the operational reliability requirements are the engineering challenge.
Group rate and corporate rate tracking
Group rates and corporate negotiated rates are typically not visible in standard search but show up under specific corporate codes or group blocks. Some hotel chains expose corporate rate search behind a known-employer dropdown that maps to internal account IDs.
For competitive corporate-rate intelligence, scrape the public group-block search and snapshot rates for the major corporate accounts that publish their negotiated rates. The dataset is sparse but provides a real signal of how corporate-rate competition is evolving in major business cities.
Common pitfalls when scraping hotel rates
Three issues recur. The first is room-type mismatching across OTAs. Booking.com, Expedia, and the brand site can describe the same room type with different names (‘King Deluxe’ vs ‘Premium King Room’). Joining on room name produces silent miscomparisons. Use a property + bed configuration + view + smoking-status hash as the canonical room key.
The second is rate-plan obfuscation. The displayed rate can be the refundable, non-refundable, or member-only rate depending on session state. The same room can show $180, $165, and $155 within the same scrape session. Capture the rate plan code on every snapshot or your ADR (average daily rate) calculations will jitter without explanation.
The third is taxes-and-fees normalization. Some markets show pre-tax rates with resort fees broken out at checkout (US), others show all-in pricing (EU). Time-series comparisons that ignore this difference attribute regional pricing differences to demand when they are structural. Always normalize to a tax-inclusive total and store the breakdown separately.
FAQ
Are hotel rates legal to scrape?
Hotel rate data is generally considered public commercial information. The aggregator sites have terms of service that prohibit unauthorized scraping; their enforcement focuses on competitive products. Confine your collection to non-personal data and consult counsel for commercial use cases.
How do hotel chains track rate parity?
Rate parity (the same rate across all distribution channels) is enforced through MFN clauses in agreements between chains and OTAs. Hotel chains use compliance monitoring tools that scrape the OTAs to detect parity violations. The same scraping patterns that support competitive intelligence also support parity compliance.
Can I scrape direct chain sites at scale?
Yes, with appropriate proxies and rate limits. Marriott, Hilton, and IHG have moderate bot defenses that respond well to U.S. residential IPs and reasonable request rates. Plan for per-chain rate limit management because the limits differ across chains.
What about Airbnb and short-term rental data?
Airbnb publishes a public listing search but actively prohibits scraping. Several specialized data providers (AirDNA, Mashvisor) license Airbnb data for commercial intelligence; for most analytical use cases, licensed data is the practical path.
How do I handle the LOS-pricing dimension?
Always snapshot at multiple LOS values for any rate research project. LOS-1 for the canonical ADR signal, LOS-3 for the leisure signal, LOS-7 for the resort signal. Single-LOS snapshots produce misleading rate trends in markets where the LOS pricing is meaningfully different.
How do I track competitive set rates without tipping off the OTA?
Distribute searches across residential proxies and vary the search anchor (date range, occupancy) so the request fingerprint matches a typical user. Rate-limit per IP to under 10 searches per hour.
What is the ideal sampling cadence for revenue management?
Hourly for the next 7 days, every 4-6 hours for 8-30 days out, and daily for 31-180 days out. This compresses cost while preserving the signal that matters for short-term yield calls.
To build broader hospitality intelligence pipelines, browse the ecommerce scraping category for tooling reviews and framework deep dives.