Scraping Airbnb Reviews with data-review-id Selector (2026 Guide)

Airbnb’s review section is one of the richest datasets in short-term rental research, but it sits behind a JavaScript-rendered page that blocks naive scrapers within minutes. The key to reliable extraction in 2026 is the data-review-id attribute, a stable HTML hook that Airbnb uses to identify each review card regardless of CSS class churn.

Why data-review-id Is the Right Selector

Airbnb’s frontend has been rebuilt several times. Class names like _1gjypya rotate with deploys, but the data-review-id attribute is tied to the underlying data model and has stayed consistent through multiple redesigns. Selecting on [data-review-id] anchors your parser to structure, not style.

Each review card looks roughly like this in the DOM:

<div data-review-id="1102847563982741504">
  <span data-testid="review-author">María G.</span>
  <span data-testid="review-date">April 2026</span>
  <div data-testid="review-body">
    Absolutely loved the place. Clean, quiet, and ...
  </div>
</div>

Your selector chain in Python with Playwright or Puppeteer:

reviews = page.query_selector_all('[data-review-id]')
for r in reviews:
    review_id = r.get_attribute('data-review-id')
    author = r.query_selector('[data-testid="review-author"]').inner_text()
    body = r.query_selector('[data-testid="review-body"]').inner_text()
    print(review_id, author, body)

The data-review-id value is the canonical review identifier you can use for deduplication and delta updates. Store it as a primary key from day one.

Rendering and Pagination Challenges

Airbnb loads reviews via GraphQL calls, and the review section does not appear in raw HTML responses. You need a headless browser or a tool that replays the underlying API. There are two practical approaches in 2026:

Headless browser (Playwright/Puppeteer): Accurate but slow. One listing with 50 reviews takes 8-15 seconds to fully render at 4G-equivalent bandwidth. Pagination requires clicking the “Show more reviews” button or intercepting the GraphQL call and replaying it with incremented cursors.

GraphQL endpoint replay: Faster and more scalable. Use browser devtools to capture the PdpReviews query, then replay it directly with requests or httpx. Paginate by incrementing the offset variable. This cuts render time to under 1 second per page but requires session cookies and is more brittle to schema changes.

For large-scale collection across thousands of listings, endpoint replay wins on cost and speed. For small-scale or one-off pulls, Playwright with [data-review-id] is simpler to maintain.

Proxy Setup and Anti-Bot Avoidance

Airbnb runs Akamai Bot Manager and applies rate limits aggressively by IP. Without proxies, you will get 403s or silent rate limiting after 10-20 requests from a single residential IP.

Residential proxies with sticky sessions are the standard choice here. Datacenter IPs are flagged immediately on Airbnb; mobile proxies work but are expensive for high-volume jobs. For a comparison of IP types across review-heavy targets, the guide on How Proxies Help Scrape Reviews at Scale: Yelp, Google, Trustpilot (2026) covers the tradeoffs in detail.

proxy typeblock rate (Airbnb)cost per GBbest for
datacentervery high$0.50-1not recommended
residential rotatinglow$3-8bulk listing sweeps
residential stickyvery low$5-10session-bound scraping
mobile (4G/5G)minimal$15-30high-value targets only

Sticky sessions matter here because Airbnb uses cookie-based session fingerprinting. Rotating your IP mid-session resets the fingerprint and triggers re-verification. Keep one IP for the full duration of a listing scrape, then rotate to a new one for the next listing.

The proxy rotation logic that works for Airbnb reviews is similar to what you would use for local business data. The article on Scraping Google Maps Data with Proxies: Business Listings and Reviews (2026) covers the session management pattern in depth and is a useful reference if you are building a unified review pipeline across platforms.

Rate Limiting and Request Pacing

Airbnb’s rate limits are not published, but empirical testing in early 2026 suggests:

  • Safe pace: 1 listing per 4-8 seconds per session
  • Soft limit trigger: ~50 requests per hour from one IP
  • Hard block: typically at 80-120 requests per hour

Practical pacing rules:

  1. Randomize delay between requests using a uniform distribution (e.g., random.uniform(3, 7) seconds).
  2. Use one sticky proxy session per listing, not per request.
  3. Rotate user-agent strings from a realistic browser pool.
  4. Respect HTTP 429 responses by backing off for 60-120 seconds before retrying.
  5. Cap concurrent sessions at 5-10 to avoid subnet-level detection.

If you are monitoring listing availability in addition to reviews, the proxy discipline is the same. The writeup on Do Proxies Help Daily Housing Listing Monitoring? Real-World Test has real latency and block-rate numbers from a sustained housing data pull that applies directly here.

Parsing and Storing Review Data

Once you are reliably pulling review cards, structure the output around the data-review-id as your canonical identifier. A minimal schema:

{
  "review_id": "1102847563982741504",  # from data-review-id
  "listing_id": "12345678",
  "author": "María G.",
  "date": "2026-04",
  "rating": 5,
  "body": "Absolutely loved the place...",
  "language": "en",
  "scraped_at": "2026-05-06T14:22:00Z"
}

A few extraction notes:

  • Ratings are rendered as SVG stars, not a numeric attribute. Count filled star elements or capture the aria-label text (e.g., “5 out of 5 stars”).
  • Review dates are relative strings like “3 weeks ago” on first load. If you need exact dates, the GraphQL response includes ISO timestamps; prefer the API replay approach for time-sensitive datasets.
  • Language detection matters if you are doing sentiment analysis. langdetect or lingua handles this well in a post-processing step.

The same selector-anchoring strategy used here generalises across structured retail data. Scraping Google Shopping with sh-dgr__content Selector (2026 Guide) applies the same data-attribute anchoring pattern to product grids, and the two approaches can share the same proxy and session management layer.

For geographic spread in your dataset, residential proxies in the same country as the listing produce the lowest block rates. Airbnb serves localized content and applies stricter bot checks to cross-border traffic. If you are collecting UK listings from a US IP, expect more verification friction. The detailed breakdown of IP type by geography in Best Proxy Types for Scraping Google Maps and Local Pack (2026) maps out which proxy types win by region.

Bottom Line

Use [data-review-id] as your stable anchor, sticky residential proxies for session continuity, and GraphQL replay for anything over a few hundred listings. Playwright with data-attribute selectors is the right choice for smaller jobs where maintainability matters more than throughput. DRT covers this category of scrape-target infrastructure regularly, including selector stability, proxy pairing, and anti-bot patterns across review platforms.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)