How to Scrape Skyscanner Flight Data at Scale (2026)

Skyscanner blocks more scrapers than almost any other travel site, and if you don’t know its 2026 anti-bot stack before you start, you’ll burn hours on dead sessions. Scraping Skyscanner flight data at scale is genuinely valuable: the platform aggregates fares across 1,200+ airlines, exposes multi-city fare calendars, and covers routes that Google Flights quietly omits. Price intelligence teams, travel startups, and affiliate researchers all need this data, and there’s no official bulk API for it.

Why Skyscanner Data Is Worth the Effort

Skyscanner’s internal search API returns structured JSON with fields that competitors don’t surface cleanly: cabin class breakdowns, stopover durations, fare basis codes, and the “cheapest month” calendar view. For route pricing models, that calendar endpoint alone is worth the engineering cost. Fare data refreshes every 10-15 minutes during peak hours, so stale snapshots from cached sources are useless for real-time comparison engines.

If you’re also pulling hotel pricing, How to Scrape Expedia Hotel Inventory in 2026 covers a comparable session-heavy target with a similar Akamai setup.

Skyscanner’s Anti-Bot Stack in 2026

Skyscanner runs Akamai Bot Manager v3 with sensor-data collection baked into the page JavaScript. Every search request carries a _abck cookie and a sensor blob that encodes mouse movement, canvas fingerprint, timezone, and WebGL renderer hash. Requests without a valid sensor value return HTTP 200 with an empty itineraries array, not a 403, so naive scripts fail silently.

Rate limits kick in around 80-100 requests per IP per hour on residential ranges; datacenter IPs get blocked within 5-10 requests. The mobile app endpoint (flights.skyscanner.net/api/v3/flights/live/search) is slightly less hardened than the desktop path but still requires a valid x-skyscanner-deviceid header that rotates per session.

JS fingerprinting checks navigator.webdriver, chrome.runtime, and the presence of Playwright/Puppeteer artifacts. Stealth patches (playwright-extra’s stealth plugin) still work in May 2026, but Akamai updates its detection signatures monthly, so plan for ongoing maintenance.

API vs HTML Scraping: Which Approach to Use

ApproachSuccess Rate (cold)Maintenance LoadCost
Playwright + stealth proxy70-85%High (monthly patches)$$$
Internal JSON API (xhr)85-95% with valid headersMedium$$
Third-party scraping API (Oxylabs, Scrapfly)90-95%+Low$$$$
Puppeteer-cluster (no stealth)10-20%Low (just broken)$

The internal XHR approach hits the flights/live/search endpoint directly. Intercepting these with browser devtools, then replaying them with valid session cookies and headers, is the most cost-effective path for teams that can absorb a week of setup. For route data at high volume, a managed scraping API saves more money than it costs once you factor in proxy spend and engineer hours. Similar tradeoffs apply when you’re scraping Kayak flight and hotel data, which also sits behind Akamai.

Proxy Strategy and Geo Requirements

Residential proxies are non-negotiable for Skyscanner at scale. Datacenter subnets are blocklisted aggressively, including most AWS, GCP, and Azure CIDR ranges. You need ISP-level residential IPs, ideally in the country matching the search origin, because Skyscanner returns different fare sets by geo. A UK IP querying London-Bangkok will return BA and Thai Airways fares that a US IP won’t see.

Rotate IPs per session, not per request. Skyscanner tracks session consistency, and mid-session IP switches trigger re-validation. Sticky sessions of 5-10 minutes with a fresh IP per new search thread work better than per-request rotation.

For Asia-Pacific route data, Singapore and Hong Kong residential pools perform well. How to Scrape Trip.com (Asia Inventory) at Scale covers geo-specific proxy considerations for that region in detail.

Parsing the Response and Capturing the Right Fields

The live search endpoint returns a polling model: POST to start a session, then GET /api/v3/flights/live/search/poll/{session_id} until status: "RESULT_STATUS_COMPLETE". Each poll adds itineraries to the response.

import httpx, time

HEADERS = {
    "x-skyscanner-deviceid": "your-uuid-v4-here",
    "x-skyscanner-market": "UK",
    "x-skyscanner-currency": "GBP",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...",
    "referer": "https://www.skyscanner.net/",
}

def poll_search(session_id: str, cookies: dict) -> list:
    url = f"https://www.skyscanner.net/api/v3/flights/live/search/poll/{session_id}"
    results = []
    for _ in range(12):  # max ~60s
        r = httpx.get(url, headers=HEADERS, cookies=cookies, timeout=10)
        data = r.json()
        results.extend(data.get("content", {}).get("results", {}).get("itineraries", {}).values())
        if data.get("status") == "RESULT_STATUS_COMPLETE":
            break
        time.sleep(5)
    return results

Fields worth capturing from each itinerary: price.raw, legs[].stopCount, legs[].durationInMinutes, legs[].carriers.marketing[].name, and pricingOptions[].items[].deepLink for the booking URL. Store raw JSON alongside parsed rows so schema changes don’t force a re-scrape. If you want a reference for how structured scraping pipelines handle pagination and session state at scale, the How to Scrape Wikipedia Data at Scale guide walks through a clean polling-and-cursor pattern that maps well here.

Scaling Considerations

Run concurrent workers with a semaphore cap of 5-10 threads per proxy pool to stay inside rate limits. Use a task queue (Celery, RQ, or a PostgreSQL job table) rather than async fire-and-forget so failed polls retry cleanly.

Key fields to index and monitor:

  • origin + destination + departure_date + scraped_at (composite index for time-series queries)
  • price.raw delta vs prior scrape (flag >15% swings for alerting)
  • status from each poll response (distinguish timeout vs complete vs error)
  • legs[].carriers for carrier coverage auditing

Storage-wise, a single route-date pair returns 50-200 itineraries at ~2KB each. For 500 routes checked daily, that’s roughly 50-200MB/day uncompressed, manageable in Postgres with JSONB columns or Parquet files on S3.

Numbered steps for a reliable scrape run:

  1. Spin up a fresh browser session with stealth plugin and residential proxy
  2. Load the Skyscanner homepage to acquire _abck cookie and session tokens
  3. Submit the flight search via the page UI or replicated XHR call
  4. Capture the session_id from the POST response
  5. Poll the results endpoint every 5 seconds until status is complete or 12 polls elapse
  6. Parse and store itineraries, then close the session and release the proxy

Fare calendar data (the cheapest-day-per-month view) is a separate, lighter endpoint worth scraping independently. It’s a single GET with no polling, and the payload is small. Hotels.com pricing across markets uses a similar calendar pattern if you want a comparable implementation reference.

Bottom line

Skyscanner scraping in 2026 is doable but requires residential proxies, stealth browser automation or valid session headers, and tolerance for monthly maintenance as Akamai signatures update. The internal polling API is your best lever once you have a valid session established. Start with the XHR replay approach before committing to a managed service. dataresearchtools.com covers the full travel scraping stack, including Skyscanner, Kayak, Expedia, and regional OTAs, with guides updated as anti-bot defenses evolve.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)