Monitoring fares across 100 travel routes is a side project. Monitoring 10,000 routes is an engineering challenge that demands purpose-built infrastructure, intelligent proxy pool management, and a database architecture that can handle millions of price records without degrading query performance. This guide covers the practical steps to scale a travel fare monitoring operation from hobby-scale to production-grade, drawing on the same principles used by commercial fare aggregators.
Understanding the Scale Challenge
The difference between 100 routes and 10,000 routes is not just “do it 100 times more.” Scaling introduces problems that do not exist at smaller volumes:
| Challenge | At 100 Routes | At 10,000 Routes |
|---|---|---|
| Daily scrape requests | 500-1,000 | 50,000-100,000+ |
| Proxy bandwidth (monthly) | 15-30 GB | 1.5-3 TB |
| Database records per year | ~200K | ~20M+ |
| Scrape execution time (sequential) | 2-4 hours | 8-20 days (impossible) |
| Proxy cost (monthly) | $200-$400 | $3,000-$15,000 |
| Failure recovery | Manual review | Must be fully automated |
| Server infrastructure | Single machine | Distributed system |
Sequential scraping is not viable at scale. If each scrape takes 5 seconds (including page load, rendering, and data extraction), scraping 100,000 daily requests sequentially would take nearly 6 days. You need distributed, concurrent execution. For the foundational principles of scaling price monitoring operations, see our detailed guide on scaling price monitoring to 100K products.
Distributed Scraping Architecture
Worker-Queue Architecture
The most proven architecture for large-scale scraping uses a job queue with multiple workers:
- Scheduler: Generates scrape jobs based on route priority, schedule, and last-scrape timestamp. Places jobs in a message queue.
- Message Queue: Holds pending scrape jobs. Redis, RabbitMQ, or AWS SQS all work. The queue decouples job creation from execution.
- Worker Pool: Multiple worker processes (or containers) pull jobs from the queue, execute the scrape, and write results to the database. Workers can scale horizontally.
- Result Processor: Validates scraped data, performs normalization, updates caches, and triggers alerts.
- Monitor: Tracks worker health, queue depth, success rates, and proxy performance. Triggers alerts when metrics degrade.
Sizing Your Worker Pool
The number of workers you need depends on your target throughput and per-scrape latency:
| Daily Scrape Target | Avg. Scrape Time | Scraping Window | Workers Needed |
|---|---|---|---|
| 10,000 | 5 seconds | 12 hours | 2-3 |
| 50,000 | 5 seconds | 12 hours | 6-8 |
| 100,000 | 5 seconds | 12 hours | 12-15 |
| 100,000 | 10 seconds | 12 hours | 24-30 |
| 500,000 | 5 seconds | 12 hours | 60-75 |
These estimates include a 30% buffer for retries and failures. In practice, per-scrape time varies by target site: simple API-based scrapes (FlixBus) might take 2 seconds, while heavily defended sites requiring headless browsers (Booking.com, Eurostar) might take 15-20 seconds.
Containerized Deployment
Workers should be deployed as containers (Docker) for several reasons:
- Horizontal scaling: Spin up more workers during peak scraping windows, scale down during quiet periods
- Isolation: Each worker has its own headless browser instance, preventing memory leaks in one worker from affecting others
- Reproducibility: Workers have identical environments, eliminating “works on my machine” issues
- Recovery: Crashed containers are automatically restarted by the orchestrator (Kubernetes, Docker Swarm, ECS)
A typical worker container includes the scraping framework (Scrapy, Playwright), proxy client configuration, and result reporting. Memory allocation depends on whether you use headless browsers (512MB-1GB per container) or HTTP-only scraping (128-256MB per container). For server setup considerations specific to large-scale bot operations, see our guide on server setup for bot operations, which covers many overlapping infrastructure decisions.
Proxy Pool Management at Scale
Why a Single Proxy Provider Is Not Enough
At 10,000 routes, you are consuming enough proxy bandwidth that relying on a single provider creates critical risks:
- Provider downtime: Even a few hours of proxy provider downtime creates data gaps across thousands of routes
- IP pool exhaustion: High-volume usage burns through proxy IPs faster than providers can refresh them for your account
- Pricing leverage: A single provider knows you are locked in and has less incentive to offer competitive rates
- Geographic gaps: No single provider has the best coverage in every region
For detailed strategies on working with multiple proxy providers, see our guide on managing multiple proxy providers.
Multi-Provider Proxy Architecture
Build an abstraction layer between your scraping workers and your proxy providers:
| Component | Function | Implementation |
|---|---|---|
| Proxy Router | Selects the best proxy for each request based on target site, geography, and provider performance | Custom middleware or commercial proxy manager |
| Performance Tracker | Records success rate, latency, and cost per proxy per target site | Time-series metrics (Prometheus, InfluxDB) |
| Budget Allocator | Distributes bandwidth budget across providers based on cost-effectiveness | Custom logic using performance data |
| Failover Logic | Automatically shifts traffic when a provider degrades | Circuit breaker pattern with automatic recovery |
| Cost Reporter | Tracks actual spend per provider per day/week/month | Dashboard pulling from provider APIs and internal logs |
Proxy Allocation Strategy
Not all routes need the same proxy quality. Implement tiered allocation:
- Tier 1 (high-defense sites: Booking.com, Google, Eurostar): Use ISP or mobile proxies. These are the most expensive but necessary for reliable scraping.
- Tier 2 (moderate-defense sites: Expedia, Amtrak, Trainline): Rotating residential proxies provide a good balance of success rate and cost.
- Tier 3 (low-defense sites: FlixBus API, smaller operators): High-quality datacenter proxies may work. Test before committing.
At 10,000 routes, proxy cost optimization matters. A 10% improvement in proxy efficiency at this scale saves $300-$1,500 per month.
IP Rotation and Cooling
At scale, you will encounter the same target sites hundreds or thousands of times per day. Even with large proxy pools, IP reuse is inevitable. Implement IP management:
- Per-site IP tracking: Record which IPs have been used for each target site and when
- Cooling periods: After using an IP for a site, wait a minimum time before reusing it (typically 30-60 minutes for aggressive sites)
- Ban detection: When an IP is blocked, mark it as “cooled” for that site for an extended period (2-24 hours)
- Cross-site independence: An IP banned on Booking.com can still be used for FlixBus. Track bans per-site, not globally.
Database Optimization for Time-Series Fare Data
Schema Design
At 10,000 routes with multiple daily scrapes, your database will accumulate millions of records per month. The schema must support both fast writes (ingesting scrape results) and fast reads (querying price history, generating alerts, serving comparison results).
A recommended schema approach:
- Routes table: Static data about monitored routes (origin, destination, operator, mode of transport). Updated infrequently.
- Fares table (time-series): The core table. Each record is one price observation: route_id, departure_date, fare_class, price, currency, availability, scrape_timestamp. This table grows continuously.
- Alerts table: Active price alerts with route, threshold, and notification preferences.
- Proxy metrics table: Performance data for proxy optimization. Separate from fare data to avoid slowing fare queries.
PostgreSQL with TimescaleDB
For most teams, PostgreSQL with the TimescaleDB extension is the best balance of capability, performance, and operational simplicity for time-series fare data.
Key configuration for fare monitoring:
- Hypertable on fares table: Partition by scrape_timestamp with chunk intervals of 1 week (balances query performance with chunk management overhead)
- Compression policy: Compress chunks older than 2 weeks. Time-series compression in TimescaleDB achieves 90-95% compression on fare data, reducing storage costs dramatically.
- Retention policy: Keep detailed data for 12-24 months, then aggregate to daily min/max/median and drop the detail records.
- Continuous aggregates: Pre-compute daily and weekly price summaries as materialized views that update automatically. These power dashboards and trend analysis without hitting the raw data.
Indexing Strategy
Critical indexes for fare monitoring queries:
| Query Pattern | Required Index | Why |
|---|---|---|
| “Show price history for route X” | (route_id, scrape_timestamp DESC) | Most common query; needs fast range scan |
| “Find cheapest fare for destination Y on date Z” | (destination, departure_date, price) | Powers search results and alerts |
| “Which routes had price drops today?” | (scrape_timestamp, route_id) with partial index on recent data | Alert processing; only needs recent data |
| “Average price by operator for route X” | (route_id, operator, scrape_timestamp) | Competitive analysis queries |
Write Optimization
Ingesting 50,000-100,000 records per day requires attention to write performance:
- Batch inserts: Buffer scrape results and insert in batches of 500-1,000 records rather than one-at-a-time inserts
- Async writes: Workers push results to a queue; a dedicated ingestion process writes to the database. This decouples scrape speed from database write speed.
- COPY vs. INSERT: PostgreSQL COPY is 5-10x faster than INSERT for bulk loading. Use it for batch ingestion.
- Minimize indexes on the fares table: Every index slows writes. Only index what you actively query. Add indexes for new query patterns as needed, not preemptively.
Route Prioritization and Scheduling
Not All Routes Deserve Equal Attention
At 10,000 routes, you cannot (and should not) scrape every route at the same frequency. Implement priority-based scheduling:
| Priority Tier | Criteria | Scrape Frequency | Percentage of Routes |
|---|---|---|---|
| Critical | High traffic, high revenue, active price alerts | Every 2-4 hours | 5-10% |
| High | Popular routes, competitive markets | Every 6-12 hours | 15-25% |
| Standard | Moderate demand, stable pricing | Once daily | 40-50% |
| Low | Low demand, infrequent price changes | Every 2-3 days | 20-30% |
Priority should be dynamic. A route that suddenly shows price volatility should be automatically promoted to a higher scraping frequency. A route with no price changes for 2 weeks can be demoted.
Intelligent Scheduling
Beyond simple frequency tiers, optimize your scheduling with:
- Price change detection: Routes where prices changed in the last scrape get scheduled for a follow-up scrape sooner
- Departure date proximity: Scrape more frequently as departure dates approach (prices change faster in the 2-week window before departure)
- Time-of-day optimization: Some travel sites update prices at specific times. Identify these patterns and schedule scrapes after price updates.
- Load spreading: Distribute scrapes evenly across the day rather than running everything at midnight. This keeps database write load steady and proxy usage smooth.
Cost Management
Breaking Down Operating Costs
At scale, understanding and optimizing costs becomes essential for sustainability:
| Cost Category | Small (100 routes) | Medium (1,000 routes) | Large (10,000 routes) |
|---|---|---|---|
| Proxy bandwidth | $200-$400/mo | $800-$1,500/mo | $3,000-$15,000/mo |
| Server/compute | $20-$50/mo | $100-$300/mo | $500-$2,000/mo |
| Database storage | $10-$20/mo | $50-$100/mo | $200-$500/mo |
| Monitoring/alerting | Free tier | $20-$50/mo | $100-$300/mo |
| Engineering time | Part-time | Half-time | 1-2 full-time engineers |
| Total | $250-$500/mo | $1,000-$2,000/mo | $4,000-$18,000/mo |
Cost Optimization Strategies
- API-first scraping: Whenever possible, scrape API endpoints rather than rendering full pages. This reduces bandwidth by 80-95% and uses less compute.
- Conditional scraping: Check if a page has changed before fully processing it. HTTP ETag and Last-Modified headers can short-circuit unchanged pages.
- Proxy tier matching: Use the cheapest proxy that works for each target. Do not use mobile proxies for sites that work fine with rotating residential.
- Shared proxy pools: If you are monitoring multiple types of data (flights, hotels, ground transport), share proxy pools where target sites overlap.
- Spot instances: Run workers on cloud spot/preemptible instances for 60-80% compute savings. Workers are inherently fault-tolerant (failed jobs re-enter the queue).
- Right-size scraping frequency: Audit which routes genuinely need frequent scraping. Over-scraping is the single biggest source of wasted cost.
Monitoring and Alerting
Key Metrics to Track
At 10,000 routes, manual oversight is impossible. Automated monitoring must cover:
- Scrape success rate (per site, per hour): Alert when it drops below 80% for any site. A sudden drop usually means the site changed its anti-bot rules or layout.
- Queue depth: If jobs are accumulating faster than workers process them, you need more workers or your scrapes are getting slower.
- Data freshness (per route tier): Track the age of the most recent scrape per route. Critical routes with data older than their target freshness need attention.
- Proxy cost per successful scrape: This is your core efficiency metric. If it spikes, investigate whether proxies are failing more often or bandwidth per scrape has increased.
- Database write latency: Increasing write latency indicates index bloat, insufficient buffer pool, or storage bottlenecks.
- Price anomaly rate: Track how often scraped prices fail validation checks (negative prices, prices 10x above historical average). A spike in anomalies usually means a site layout change broke your parser.
Incident Response
Common incidents at scale and their response playbooks:
| Incident | Detection | Response |
|---|---|---|
| Target site layout change | Parse errors spike; anomaly rate increases | Pause scraping for affected site; update parser; replay failed jobs |
| Proxy provider outage | Success rate drops for all sites simultaneously | Failover to backup provider; resume when primary recovers |
| Database write bottleneck | Queue depth grows; write latency increases | Increase batch size; check for lock contention; scale database |
| Worker crash loop | Worker restart count spikes | Check logs for OOM or browser crashes; increase container memory |
| Anti-bot escalation | Success rate drops for one site; other sites unaffected | Switch to higher-tier proxies; reduce request rate; update fingerprints |
Migration Path: 100 to 10,000 Routes
Phase 1: Solid Foundation (100-500 Routes)
- Single server, single proxy provider
- PostgreSQL without TimescaleDB (standard partitioning is sufficient)
- Cron-based scheduling
- Manual monitoring with basic alerting
- Focus on getting your parsers and normalization right
Phase 2: Early Scale (500-2,000 Routes)
- Add a second proxy provider
- Implement job queue (Redis Queue or RabbitMQ)
- Move to containerized workers (2-5 containers)
- Add TimescaleDB for time-series optimization
- Implement priority-based scheduling
- Build basic monitoring dashboard
Phase 3: Production Scale (2,000-10,000 Routes)
- 3+ proxy providers with automated failover
- Kubernetes or ECS for worker orchestration (10-30 containers)
- Dedicated database server with read replicas
- Comprehensive monitoring with PagerDuty or similar alerting
- Automated parser testing (catch layout changes before they affect production)
- Cost optimization becomes a regular operational activity
Phase 4: Large Scale (10,000+ Routes)
- Multi-region deployment for geographic proxy diversity and fault tolerance
- Database sharding or migration to a purpose-built time-series database
- Machine learning for anomaly detection and schedule optimization
- Dedicated SRE (Site Reliability Engineering) for the scraping infrastructure
- Formal vendor management for proxy providers
FAQ
What is the biggest bottleneck when scaling from 100 to 10,000 routes?
For most teams, the bottleneck is proxy management, not compute or database. At 100 routes, you can use a single proxy provider and not worry much about optimization. At 10,000 routes, proxy costs dominate your budget, and proxy failures are the primary cause of data gaps. Investing in a proper proxy abstraction layer with multi-provider support and intelligent routing pays for itself quickly through better success rates and lower per-scrape costs.
Should I build my own scraping infrastructure or use a commercial scraping API?
At small scale (under 500 routes), commercial scraping APIs like ScraperAPI, Zyte, or Bright Data’s Web Scraper IDE can be cost-effective and save development time. At 10,000 routes, the economics usually favor custom infrastructure. Commercial APIs charge per request, and at 50,000-100,000 daily requests, the monthly cost can exceed $5,000-$10,000. Custom infrastructure requires more engineering investment but gives you lower marginal costs, more control over data quality, and faster adaptation when target sites change.
How do I handle different target sites updating prices at different frequencies?
Track the price change frequency for each site empirically. After a few weeks of daily scraping, you will have data showing how often each site actually changes prices for each route. Some airlines update prices every few hours, while some bus operators only change prices daily or weekly. Use this data to set per-site scraping frequencies. There is no value in scraping a site 4 times per day if it only updates prices once. Adaptive scheduling based on observed change rates is one of the most effective cost optimizations at scale.
What happens when a target site completely blocks my scraping operation?
This is inevitable at scale. Your response should be layered: first, rotate to a different proxy provider. Second, review and update your request fingerprints (headers, TLS signature, browser profile). Third, reduce your request rate for that site. Fourth, try a different scraping approach (switch from headless browser to API interception or vice versa). If the site is genuinely blocked and you have exhausted technical options, consider whether you can get the same data from an aggregator site that is easier to scrape. Always maintain at least one backup data source for critical routes.
How much engineering time does maintaining a 10,000-route monitoring system require?
Once operational, plan for 1-2 full-time engineers dedicated to the system. Approximately 40% of their time goes to parser maintenance (target sites change layouts and anti-bot systems regularly), 25% to infrastructure operations (scaling, database maintenance, provider management), 20% to feature development (new sites, new data points, better alerting), and 15% to cost optimization and performance tuning. The common mistake is building the system and assuming it runs itself. Travel sites actively evolve their defenses, and a monitoring system without ongoing engineering attention degrades within weeks.