How to Scale Your Price Monitoring from 100 to 100,000 Products (2026)

Monitoring 100 products is a weekend project. Monitoring 100,000 is a full-scale engineering challenge. The techniques that work perfectly for a small scraping operation — a single script, one proxy provider, a local SQLite database — collapse entirely when you try to scale them by three orders of magnitude. At 100,000 products with daily price checks across 20 competitors, you are looking at 2 million requests per day. Factor in retries, variant-level tracking, and multiple daily checks, and that number easily climbs to 5–10 million. This guide covers the infrastructure, proxy management, database architecture, and operational practices you need to scale your price intelligence operation from a hobby project to an enterprise-grade system.

Understanding the Scale Challenge

Before diving into solutions, let us quantify what scaling to 100,000 products actually means:

Metric100 Products10,000 Products100,000 Products
Daily requests (1 check, 5 competitors)50050,000500,000
Daily requests (4 checks, 10 competitors)4,000400,0004,000,000
Daily data rows generated500–4,00050K–400K500K–4M
Monthly data storage (compressed)~50 MB~5 GB~50 GB
Yearly data storage~600 MB~60 GB~600 GB
Proxy bandwidth per day~250 MB~25 GB~250 GB
Required proxy pool size5–10 IPs100–500 IPs1,000–10,000 IPs
Compute resources1 laptop/VPS1–2 servers5–20 servers or cloud instances

At 100,000 products, every inefficiency in your system is multiplied by a factor that makes it painful. A parser that wastes 100ms per page costs you 2.7 extra hours of compute time daily. A proxy with a 5% failure rate means 200,000 wasted requests per day. Scaling is not just about making things bigger — it is about making everything more efficient.

Distributed Scraping Architecture

Moving Beyond Single-Machine Scraping

A single machine can handle roughly 10,000–30,000 requests per day depending on page complexity, proxy latency, and rate limiting requirements. Beyond that, you need a distributed architecture. The standard approach uses a task queue pattern:

  • Scheduler: Generates scrape tasks based on product priority and timing rules
  • Task queue: Holds pending tasks (Redis, RabbitMQ, or AWS SQS)
  • Worker pool: Multiple worker processes across multiple machines that pull tasks from the queue, execute scrapes, and push results
  • Result store: Collects and processes completed scrape results
  • Database: Stores final structured pricing data

Task Queue Architecture

A task queue decouples the “what to scrape” from the “how to scrape,” allowing each component to scale independently:

# Simplified task queue architecture with Celery and Redis

# tasks.py
from celery import Celery
import requests
from bs4 import BeautifulSoup

app = Celery("price_scraper", broker="redis://localhost:6379/0")

@app.task(bind=True, max_retries=3, default_retry_delay=60)
def scrape_product(self, url, retailer, proxy_config):
    """Scrape a single product page."""
    try:
        proxies = {
            "http": proxy_config["gateway"],
            "https": proxy_config["gateway"],
        }
        response = requests.get(url, proxies=proxies, timeout=20, headers=get_headers())

        if response.status_code == 200:
            data = parse_product(response.text, retailer)
            store_result.delay(url, retailer, data)
            return {"status": "success", "url": url}

        elif response.status_code in (403, 429, 503):
            raise self.retry(exc=Exception(f"Blocked: {response.status_code}"))

    except requests.exceptions.RequestException as exc:
        raise self.retry(exc=exc)

@app.task
def store_result(url, retailer, data):
    """Store a scraped result in the database."""
    # Insert into PostgreSQL
    pass

# scheduler.py
def schedule_daily_scrapes():
    """Generate scrape tasks for all products."""
    products = get_products_to_scrape()  # From your product database
    for product in products:
        for competitor_url in product["competitor_urls"]:
            scrape_product.delay(
                url=competitor_url,
                retailer=product["retailer"],
                proxy_config=get_proxy_config(product["retailer"]),
            )

Worker Scaling Strategies

How you scale your worker pool depends on your infrastructure choice:

InfrastructureScaling ApproachProsCons
Dedicated serversAdd more machines, run multiple workers per machinePredictable performance, low latencyManual scaling, fixed cost
Cloud VMs (AWS EC2, GCP Compute)Auto-scaling groups based on queue depthElastic scaling, pay-per-useCosts can spike, cold start latency
Serverless (AWS Lambda, Cloud Functions)One function invocation per scrape taskZero idle cost, massive parallelismCold starts, IP reputation issues, timeout limits
KubernetesHorizontal pod autoscalingPortable, fine-grained controlOperational complexity

For most price monitoring operations at the 100,000-product scale, cloud VMs with auto-scaling provide the best balance of cost and simplicity. For the server setup fundamentals, our server setup guide covers the basics that apply to any high-volume scraping operation.

Proxy Pool Management at Scale

At 100,000 products, proxy management becomes a first-class engineering problem. You are no longer just configuring a single proxy gateway — you are managing a complex pool of IPs across multiple providers, regions, and quality tiers.

Multi-Provider Strategy

Relying on a single proxy provider at scale is risky. Provider outages, IP pool quality degradation, and bandwidth limits can cripple your entire operation. Use multiple providers and route traffic intelligently. For detailed guidance, see our article on managing multiple proxy providers.

Provider RoleProxy TypeTraffic AllocationUse Case
Primary (Provider A)Rotating residential50–60%General product page scraping
Secondary (Provider B)Rotating residential20–30%Failover and load distribution
Premium (Provider C)ISP/static residential10–15%Difficult targets, session-based scraping
Backup (Provider D)Datacenter5–10%Easy targets, cost optimization

Intelligent Proxy Routing

Not all target sites require the same proxy quality. Build a routing layer that assigns proxy types based on target difficulty:

class ProxyRouter:
    """Routes requests to appropriate proxy providers based on target site."""

    def __init__(self):
        self.site_configs = {}
        self.provider_stats = {}

    def configure_site(self, domain, proxy_tier, rate_limit_rpm):
        """Set proxy requirements for a target site."""
        self.site_configs[domain] = {
            "tier": proxy_tier,       # "premium", "standard", "basic"
            "rate_limit": rate_limit_rpm,
            "last_request": 0,
        }

    def get_proxy(self, url):
        """Get the appropriate proxy for a URL."""
        from urllib.parse import urlparse
        domain = urlparse(url).netloc.replace("www.", "")

        config = self.site_configs.get(domain, {"tier": "standard"})
        tier = config["tier"]

        if tier == "premium":
            return self.providers["isp"].get_proxy()
        elif tier == "standard":
            return self.providers["residential_primary"].get_proxy()
        else:
            return self.providers["datacenter"].get_proxy()

Proxy Health Monitoring

At scale, you need real-time visibility into proxy performance. Track these metrics per provider, per proxy IP, and per target site:

  • Success rate: Percentage of requests returning valid product data (not 403s, CAPTCHAs, or empty pages)
  • Response time: Average and P95 latency per proxy
  • Bandwidth usage: GB consumed per provider and per target
  • Cost per successful request: The true cost accounting for failures and retries
  • IP freshness: How frequently the provider is rotating IPs (detectable by logging the exit IP from response headers or test endpoints)

Build a dashboard (Grafana is excellent for this) that displays these metrics in real time. Set alerts for anomalies — a sudden drop in success rate for a specific target or provider requires immediate attention.

Subnet Diversity

When using ISP or datacenter proxies, ensure your IPs come from diverse subnets. Using 100 IPs from the same /24 subnet is almost as bad as using a single IP — target sites analyze subnet patterns to identify coordinated scraping. Our guide on proxy subnets and diversity explains the technical details of why this matters and how to verify your subnet distribution.

Database Optimization for Price Data

Schema Design for Time-Series Price Data

Price data is time-series data — you are recording the same measurement (price) for the same entity (product at retailer) at regular intervals. Standard relational database schemas become inefficient at scale for this type of data. Consider these approaches:

Option 1: PostgreSQL with Partitioning

-- Partition price_records by month for efficient querying and cleanup
CREATE TABLE price_records (
    id BIGSERIAL,
    product_id INTEGER NOT NULL,
    retailer_id SMALLINT NOT NULL,
    price NUMERIC(10, 2),
    currency CHAR(3) DEFAULT 'USD',
    availability VARCHAR(20),
    scraped_at TIMESTAMP NOT NULL DEFAULT NOW(),
    PRIMARY KEY (id, scraped_at)
) PARTITION BY RANGE (scraped_at);

-- Create monthly partitions
CREATE TABLE price_records_2026_01 PARTITION OF price_records
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE price_records_2026_02 PARTITION OF price_records
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
-- ... continue for each month

-- Indexes for common query patterns
CREATE INDEX idx_price_product_time ON price_records (product_id, scraped_at DESC);
CREATE INDEX idx_price_retailer_time ON price_records (retailer_id, scraped_at DESC);

Option 2: TimescaleDB (PostgreSQL Extension)

TimescaleDB is a PostgreSQL extension specifically designed for time-series data. It automates partitioning, provides advanced compression, and offers time-series-specific query functions:

-- Create a hypertable for automatic time-based partitioning
SELECT create_hypertable('price_records', 'scraped_at',
    chunk_time_interval => INTERVAL '1 week');

-- Enable compression for older data
ALTER TABLE price_records SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'product_id, retailer_id'
);

-- Automatically compress data older than 30 days
SELECT add_compression_policy('price_records', INTERVAL '30 days');

Data Retention and Archival

At 500,000–4,000,000 rows per day, your database will grow fast. Implement a tiered storage strategy:

Data AgeStorageGranularityAccess Pattern
0–30 daysPrimary database (hot storage)Full resolution (every observation)Real-time queries, dashboards
30–180 daysPrimary database (compressed)Full resolution, compressedHistorical analysis, trend reports
180 days–2 yearsData warehouse or cold storageAggregated (daily min/max/avg)Long-term trend analysis
2+ yearsArchive (S3, GCS)Aggregated monthlyCompliance, rare lookups

Scheduling and Prioritization

Not every product needs to be scraped at the same frequency. Smart scheduling significantly reduces your resource requirements without sacrificing data quality where it matters.

Priority Tiers

PriorityDescriptionScraping Frequency% of Products
P1 — CriticalTop sellers, high-margin products, price-sensitive categoriesEvery 2–4 hours5–10%
P2 — ImportantCore catalog, actively competitive categoriesEvery 8–12 hours20–30%
P3 — StandardFull catalog breadthDaily40–50%
P4 — LowLong-tail products, stable-price categoriesEvery 2–3 days20–30%

This tiered approach can reduce your total daily scrape volume by 40–60% compared to scraping everything at the same frequency, while actually improving data quality for the products that matter most.

Adaptive Scheduling

Go beyond static tiers by using historical data to dynamically adjust scraping frequency. Products whose prices rarely change do not need frequent checks. Products with volatile pricing need more attention. Implement a feedback loop:

def calculate_scrape_priority(product_id):
    """Dynamically calculate how frequently a product should be scraped."""
    # Get recent price history
    recent_prices = get_price_history(product_id, days=30)

    if len(recent_prices) < 2:
        return "P2"  # Default for products with insufficient history

    # Calculate price volatility (coefficient of variation)
    prices = [p["price"] for p in recent_prices if p["price"]]
    if not prices:
        return "P3"

    mean_price = sum(prices) / len(prices)
    variance = sum((p - mean_price) ** 2 for p in prices) / len(prices)
    cv = (variance ** 0.5) / mean_price if mean_price > 0 else 0

    # Calculate revenue impact
    monthly_units = get_monthly_units_sold(product_id)
    revenue_impact = mean_price * monthly_units

    # High volatility + high revenue = highest priority
    if cv > 0.1 and revenue_impact > 10000:
        return "P1"
    elif cv > 0.05 or revenue_impact > 5000:
        return "P2"
    elif cv > 0.02 or revenue_impact > 1000:
        return "P3"
    else:
        return "P4"

Handling Anti-Bot Systems at Scale

At high volume, you will encounter every anti-bot system on the market. Your approach must be systematic rather than reactive. Build a target site classification system that documents the anti-bot measures used by each site and configures your scraping behavior accordingly.

Target Site Profiles

# site_profiles.yaml
amazon.com:
  anti_bot: "advanced"         # Amazon's custom system
  proxy_tier: "premium"
  rate_limit_rpm: 20
  requires_js: true
  captcha_frequency: "high"
  session_required: false
  custom_headers: true

target.com:
  anti_bot: "akamai"
  proxy_tier: "standard"
  rate_limit_rpm: 15
  requires_js: true
  captcha_frequency: "medium"
  session_required: false

small-retailer.com:
  anti_bot: "basic"
  proxy_tier: "basic"
  rate_limit_rpm: 30
  requires_js: false
  captcha_frequency: "none"
  session_required: false

This profile-driven approach lets your system automatically adjust its behavior for each target without manual intervention. When a site updates its protections, you update one profile rather than modifying scraper code.

Cost Management at Scale

At 100,000 products, costs are significant and need active management. Here is where your money goes and how to optimize each category:

Cost Breakdown

Cost CategoryTypical Monthly Cost (100K products)Optimization Strategy
Proxy bandwidth$1,500–$5,000Tiered proxy usage, caching, conditional requests
Compute infrastructure$500–$2,000Auto-scaling, spot instances, efficient code
Database storage$200–$800Compression, data retention policies, aggregation
CAPTCHA solving$100–$500Reduce CAPTCHA triggers with better fingerprinting
Monitoring and alerting$50–$200Open-source tools (Grafana, Prometheus)
Engineering maintenance$3,000–$10,000Automation, robust error handling, good documentation
Total$5,350–$18,500

Proxy Cost Optimization

Proxies are typically the largest variable cost. Optimize aggressively:

  • Use the cheapest proxy tier that works. Do not send traffic for easy sites through premium residential proxies when datacenter proxies would succeed
  • Cache unchanged pages. If a product page has not changed since your last check (detectable via ETag or Last-Modified headers), do not download the full page
  • Compress response data. Always request gzip/deflate encoding to reduce bandwidth
  • Avoid downloading images and assets. Configure your scrapers to fetch only the HTML, not images, CSS, or JavaScript files unless needed for rendering
  • Negotiate volume discounts. At scale, proxy providers will negotiate pricing. Get quotes from multiple providers and use them as leverage
  • Monitor cost per successful scrape. This is your true efficiency metric — total proxy cost divided by the number of valid data points collected

Reliability and Monitoring

At scale, you need comprehensive monitoring to detect and resolve issues before they cause data gaps:

Key Metrics to Monitor

  • Scrape completion rate: What percentage of scheduled scrapes completed successfully? Target 95%+
  • Data freshness: What is the average age of your most recent price data per product? Set alerts if any P1 product data is older than your target freshness
  • Queue depth: How many tasks are waiting in the queue? A growing queue indicates your workers cannot keep up
  • Error rate by type: Break down errors into categories (timeout, blocked, parse error, proxy error) to identify systemic issues
  • Cost per data point: Track this daily to detect cost anomalies
  • Parser accuracy: Sample and manually verify scraped data regularly to catch parser degradation

Alerting Rules

  • Scrape completion rate drops below 90% for any P1 target: immediate alert
  • Queue depth exceeds 2x normal for more than 30 minutes: scale workers up
  • Any target site success rate drops below 50%: investigate anti-bot changes
  • Proxy provider success rate drops below 80%: failover to secondary provider
  • Daily cost exceeds 120% of budget: review traffic patterns for anomalies

Scaling Checklist: From 100 to 100,000 Products

Here is a phased roadmap for scaling your price monitoring system:

Phase 1: 100–1,000 Products

  • Single Python script with proxy rotation
  • SQLite or PostgreSQL database
  • One proxy provider
  • Manual scheduling (cron jobs)
  • CSV exports for analysis

Phase 2: 1,000–10,000 Products

  • Task queue (Celery + Redis)
  • PostgreSQL with proper indexing
  • Two proxy providers (primary + backup)
  • Priority-based scheduling
  • Basic monitoring (log analysis, success rate tracking)
  • Automated error handling and retries

Phase 3: 10,000–50,000 Products

  • Distributed worker pool across multiple machines
  • PostgreSQL with partitioning or TimescaleDB
  • Multi-provider proxy strategy with intelligent routing
  • Adaptive scheduling based on price volatility
  • Grafana dashboards for real-time monitoring
  • Automated alerting (PagerDuty, Slack)
  • Data retention policies

Phase 4: 50,000–100,000+ Products

  • Kubernetes or auto-scaling cloud infrastructure
  • TimescaleDB or dedicated time-series database
  • 3–4 proxy providers with automated failover
  • ML-driven scheduling and priority assignment
  • Comprehensive observability stack
  • Automated parser health checks and recovery
  • Cost optimization automation
  • Data quality scoring and validation pipelines

For the Python foundations you need to start building, see our Python price scraping tutorial. For choosing the right proxy setup, our proxy rotation strategies guide covers the patterns that work at every scale.

Frequently Asked Questions

How many proxies do I need for 100,000 products?

With rotating residential proxies using a gateway model, you do not need to manage individual IPs — the provider handles rotation from their pool. However, you need enough bandwidth to cover your daily request volume. At roughly 500KB per product page and 2–4 million requests per day, you are looking at 1–2 TB of monthly proxy bandwidth. If using ISP or datacenter proxies with static IPs, aim for 1,000–5,000 IPs across diverse subnets to avoid per-IP rate limiting. The exact number depends on your target sites’ rate limits and your scraping frequency.

Should I use serverless (AWS Lambda) for large-scale scraping?

Serverless can work but has significant limitations for scraping at scale. Each Lambda invocation gets a new IP from AWS’s IP pool, which has poor reputation on most e-commerce sites. You need to route Lambda traffic through your proxy pool, adding latency. Cold starts add unpredictable delays. And Lambda’s 15-minute timeout limits your ability to handle complex pages or retry sequences. Serverless is best used as a supplement — for example, to handle overflow during peak scraping periods — rather than as your primary infrastructure.

How do I handle websites that change their HTML structure?

Parser breakage is inevitable at scale. Build resilience through multiple strategies: use structured data (JSON-LD, microdata) as your primary extraction method since it changes less frequently than HTML structure. Implement fallback parsers that try multiple CSS selectors. Add data validation that catches parsing errors quickly (prices outside expected ranges, missing required fields). Set up automated alerts when parser success rates drop. Invest in parser testing infrastructure that regularly validates your parsers against live pages.

What is the most cost-effective proxy strategy at scale?

A tiered approach where you match proxy quality to target difficulty. Use datacenter proxies (cheapest) for easy targets with minimal anti-bot protection. Use rotating residential proxies (moderate cost) for most mainstream retailers. Reserve ISP and mobile proxies (most expensive) for the hardest targets. This tiered approach can reduce proxy costs by 40–60% compared to using residential proxies for everything. Also negotiate volume discounts — most providers will offer 20–40% discounts for high-volume commitments.

How do I ensure data quality at scale?

Implement multiple layers of validation. First, validate at parse time — reject prices that are negative, zero, or outside a reasonable range for the product category. Second, validate against historical data — flag any price change greater than 50% for manual review, as these are often parsing errors. Third, run periodic audits where you manually check a sample of scraped data against live pages. Fourth, cross-validate by comparing the same product’s price from multiple sources. Build a data quality score for each scraping run and track it over time to detect gradual degradation.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top