How to Build a Digital Shelf Monitoring System with Proxies

Monitoring your products across e-commerce marketplaces is no longer something you can do manually. With thousands of SKUs listed across multiple platforms, each with its own search rankings, pricing, and content standards, brands need automated systems that collect and analyze digital shelf data continuously. Building such a system requires combining web scraping technology with proxy infrastructure to collect data reliably and at scale.

This guide walks through the architecture, tools, and implementation steps for building a digital shelf monitoring system from the ground up.

Why Build a Custom Monitoring System

Commercial digital shelf analytics platforms exist, and they serve many brands well. However, there are compelling reasons to build a custom system or augment a commercial solution with custom data collection:

Platform coverage gaps: Commercial tools may not cover niche or regional marketplaces like Tokopedia, Tiki, or local retail sites
Custom metrics: Your brand may need specific data points that off-the-shelf tools do not track
Cost efficiency: At high collection volumes, building your own infrastructure can be more cost-effective than per-SKU pricing models
Data ownership: Owning your data pipeline gives you full control over storage, processing, and integration with internal systems
Flexibility: Custom systems can be adapted quickly when platforms change or new monitoring needs emerge

System Architecture Overview

A digital shelf monitoring system consists of several interconnected components:

[Scheduler] → [Task Queue] → [Scraper Workers] → [Proxy Layer] → [Target Sites]
                                    ↓
                              [Raw Data Store]
                                    ↓
                              [Data Parser/Normalizer]
                                    ↓
                              [Analytics Database]
                                    ↓
                              [Dashboard/Alerts]

Each component plays a specific role, and the quality of the overall system depends on how well these components work together.

Component 1: The Proxy Layer

The proxy layer is arguably the most critical component. Without reliable proxies, your scrapers will be blocked, throttled, or served misleading data. The proxy layer sits between your scraper workers and the target marketplaces, rotating IP addresses and managing connection parameters.

Choosing the Right Proxy Type

For digital shelf monitoring, mobile proxies provide the best balance of reliability and authenticity. Here is why:

Mobile proxies route your requests through mobile carrier networks. The IP addresses are shared among thousands of real mobile users, which means marketplaces are extremely unlikely to block them—doing so would block legitimate customers. This is particularly relevant for monitoring Southeast Asian marketplaces where mobile commerce dominates.

DataResearchTools specializes in mobile proxy infrastructure for SEA markets, providing carrier-level IP addresses from Singapore, Malaysia, Thailand, Indonesia, the Philippines, and Vietnam. This coverage is essential for collecting geo-accurate data from platforms like Shopee, Lazada, and Tokopedia, where product availability, pricing, and search rankings vary by country.

Proxy Configuration Best Practices

When integrating proxies into your monitoring system:

Rotate IPs per request or session: Avoid making too many consecutive requests from the same IP
Match proxy geography to target market: Use Malaysian proxies when monitoring Shopee Malaysia, Thai proxies for Lazada Thailand, and so on
Implement sticky sessions where needed: Some platforms require consistent IPs within a single browsing session to avoid triggering security measures
Set appropriate timeouts: Mobile connections can be slower than datacenter connections; configure timeouts of 30-60 seconds
Handle proxy failures gracefully: Implement automatic retry logic with IP rotation when a request fails

Sample Proxy Integration (Python)

import requests
from itertools import cycle

class ProxyManager:
    def __init__(self, proxy_endpoint, countries):
        self.proxy_endpoint = proxy_endpoint
        self.countries = countries

    def get_proxy(self, country='sg'):
        """Get a mobile proxy for a specific country."""
        return {
            'http': f'http://user:pass@{self.proxy_endpoint}:{self.get_port(country)}',
            'https': f'http://user:pass@{self.proxy_endpoint}:{self.get_port(country)}'
        }

    def get_port(self, country):
        port_map = {
            'sg': 10001,
            'my': 10002,
            'th': 10003,
            'id': 10004,
            'ph': 10005,
            'vn': 10006,
        }
        return port_map.get(country, 10001)

    def make_request(self, url, country='sg', max_retries=3):
        """Make a request through the proxy with retry logic."""
        for attempt in range(max_retries):
            try:
                proxy = self.get_proxy(country)
                response = requests.get(
                    url,
                    proxies=proxy,
                    timeout=45,
                    headers=self.get_headers()
                )
                if response.status_code == 200:
                    return response
            except requests.RequestException:
                continue
        return None

    def get_headers(self):
        return {
            'User-Agent': 'Mozilla/5.0 (Linux; Android 13) AppleWebKit/537.36',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept': 'text/html,application/xhtml+xml'
        }

Component 2: The Scheduler

The scheduler determines when and how often each monitoring task runs. Different data types require different collection frequencies:

Data Type	Recommended Frequency	Rationale
Pricing	Every 2-4 hours	Prices change frequently, especially during promotions
Search rankings	Daily	Rankings shift gradually; daily snapshots are sufficient
Content/listings	Weekly	Content changes are less frequent
Stock availability	Every 4-6 hours	Stockouts can happen quickly during demand spikes
Reviews	Daily	New reviews trickle in; daily collection captures trends

Use a task scheduler like Celery (Python), Bull (Node.js), or a managed service like AWS EventBridge to trigger collection jobs at defined intervals.

from celery import Celery
from celery.schedules import crontab

app = Celery('digital_shelf_monitor')

app.conf.beat_schedule = {
    'collect-pricing-data': {
        'task': 'tasks.collect_pricing',
        'schedule': crontab(minute=0, hour='*/4'),  # Every 4 hours
    },
    'collect-search-rankings': {
        'task': 'tasks.collect_search_rankings',
        'schedule': crontab(minute=0, hour=6),  # Daily at 6 AM
    },
    'collect-stock-status': {
        'task': 'tasks.collect_availability',
        'schedule': crontab(minute=0, hour='*/6'),  # Every 6 hours
    },
    'collect-reviews': {
        'task': 'tasks.collect_reviews',
        'schedule': crontab(minute=0, hour=8),  # Daily at 8 AM
    },
    'audit-content': {
        'task': 'tasks.audit_content',
        'schedule': crontab(minute=0, hour=3, day_of_week=1),  # Weekly
    },
}

Component 3: Scraper Workers

Scraper workers execute the actual data collection. Each worker picks up a task from the queue, makes requests through the proxy layer, and stores the collected data.

Platform-Specific Parsers

Each marketplace requires its own parser because page structures differ. Structure your parsers as modular components:

class ShopeeParser:
    def parse_product_page(self, html):
        """Extract product data from a Shopee product page."""
        return {
            'title': self._extract_title(html),
            'price': self._extract_price(html),
            'original_price': self._extract_original_price(html),
            'rating': self._extract_rating(html),
            'review_count': self._extract_review_count(html),
            'stock_status': self._extract_stock_status(html),
            'seller': self._extract_seller_info(html),
            'images': self._extract_images(html),
            'description': self._extract_description(html),
        }

    def parse_search_results(self, html, keyword):
        """Extract product rankings from search results."""
        results = []
        products = self._extract_search_items(html)
        for position, product in enumerate(products, 1):
            results.append({
                'keyword': keyword,
                'position': position,
                'product_id': product.get('id'),
                'title': product.get('title'),
                'price': product.get('price'),
                'is_sponsored': product.get('is_ad', False),
            })
        return results

class LazadaParser:
    def parse_product_page(self, html):
        """Extract product data from a Lazada product page."""
        # Lazada-specific parsing logic
        pass

class TokopediaParser:
    def parse_product_page(self, html):
        """Extract product data from a Tokopedia product page."""
        # Tokopedia-specific parsing logic
        pass

Handling JavaScript-Rendered Content

Many modern marketplaces render content using JavaScript. For these cases, you may need headless browser automation instead of simple HTTP requests:

from playwright.async_api import async_playwright

async def collect_with_browser(url, proxy_config):
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            proxy={
                'server': proxy_config['server'],
                'username': proxy_config['username'],
                'password': proxy_config['password'],
            }
        )
        page = await browser.new_page()
        await page.goto(url, wait_until='networkidle')
        content = await page.content()
        await browser.close()
        return content

Rate Limiting and Politeness

Even with proxies, implement rate limiting to be a responsible data collector:

Add random delays between requests (2-5 seconds)
Respect robots.txt where applicable
Distribute requests across time rather than sending bursts
Monitor response codes and back off when you see increased error rates

Component 4: Data Storage and Processing

Raw Data Storage

Store the raw HTML or JSON responses from each collection run. This provides several benefits:

You can re-parse data if your parser has bugs
Historical raw data enables retrospective analysis
You can debug collection issues by examining raw responses

Use object storage (S3, Google Cloud Storage) for raw data, organized by date and platform:

raw_data/
  2026/03/09/
    shopee_sg/
      products/
        product_12345.html
      search/
        keyword_mobile_proxy.html
    lazada_my/
      products/
        product_67890.html

Structured Data Storage

After parsing, store structured data in a database optimized for time-series queries. PostgreSQL with TimescaleDB extension works well, as does ClickHouse for high-volume analytical workloads.

CREATE TABLE product_prices (
    collected_at TIMESTAMPTZ NOT NULL,
    platform VARCHAR(50) NOT NULL,
    country VARCHAR(5) NOT NULL,
    product_id VARCHAR(100) NOT NULL,
    price DECIMAL(12, 2),
    original_price DECIMAL(12, 2),
    currency VARCHAR(5),
    seller_id VARCHAR(100),
    in_stock BOOLEAN,
    PRIMARY KEY (collected_at, platform, country, product_id)
);

CREATE TABLE search_rankings (
    collected_at TIMESTAMPTZ NOT NULL,
    platform VARCHAR(50) NOT NULL,
    country VARCHAR(5) NOT NULL,
    keyword VARCHAR(500) NOT NULL,
    product_id VARCHAR(100) NOT NULL,
    position INTEGER,
    is_sponsored BOOLEAN,
    PRIMARY KEY (collected_at, platform, country, keyword, product_id)
);

Component 5: Analytics and Alerting

Dashboard Design

Build dashboards that surface the most actionable metrics:

Overview dashboard: High-level health across all products and platforms
Pricing dashboard: Current prices, MAP violations, competitive positioning
Search dashboard: Ranking trends, share of search, keyword performance
Content dashboard: Compliance scores, content gaps, listing quality
Availability dashboard: Stockout rates, fulfillment metrics

Use tools like Metabase, Grafana, or a custom web application to visualize the data.

Alert Configuration

Set up alerts for events that require immediate action:

class AlertManager:
    def check_price_violations(self, product_id, current_price, map_price):
        if current_price < map_price:
            self.send_alert(
                severity='high',
                message=f'MAP violation: {product_id} priced at {current_price}, '
                        f'MAP is {map_price}',
                channel='pricing-alerts'
            )

    def check_stock_status(self, product_id, was_in_stock, is_in_stock):
        if was_in_stock and not is_in_stock:
            self.send_alert(
                severity='medium',
                message=f'Stockout detected: {product_id}',
                channel='availability-alerts'
            )

    def check_ranking_drop(self, product_id, keyword, old_position, new_position):
        if new_position - old_position > 10:
            self.send_alert(
                severity='medium',
                message=f'Ranking drop: {product_id} for "{keyword}" '
                        f'dropped from #{old_position} to #{new_position}',
                channel='search-alerts'
            )

Deployment Considerations

Infrastructure Sizing

For a monitoring system tracking 1,000 products across 5 platforms:

Proxy bandwidth: Approximately 50-100 GB per month
Storage: 10-20 GB per month for raw data, 1-2 GB for structured data
Compute: 2-4 vCPUs for scraper workers, 1-2 vCPUs for scheduler and analytics
Database: PostgreSQL with 50 GB storage is sufficient for the first year

Scaling Strategy

As your monitoring scope grows, scale horizontally by adding more scraper workers and distributing them across multiple proxy endpoints. DataResearchTools supports high-concurrency connections, making it straightforward to scale your data collection without hitting proxy capacity limits.

Monitoring the Monitor

Your monitoring system itself needs monitoring. Track:

Collection success rates per platform
Data freshness (time since last successful collection)
Parser error rates (indicating potential page structure changes)
Proxy health metrics (connection success rate, response times)

Common Pitfalls and How to Avoid Them

Pitfall 1: Ignoring Platform Updates

Marketplaces regularly update their page structures. Build your parsers defensively and set up automated tests that catch parsing failures early.

Pitfall 2: Under-Investing in Proxy Quality

Cheap datacenter proxies will get blocked frequently, creating data gaps. Invest in quality mobile proxies from providers like DataResearchTools that offer genuine carrier IP addresses.

Pitfall 3: Collecting Too Much, Analyzing Too Little

It is easy to collect mountains of data without building the analytical capabilities to extract insights. Start with a focused set of metrics and expand gradually.

Pitfall 4: Not Accounting for Geographic Variation

Products displayed in Singapore may have different prices, availability, and rankings than the same products in Indonesia. Always use geo-targeted proxies to collect market-specific data.

Conclusion

Building a digital shelf monitoring system is a significant technical undertaking, but the competitive advantages it provides justify the investment. By combining reliable proxy infrastructure from DataResearchTools with well-designed scraping, parsing, and analytics components, brands can gain continuous visibility into their digital shelf performance across Southeast Asian marketplaces.

Start with the highest-priority platforms and metrics, prove the value with initial results, and then expand your monitoring scope systematically. The brands that build these capabilities gain an information advantage that compounds over time as they accumulate historical data and refine their strategies based on what the data reveals.