How to Build a Digital Shelf Monitoring System with Proxies
Monitoring your products across e-commerce marketplaces is no longer something you can do manually. With thousands of SKUs listed across multiple platforms, each with its own search rankings, pricing, and content standards, brands need automated systems that collect and analyze digital shelf data continuously. Building such a system requires combining web scraping technology with proxy infrastructure to collect data reliably and at scale.
This guide walks through the architecture, tools, and implementation steps for building a digital shelf monitoring system from the ground up.
Why Build a Custom Monitoring System
Commercial digital shelf analytics platforms exist, and they serve many brands well. However, there are compelling reasons to build a custom system or augment a commercial solution with custom data collection:
- Platform coverage gaps: Commercial tools may not cover niche or regional marketplaces like Tokopedia, Tiki, or local retail sites
- Custom metrics: Your brand may need specific data points that off-the-shelf tools do not track
- Cost efficiency: At high collection volumes, building your own infrastructure can be more cost-effective than per-SKU pricing models
- Data ownership: Owning your data pipeline gives you full control over storage, processing, and integration with internal systems
- Flexibility: Custom systems can be adapted quickly when platforms change or new monitoring needs emerge
System Architecture Overview
A digital shelf monitoring system consists of several interconnected components:
[Scheduler] → [Task Queue] → [Scraper Workers] → [Proxy Layer] → [Target Sites]
↓
[Raw Data Store]
↓
[Data Parser/Normalizer]
↓
[Analytics Database]
↓
[Dashboard/Alerts]Each component plays a specific role, and the quality of the overall system depends on how well these components work together.
Component 1: The Proxy Layer
The proxy layer is arguably the most critical component. Without reliable proxies, your scrapers will be blocked, throttled, or served misleading data. The proxy layer sits between your scraper workers and the target marketplaces, rotating IP addresses and managing connection parameters.
Choosing the Right Proxy Type
For digital shelf monitoring, mobile proxies provide the best balance of reliability and authenticity. Here is why:
Mobile proxies route your requests through mobile carrier networks. The IP addresses are shared among thousands of real mobile users, which means marketplaces are extremely unlikely to block them—doing so would block legitimate customers. This is particularly relevant for monitoring Southeast Asian marketplaces where mobile commerce dominates.
DataResearchTools specializes in mobile proxy infrastructure for SEA markets, providing carrier-level IP addresses from Singapore, Malaysia, Thailand, Indonesia, the Philippines, and Vietnam. This coverage is essential for collecting geo-accurate data from platforms like Shopee, Lazada, and Tokopedia, where product availability, pricing, and search rankings vary by country.
Proxy Configuration Best Practices
When integrating proxies into your monitoring system:
- Rotate IPs per request or session: Avoid making too many consecutive requests from the same IP
- Match proxy geography to target market: Use Malaysian proxies when monitoring Shopee Malaysia, Thai proxies for Lazada Thailand, and so on
- Implement sticky sessions where needed: Some platforms require consistent IPs within a single browsing session to avoid triggering security measures
- Set appropriate timeouts: Mobile connections can be slower than datacenter connections; configure timeouts of 30-60 seconds
- Handle proxy failures gracefully: Implement automatic retry logic with IP rotation when a request fails
Sample Proxy Integration (Python)
import requests
from itertools import cycle
class ProxyManager:
def __init__(self, proxy_endpoint, countries):
self.proxy_endpoint = proxy_endpoint
self.countries = countries
def get_proxy(self, country='sg'):
"""Get a mobile proxy for a specific country."""
return {
'http': f'http://user:pass@{self.proxy_endpoint}:{self.get_port(country)}',
'https': f'http://user:pass@{self.proxy_endpoint}:{self.get_port(country)}'
}
def get_port(self, country):
port_map = {
'sg': 10001,
'my': 10002,
'th': 10003,
'id': 10004,
'ph': 10005,
'vn': 10006,
}
return port_map.get(country, 10001)
def make_request(self, url, country='sg', max_retries=3):
"""Make a request through the proxy with retry logic."""
for attempt in range(max_retries):
try:
proxy = self.get_proxy(country)
response = requests.get(
url,
proxies=proxy,
timeout=45,
headers=self.get_headers()
)
if response.status_code == 200:
return response
except requests.RequestException:
continue
return None
def get_headers(self):
return {
'User-Agent': 'Mozilla/5.0 (Linux; Android 13) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept': 'text/html,application/xhtml+xml'
}Component 2: The Scheduler
The scheduler determines when and how often each monitoring task runs. Different data types require different collection frequencies:
| Data Type | Recommended Frequency | Rationale |
|---|---|---|
| Pricing | Every 2-4 hours | Prices change frequently, especially during promotions |
| Search rankings | Daily | Rankings shift gradually; daily snapshots are sufficient |
| Content/listings | Weekly | Content changes are less frequent |
| Stock availability | Every 4-6 hours | Stockouts can happen quickly during demand spikes |
| Reviews | Daily | New reviews trickle in; daily collection captures trends |
Use a task scheduler like Celery (Python), Bull (Node.js), or a managed service like AWS EventBridge to trigger collection jobs at defined intervals.
from celery import Celery
from celery.schedules import crontab
app = Celery('digital_shelf_monitor')
app.conf.beat_schedule = {
'collect-pricing-data': {
'task': 'tasks.collect_pricing',
'schedule': crontab(minute=0, hour='*/4'), # Every 4 hours
},
'collect-search-rankings': {
'task': 'tasks.collect_search_rankings',
'schedule': crontab(minute=0, hour=6), # Daily at 6 AM
},
'collect-stock-status': {
'task': 'tasks.collect_availability',
'schedule': crontab(minute=0, hour='*/6'), # Every 6 hours
},
'collect-reviews': {
'task': 'tasks.collect_reviews',
'schedule': crontab(minute=0, hour=8), # Daily at 8 AM
},
'audit-content': {
'task': 'tasks.audit_content',
'schedule': crontab(minute=0, hour=3, day_of_week=1), # Weekly
},
}Component 3: Scraper Workers
Scraper workers execute the actual data collection. Each worker picks up a task from the queue, makes requests through the proxy layer, and stores the collected data.
Platform-Specific Parsers
Each marketplace requires its own parser because page structures differ. Structure your parsers as modular components:
class ShopeeParser:
def parse_product_page(self, html):
"""Extract product data from a Shopee product page."""
return {
'title': self._extract_title(html),
'price': self._extract_price(html),
'original_price': self._extract_original_price(html),
'rating': self._extract_rating(html),
'review_count': self._extract_review_count(html),
'stock_status': self._extract_stock_status(html),
'seller': self._extract_seller_info(html),
'images': self._extract_images(html),
'description': self._extract_description(html),
}
def parse_search_results(self, html, keyword):
"""Extract product rankings from search results."""
results = []
products = self._extract_search_items(html)
for position, product in enumerate(products, 1):
results.append({
'keyword': keyword,
'position': position,
'product_id': product.get('id'),
'title': product.get('title'),
'price': product.get('price'),
'is_sponsored': product.get('is_ad', False),
})
return results
class LazadaParser:
def parse_product_page(self, html):
"""Extract product data from a Lazada product page."""
# Lazada-specific parsing logic
pass
class TokopediaParser:
def parse_product_page(self, html):
"""Extract product data from a Tokopedia product page."""
# Tokopedia-specific parsing logic
passHandling JavaScript-Rendered Content
Many modern marketplaces render content using JavaScript. For these cases, you may need headless browser automation instead of simple HTTP requests:
from playwright.async_api import async_playwright
async def collect_with_browser(url, proxy_config):
async with async_playwright() as p:
browser = await p.chromium.launch(
proxy={
'server': proxy_config['server'],
'username': proxy_config['username'],
'password': proxy_config['password'],
}
)
page = await browser.new_page()
await page.goto(url, wait_until='networkidle')
content = await page.content()
await browser.close()
return contentRate Limiting and Politeness
Even with proxies, implement rate limiting to be a responsible data collector:
- Add random delays between requests (2-5 seconds)
- Respect robots.txt where applicable
- Distribute requests across time rather than sending bursts
- Monitor response codes and back off when you see increased error rates
Component 4: Data Storage and Processing
Raw Data Storage
Store the raw HTML or JSON responses from each collection run. This provides several benefits:
- You can re-parse data if your parser has bugs
- Historical raw data enables retrospective analysis
- You can debug collection issues by examining raw responses
Use object storage (S3, Google Cloud Storage) for raw data, organized by date and platform:
raw_data/
2026/03/09/
shopee_sg/
products/
product_12345.html
search/
keyword_mobile_proxy.html
lazada_my/
products/
product_67890.htmlStructured Data Storage
After parsing, store structured data in a database optimized for time-series queries. PostgreSQL with TimescaleDB extension works well, as does ClickHouse for high-volume analytical workloads.
CREATE TABLE product_prices (
collected_at TIMESTAMPTZ NOT NULL,
platform VARCHAR(50) NOT NULL,
country VARCHAR(5) NOT NULL,
product_id VARCHAR(100) NOT NULL,
price DECIMAL(12, 2),
original_price DECIMAL(12, 2),
currency VARCHAR(5),
seller_id VARCHAR(100),
in_stock BOOLEAN,
PRIMARY KEY (collected_at, platform, country, product_id)
);
CREATE TABLE search_rankings (
collected_at TIMESTAMPTZ NOT NULL,
platform VARCHAR(50) NOT NULL,
country VARCHAR(5) NOT NULL,
keyword VARCHAR(500) NOT NULL,
product_id VARCHAR(100) NOT NULL,
position INTEGER,
is_sponsored BOOLEAN,
PRIMARY KEY (collected_at, platform, country, keyword, product_id)
);Component 5: Analytics and Alerting
Dashboard Design
Build dashboards that surface the most actionable metrics:
- Overview dashboard: High-level health across all products and platforms
- Pricing dashboard: Current prices, MAP violations, competitive positioning
- Search dashboard: Ranking trends, share of search, keyword performance
- Content dashboard: Compliance scores, content gaps, listing quality
- Availability dashboard: Stockout rates, fulfillment metrics
Use tools like Metabase, Grafana, or a custom web application to visualize the data.
Alert Configuration
Set up alerts for events that require immediate action:
class AlertManager:
def check_price_violations(self, product_id, current_price, map_price):
if current_price < map_price:
self.send_alert(
severity='high',
message=f'MAP violation: {product_id} priced at {current_price}, '
f'MAP is {map_price}',
channel='pricing-alerts'
)
def check_stock_status(self, product_id, was_in_stock, is_in_stock):
if was_in_stock and not is_in_stock:
self.send_alert(
severity='medium',
message=f'Stockout detected: {product_id}',
channel='availability-alerts'
)
def check_ranking_drop(self, product_id, keyword, old_position, new_position):
if new_position - old_position > 10:
self.send_alert(
severity='medium',
message=f'Ranking drop: {product_id} for "{keyword}" '
f'dropped from #{old_position} to #{new_position}',
channel='search-alerts'
)Deployment Considerations
Infrastructure Sizing
For a monitoring system tracking 1,000 products across 5 platforms:
- Proxy bandwidth: Approximately 50-100 GB per month
- Storage: 10-20 GB per month for raw data, 1-2 GB for structured data
- Compute: 2-4 vCPUs for scraper workers, 1-2 vCPUs for scheduler and analytics
- Database: PostgreSQL with 50 GB storage is sufficient for the first year
Scaling Strategy
As your monitoring scope grows, scale horizontally by adding more scraper workers and distributing them across multiple proxy endpoints. DataResearchTools supports high-concurrency connections, making it straightforward to scale your data collection without hitting proxy capacity limits.
Monitoring the Monitor
Your monitoring system itself needs monitoring. Track:
- Collection success rates per platform
- Data freshness (time since last successful collection)
- Parser error rates (indicating potential page structure changes)
- Proxy health metrics (connection success rate, response times)
Common Pitfalls and How to Avoid Them
Pitfall 1: Ignoring Platform Updates
Marketplaces regularly update their page structures. Build your parsers defensively and set up automated tests that catch parsing failures early.
Pitfall 2: Under-Investing in Proxy Quality
Cheap datacenter proxies will get blocked frequently, creating data gaps. Invest in quality mobile proxies from providers like DataResearchTools that offer genuine carrier IP addresses.
Pitfall 3: Collecting Too Much, Analyzing Too Little
It is easy to collect mountains of data without building the analytical capabilities to extract insights. Start with a focused set of metrics and expand gradually.
Pitfall 4: Not Accounting for Geographic Variation
Products displayed in Singapore may have different prices, availability, and rankings than the same products in Indonesia. Always use geo-targeted proxies to collect market-specific data.
Conclusion
Building a digital shelf monitoring system is a significant technical undertaking, but the competitive advantages it provides justify the investment. By combining reliable proxy infrastructure from DataResearchTools with well-designed scraping, parsing, and analytics components, brands can gain continuous visibility into their digital shelf performance across Southeast Asian marketplaces.
Start with the highest-priority platforms and metrics, prove the value with initial results, and then expand your monitoring scope systematically. The brands that build these capabilities gain an information advantage that compounds over time as they accumulate historical data and refine their strategies based on what the data reveals.
- Building an Automated Price Parity Monitor with Proxies
- Monitoring Buy Box Ownership Across Amazon, Shopee, and Lazada
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- AdsPower Proxy Setup: Multi-Account Browser Configuration
- AdsPower vs GoLogin: Features, Pricing, and Proxy Support Compared
- Building an Automated Price Parity Monitor with Proxies
- Monitoring Buy Box Ownership Across Amazon, Shopee, and Lazada
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- AdsPower Proxy Setup: Multi-Account Browser Configuration
- AdsPower Tutorial: Team Browser Management Guide 2026
- Building an Automated Price Parity Monitor with Proxies
- Monitoring Buy Box Ownership Across Amazon, Shopee, and Lazada
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- AdsPower Proxy Setup: Multi-Account Browser Configuration
- AdsPower Tutorial: Team Browser Management Guide 2026
- Building an Automated Price Parity Monitor with Proxies
- Monitoring Buy Box Ownership Across Amazon, Shopee, and Lazada
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- AdsPower Proxy Setup: Multi-Account Browser Configuration
- AdsPower Tutorial: Team Browser Management Guide 2026
Related Reading
- Building an Automated Price Parity Monitor with Proxies
- Monitoring Buy Box Ownership Across Amazon, Shopee, and Lazada
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- AdsPower Proxy Setup: Multi-Account Browser Configuration
- AdsPower Tutorial: Team Browser Management Guide 2026