Building Custom Datasets with Proxies: A Practical Guide

Building Custom Datasets with Proxies: A Practical Guide

Public datasets get you started, but custom datasets give you a competitive edge. Whether you are training a domain-specific language model, building a recommendation system, or developing a classification model for a niche market, custom data collected from targeted web sources often outperforms generic alternatives.

This guide covers the full process of building custom datasets using proxy-assisted web scraping, from initial planning through final validation.

Why Build Custom Datasets

Pre-built datasets like Common Crawl, Wikipedia dumps, or Kaggle repositories serve general purposes well. But they fall short when you need:

  • Domain specificity: Medical forums, legal databases, niche e-commerce catalogs, or regional content in Southeast Asian languages
  • Freshness: Pre-built datasets can be months or years old, missing recent trends, products, or terminology
  • Specific formats: Your model might need structured data in a particular schema that no existing dataset provides
  • Competitive advantage: If everyone trains on the same public data, models converge toward similar performance

Custom dataset creation through targeted scraping solves all of these problems, provided you have the infrastructure to do it reliably.

Phase 1: Dataset Design

Before writing any code, define exactly what you need.

Define Your Data Requirements

Answer these questions:

  1. What content types do you need? Text, images, structured records, or a combination?
  2. How much data is required? Order of magnitude matters. 10K records is a weekend project; 10M records requires robust infrastructure.
  3. What sources contain this data? List specific websites, platforms, and content types.
  4. What schema should each record follow? Define fields, data types, and validation rules.
  5. What quality thresholds apply? Minimum content length, language requirements, freshness constraints.

Create a Data Specification Document

dataset_name: "sea_product_reviews"
description: "Product reviews from Southeast Asian e-commerce platforms"
target_size: 500000
schema:
  - field: review_text
    type: string
    min_length: 50
    required: true
  - field: rating
    type: integer
    range: [1, 5]
    required: true
  - field: product_category
    type: string
    enum: [electronics, fashion, home, beauty, food]
    required: true
  - field: language
    type: string
    enum: [en, th, vi, id, ms, tl]
    required: true
  - field: source_url
    type: url
    required: true
  - field: collected_at
    type: datetime
    required: true
sources:
  - domain: example-shop.co.th
    expected_volume: 200000
    difficulty: medium
  - domain: example-market.com.vn
    expected_volume: 150000
    difficulty: high
  - domain: example-store.co.id
    expected_volume: 150000
    difficulty: medium

Phase 2: Proxy Strategy

Your proxy strategy should match your data sources and volume requirements.

Matching Proxy Types to Sources

Source DifficultyProxy TypeReasoning
Low (static sites, public APIs)DatacenterFast, cheap, sufficient for unprotected sources
Medium (dynamic sites, basic anti-bot)ResidentialMimics real user traffic patterns
High (aggressive anti-bot, mobile-first platforms)MobileHighest trust score, rarely blocked

For Southeast Asian e-commerce and social media platforms, mobile proxies are typically the best choice. These platforms aggressively detect and block datacenter IPs, and many serve different content to mobile versus desktop users. DataResearchTools mobile proxies route through real carrier networks in countries like Thailand, Vietnam, Indonesia, and the Philippines, giving you authentic mobile IP addresses that match each platform’s primary user base.

Geographic Proxy Selection

Many websites serve localized content based on the visitor’s IP location. If your dataset needs content specific to certain countries:

  • Use proxies located in the target country
  • Verify that the proxy IP resolves to the correct location using a geolocation API
  • Test that the target site actually serves different content based on location before committing to location-specific proxies
import requests

def verify_proxy_location(proxy_url, expected_country):
    """Verify that a proxy IP is in the expected country."""
    try:
        response = requests.get(
            "https://ipapi.co/json/",
            proxies={"http": proxy_url, "https": proxy_url},
            timeout=10
        )
        data = response.json()
        return data.get("country_code") == expected_country
    except Exception:
        return False

Calculating Proxy Requirements

Estimate how many proxy IPs and how much bandwidth you will need:

Total pages to scrape: 500,000
Average page size: 200 KB
Total bandwidth: ~100 GB
Requests per IP before rotation: 50
Unique IPs needed: 10,000
Time to complete (at 5 req/sec): ~28 hours

This kind of estimation helps you choose the right proxy plan and set realistic timelines.

Phase 3: Building the Scraper

Architecture Overview

A production dataset scraper has several components:

URL Generator -> Request Queue -> Proxy Manager -> Fetcher -> Parser -> Validator -> Storage

URL Discovery

Before scraping content, you need a list of URLs to visit. Common strategies:

import xml.etree.ElementTree as ET

def parse_sitemap(sitemap_url, proxy_config):
    """Extract URLs from a sitemap."""
    response = requests.get(sitemap_url, proxies=proxy_config, timeout=15)
    root = ET.fromstring(response.content)
    namespace = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}
    urls = [loc.text for loc in root.findall(".//ns:loc", namespace)]
    return urls

def crawl_pagination(base_url, proxy_config, max_pages=100):
    """Follow pagination links to discover content pages."""
    urls = []
    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        response = requests.get(url, proxies=proxy_config, timeout=15)
        if response.status_code != 200:
            break
        soup = BeautifulSoup(response.text, "html.parser")
        links = soup.select("a.product-link")
        if not links:
            break
        urls.extend([link["href"] for link in links])
    return urls

Content Extraction

Write extractors specific to each source. Use CSS selectors or XPath for structured extraction:

from dataclasses import dataclass
from datetime import datetime
from bs4 import BeautifulSoup

@dataclass
class ReviewRecord:
    review_text: str
    rating: int
    product_category: str
    language: str
    source_url: str
    collected_at: str

def extract_review(html, url, source_config):
    """Extract a review record from raw HTML."""
    soup = BeautifulSoup(html, "html.parser")

    review_el = soup.select_one(source_config["review_selector"])
    rating_el = soup.select_one(source_config["rating_selector"])
    category_el = soup.select_one(source_config["category_selector"])

    if not review_el or not rating_el:
        return None

    text = review_el.get_text(strip=True)
    if len(text) < 50:
        return None

    return ReviewRecord(
        review_text=text,
        rating=int(rating_el.get("data-rating", 0)),
        product_category=category_el.get_text(strip=True) if category_el else "unknown",
        language=detect_language(text),
        source_url=url,
        collected_at=datetime.utcnow().isoformat()
    )

Handling Different Page Structures

Real-world websites vary in structure. Build adapters for each source:

SOURCE_CONFIGS = {
    "example-shop.co.th": {
        "review_selector": "div.review-content p",
        "rating_selector": "span.star-rating",
        "category_selector": "nav.breadcrumb a:nth-child(2)",
        "proxy_country": "TH",
        "proxy_type": "mobile"
    },
    "example-market.com.vn": {
        "review_selector": "div.comment-text",
        "rating_selector": "div.rating-stars",
        "category_selector": "span.category-label",
        "proxy_country": "VN",
        "proxy_type": "mobile"
    }
}

Phase 4: Data Cleaning and Validation

Raw scraped data is messy. Cleaning is where most of the effort goes.

Text Cleaning Pipeline

import re
import unicodedata

def clean_text(text):
    """Clean and normalize scraped text."""
    # Normalize unicode
    text = unicodedata.normalize("NFKC", text)
    # Remove excessive whitespace
    text = re.sub(r"\s+", " ", text).strip()
    # Remove HTML entities that survived parsing
    text = re.sub(r"&[a-zA-Z]+;", " ", text)
    # Remove zero-width characters
    text = re.sub(r"[\u200b\u200c\u200d\ufeff]", "", text)
    return text

Deduplication

Duplicate content inflates dataset size without adding value and can cause models to overfit:

from datasketch import MinHash, MinHashLSH

def build_dedup_index(records, threshold=0.8):
    """Build a MinHash LSH index for near-duplicate detection."""
    lsh = MinHashLSH(threshold=threshold, num_perm=128)

    for i, record in enumerate(records):
        minhash = MinHash(num_perm=128)
        words = record["review_text"].lower().split()
        for word in words:
            minhash.update(word.encode("utf-8"))
        try:
            lsh.insert(f"doc_{i}", minhash)
        except ValueError:
            record["is_duplicate"] = True

    return lsh

Validation Rules

Apply your schema’s validation rules programmatically:

def validate_record(record, schema):
    """Validate a record against the dataset schema."""
    errors = []

    if len(record.review_text) < schema["review_text"]["min_length"]:
        errors.append(f"review_text too short: {len(record.review_text)}")

    if record.rating not in range(1, 6):
        errors.append(f"invalid rating: {record.rating}")

    if record.language not in schema["language"]["enum"]:
        errors.append(f"unsupported language: {record.language}")

    return len(errors) == 0, errors

Phase 5: Storage and Distribution

Choosing a Format

For AI training datasets, the most common formats are:

  • JSONL: One JSON object per line. Simple, human-readable, widely supported.
  • Parquet: Columnar format. Efficient for large datasets, supports compression.
  • CSV: Universal compatibility but poor for nested data or text with special characters.
import json
import pyarrow as pa
import pyarrow.parquet as pq

def save_jsonl(records, filepath):
    with open(filepath, "w", encoding="utf-8") as f:
        for record in records:
            f.write(json.dumps(record, ensure_ascii=False) + "\n")

def save_parquet(records, filepath):
    table = pa.Table.from_pylist(records)
    pq.write_table(table, filepath, compression="snappy")

Dataset Documentation

Every custom dataset should include:

  • Description of what the data contains and how it was collected
  • Schema definition with field descriptions
  • Collection date range
  • Source websites (if appropriate to disclose)
  • Known limitations or biases
  • Size statistics (record count, file size, language distribution)

Practical Tips from the Field

Start Small and Iterate

Scrape 1,000 records first. Check quality manually. Fix extraction issues before scaling up. Discovering a parsing bug after scraping 500K pages means re-doing all of that work.

Use Proxy Resources Wisely

Mobile proxies like those from DataResearchTools offer high success rates but cost more than datacenter options. Use them strategically for sources that block cheaper alternatives. Route easy targets through datacenter proxies to optimize your budget.

Build for Resumability

Long-running scraping jobs will fail at some point due to network issues, proxy problems, or target site changes. Design your pipeline to resume from where it stopped:

  • Track which URLs have been successfully scraped
  • Store raw HTML separately from parsed data so you can re-parse without re-scraping
  • Use checkpointing to save progress at regular intervals

Monitor Data Distribution

As you collect data, monitor the distribution across categories, languages, and sources. Imbalanced datasets lead to biased models. If one source contributes 80% of your data, actively seek additional sources to balance things out.

Conclusion

Building custom datasets with proxies is a systematic process that rewards careful planning. Define your requirements precisely, choose proxies that match your source difficulty, build robust extraction and validation pipelines, and invest heavily in data quality. The result is a dataset tailored exactly to your needs, giving your AI models an advantage that no public dataset can match.


Related Reading

Scroll to Top