Building Custom Datasets with Proxies: A Practical Guide
Public datasets get you started, but custom datasets give you a competitive edge. Whether you are training a domain-specific language model, building a recommendation system, or developing a classification model for a niche market, custom data collected from targeted web sources often outperforms generic alternatives.
This guide covers the full process of building custom datasets using proxy-assisted web scraping, from initial planning through final validation.
Why Build Custom Datasets
Pre-built datasets like Common Crawl, Wikipedia dumps, or Kaggle repositories serve general purposes well. But they fall short when you need:
- Domain specificity: Medical forums, legal databases, niche e-commerce catalogs, or regional content in Southeast Asian languages
- Freshness: Pre-built datasets can be months or years old, missing recent trends, products, or terminology
- Specific formats: Your model might need structured data in a particular schema that no existing dataset provides
- Competitive advantage: If everyone trains on the same public data, models converge toward similar performance
Custom dataset creation through targeted scraping solves all of these problems, provided you have the infrastructure to do it reliably.
Phase 1: Dataset Design
Before writing any code, define exactly what you need.
Define Your Data Requirements
Answer these questions:
- What content types do you need? Text, images, structured records, or a combination?
- How much data is required? Order of magnitude matters. 10K records is a weekend project; 10M records requires robust infrastructure.
- What sources contain this data? List specific websites, platforms, and content types.
- What schema should each record follow? Define fields, data types, and validation rules.
- What quality thresholds apply? Minimum content length, language requirements, freshness constraints.
Create a Data Specification Document
dataset_name: "sea_product_reviews"
description: "Product reviews from Southeast Asian e-commerce platforms"
target_size: 500000
schema:
- field: review_text
type: string
min_length: 50
required: true
- field: rating
type: integer
range: [1, 5]
required: true
- field: product_category
type: string
enum: [electronics, fashion, home, beauty, food]
required: true
- field: language
type: string
enum: [en, th, vi, id, ms, tl]
required: true
- field: source_url
type: url
required: true
- field: collected_at
type: datetime
required: true
sources:
- domain: example-shop.co.th
expected_volume: 200000
difficulty: medium
- domain: example-market.com.vn
expected_volume: 150000
difficulty: high
- domain: example-store.co.id
expected_volume: 150000
difficulty: mediumPhase 2: Proxy Strategy
Your proxy strategy should match your data sources and volume requirements.
Matching Proxy Types to Sources
| Source Difficulty | Proxy Type | Reasoning |
|---|---|---|
| Low (static sites, public APIs) | Datacenter | Fast, cheap, sufficient for unprotected sources |
| Medium (dynamic sites, basic anti-bot) | Residential | Mimics real user traffic patterns |
| High (aggressive anti-bot, mobile-first platforms) | Mobile | Highest trust score, rarely blocked |
For Southeast Asian e-commerce and social media platforms, mobile proxies are typically the best choice. These platforms aggressively detect and block datacenter IPs, and many serve different content to mobile versus desktop users. DataResearchTools mobile proxies route through real carrier networks in countries like Thailand, Vietnam, Indonesia, and the Philippines, giving you authentic mobile IP addresses that match each platform’s primary user base.
Geographic Proxy Selection
Many websites serve localized content based on the visitor’s IP location. If your dataset needs content specific to certain countries:
- Use proxies located in the target country
- Verify that the proxy IP resolves to the correct location using a geolocation API
- Test that the target site actually serves different content based on location before committing to location-specific proxies
import requests
def verify_proxy_location(proxy_url, expected_country):
"""Verify that a proxy IP is in the expected country."""
try:
response = requests.get(
"https://ipapi.co/json/",
proxies={"http": proxy_url, "https": proxy_url},
timeout=10
)
data = response.json()
return data.get("country_code") == expected_country
except Exception:
return FalseCalculating Proxy Requirements
Estimate how many proxy IPs and how much bandwidth you will need:
Total pages to scrape: 500,000
Average page size: 200 KB
Total bandwidth: ~100 GB
Requests per IP before rotation: 50
Unique IPs needed: 10,000
Time to complete (at 5 req/sec): ~28 hoursThis kind of estimation helps you choose the right proxy plan and set realistic timelines.
Phase 3: Building the Scraper
Architecture Overview
A production dataset scraper has several components:
URL Generator -> Request Queue -> Proxy Manager -> Fetcher -> Parser -> Validator -> StorageURL Discovery
Before scraping content, you need a list of URLs to visit. Common strategies:
import xml.etree.ElementTree as ET
def parse_sitemap(sitemap_url, proxy_config):
"""Extract URLs from a sitemap."""
response = requests.get(sitemap_url, proxies=proxy_config, timeout=15)
root = ET.fromstring(response.content)
namespace = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}
urls = [loc.text for loc in root.findall(".//ns:loc", namespace)]
return urls
def crawl_pagination(base_url, proxy_config, max_pages=100):
"""Follow pagination links to discover content pages."""
urls = []
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
response = requests.get(url, proxies=proxy_config, timeout=15)
if response.status_code != 200:
break
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a.product-link")
if not links:
break
urls.extend([link["href"] for link in links])
return urlsContent Extraction
Write extractors specific to each source. Use CSS selectors or XPath for structured extraction:
from dataclasses import dataclass
from datetime import datetime
from bs4 import BeautifulSoup
@dataclass
class ReviewRecord:
review_text: str
rating: int
product_category: str
language: str
source_url: str
collected_at: str
def extract_review(html, url, source_config):
"""Extract a review record from raw HTML."""
soup = BeautifulSoup(html, "html.parser")
review_el = soup.select_one(source_config["review_selector"])
rating_el = soup.select_one(source_config["rating_selector"])
category_el = soup.select_one(source_config["category_selector"])
if not review_el or not rating_el:
return None
text = review_el.get_text(strip=True)
if len(text) < 50:
return None
return ReviewRecord(
review_text=text,
rating=int(rating_el.get("data-rating", 0)),
product_category=category_el.get_text(strip=True) if category_el else "unknown",
language=detect_language(text),
source_url=url,
collected_at=datetime.utcnow().isoformat()
)Handling Different Page Structures
Real-world websites vary in structure. Build adapters for each source:
SOURCE_CONFIGS = {
"example-shop.co.th": {
"review_selector": "div.review-content p",
"rating_selector": "span.star-rating",
"category_selector": "nav.breadcrumb a:nth-child(2)",
"proxy_country": "TH",
"proxy_type": "mobile"
},
"example-market.com.vn": {
"review_selector": "div.comment-text",
"rating_selector": "div.rating-stars",
"category_selector": "span.category-label",
"proxy_country": "VN",
"proxy_type": "mobile"
}
}Phase 4: Data Cleaning and Validation
Raw scraped data is messy. Cleaning is where most of the effort goes.
Text Cleaning Pipeline
import re
import unicodedata
def clean_text(text):
"""Clean and normalize scraped text."""
# Normalize unicode
text = unicodedata.normalize("NFKC", text)
# Remove excessive whitespace
text = re.sub(r"\s+", " ", text).strip()
# Remove HTML entities that survived parsing
text = re.sub(r"&[a-zA-Z]+;", " ", text)
# Remove zero-width characters
text = re.sub(r"[\u200b\u200c\u200d\ufeff]", "", text)
return textDeduplication
Duplicate content inflates dataset size without adding value and can cause models to overfit:
from datasketch import MinHash, MinHashLSH
def build_dedup_index(records, threshold=0.8):
"""Build a MinHash LSH index for near-duplicate detection."""
lsh = MinHashLSH(threshold=threshold, num_perm=128)
for i, record in enumerate(records):
minhash = MinHash(num_perm=128)
words = record["review_text"].lower().split()
for word in words:
minhash.update(word.encode("utf-8"))
try:
lsh.insert(f"doc_{i}", minhash)
except ValueError:
record["is_duplicate"] = True
return lshValidation Rules
Apply your schema’s validation rules programmatically:
def validate_record(record, schema):
"""Validate a record against the dataset schema."""
errors = []
if len(record.review_text) < schema["review_text"]["min_length"]:
errors.append(f"review_text too short: {len(record.review_text)}")
if record.rating not in range(1, 6):
errors.append(f"invalid rating: {record.rating}")
if record.language not in schema["language"]["enum"]:
errors.append(f"unsupported language: {record.language}")
return len(errors) == 0, errorsPhase 5: Storage and Distribution
Choosing a Format
For AI training datasets, the most common formats are:
- JSONL: One JSON object per line. Simple, human-readable, widely supported.
- Parquet: Columnar format. Efficient for large datasets, supports compression.
- CSV: Universal compatibility but poor for nested data or text with special characters.
import json
import pyarrow as pa
import pyarrow.parquet as pq
def save_jsonl(records, filepath):
with open(filepath, "w", encoding="utf-8") as f:
for record in records:
f.write(json.dumps(record, ensure_ascii=False) + "\n")
def save_parquet(records, filepath):
table = pa.Table.from_pylist(records)
pq.write_table(table, filepath, compression="snappy")Dataset Documentation
Every custom dataset should include:
- Description of what the data contains and how it was collected
- Schema definition with field descriptions
- Collection date range
- Source websites (if appropriate to disclose)
- Known limitations or biases
- Size statistics (record count, file size, language distribution)
Practical Tips from the Field
Start Small and Iterate
Scrape 1,000 records first. Check quality manually. Fix extraction issues before scaling up. Discovering a parsing bug after scraping 500K pages means re-doing all of that work.
Use Proxy Resources Wisely
Mobile proxies like those from DataResearchTools offer high success rates but cost more than datacenter options. Use them strategically for sources that block cheaper alternatives. Route easy targets through datacenter proxies to optimize your budget.
Build for Resumability
Long-running scraping jobs will fail at some point due to network issues, proxy problems, or target site changes. Design your pipeline to resume from where it stopped:
- Track which URLs have been successfully scraped
- Store raw HTML separately from parsed data so you can re-parse without re-scraping
- Use checkpointing to save progress at regular intervals
Monitor Data Distribution
As you collect data, monitor the distribution across categories, languages, and sources. Imbalanced datasets lead to biased models. If one source contributes 80% of your data, actively seek additional sources to balance things out.
Conclusion
Building custom datasets with proxies is a systematic process that rewards careful planning. Define your requirements precisely, choose proxies that match your source difficulty, build robust extraction and validation pipelines, and invest heavily in data quality. The result is a dataset tailored exactly to your needs, giving your AI models an advantage that no public dataset can match.
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Web Scraping for AI Training Data: Proxy Setup and Best Practices
- RAG Pipeline Data Collection: Scraping Sources with Mobile Proxies
- Collecting LLM Fine-Tuning Data with Web Scraping and Proxies
- Scraping at Scale for AI Datasets: Infrastructure and Proxy Architecture
- Training Data Quality: How Proxy Choice Affects Your AI Dataset
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Collecting LLM Fine-Tuning Data with Web Scraping and Proxies
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Collecting LLM Fine-Tuning Data with Web Scraping and Proxies
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- AI Web Scraper with Python: Build Your Own
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Collecting LLM Fine-Tuning Data with Web Scraping and Proxies
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- AI Web Scraper with Python: Build Your Own
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Collecting LLM Fine-Tuning Data with Web Scraping and Proxies
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- AI Web Scraper with Python: Build Your Own
Related Reading
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Collecting LLM Fine-Tuning Data with Web Scraping and Proxies
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- AI Web Scraper with Python: Build Your Own