Collecting LLM Fine-Tuning Data with Web Scraping and Proxies
Fine-tuning a large language model transforms a general-purpose model into a specialist. Whether you want an LLM that writes marketing copy in Thai, answers technical support questions about your product, or generates legal summaries in Bahasa Indonesia, fine-tuning requires domain-specific data in the right format.
Public fine-tuning datasets exist, but they rarely cover specialized domains or regional languages well. Scraping your own fine-tuning data gives you control over domain focus, language coverage, and data quality. This guide explains how to collect, format, and validate fine-tuning data using web scraping with proxy infrastructure.
Understanding Fine-Tuning Data Requirements
Data Formats for Fine-Tuning
Modern LLM fine-tuning uses several data formats depending on the approach:
Instruction tuning requires prompt-response pairs:
{
"instruction": "Summarize the key features of this product",
"input": "The XR-500 wireless router features WiFi 6E support, tri-band connectivity with speeds up to 10 Gbps, and a built-in VPN server...",
"output": "The XR-500 is a tri-band WiFi 6E router offering up to 10 Gbps speeds with built-in VPN capability."
}Conversational fine-tuning uses multi-turn dialogue:
{
"messages": [
{"role": "system", "content": "You are a helpful customer service agent."},
{"role": "user", "content": "My order hasn't arrived yet. Order number: 12345"},
{"role": "assistant", "content": "I'll look into order #12345 for you. Could you confirm the shipping address?"}
]
}Continued pre-training uses raw text in the target domain:
{"text": "The regulatory framework for digital payments in Southeast Asia has evolved significantly since 2020. Thailand's Bank of Thailand issued new guidelines..."}Each format demands a different scraping and processing approach.
How Much Data Do You Need
The amount depends on your fine-tuning method:
| Approach | Typical Dataset Size | Quality Requirement |
|---|---|---|
| Full fine-tuning | 10K-100K+ examples | Moderate (volume compensates) |
| LoRA / QLoRA | 1K-10K examples | High (every example matters) |
| Prompt tuning | 100-1K examples | Very high (precision is critical) |
| Continued pre-training | 1M+ tokens | Moderate (domain relevance key) |
For parameter-efficient methods like LoRA, a smaller but carefully curated dataset outperforms a large noisy one. This means your scraping pipeline needs strong quality filters.
Identifying Fine-Tuning Data Sources
Source Types by Use Case
For instruction-following models:
- Q&A sites (Stack Overflow, Quora, regional equivalents)
- FAQ pages on corporate websites
- Customer support forums
- How-to guides and tutorials
For conversational models:
- Public customer service chat logs
- Forum threads with question-answer patterns
- Reddit threads with clear question-response structure
- Community support pages
For domain-specific continued pre-training:
- Industry publications and journals
- Government regulatory documents
- Specialized blogs and news sites
- Product documentation and technical manuals
Evaluating Source Quality
Before scraping a source, assess its value:
def evaluate_source(url, proxy_config, sample_size=20):
"""Evaluate a potential data source by sampling pages."""
sample_urls = discover_sample_urls(url, proxy_config, sample_size)
results = {
"total_sampled": len(sample_urls),
"avg_content_length": 0,
"language_distribution": {},
"has_structured_qa": False,
"content_quality_score": 0
}
content_lengths = []
for sample_url in sample_urls:
response = requests.get(
sample_url,
proxies=proxy_config,
timeout=15
)
if response.status_code != 200:
continue
soup = BeautifulSoup(response.text, "html.parser")
text = soup.get_text(strip=True)
content_lengths.append(len(text.split()))
# Check for Q&A structure
if soup.select("div.question, div.answer, div.reply"):
results["has_structured_qa"] = True
results["avg_content_length"] = (
sum(content_lengths) / len(content_lengths) if content_lengths else 0
)
return resultsScraping Strategies for Different Data Types
Scraping Q&A Pairs for Instruction Tuning
Q&A websites are the richest source for instruction-tuning data. The challenge is extracting clean question-answer pairs from complex page structures:
def extract_qa_pairs(html, url):
"""Extract question-answer pairs from a Q&A page."""
soup = BeautifulSoup(html, "html.parser")
pairs = []
question = soup.select_one("h1.question-title, div.question-text")
if not question:
return pairs
question_text = question.get_text(strip=True)
# Get accepted or top-voted answers
answers = soup.select("div.answer")
for answer in answers:
# Check if answer has upvotes or is accepted
score_el = answer.select_one("span.vote-count")
score = int(score_el.get_text(strip=True)) if score_el else 0
is_accepted = "accepted" in answer.get("class", [])
if score >= 2 or is_accepted:
answer_text = answer.select_one("div.answer-body")
if answer_text:
pairs.append({
"instruction": question_text,
"input": "",
"output": answer_text.get_text(strip=True),
"source_url": url,
"quality_signals": {
"score": score,
"is_accepted": is_accepted
}
})
return pairsScraping Conversations for Chat Fine-Tuning
Forum threads naturally contain multi-turn conversations:
def extract_conversation(thread_html, url):
"""Extract a conversation from a forum thread."""
soup = BeautifulSoup(thread_html, "html.parser")
posts = soup.select("div.post, div.message")
if len(posts) < 2:
return None
messages = []
for i, post in enumerate(posts):
author_el = post.select_one("span.author, a.username")
content_el = post.select_one("div.post-content, div.message-body")
if not content_el:
continue
text = content_el.get_text(strip=True)
author = author_el.get_text(strip=True) if author_el else f"user_{i}"
# First post is typically the question
role = "user" if i == 0 else "assistant"
# If same author posts again, they are the user
if i > 0 and author == messages[0].get("author"):
role = "user"
messages.append({
"role": role,
"content": text,
"author": author
})
# Only keep conversations with clear user-assistant alternation
if len(messages) >= 2 and messages[0]["role"] == "user":
return {
"messages": [
{"role": m["role"], "content": m["content"]}
for m in messages
],
"source_url": url
}
return NoneScraping Domain Text for Continued Pre-Training
For continued pre-training, you need large volumes of clean domain text:
def extract_domain_text(html, url, min_words=100):
"""Extract clean domain text from a page."""
soup = BeautifulSoup(html, "html.parser")
# Remove boilerplate
for tag in soup(["script", "style", "nav", "footer", "header",
"aside", "iframe", "form"]):
tag.decompose()
# Remove ads and social widgets
for cls in ["advertisement", "social-share", "newsletter-signup",
"related-articles", "sidebar"]:
for el in soup.select(f".{cls}, #{cls}"):
el.decompose()
main = soup.select_one("article, main, div.content")
if not main:
main = soup.body
text = main.get_text(separator="\n", strip=True)
words = text.split()
if len(words) < min_words:
return None
return {
"text": " ".join(words),
"url": url,
"word_count": len(words),
"title": soup.title.string if soup.title else ""
}Proxy Strategy for Fine-Tuning Data Collection
Why Mobile Proxies Matter for LLM Data
Fine-tuning data often comes from platforms that aggressively protect their content:
- Q&A platforms like Stack Overflow and regional equivalents rate-limit scrapers
- Forums use anti-bot measures that block datacenter IPs
- Social platforms require mobile-like request patterns to avoid detection
Mobile proxies work well here because:
- Mobile IPs have high trust scores due to carrier-grade NAT (many users share the same IP)
- Many Q&A and forum platforms have mobile-optimized versions that serve cleaner HTML
- Regional platforms in Southeast Asia primarily serve mobile users, so mobile traffic patterns look natural
DataResearchTools mobile proxies provide IPs from real mobile carriers across SEA countries. When scraping Thai tech forums or Vietnamese Q&A sites for fine-tuning data, using a mobile proxy from the same country ensures you see the same content and language that local users see.
Proxy Rotation for Sustained Collection
Fine-tuning data collection usually runs over several days. Configure your rotation accordingly:
class FineTuningProxyManager:
def __init__(self, proxy_gateway, requests_per_ip=30):
self.proxy_gateway = proxy_gateway
self.requests_per_ip = requests_per_ip
self.request_count = 0
def get_session(self):
"""Get a proxy session, rotating after N requests."""
self.request_count += 1
if self.request_count >= self.requests_per_ip:
self.request_count = 0
# Trigger rotation by creating new session
return self._new_session()
return self.current_session
def _new_session(self):
session = requests.Session()
session.proxies = {
"http": self.proxy_gateway,
"https": self.proxy_gateway
}
session.headers.update({
"User-Agent": self._mobile_user_agent(),
"Accept-Language": "en-US,th;q=0.9,vi;q=0.8"
})
self.current_session = session
return session
def _mobile_user_agent(self):
agents = [
"Mozilla/5.0 (Linux; Android 14; SM-S918B) AppleWebKit/537.36",
"Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15",
"Mozilla/5.0 (Linux; Android 13; Pixel 7) AppleWebKit/537.36"
]
return random.choice(agents)Quality Filtering for Fine-Tuning Data
Quality matters more than quantity for fine-tuning. Apply multiple filters:
Content Quality Filters
def quality_filter(record, record_type="instruction"):
"""Apply quality filters to a scraped record."""
if record_type == "instruction":
instruction = record.get("instruction", "")
output = record.get("output", "")
# Minimum length requirements
if len(instruction.split()) < 5:
return False, "instruction too short"
if len(output.split()) < 10:
return False, "output too short"
# Output should be longer than instruction for most tasks
if len(output) < len(instruction) * 0.5:
return False, "output suspiciously short relative to instruction"
# Check for placeholder or template responses
placeholder_patterns = [
r"please contact",
r"click here",
r"sign up to view",
r"login required"
]
for pattern in placeholder_patterns:
if re.search(pattern, output, re.IGNORECASE):
return False, f"output contains placeholder: {pattern}"
return True, "passed"
elif record_type == "conversation":
messages = record.get("messages", [])
if len(messages) < 2:
return False, "too few messages"
# Check for meaningful content in each turn
for msg in messages:
if len(msg["content"].split()) < 3:
return False, "message too short"
return True, "passed"
return True, "no filter applied"Deduplication at the Example Level
For fine-tuning data, semantic deduplication is important to prevent the model from memorizing repeated patterns:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def deduplicate_examples(records, field="instruction", threshold=0.85):
"""Remove near-duplicate examples based on cosine similarity."""
texts = [r[field] for r in records]
vectorizer = TfidfVectorizer(max_features=10000)
tfidf_matrix = vectorizer.fit_transform(texts)
unique_indices = []
seen = set()
for i in range(len(records)):
if i in seen:
continue
unique_indices.append(i)
similarities = cosine_similarity(
tfidf_matrix[i:i+1], tfidf_matrix[i+1:]
)[0]
for j, sim in enumerate(similarities):
if sim >= threshold:
seen.add(i + 1 + j)
return [records[i] for i in unique_indices]Formatting and Exporting Fine-Tuning Data
Converting to Standard Formats
Most fine-tuning frameworks expect specific formats:
def export_alpaca_format(records, output_path):
"""Export in Alpaca/Stanford format for instruction tuning."""
formatted = []
for record in records:
formatted.append({
"instruction": record["instruction"],
"input": record.get("input", ""),
"output": record["output"]
})
with open(output_path, "w", encoding="utf-8") as f:
json.dump(formatted, f, ensure_ascii=False, indent=2)
def export_sharegpt_format(records, output_path):
"""Export in ShareGPT format for conversational fine-tuning."""
formatted = []
for record in records:
formatted.append({
"conversations": [
{"from": "human" if m["role"] == "user" else "gpt",
"value": m["content"]}
for m in record["messages"]
]
})
with open(output_path, "w", encoding="utf-8") as f:
json.dump(formatted, f, ensure_ascii=False, indent=2)
def export_jsonl(records, output_path):
"""Export as JSONL for frameworks that prefer line-delimited JSON."""
with open(output_path, "w", encoding="utf-8") as f:
for record in records:
f.write(json.dumps(record, ensure_ascii=False) + "\n")Dataset Splitting
Always create train/validation/test splits:
from sklearn.model_selection import train_test_split
def split_dataset(records, val_ratio=0.1, test_ratio=0.05):
"""Split records into train/validation/test sets."""
train_val, test = train_test_split(records, test_size=test_ratio, random_state=42)
train, val = train_test_split(train_val, test_size=val_ratio / (1 - test_ratio), random_state=42)
return {
"train": train,
"validation": val,
"test": test,
"stats": {
"train_size": len(train),
"val_size": len(val),
"test_size": len(test)
}
}Putting It All Together
A complete fine-tuning data collection pipeline:
- Source identification: Find 5-10 high-quality sources for your domain
- Proxy setup: Configure mobile proxies for protected sources, datacenter for easy ones
- URL discovery: Crawl sitemaps, pagination, and search results to find relevant pages
- Content extraction: Build source-specific extractors for Q&A pairs, conversations, or raw text
- Quality filtering: Apply length, language, content, and deduplication filters
- Format conversion: Export in the format your fine-tuning framework expects
- Dataset splitting: Create train/val/test splits
- Validation: Manually review 100+ random examples to verify quality
The effort pays off in model performance. A well-curated dataset of 5,000 instruction-response pairs from your specific domain will produce a better fine-tuned model than 50,000 noisy examples from generic sources.
Conclusion
Collecting fine-tuning data through scraping is a precision task. Unlike pre-training data collection where volume is king, fine-tuning demands carefully selected, high-quality examples that represent exactly the behavior you want your model to learn. The right proxy infrastructure ensures reliable access to your data sources, and thorough quality filtering ensures that every example in your dataset teaches the model something valuable.
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Web Scraping for AI Training Data: Proxy Setup and Best Practices
- Building Custom Datasets with Proxies: A Practical Guide
- RAG Pipeline Data Collection: Scraping Sources with Mobile Proxies
- Scraping at Scale for AI Datasets: Infrastructure and Proxy Architecture
- Training Data Quality: How Proxy Choice Affects Your AI Dataset
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Building Custom Datasets with Proxies: A Practical Guide
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Building Custom Datasets with Proxies: A Practical Guide
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- AI Web Scraper with Python: Build Your Own
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Building Custom Datasets with Proxies: A Practical Guide
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- AI Web Scraper with Python: Build Your Own
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Building Custom Datasets with Proxies: A Practical Guide
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- AI Web Scraper with Python: Build Your Own
Related Reading
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Building Custom Datasets with Proxies: A Practical Guide
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- AI Web Scraper with Python: Build Your Own