Collecting LLM Fine-Tuning Data with Web Scraping and Proxies

Collecting LLM Fine-Tuning Data with Web Scraping and Proxies

Fine-tuning a large language model transforms a general-purpose model into a specialist. Whether you want an LLM that writes marketing copy in Thai, answers technical support questions about your product, or generates legal summaries in Bahasa Indonesia, fine-tuning requires domain-specific data in the right format.

Public fine-tuning datasets exist, but they rarely cover specialized domains or regional languages well. Scraping your own fine-tuning data gives you control over domain focus, language coverage, and data quality. This guide explains how to collect, format, and validate fine-tuning data using web scraping with proxy infrastructure.

Understanding Fine-Tuning Data Requirements

Data Formats for Fine-Tuning

Modern LLM fine-tuning uses several data formats depending on the approach:

Instruction tuning requires prompt-response pairs:

{
  "instruction": "Summarize the key features of this product",
  "input": "The XR-500 wireless router features WiFi 6E support, tri-band connectivity with speeds up to 10 Gbps, and a built-in VPN server...",
  "output": "The XR-500 is a tri-band WiFi 6E router offering up to 10 Gbps speeds with built-in VPN capability."
}

Conversational fine-tuning uses multi-turn dialogue:

{
  "messages": [
    {"role": "system", "content": "You are a helpful customer service agent."},
    {"role": "user", "content": "My order hasn't arrived yet. Order number: 12345"},
    {"role": "assistant", "content": "I'll look into order #12345 for you. Could you confirm the shipping address?"}
  ]
}

Continued pre-training uses raw text in the target domain:

{"text": "The regulatory framework for digital payments in Southeast Asia has evolved significantly since 2020. Thailand's Bank of Thailand issued new guidelines..."}

Each format demands a different scraping and processing approach.

How Much Data Do You Need

The amount depends on your fine-tuning method:

ApproachTypical Dataset SizeQuality Requirement
Full fine-tuning10K-100K+ examplesModerate (volume compensates)
LoRA / QLoRA1K-10K examplesHigh (every example matters)
Prompt tuning100-1K examplesVery high (precision is critical)
Continued pre-training1M+ tokensModerate (domain relevance key)

For parameter-efficient methods like LoRA, a smaller but carefully curated dataset outperforms a large noisy one. This means your scraping pipeline needs strong quality filters.

Identifying Fine-Tuning Data Sources

Source Types by Use Case

For instruction-following models:

  • Q&A sites (Stack Overflow, Quora, regional equivalents)
  • FAQ pages on corporate websites
  • Customer support forums
  • How-to guides and tutorials

For conversational models:

  • Public customer service chat logs
  • Forum threads with question-answer patterns
  • Reddit threads with clear question-response structure
  • Community support pages

For domain-specific continued pre-training:

  • Industry publications and journals
  • Government regulatory documents
  • Specialized blogs and news sites
  • Product documentation and technical manuals

Evaluating Source Quality

Before scraping a source, assess its value:

def evaluate_source(url, proxy_config, sample_size=20):
    """Evaluate a potential data source by sampling pages."""
    sample_urls = discover_sample_urls(url, proxy_config, sample_size)
    results = {
        "total_sampled": len(sample_urls),
        "avg_content_length": 0,
        "language_distribution": {},
        "has_structured_qa": False,
        "content_quality_score": 0
    }

    content_lengths = []
    for sample_url in sample_urls:
        response = requests.get(
            sample_url,
            proxies=proxy_config,
            timeout=15
        )
        if response.status_code != 200:
            continue

        soup = BeautifulSoup(response.text, "html.parser")
        text = soup.get_text(strip=True)
        content_lengths.append(len(text.split()))

        # Check for Q&A structure
        if soup.select("div.question, div.answer, div.reply"):
            results["has_structured_qa"] = True

    results["avg_content_length"] = (
        sum(content_lengths) / len(content_lengths) if content_lengths else 0
    )
    return results

Scraping Strategies for Different Data Types

Scraping Q&A Pairs for Instruction Tuning

Q&A websites are the richest source for instruction-tuning data. The challenge is extracting clean question-answer pairs from complex page structures:

def extract_qa_pairs(html, url):
    """Extract question-answer pairs from a Q&A page."""
    soup = BeautifulSoup(html, "html.parser")
    pairs = []

    question = soup.select_one("h1.question-title, div.question-text")
    if not question:
        return pairs

    question_text = question.get_text(strip=True)

    # Get accepted or top-voted answers
    answers = soup.select("div.answer")
    for answer in answers:
        # Check if answer has upvotes or is accepted
        score_el = answer.select_one("span.vote-count")
        score = int(score_el.get_text(strip=True)) if score_el else 0

        is_accepted = "accepted" in answer.get("class", [])

        if score >= 2 or is_accepted:
            answer_text = answer.select_one("div.answer-body")
            if answer_text:
                pairs.append({
                    "instruction": question_text,
                    "input": "",
                    "output": answer_text.get_text(strip=True),
                    "source_url": url,
                    "quality_signals": {
                        "score": score,
                        "is_accepted": is_accepted
                    }
                })

    return pairs

Scraping Conversations for Chat Fine-Tuning

Forum threads naturally contain multi-turn conversations:

def extract_conversation(thread_html, url):
    """Extract a conversation from a forum thread."""
    soup = BeautifulSoup(thread_html, "html.parser")
    posts = soup.select("div.post, div.message")

    if len(posts) < 2:
        return None

    messages = []
    for i, post in enumerate(posts):
        author_el = post.select_one("span.author, a.username")
        content_el = post.select_one("div.post-content, div.message-body")

        if not content_el:
            continue

        text = content_el.get_text(strip=True)
        author = author_el.get_text(strip=True) if author_el else f"user_{i}"

        # First post is typically the question
        role = "user" if i == 0 else "assistant"

        # If same author posts again, they are the user
        if i > 0 and author == messages[0].get("author"):
            role = "user"

        messages.append({
            "role": role,
            "content": text,
            "author": author
        })

    # Only keep conversations with clear user-assistant alternation
    if len(messages) >= 2 and messages[0]["role"] == "user":
        return {
            "messages": [
                {"role": m["role"], "content": m["content"]}
                for m in messages
            ],
            "source_url": url
        }

    return None

Scraping Domain Text for Continued Pre-Training

For continued pre-training, you need large volumes of clean domain text:

def extract_domain_text(html, url, min_words=100):
    """Extract clean domain text from a page."""
    soup = BeautifulSoup(html, "html.parser")

    # Remove boilerplate
    for tag in soup(["script", "style", "nav", "footer", "header",
                     "aside", "iframe", "form"]):
        tag.decompose()

    # Remove ads and social widgets
    for cls in ["advertisement", "social-share", "newsletter-signup",
                "related-articles", "sidebar"]:
        for el in soup.select(f".{cls}, #{cls}"):
            el.decompose()

    main = soup.select_one("article, main, div.content")
    if not main:
        main = soup.body

    text = main.get_text(separator="\n", strip=True)
    words = text.split()

    if len(words) < min_words:
        return None

    return {
        "text": " ".join(words),
        "url": url,
        "word_count": len(words),
        "title": soup.title.string if soup.title else ""
    }

Proxy Strategy for Fine-Tuning Data Collection

Why Mobile Proxies Matter for LLM Data

Fine-tuning data often comes from platforms that aggressively protect their content:

  • Q&A platforms like Stack Overflow and regional equivalents rate-limit scrapers
  • Forums use anti-bot measures that block datacenter IPs
  • Social platforms require mobile-like request patterns to avoid detection

Mobile proxies work well here because:

  1. Mobile IPs have high trust scores due to carrier-grade NAT (many users share the same IP)
  2. Many Q&A and forum platforms have mobile-optimized versions that serve cleaner HTML
  3. Regional platforms in Southeast Asia primarily serve mobile users, so mobile traffic patterns look natural

DataResearchTools mobile proxies provide IPs from real mobile carriers across SEA countries. When scraping Thai tech forums or Vietnamese Q&A sites for fine-tuning data, using a mobile proxy from the same country ensures you see the same content and language that local users see.

Proxy Rotation for Sustained Collection

Fine-tuning data collection usually runs over several days. Configure your rotation accordingly:

class FineTuningProxyManager:
    def __init__(self, proxy_gateway, requests_per_ip=30):
        self.proxy_gateway = proxy_gateway
        self.requests_per_ip = requests_per_ip
        self.request_count = 0

    def get_session(self):
        """Get a proxy session, rotating after N requests."""
        self.request_count += 1
        if self.request_count >= self.requests_per_ip:
            self.request_count = 0
            # Trigger rotation by creating new session
            return self._new_session()
        return self.current_session

    def _new_session(self):
        session = requests.Session()
        session.proxies = {
            "http": self.proxy_gateway,
            "https": self.proxy_gateway
        }
        session.headers.update({
            "User-Agent": self._mobile_user_agent(),
            "Accept-Language": "en-US,th;q=0.9,vi;q=0.8"
        })
        self.current_session = session
        return session

    def _mobile_user_agent(self):
        agents = [
            "Mozilla/5.0 (Linux; Android 14; SM-S918B) AppleWebKit/537.36",
            "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15",
            "Mozilla/5.0 (Linux; Android 13; Pixel 7) AppleWebKit/537.36"
        ]
        return random.choice(agents)

Quality Filtering for Fine-Tuning Data

Quality matters more than quantity for fine-tuning. Apply multiple filters:

Content Quality Filters

def quality_filter(record, record_type="instruction"):
    """Apply quality filters to a scraped record."""
    if record_type == "instruction":
        instruction = record.get("instruction", "")
        output = record.get("output", "")

        # Minimum length requirements
        if len(instruction.split()) < 5:
            return False, "instruction too short"
        if len(output.split()) < 10:
            return False, "output too short"

        # Output should be longer than instruction for most tasks
        if len(output) < len(instruction) * 0.5:
            return False, "output suspiciously short relative to instruction"

        # Check for placeholder or template responses
        placeholder_patterns = [
            r"please contact",
            r"click here",
            r"sign up to view",
            r"login required"
        ]
        for pattern in placeholder_patterns:
            if re.search(pattern, output, re.IGNORECASE):
                return False, f"output contains placeholder: {pattern}"

        return True, "passed"

    elif record_type == "conversation":
        messages = record.get("messages", [])
        if len(messages) < 2:
            return False, "too few messages"

        # Check for meaningful content in each turn
        for msg in messages:
            if len(msg["content"].split()) < 3:
                return False, "message too short"

        return True, "passed"

    return True, "no filter applied"

Deduplication at the Example Level

For fine-tuning data, semantic deduplication is important to prevent the model from memorizing repeated patterns:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def deduplicate_examples(records, field="instruction", threshold=0.85):
    """Remove near-duplicate examples based on cosine similarity."""
    texts = [r[field] for r in records]

    vectorizer = TfidfVectorizer(max_features=10000)
    tfidf_matrix = vectorizer.fit_transform(texts)

    unique_indices = []
    seen = set()

    for i in range(len(records)):
        if i in seen:
            continue

        unique_indices.append(i)
        similarities = cosine_similarity(
            tfidf_matrix[i:i+1], tfidf_matrix[i+1:]
        )[0]

        for j, sim in enumerate(similarities):
            if sim >= threshold:
                seen.add(i + 1 + j)

    return [records[i] for i in unique_indices]

Formatting and Exporting Fine-Tuning Data

Converting to Standard Formats

Most fine-tuning frameworks expect specific formats:

def export_alpaca_format(records, output_path):
    """Export in Alpaca/Stanford format for instruction tuning."""
    formatted = []
    for record in records:
        formatted.append({
            "instruction": record["instruction"],
            "input": record.get("input", ""),
            "output": record["output"]
        })

    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(formatted, f, ensure_ascii=False, indent=2)


def export_sharegpt_format(records, output_path):
    """Export in ShareGPT format for conversational fine-tuning."""
    formatted = []
    for record in records:
        formatted.append({
            "conversations": [
                {"from": "human" if m["role"] == "user" else "gpt",
                 "value": m["content"]}
                for m in record["messages"]
            ]
        })

    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(formatted, f, ensure_ascii=False, indent=2)


def export_jsonl(records, output_path):
    """Export as JSONL for frameworks that prefer line-delimited JSON."""
    with open(output_path, "w", encoding="utf-8") as f:
        for record in records:
            f.write(json.dumps(record, ensure_ascii=False) + "\n")

Dataset Splitting

Always create train/validation/test splits:

from sklearn.model_selection import train_test_split

def split_dataset(records, val_ratio=0.1, test_ratio=0.05):
    """Split records into train/validation/test sets."""
    train_val, test = train_test_split(records, test_size=test_ratio, random_state=42)
    train, val = train_test_split(train_val, test_size=val_ratio / (1 - test_ratio), random_state=42)

    return {
        "train": train,
        "validation": val,
        "test": test,
        "stats": {
            "train_size": len(train),
            "val_size": len(val),
            "test_size": len(test)
        }
    }

Putting It All Together

A complete fine-tuning data collection pipeline:

  1. Source identification: Find 5-10 high-quality sources for your domain
  2. Proxy setup: Configure mobile proxies for protected sources, datacenter for easy ones
  3. URL discovery: Crawl sitemaps, pagination, and search results to find relevant pages
  4. Content extraction: Build source-specific extractors for Q&A pairs, conversations, or raw text
  5. Quality filtering: Apply length, language, content, and deduplication filters
  6. Format conversion: Export in the format your fine-tuning framework expects
  7. Dataset splitting: Create train/val/test splits
  8. Validation: Manually review 100+ random examples to verify quality

The effort pays off in model performance. A well-curated dataset of 5,000 instruction-response pairs from your specific domain will produce a better fine-tuned model than 50,000 noisy examples from generic sources.

Conclusion

Collecting fine-tuning data through scraping is a precision task. Unlike pre-training data collection where volume is king, fine-tuning demands carefully selected, high-quality examples that represent exactly the behavior you want your model to learn. The right proxy infrastructure ensures reliable access to your data sources, and thorough quality filtering ensures that every example in your dataset teaches the model something valuable.


Related Reading

Scroll to Top