how to collect AI training data at scale: scraping, licensing, APIs
AI training data collection is one of the fastest growing use cases for web scraping in 2026. whether you’re fine-tuning a domain LLM, building a vector database for RAG, or training a custom vision model, you need clean, diverse, legally-defensible data at meaningful scale. this guide covers the four sourcing paths (scraping, licensing, APIs, public datasets), the infrastructure decisions, and the legal lines you can’t cross.
the four sources of training data
every AI dataset combines some mix of these. each has trade-offs.
| source | cost | scale | legal risk | quality |
|---|---|---|---|---|
| public datasets (Common Crawl, HF Hub) | free | very high | low | mixed |
| licensed datasets (Reuters, AP, academic) | high | medium | low | high |
| commercial APIs (Twitter/X, NYT, Reddit) | medium-high | high | medium | high |
| custom scraping | medium | very high | medium-high | variable |
most teams build a base from public datasets, fill gaps with custom scraping, and license sensitive verticals (legal, medical, financial) where ToS issues bite hardest. for the proxy infrastructure side, see proxies for ML and AI training data collection.
start with public datasets
before you scrape a single URL, check if someone already collected what you need. public datasets cover billions of pages and millions of images, all pre-cleaned and deduplicated.
Common Crawl publishes monthly snapshots of ~3-4 billion web pages in WARC format. it’s the foundation of most foundation models (GPT-3, LLaMA, Mistral all used it). access via S3 (s3://commoncrawl/) for free, but bandwidth costs add up.
HuggingFace Hub hosts thousands of curated datasets including FineWeb (15T tokens of filtered Common Crawl), C4 (Google’s cleaned web text), and domain-specific datasets like PubMed, ArXiv, and StackOverflow dumps.
The Pile, RedPajama, Dolma are research-grade open datasets that have already been deduplicated and quality-filtered. start here if you’re training a foundation LLM.
Wikipedia + Wikidata dumps are free, structured, and updated monthly. excellent for factual grounding.
if your needs match these, save yourself months of infrastructure work and just download.
when custom scraping makes sense
scraping is the right choice when:
- you need data that’s fresh (yesterday, not last year’s snapshot)
- you need a niche vertical (specific industry forums, regional news, product reviews)
- you need structured data the public sets don’t preserve (HTML tables, schema.org markup, image alt text)
- the source is a real-time stream (social media, news feeds)
scraping at AI scale means tens of millions to billions of pages. infrastructure decisions compound at that volume.
scraping infrastructure for AI scale
three components define the budget: proxy bandwidth, compute, and storage.
proxies. residential proxies cost $4-15 per GB. at 50KB average per page, 1 billion pages = 50TB = $200K-$750K in proxy bandwidth alone. for AI-scale, you want high-bandwidth datacenter proxies for the easy 70% of pages and reserve residential for the 30% that need it. our provider comparison ranks options by per-GB cost.
compute. Python with asyncio + httpx handles 200-500 pages/second per core. 1 billion pages on a 32-core box at 50% utilization takes about 40 days. faster with distributed runners (Apache Beam, Ray, Apify Actors) at higher cost.
storage. raw HTML is 50-200KB per page. cleaned text is 5-20KB. 1 billion pages of raw HTML = 50-200TB. S3 standard at $23/TB/month = $1,150-$4,600 monthly. cheaper with S3 Glacier or Backblaze B2 for cold storage.
start with smaller batches (10M pages) to validate the pipeline before scaling.
a minimal AI scraping pipeline
import asyncio
import httpx
import json
from pathlib import Path
from urllib.parse import urlparse
async def fetch_one(client, url, sem):
async with sem:
try:
r = await client.get(url, timeout=20)
return {
'url': url,
'status': r.status_code,
'html': r.text if r.status_code == 200 else None,
'content_type': r.headers.get('content-type', ''),
}
except Exception as e:
return {'url': url, 'error': str(e)}
async def crawl(urls, output_path, concurrency=100):
sem = asyncio.Semaphore(concurrency)
async with httpx.AsyncClient(
proxies={'all://': 'http://user:pass@proxy.example.com:8000'},
http2=True,
follow_redirects=True,
) as client:
tasks = [fetch_one(client, url, sem) for url in urls]
results = []
for coro in asyncio.as_completed(tasks):
results.append(await coro)
if len(results) % 1000 == 0:
with open(output_path, 'a') as f:
for r in results[-1000:]:
f.write(json.dumps(r) + '\n')
# usage
urls = Path('seed_urls.txt').read_text().splitlines()
asyncio.run(crawl(urls, 'pages.jsonl'))
JSON Lines is the right format for streaming AI data. one record per line, easy to filter and dedupe with jq or pandas. for the broader Python toolkit, see our Python web scraping guide.
clean the data
raw HTML is useless to a language model. you need to extract text, strip boilerplate, and remove low-quality content.
from trafilatura import extract
def clean(html):
return extract(
html,
include_comments=False,
include_tables=False,
no_fallback=False,
)
trafilatura is the industry standard for HTML to clean text. it removes navigation, ads, footers, and cookie banners while preserving article structure. used by HuggingFace’s FineWeb pipeline.
then filter by quality:
– minimum length (300+ words for LLM training)
– language detection (langdetect or fasttext)
– duplicate detection (MinHash + LSH for fuzzy dedup)
– profanity and PII filters
– model-based quality classifier (FineWeb-Edu uses one)
for very large datasets, deduplication is more important than collection. ~30-50% of any web crawl is duplicate or near-duplicate content.
sourcing options for vertical data
different domains have different rules.
news: NewsAPI, GDELT, Common Crawl news subset, or licensed feeds from Reuters/AP. NYT and WSJ explicitly prohibit AI training. licensing is the safe path. our news APIs comparison covers options.
social media: X/Twitter API ($200K+/year for full firehose), Reddit API ($0.24 per 1k requests), Mastodon (free, smaller volume). don’t scrape social platforms outside their APIs in 2026; both X and Reddit successfully sued multiple AI labs in 2024.
academic: ArXiv (free, ~2.4M papers), PubMed (free, ~36M abstracts), CORE (free, 280M open-access papers). always available via official bulk download endpoints.
code: GitHub via gharchive.org (BigQuery dumps), Software Heritage, StackOverflow data dump. respect the license of each repo; only permissive licenses (MIT, Apache, BSD) are safe for training redistributable models.
legal: licensed databases (Westlaw, LexisNexis) or court PACER data. court records are public but the bulk-access mechanisms are clunky. CourtListener is the best free open dataset.
ecommerce/product: Amazon Product Advertising API for affiliates only, Shopify product sitemaps for store-by-store. mostly need scraping for breadth.
the legal layer
three legal regimes apply to AI training data in 2026.
copyright. training on copyrighted works without permission is contested. the US has had partial fair-use rulings (Authors Guild v. Google, 2015 Google Books). EU has explicit exceptions for “text and data mining” (Article 4 of the DSM Directive) but rights holders can opt out via robots.txt or specific tags. always check the AI-specific signals: ai.txt, robots.txt directives for GPTBot, ClaudeBot, Google-Extended.
terms of service. most websites prohibit AI training in their ToS. enforcement is inconsistent but lawsuits are increasing. NYT v. OpenAI (2023), Getty v. Stability AI (2023), and Reddit v. Anthropic (2025) are the high-profile cases. ignore ToS at your peril.
privacy law. GDPR (EU), CCPA (California), and similar laws restrict processing personal data without lawful basis. scraping public profiles is not automatically GDPR-compliant. PII filters are mandatory if you train on web data and serve EU users.
the cleanest legal posture: train only on (a) public datasets that have explicit AI-training licenses, (b) data you own, (c) data you’ve licensed, or (d) data covered by a clear fair-use argument. consult a lawyer before scraping any commercial site for AI training.
best practices for AI dataset hygiene
| practice | why it matters |
|---|---|
| dedupe before training | duplicates make models memorize, hurt generalization |
| filter PII (names, emails, phone numbers) | privacy law, model output safety |
| balance domains | avoid overweighting one source |
| document provenance | required for audits and compliance |
| respect robots.txt and AI directives | legal defensibility |
| store raw + cleaned versions | for re-cleaning when filters improve |
most production AI datasets go through 10-20 cleaning stages. each stage drops 5-30% of data. expect your final dataset to be 10-30% the size of raw crawl.
faq
can I just scrape the entire internet for training data?
no. you can scrape sites that allow it (per robots.txt and ai.txt), use public datasets, or license data. wholesale scraping of sites that prohibit it (in ToS or robots.txt) creates significant legal exposure, especially after the 2024-2025 AI lawsuits.
how much data do I actually need?
depends on the model. a 7B-parameter LLM benefits from 1-2T tokens of training data (~10TB cleaned text). a domain fine-tune needs only 100M-1B tokens (1-10GB). a RAG pipeline can work with megabytes if the retrieval is good. start small and scale up only when quality plateaus.
what’s the difference between scraping for AI vs scraping for analytics?
AI scraping prioritizes diversity, deduplication, and quality filtering over completeness. analytics scraping prioritizes structure (extracting specific fields like price or rating) over volume. AI pipelines clean text aggressively; analytics pipelines preserve structure.
should I use a managed scraping API for AI data?
for under 100M pages, yes. Bright Data, Apify, and ScraperAPI handle the proxy and CAPTCHA layer for you. above 100M pages, the math flips and self-hosted with residential proxies becomes cheaper.
do I need residential proxies for AI scraping?
depends on targets. Common Crawl, HuggingFace, ArXiv, and most news APIs work fine with datacenter IPs. Amazon, LinkedIn, Instagram require residential or mobile. our proxies for ML training page covers the typical mix.
how do I handle copyright in training data?
three options: license everything (expensive, clean), restrict to clearly permissive sources (limits scale), or rely on fair-use arguments (legally murky, requires legal counsel). foundation model labs increasingly take option 1 or 2 after the 2024-2025 lawsuits.
conclusion
AI training data collection in 2026 is a multi-layered problem: pick your sources (public, scraped, licensed, API), build infrastructure that scales (proxies, async runners, JSONL storage), clean aggressively (trafilatura, MinHash, quality classifiers), and respect the legal boundaries (robots.txt, ToS, privacy law).
most teams underinvest in cleaning and overinvest in volume. a 100GB dataset of clean, deduplicated, well-balanced text outperforms a 10TB dataset of raw crawl every time. focus on quality from the start.
if you’re building a domain model or RAG system, start with public datasets, fill gaps with targeted scraping, and license sensitive data. that path stays out of legal trouble while giving you 90% of the data quality the big labs have.