RAG Pipeline Data Collection: Scraping Sources with Mobile Proxies
Retrieval Augmented Generation (RAG) has become the standard approach for grounding large language models in specific, up-to-date knowledge. Instead of fine-tuning a model on your data, RAG retrieves relevant documents at query time and feeds them to the LLM as context. But here is the challenge most tutorials skip: where does that knowledge base come from, and how do you keep it current?
For most production RAG systems, the answer involves scraping. Documentation sites, forums, knowledge bases, news sources, and industry publications all contain the domain knowledge your RAG pipeline needs. This guide covers the practical mechanics of collecting that data using mobile proxies.
Why RAG Needs Web Scraping
RAG pipelines depend on a corpus of documents that gets searched at query time. The quality of your RAG system is directly limited by the quality and coverage of this corpus. Common sources include:
- Official documentation: Product docs, API references, technical specifications
- Community forums: Stack Overflow, Reddit, specialized industry forums
- News and publications: Industry news sites, research paper repositories, blogs
- Government and regulatory sites: Legal texts, compliance documents, standards
- E-commerce platforms: Product descriptions, specifications, user reviews
Most of these sources do not offer convenient API access or downloadable data dumps. Scraping is the only practical way to build a comprehensive corpus, and proxies are necessary to scrape reliably at the volumes RAG pipelines require.
Designing Your RAG Corpus
Identify Authoritative Sources
Not all web content is equally valuable for RAG. Prioritize:
- Primary sources: Official documentation, original research, regulatory texts
- Expert content: Posts by verified experts on forums, peer-reviewed publications
- Recent content: For time-sensitive domains, freshness matters more than volume
- Diverse perspectives: Multiple sources on the same topic improve retrieval quality
Map Sources to Difficulty Levels
Each source requires a different scraping approach:
RAG_SOURCES = {
"docs.example-api.com": {
"type": "documentation",
"update_frequency": "weekly",
"anti_bot": "none",
"proxy_needed": "datacenter",
"estimated_pages": 5000
},
"forum.example-tech.com": {
"type": "forum",
"update_frequency": "daily",
"anti_bot": "moderate",
"proxy_needed": "residential",
"estimated_pages": 50000
},
"news.example-industry.co.th": {
"type": "news",
"update_frequency": "hourly",
"anti_bot": "aggressive",
"proxy_needed": "mobile",
"estimated_pages": 100000
}
}For sources with aggressive anti-bot protection, particularly Southeast Asian news sites and forums that serve region-specific content, mobile proxies provide the most reliable access. DataResearchTools mobile proxies route through real carrier networks in the region, which means your requests look identical to regular mobile users browsing the same sites.
Building the Collection Pipeline
Architecture for RAG Data Collection
RAG data collection differs from one-time dataset building. It requires:
- Continuous operation: New content appears constantly and must be captured
- Incremental updates: Only scrape new or changed pages, not the entire site every time
- Document-level tracking: Know exactly which documents are in your corpus and when they were last updated
- Metadata preservation: Store URL, title, publication date, author, and section structure alongside content
Source Registry -> Scheduler -> URL Discovery -> Change Detection -> Fetcher -> Parser -> Chunker -> Vector StoreURL Discovery for Different Source Types
Each source type has its own discovery pattern:
class DocumentationCrawler:
"""Crawl documentation sites by following sidebar navigation."""
def __init__(self, base_url, proxy_config):
self.base_url = base_url
self.proxy_config = proxy_config
self.discovered_urls = set()
def discover(self):
response = requests.get(
self.base_url,
proxies=self.proxy_config,
timeout=15
)
soup = BeautifulSoup(response.text, "html.parser")
# Most doc sites have a sidebar with all page links
nav = soup.select_one("nav.sidebar, aside.docs-nav, div.toc")
if nav:
for link in nav.find_all("a", href=True):
url = urljoin(self.base_url, link["href"])
if url.startswith(self.base_url):
self.discovered_urls.add(url)
return self.discovered_urls
class ForumCrawler:
"""Crawl forums by iterating through thread listings."""
def __init__(self, base_url, proxy_config):
self.base_url = base_url
self.proxy_config = proxy_config
def discover_threads(self, section_url, max_pages=50):
threads = []
for page in range(1, max_pages + 1):
url = f"{section_url}?page={page}"
response = requests.get(
url,
proxies=self.proxy_config,
timeout=15
)
if response.status_code != 200:
break
soup = BeautifulSoup(response.text, "html.parser")
thread_links = soup.select("a.thread-title, h3.topic a")
if not thread_links:
break
for link in thread_links:
threads.append({
"url": urljoin(self.base_url, link["href"]),
"title": link.get_text(strip=True)
})
return threadsChange Detection
Re-scraping unchanged pages wastes proxy bandwidth and processing time. Implement change detection:
import hashlib
from datetime import datetime, timedelta
class ChangeDetector:
def __init__(self, db_connection):
self.db = db_connection
def should_scrape(self, url, proxy_config):
"""Check if a URL needs re-scraping."""
record = self.db.get(url)
if record is None:
return True # Never scraped before
# Check if enough time has passed
last_scraped = record["last_scraped"]
min_interval = timedelta(hours=record.get("min_interval_hours", 24))
if datetime.utcnow() - last_scraped < min_interval:
return False
# Use conditional requests to check for changes
headers = {}
if record.get("etag"):
headers["If-None-Match"] = record["etag"]
if record.get("last_modified"):
headers["If-Modified-Since"] = record["last_modified"]
try:
response = requests.head(
url,
proxies=proxy_config,
headers=headers,
timeout=10
)
if response.status_code == 304:
self.db.update_check_time(url)
return False
except requests.exceptions.RequestException:
pass
return TrueContent Extraction for RAG
Preserving Document Structure
RAG systems perform better when documents retain their hierarchical structure. Extract content with headings intact:
def extract_structured_content(html, url):
"""Extract content preserving heading hierarchy."""
soup = BeautifulSoup(html, "html.parser")
# Remove non-content elements
for tag in soup(["script", "style", "nav", "footer", "header", "aside"]):
tag.decompose()
# Find main content area
main = soup.select_one("main, article, div.content, div.post-body")
if not main:
main = soup.body
sections = []
current_section = {"heading": "", "level": 0, "content": []}
for element in main.children:
if element.name in ["h1", "h2", "h3", "h4"]:
if current_section["content"]:
sections.append(current_section)
level = int(element.name[1])
current_section = {
"heading": element.get_text(strip=True),
"level": level,
"content": []
}
elif element.name in ["p", "ul", "ol", "pre", "table", "blockquote"]:
text = element.get_text(strip=True)
if text:
current_section["content"].append(text)
if current_section["content"]:
sections.append(current_section)
return {
"url": url,
"title": soup.title.string if soup.title else "",
"sections": sections,
"full_text": "\n\n".join(
s["heading"] + "\n" + "\n".join(s["content"]) for s in sections
)
}Handling Multi-Page Content
Forum threads and paginated articles often span multiple pages:
def scrape_full_thread(first_page_url, proxy_config):
"""Scrape all pages of a forum thread."""
all_posts = []
current_url = first_page_url
while current_url:
response = requests.get(
current_url,
proxies=proxy_config,
timeout=15
)
soup = BeautifulSoup(response.text, "html.parser")
posts = soup.select("div.post-content")
for post in posts:
all_posts.append({
"text": post.get_text(strip=True),
"author": post.select_one(".author").get_text(strip=True)
if post.select_one(".author") else "unknown"
})
# Find next page link
next_link = soup.select_one("a.pagination-next, a[rel='next']")
current_url = urljoin(first_page_url, next_link["href"]) if next_link else None
return all_postsChunking Strategies for RAG
How you chunk scraped content directly affects retrieval quality. Different strategies suit different content types.
Fixed-Size Chunking
Simple but effective for uniform content:
def chunk_fixed(text, chunk_size=512, overlap=50):
"""Split text into fixed-size chunks with overlap."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = " ".join(words[start:end])
chunks.append(chunk)
start = end - overlap
return chunksSemantic Chunking
Respects document structure for better retrieval:
def chunk_by_sections(structured_content, max_chunk_size=1000):
"""Create chunks based on document sections."""
chunks = []
for section in structured_content["sections"]:
section_text = section["heading"] + "\n" + "\n".join(section["content"])
words = section_text.split()
if len(words) <= max_chunk_size:
chunks.append({
"text": section_text,
"heading": section["heading"],
"source_url": structured_content["url"],
"title": structured_content["title"]
})
else:
# Split large sections into smaller chunks
sub_chunks = chunk_fixed(section_text, max_chunk_size, overlap=100)
for i, sub_chunk in enumerate(sub_chunks):
chunks.append({
"text": sub_chunk,
"heading": f"{section['heading']} (part {i+1})",
"source_url": structured_content["url"],
"title": structured_content["title"]
})
return chunksMetadata-Enriched Chunks
Add metadata that helps the retriever find the right chunks:
def enrich_chunk(chunk, source_metadata):
"""Add metadata to a chunk for better retrieval."""
return {
**chunk,
"domain": source_metadata["domain"],
"content_type": source_metadata["type"],
"language": detect_language(chunk["text"]),
"scraped_at": datetime.utcnow().isoformat(),
"word_count": len(chunk["text"].split()),
"has_code": bool(re.search(r"```|def |class |import ", chunk["text"]))
}Proxy Configuration for Continuous Collection
RAG data collection runs continuously, which places different demands on proxy infrastructure compared to one-time scraping.
Session Management
Some sites require maintaining sessions across multiple requests (e.g., logged-in forum access):
class SessionProxy:
"""Maintain a session through a sticky proxy."""
def __init__(self, proxy_url, session_duration=300):
self.proxy_url = proxy_url
self.session = requests.Session()
self.session.proxies = {
"http": proxy_url,
"https": proxy_url
}
self.created_at = time.time()
self.session_duration = session_duration
def is_expired(self):
return time.time() - self.created_at > self.session_duration
def get(self, url, **kwargs):
if self.is_expired():
self.rotate()
return self.session.get(url, **kwargs)
def rotate(self):
self.session = requests.Session()
self.session.proxies = {
"http": self.proxy_url,
"https": self.proxy_url
}
self.created_at = time.time()Scheduling Scrape Jobs
Different sources need different update frequencies:
import schedule
import threading
def setup_rag_scheduler(sources, proxy_manager):
"""Schedule scraping jobs based on source update frequency."""
for source_name, config in sources.items():
frequency = config["update_frequency"]
proxy_type = config["proxy_needed"]
if frequency == "hourly":
schedule.every(1).hour.do(
scrape_source, source_name, proxy_manager.get(proxy_type)
)
elif frequency == "daily":
schedule.every(1).day.at("02:00").do(
scrape_source, source_name, proxy_manager.get(proxy_type)
)
elif frequency == "weekly":
schedule.every().monday.at("03:00").do(
scrape_source, source_name, proxy_manager.get(proxy_type)
)
# Run scheduler in background
def run_scheduler():
while True:
schedule.run_pending()
time.sleep(60)
thread = threading.Thread(target=run_scheduler, daemon=True)
thread.start()Handling Regional Content for RAG
When building RAG systems for Southeast Asian markets, geographic proxy selection determines what content you can access. Many regional platforms serve different content, languages, and pricing based on the user’s IP location.
DataResearchTools mobile proxies are available in multiple SEA countries, which lets you build a RAG corpus that reflects the actual content users in each country see. This is critical for applications like:
- Customer support bots that need to reference local product listings
- Compliance systems that track local regulatory content
- Market intelligence tools that monitor regional competitors
Language Handling
SEA content often mixes multiple languages within a single page. Handle this in your extraction:
from langdetect import detect_langs
def extract_multilingual_content(html, url):
"""Extract content and detect languages present."""
soup = BeautifulSoup(html, "html.parser")
main_text = soup.get_text(strip=True)
detected = detect_langs(main_text)
languages = [{"lang": d.lang, "confidence": d.prob} for d in detected]
return {
"url": url,
"text": main_text,
"primary_language": languages[0]["lang"] if languages else "unknown",
"languages_detected": languages
}Quality Assurance for RAG Corpora
Relevance Scoring
Not every scraped page belongs in your RAG corpus. Score relevance before indexing:
def score_relevance(document, domain_keywords):
"""Score a document's relevance to your domain."""
text = document["text"].lower()
word_count = len(text.split())
if word_count < 50:
return 0.0
keyword_hits = sum(
text.count(kw.lower()) for kw in domain_keywords
)
density = keyword_hits / word_count
length_bonus = min(word_count / 500, 1.0)
return min(density * 100 * length_bonus, 1.0)Freshness Tracking
Stale data degrades RAG quality. Track document freshness and prioritize updates:
def calculate_freshness_score(last_scraped, content_date, max_age_days=30):
"""Calculate how fresh a document is."""
now = datetime.utcnow()
if content_date:
age = (now - content_date).days
else:
age = (now - last_scraped).days
if age <= 1:
return 1.0
elif age <= 7:
return 0.8
elif age <= max_age_days:
return 0.5
else:
return 0.2Conclusion
Building a RAG pipeline’s knowledge base through scraping is an ongoing operational challenge, not a one-time task. The combination of targeted source selection, appropriate proxy infrastructure (with mobile proxies for the hardest targets), structured content extraction, and smart chunking determines how well your RAG system performs. Invest in change detection and scheduling to keep your corpus fresh, and always validate that scraped content meets your quality standards before it enters the retrieval index.
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Web Scraping for AI Training Data: Proxy Setup and Best Practices
- Building Custom Datasets with Proxies: A Practical Guide
- Collecting LLM Fine-Tuning Data with Web Scraping and Proxies
- Scraping at Scale for AI Datasets: Infrastructure and Proxy Architecture
- Training Data Quality: How Proxy Choice Affects Your AI Dataset
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Building Custom Datasets with Proxies: A Practical Guide
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Building Custom Datasets with Proxies: A Practical Guide
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- AI Web Scraper with Python: Build Your Own
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Building Custom Datasets with Proxies: A Practical Guide
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- AI Web Scraper with Python: Build Your Own
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Building Custom Datasets with Proxies: A Practical Guide
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- AI Web Scraper with Python: Build Your Own
Related Reading
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Building Custom Datasets with Proxies: A Practical Guide
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- AI Web Scraper with Python: Build Your Own