RAG Pipeline Data Collection: Scraping Sources with Mobile Proxies

RAG Pipeline Data Collection: Scraping Sources with Mobile Proxies

Retrieval Augmented Generation (RAG) has become the standard approach for grounding large language models in specific, up-to-date knowledge. Instead of fine-tuning a model on your data, RAG retrieves relevant documents at query time and feeds them to the LLM as context. But here is the challenge most tutorials skip: where does that knowledge base come from, and how do you keep it current?

For most production RAG systems, the answer involves scraping. Documentation sites, forums, knowledge bases, news sources, and industry publications all contain the domain knowledge your RAG pipeline needs. This guide covers the practical mechanics of collecting that data using mobile proxies.

Why RAG Needs Web Scraping

RAG pipelines depend on a corpus of documents that gets searched at query time. The quality of your RAG system is directly limited by the quality and coverage of this corpus. Common sources include:

  • Official documentation: Product docs, API references, technical specifications
  • Community forums: Stack Overflow, Reddit, specialized industry forums
  • News and publications: Industry news sites, research paper repositories, blogs
  • Government and regulatory sites: Legal texts, compliance documents, standards
  • E-commerce platforms: Product descriptions, specifications, user reviews

Most of these sources do not offer convenient API access or downloadable data dumps. Scraping is the only practical way to build a comprehensive corpus, and proxies are necessary to scrape reliably at the volumes RAG pipelines require.

Designing Your RAG Corpus

Identify Authoritative Sources

Not all web content is equally valuable for RAG. Prioritize:

  1. Primary sources: Official documentation, original research, regulatory texts
  2. Expert content: Posts by verified experts on forums, peer-reviewed publications
  3. Recent content: For time-sensitive domains, freshness matters more than volume
  4. Diverse perspectives: Multiple sources on the same topic improve retrieval quality

Map Sources to Difficulty Levels

Each source requires a different scraping approach:

RAG_SOURCES = {
    "docs.example-api.com": {
        "type": "documentation",
        "update_frequency": "weekly",
        "anti_bot": "none",
        "proxy_needed": "datacenter",
        "estimated_pages": 5000
    },
    "forum.example-tech.com": {
        "type": "forum",
        "update_frequency": "daily",
        "anti_bot": "moderate",
        "proxy_needed": "residential",
        "estimated_pages": 50000
    },
    "news.example-industry.co.th": {
        "type": "news",
        "update_frequency": "hourly",
        "anti_bot": "aggressive",
        "proxy_needed": "mobile",
        "estimated_pages": 100000
    }
}

For sources with aggressive anti-bot protection, particularly Southeast Asian news sites and forums that serve region-specific content, mobile proxies provide the most reliable access. DataResearchTools mobile proxies route through real carrier networks in the region, which means your requests look identical to regular mobile users browsing the same sites.

Building the Collection Pipeline

Architecture for RAG Data Collection

RAG data collection differs from one-time dataset building. It requires:

  • Continuous operation: New content appears constantly and must be captured
  • Incremental updates: Only scrape new or changed pages, not the entire site every time
  • Document-level tracking: Know exactly which documents are in your corpus and when they were last updated
  • Metadata preservation: Store URL, title, publication date, author, and section structure alongside content
Source Registry -> Scheduler -> URL Discovery -> Change Detection -> Fetcher -> Parser -> Chunker -> Vector Store

URL Discovery for Different Source Types

Each source type has its own discovery pattern:

class DocumentationCrawler:
    """Crawl documentation sites by following sidebar navigation."""

    def __init__(self, base_url, proxy_config):
        self.base_url = base_url
        self.proxy_config = proxy_config
        self.discovered_urls = set()

    def discover(self):
        response = requests.get(
            self.base_url,
            proxies=self.proxy_config,
            timeout=15
        )
        soup = BeautifulSoup(response.text, "html.parser")

        # Most doc sites have a sidebar with all page links
        nav = soup.select_one("nav.sidebar, aside.docs-nav, div.toc")
        if nav:
            for link in nav.find_all("a", href=True):
                url = urljoin(self.base_url, link["href"])
                if url.startswith(self.base_url):
                    self.discovered_urls.add(url)

        return self.discovered_urls


class ForumCrawler:
    """Crawl forums by iterating through thread listings."""

    def __init__(self, base_url, proxy_config):
        self.base_url = base_url
        self.proxy_config = proxy_config

    def discover_threads(self, section_url, max_pages=50):
        threads = []
        for page in range(1, max_pages + 1):
            url = f"{section_url}?page={page}"
            response = requests.get(
                url,
                proxies=self.proxy_config,
                timeout=15
            )
            if response.status_code != 200:
                break

            soup = BeautifulSoup(response.text, "html.parser")
            thread_links = soup.select("a.thread-title, h3.topic a")
            if not thread_links:
                break

            for link in thread_links:
                threads.append({
                    "url": urljoin(self.base_url, link["href"]),
                    "title": link.get_text(strip=True)
                })

        return threads

Change Detection

Re-scraping unchanged pages wastes proxy bandwidth and processing time. Implement change detection:

import hashlib
from datetime import datetime, timedelta

class ChangeDetector:
    def __init__(self, db_connection):
        self.db = db_connection

    def should_scrape(self, url, proxy_config):
        """Check if a URL needs re-scraping."""
        record = self.db.get(url)

        if record is None:
            return True  # Never scraped before

        # Check if enough time has passed
        last_scraped = record["last_scraped"]
        min_interval = timedelta(hours=record.get("min_interval_hours", 24))
        if datetime.utcnow() - last_scraped < min_interval:
            return False

        # Use conditional requests to check for changes
        headers = {}
        if record.get("etag"):
            headers["If-None-Match"] = record["etag"]
        if record.get("last_modified"):
            headers["If-Modified-Since"] = record["last_modified"]

        try:
            response = requests.head(
                url,
                proxies=proxy_config,
                headers=headers,
                timeout=10
            )
            if response.status_code == 304:
                self.db.update_check_time(url)
                return False
        except requests.exceptions.RequestException:
            pass

        return True

Content Extraction for RAG

Preserving Document Structure

RAG systems perform better when documents retain their hierarchical structure. Extract content with headings intact:

def extract_structured_content(html, url):
    """Extract content preserving heading hierarchy."""
    soup = BeautifulSoup(html, "html.parser")

    # Remove non-content elements
    for tag in soup(["script", "style", "nav", "footer", "header", "aside"]):
        tag.decompose()

    # Find main content area
    main = soup.select_one("main, article, div.content, div.post-body")
    if not main:
        main = soup.body

    sections = []
    current_section = {"heading": "", "level": 0, "content": []}

    for element in main.children:
        if element.name in ["h1", "h2", "h3", "h4"]:
            if current_section["content"]:
                sections.append(current_section)
            level = int(element.name[1])
            current_section = {
                "heading": element.get_text(strip=True),
                "level": level,
                "content": []
            }
        elif element.name in ["p", "ul", "ol", "pre", "table", "blockquote"]:
            text = element.get_text(strip=True)
            if text:
                current_section["content"].append(text)

    if current_section["content"]:
        sections.append(current_section)

    return {
        "url": url,
        "title": soup.title.string if soup.title else "",
        "sections": sections,
        "full_text": "\n\n".join(
            s["heading"] + "\n" + "\n".join(s["content"]) for s in sections
        )
    }

Handling Multi-Page Content

Forum threads and paginated articles often span multiple pages:

def scrape_full_thread(first_page_url, proxy_config):
    """Scrape all pages of a forum thread."""
    all_posts = []
    current_url = first_page_url

    while current_url:
        response = requests.get(
            current_url,
            proxies=proxy_config,
            timeout=15
        )
        soup = BeautifulSoup(response.text, "html.parser")

        posts = soup.select("div.post-content")
        for post in posts:
            all_posts.append({
                "text": post.get_text(strip=True),
                "author": post.select_one(".author").get_text(strip=True)
                    if post.select_one(".author") else "unknown"
            })

        # Find next page link
        next_link = soup.select_one("a.pagination-next, a[rel='next']")
        current_url = urljoin(first_page_url, next_link["href"]) if next_link else None

    return all_posts

Chunking Strategies for RAG

How you chunk scraped content directly affects retrieval quality. Different strategies suit different content types.

Fixed-Size Chunking

Simple but effective for uniform content:

def chunk_fixed(text, chunk_size=512, overlap=50):
    """Split text into fixed-size chunks with overlap."""
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start = end - overlap

    return chunks

Semantic Chunking

Respects document structure for better retrieval:

def chunk_by_sections(structured_content, max_chunk_size=1000):
    """Create chunks based on document sections."""
    chunks = []

    for section in structured_content["sections"]:
        section_text = section["heading"] + "\n" + "\n".join(section["content"])
        words = section_text.split()

        if len(words) <= max_chunk_size:
            chunks.append({
                "text": section_text,
                "heading": section["heading"],
                "source_url": structured_content["url"],
                "title": structured_content["title"]
            })
        else:
            # Split large sections into smaller chunks
            sub_chunks = chunk_fixed(section_text, max_chunk_size, overlap=100)
            for i, sub_chunk in enumerate(sub_chunks):
                chunks.append({
                    "text": sub_chunk,
                    "heading": f"{section['heading']} (part {i+1})",
                    "source_url": structured_content["url"],
                    "title": structured_content["title"]
                })

    return chunks

Metadata-Enriched Chunks

Add metadata that helps the retriever find the right chunks:

def enrich_chunk(chunk, source_metadata):
    """Add metadata to a chunk for better retrieval."""
    return {
        **chunk,
        "domain": source_metadata["domain"],
        "content_type": source_metadata["type"],
        "language": detect_language(chunk["text"]),
        "scraped_at": datetime.utcnow().isoformat(),
        "word_count": len(chunk["text"].split()),
        "has_code": bool(re.search(r"```|def |class |import ", chunk["text"]))
    }

Proxy Configuration for Continuous Collection

RAG data collection runs continuously, which places different demands on proxy infrastructure compared to one-time scraping.

Session Management

Some sites require maintaining sessions across multiple requests (e.g., logged-in forum access):

class SessionProxy:
    """Maintain a session through a sticky proxy."""

    def __init__(self, proxy_url, session_duration=300):
        self.proxy_url = proxy_url
        self.session = requests.Session()
        self.session.proxies = {
            "http": proxy_url,
            "https": proxy_url
        }
        self.created_at = time.time()
        self.session_duration = session_duration

    def is_expired(self):
        return time.time() - self.created_at > self.session_duration

    def get(self, url, **kwargs):
        if self.is_expired():
            self.rotate()
        return self.session.get(url, **kwargs)

    def rotate(self):
        self.session = requests.Session()
        self.session.proxies = {
            "http": self.proxy_url,
            "https": self.proxy_url
        }
        self.created_at = time.time()

Scheduling Scrape Jobs

Different sources need different update frequencies:

import schedule
import threading

def setup_rag_scheduler(sources, proxy_manager):
    """Schedule scraping jobs based on source update frequency."""

    for source_name, config in sources.items():
        frequency = config["update_frequency"]
        proxy_type = config["proxy_needed"]

        if frequency == "hourly":
            schedule.every(1).hour.do(
                scrape_source, source_name, proxy_manager.get(proxy_type)
            )
        elif frequency == "daily":
            schedule.every(1).day.at("02:00").do(
                scrape_source, source_name, proxy_manager.get(proxy_type)
            )
        elif frequency == "weekly":
            schedule.every().monday.at("03:00").do(
                scrape_source, source_name, proxy_manager.get(proxy_type)
            )

    # Run scheduler in background
    def run_scheduler():
        while True:
            schedule.run_pending()
            time.sleep(60)

    thread = threading.Thread(target=run_scheduler, daemon=True)
    thread.start()

Handling Regional Content for RAG

When building RAG systems for Southeast Asian markets, geographic proxy selection determines what content you can access. Many regional platforms serve different content, languages, and pricing based on the user’s IP location.

DataResearchTools mobile proxies are available in multiple SEA countries, which lets you build a RAG corpus that reflects the actual content users in each country see. This is critical for applications like:

  • Customer support bots that need to reference local product listings
  • Compliance systems that track local regulatory content
  • Market intelligence tools that monitor regional competitors

Language Handling

SEA content often mixes multiple languages within a single page. Handle this in your extraction:

from langdetect import detect_langs

def extract_multilingual_content(html, url):
    """Extract content and detect languages present."""
    soup = BeautifulSoup(html, "html.parser")
    main_text = soup.get_text(strip=True)

    detected = detect_langs(main_text)
    languages = [{"lang": d.lang, "confidence": d.prob} for d in detected]

    return {
        "url": url,
        "text": main_text,
        "primary_language": languages[0]["lang"] if languages else "unknown",
        "languages_detected": languages
    }

Quality Assurance for RAG Corpora

Relevance Scoring

Not every scraped page belongs in your RAG corpus. Score relevance before indexing:

def score_relevance(document, domain_keywords):
    """Score a document's relevance to your domain."""
    text = document["text"].lower()
    word_count = len(text.split())

    if word_count < 50:
        return 0.0

    keyword_hits = sum(
        text.count(kw.lower()) for kw in domain_keywords
    )

    density = keyword_hits / word_count
    length_bonus = min(word_count / 500, 1.0)

    return min(density * 100 * length_bonus, 1.0)

Freshness Tracking

Stale data degrades RAG quality. Track document freshness and prioritize updates:

def calculate_freshness_score(last_scraped, content_date, max_age_days=30):
    """Calculate how fresh a document is."""
    now = datetime.utcnow()

    if content_date:
        age = (now - content_date).days
    else:
        age = (now - last_scraped).days

    if age <= 1:
        return 1.0
    elif age <= 7:
        return 0.8
    elif age <= max_age_days:
        return 0.5
    else:
        return 0.2

Conclusion

Building a RAG pipeline’s knowledge base through scraping is an ongoing operational challenge, not a one-time task. The combination of targeted source selection, appropriate proxy infrastructure (with mobile proxies for the hardest targets), structured content extraction, and smart chunking determines how well your RAG system performs. Invest in change detection and scheduling to keep your corpus fresh, and always validate that scraped content meets your quality standards before it enters the retrieval index.


Related Reading

Scroll to Top