RAG Pipeline with Web Scraping: Live Data for AI

Large language models have a fundamental limitation: their knowledge is frozen at training time. Ask GPT-4o about yesterday’s stock prices, a product launched last week, or the latest version of a library’s documentation, and you will get outdated or hallucinated answers.

Retrieval-Augmented Generation (RAG) solves this by fetching relevant information from external sources at query time and feeding it to the LLM as context. When you combine RAG with web scraping, you get a system that can answer questions with live, current data from any website — turning your AI from a static knowledge base into a dynamic research assistant.

This guide walks you through building a complete RAG pipeline powered by web scraping, from data collection to vector storage to intelligent retrieval.

What Is RAG?
Architecture Overview
Step 1: Web Scraping for Data Collection
Step 2: Content Processing & Chunking
Step 3: Embedding Generation
Step 4: Vector Database Storage
Step 5: Retrieval & Generation
Complete Pipeline Implementation
Keeping Data Fresh
Advanced Techniques
Vector Database Options
Production Considerations
FAQ

What Is RAG?

Retrieval-Augmented Generation is a technique that enhances LLM responses by:

Retrieving relevant documents from a knowledge base
Augmenting the LLM’s prompt with those documents
Generating a response that’s grounded in the retrieved information

Why RAG + Web Scraping?

Approach	Pros	Cons
LLM only	Simple	Outdated, hallucinations
RAG + static docs	Grounded responses	Content goes stale
RAG + web scraping	Live data, always current	More complex setup

The combination of RAG with web scraping gives you the best of both worlds: the reasoning power of LLMs with access to live, current information from any website.

Architecture Overview

┌──────────────────────────────────────────────────────┐
│                    INGESTION PIPELINE                  │
│                                                        │
│  [Web Sources] → [Scraper] → [Chunker] → [Embedder]  │
│                                    ↓                   │
│                            [Vector Database]           │
└──────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────┐
│                    QUERY PIPELINE                      │
│                                                        │
│  [User Query] → [Embed Query] → [Vector Search]      │
│                                       ↓               │
│                            [Retrieve Top-K Docs]      │
│                                       ↓               │
│                  [Augmented Prompt] → [LLM] → [Answer]│
└──────────────────────────────────────────────────────┘

Components

Component	Purpose	Options
Web Scraper	Collect content from websites	Crawl4ai, Firecrawl, custom scripts
Chunker	Split content into manageable pieces	Token-based, semantic, recursive
Embedding Model	Convert text to vectors	OpenAI, Cohere, sentence-transformers
Vector Database	Store and search embeddings	Chroma, Pinecone, Weaviate, Qdrant
LLM	Generate responses	GPT-4o, Claude, Llama 3.1

Step 1: Web Scraping for Data Collection

Using Crawl4ai (Free, Open Source)

import asyncio
from crawl4ai import AsyncWebCrawler

async def scrape_documentation(urls: list[str]) -> list[dict]:
    """Scrape multiple documentation pages."""
    documents = []
    
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(urls=urls, max_concurrent=5)
        
        for result in results:
            if result.success:
                documents.append({
                    "url": result.url,
                    "title": result.metadata.get("title", ""),
                    "content": result.markdown,
                    "scraped_at": datetime.now().isoformat()
                })
    
    return documents

# Crawl an entire documentation site
async def crawl_site(start_url: str, max_pages: int = 50) -> list[dict]:
    """Crawl a site following internal links."""
    documents = []
    
    async with AsyncWebCrawler() as crawler:
        # First, get all URLs
        result = await crawler.arun(url=start_url)
        internal_links = [
            link["href"] for link in result.links.get("internal", [])
        ][:max_pages]
        
        # Then scrape each page
        results = await crawler.arun_many(urls=internal_links, max_concurrent=5)
        
        for result in results:
            if result.success and len(result.markdown) > 100:
                documents.append({
                    "url": result.url,
                    "title": result.metadata.get("title", ""),
                    "content": result.markdown,
                })
    
    return documents

Using Firecrawl (Managed API)

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-your-key")

def crawl_with_firecrawl(url: str, limit: int = 50) -> list[dict]:
    """Crawl a site using Firecrawl."""
    result = app.crawl_url(url, params={
        "limit": limit,
        "scrapeOptions": {
            "formats": ["markdown"],
            "onlyMainContent": True
        }
    })
    
    documents = []
    for page in result.get("data", []):
        documents.append({
            "url": page.get("metadata", {}).get("sourceURL", ""),
            "title": page.get("metadata", {}).get("title", ""),
            "content": page.get("markdown", ""),
        })
    
    return documents

Using Proxies for Reliable Scraping

For scraping multiple sources reliably, use proxies:

async with AsyncWebCrawler(
    proxy="http://user:pass@proxy:8080"
) as crawler:
    results = await crawler.arun_many(urls=urls)

Step 2: Content Processing & Chunking

Why Chunking Matters

LLMs have limited context windows. You can’t feed an entire website into a prompt. Chunking splits content into pieces that:

Fit within embedding model limits (typically 512-8192 tokens)
Contain enough context to be meaningful
Are small enough for precise retrieval

Recursive Character Splitting

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_documents(documents: list[dict], chunk_size: int = 1000, overlap: int = 200) -> list[dict]:
    """Split documents into chunks with metadata."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n## ", "\n### ", "\n\n", "\n", " "]
    )
    
    chunks = []
    for doc in documents:
        splits = splitter.split_text(doc["content"])
        for i, split in enumerate(splits):
            chunks.append({
                "text": split,
                "metadata": {
                    "source_url": doc["url"],
                    "title": doc["title"],
                    "chunk_index": i,
                    "total_chunks": len(splits)
                }
            })
    
    return chunks

Semantic Chunking

For better retrieval quality, split by semantic boundaries:

def semantic_chunk(text: str, max_tokens: int = 500) -> list[str]:
    """Split by markdown headers for semantic coherence."""
    sections = []
    current_section = ""
    
    for line in text.split("\n"):
        if line.startswith("## ") and current_section:
            if len(current_section.split()) > 50:  # Minimum size
                sections.append(current_section.strip())
            current_section = line + "\n"
        else:
            current_section += line + "\n"
    
    if current_section.strip():
        sections.append(current_section.strip())
    
    # Further split sections that are too large
    final_chunks = []
    splitter = RecursiveCharacterTextSplitter(chunk_size=max_tokens * 4)
    for section in sections:
        if len(section.split()) > max_tokens:
            final_chunks.extend(splitter.split_text(section))
        else:
            final_chunks.append(section)
    
    return final_chunks

Chunking Best Practices

Parameter	Recommendation	Why
Chunk size	500-1,500 tokens	Balances context vs precision
Overlap	10-20% of chunk size	Preserves cross-boundary context
Separators	Headers > paragraphs > sentences	Maintains semantic coherence
Min size	50 tokens	Avoids meaningless fragments

Step 3: Embedding Generation

Using OpenAI Embeddings

from openai import OpenAI

client = OpenAI()

def generate_embeddings(chunks: list[dict], model: str = "text-embedding-3-small") -> list[dict]:
    """Generate embeddings for all chunks."""
    texts = [chunk["text"] for chunk in chunks]
    
    # Process in batches of 100
    all_embeddings = []
    for i in range(0, len(texts), 100):
        batch = texts[i:i + 100]
        response = client.embeddings.create(model=model, input=batch)
        embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(embeddings)
    
    # Attach embeddings to chunks
    for chunk, embedding in zip(chunks, all_embeddings):
        chunk["embedding"] = embedding
    
    return chunks

Using Free Local Embeddings

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def generate_local_embeddings(chunks: list[dict]) -> list[dict]:
    """Generate embeddings locally (free)."""
    texts = [chunk["text"] for chunk in chunks]
    embeddings = model.encode(texts, show_progress_bar=True)
    
    for chunk, embedding in zip(chunks, embeddings):
        chunk["embedding"] = embedding.tolist()
    
    return chunks

Embedding Model Comparison

Model	Dimensions	Cost	Quality
OpenAI text-embedding-3-small	1536	$0.02/1M tokens	Good
OpenAI text-embedding-3-large	3072	$0.13/1M tokens	Best
all-MiniLM-L6-v2	384	Free (local)	Good
nomic-embed-text (Ollama)	768	Free (local)	Good
Cohere embed-v3	1024	$0.10/1M tokens	Excellent

Step 4: Vector Database Storage

Using ChromaDB (Local, Free)

import chromadb

def create_vector_store(chunks: list[dict], collection_name: str = "web_rag"):
    """Store chunks in ChromaDB."""
    client = chromadb.PersistentClient(path="./chroma_db")
    
    collection = client.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}
    )
    
    # Add chunks
    collection.add(
        ids=[f"chunk_{i}" for i in range(len(chunks))],
        embeddings=[c["embedding"] for c in chunks],
        documents=[c["text"] for c in chunks],
        metadatas=[c["metadata"] for c in chunks]
    )
    
    return collection

def search_vector_store(collection, query: str, n_results: int = 5) -> list[dict]:
    """Search for relevant chunks."""
    # Generate query embedding
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )
    
    return [
        {"text": doc, "metadata": meta, "distance": dist}
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        )
    ]

Using Pinecone (Cloud, Managed)

from pinecone import Pinecone

pc = Pinecone(api_key="your-pinecone-key")

def create_pinecone_index(chunks: list[dict], index_name: str = "web-rag"):
    """Store chunks in Pinecone."""
    # Create index if it doesn't exist
    if index_name not in pc.list_indexes().names():
        pc.create_index(
            name=index_name,
            dimension=1536,  # Match embedding model
            metric="cosine"
        )
    
    index = pc.Index(index_name)
    
    # Upsert in batches
    batch_size = 100
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        vectors = [
            {
                "id": f"chunk_{i + j}",
                "values": chunk["embedding"],
                "metadata": {
                    **chunk["metadata"],
                    "text": chunk["text"][:1000]  # Store text in metadata
                }
            }
            for j, chunk in enumerate(batch)
        ]
        index.upsert(vectors=vectors)
    
    return index

Step 5: Retrieval & Generation

The RAG Query Function

from openai import OpenAI

client = OpenAI()

def rag_query(question: str, collection, n_results: int = 5) -> str:
    """Answer a question using RAG."""
    # Step 1: Retrieve relevant chunks
    relevant_chunks = search_vector_store(collection, question, n_results)
    
    # Step 2: Build context
    context = "\n\n---\n\n".join([
        f"Source: {chunk['metadata']['source_url']}\n{chunk['text']}"
        for chunk in relevant_chunks
    ])
    
    # Step 3: Generate answer
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant that answers questions based on 
the provided context. Only use information from the context to answer.
If the context doesn't contain the answer, say so.
Always cite your sources by mentioning the source URL."""
            },
            {
                "role": "user",
                "content": f"""Context:
{context}

Question: {question}

Answer based on the context above:"""
            }
        ],
        temperature=0.3
    )
    
    return response.choices[0].message.content

Example Usage

# Build the pipeline
documents = await scrape_documentation([
    "https://docs.firecrawl.dev/introduction",
    "https://docs.firecrawl.dev/api/scrape",
    "https://docs.firecrawl.dev/api/crawl",
])

chunks = chunk_documents(documents)
chunks = generate_embeddings(chunks)
collection = create_vector_store(chunks)

# Query
answer = rag_query(
    "How do I use Firecrawl's extract mode with a custom schema?",
    collection
)
print(answer)

Complete Pipeline Implementation

All-in-One RAG Pipeline

import asyncio
from datetime import datetime
from crawl4ai import AsyncWebCrawler
from openai import OpenAI
import chromadb
from langchain.text_splitter import RecursiveCharacterTextSplitter

class WebRAGPipeline:
    def __init__(self, openai_key: str = None, db_path: str = "./rag_db"):
        self.llm_client = OpenAI(api_key=openai_key)
        self.chroma_client = chromadb.PersistentClient(path=db_path)
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000, chunk_overlap=200
        )
    
    async def ingest(self, urls: list[str], collection_name: str = "default"):
        """Scrape URLs and add to vector store."""
        # Scrape
        async with AsyncWebCrawler() as crawler:
            results = await crawler.arun_many(urls=urls, max_concurrent=5)
        
        documents = [
            {"url": r.url, "title": r.metadata.get("title", ""), "content": r.markdown}
            for r in results if r.success
        ]
        
        # Chunk
        chunks = []
        for doc in documents:
            splits = self.splitter.split_text(doc["content"])
            for i, split in enumerate(splits):
                chunks.append({
                    "text": split,
                    "metadata": {"source_url": doc["url"], "title": doc["title"]}
                })
        
        # Embed
        texts = [c["text"] for c in chunks]
        embeddings = []
        for i in range(0, len(texts), 100):
            batch = texts[i:i+100]
            resp = self.llm_client.embeddings.create(
                model="text-embedding-3-small", input=batch
            )
            embeddings.extend([d.embedding for d in resp.data])
        
        # Store
        collection = self.chroma_client.get_or_create_collection(collection_name)
        collection.add(
            ids=[f"chunk_{datetime.now().timestamp()}_{i}" for i in range(len(chunks))],
            embeddings=embeddings,
            documents=texts,
            metadatas=[c["metadata"] for c in chunks]
        )
        
        return len(chunks)
    
    def query(self, question: str, collection_name: str = "default", k: int = 5) -> str:
        """Query the RAG pipeline."""
        collection = self.chroma_client.get_collection(collection_name)
        
        # Embed query
        query_emb = self.llm_client.embeddings.create(
            model="text-embedding-3-small", input=question
        ).data[0].embedding
        
        # Search
        results = collection.query(query_embeddings=[query_emb], n_results=k)
        
        context = "\n\n---\n\n".join([
            f"Source: {meta['source_url']}\n{doc}"
            for doc, meta in zip(results["documents"][0], results["metadatas"][0])
        ])
        
        # Generate
        response = self.llm_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Answer using only the provided context. Cite sources."},
                {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
            ],
            temperature=0.3
        )
        
        return response.choices[0].message.content


# Usage
pipeline = WebRAGPipeline()

# Ingest documentation
asyncio.run(pipeline.ingest([
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/tutorials",
]))

# Query
answer = pipeline.query("How do I authenticate API requests?")
print(answer)

Keeping Data Fresh

Scheduled Re-Scraping

import schedule
import time

def refresh_knowledge_base():
    """Re-scrape sources and update the vector store."""
    asyncio.run(pipeline.ingest(source_urls, collection_name="docs"))
    print(f"Knowledge base refreshed at {datetime.now()}")

# Run daily at 6 AM
schedule.every().day.at("06:00").do(refresh_knowledge_base)

while True:
    schedule.run_pending()
    time.sleep(60)

Incremental Updates

async def incremental_update(pipeline, urls, collection_name="default"):
    """Only re-scrape pages that have changed."""
    collection = pipeline.chroma_client.get_collection(collection_name)
    
    for url in urls:
        # Check if we already have this URL
        existing = collection.get(
            where={"source_url": url},
            include=["metadatas"]
        )
        
        # Scrape the page
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(url=url)
        
        if result.success:
            # Compare content hash
            new_hash = hashlib.md5(result.markdown.encode()).hexdigest()
            
            if existing["ids"] and existing["metadatas"][0].get("content_hash") == new_hash:
                continue  # Content unchanged, skip
            
            # Delete old chunks for this URL
            if existing["ids"]:
                collection.delete(ids=existing["ids"])
            
            # Add new chunks
            await pipeline.ingest([url], collection_name)

Advanced Techniques

Hybrid Search (Vector + Keyword)

def hybrid_search(collection, query: str, k: int = 5) -> list:
    """Combine vector similarity with keyword matching."""
    # Vector search
    vector_results = search_vector_store(collection, query, k * 2)
    
    # Keyword filter
    keywords = query.lower().split()
    scored_results = []
    for result in vector_results:
        keyword_score = sum(
            1 for kw in keywords if kw in result["text"].lower()
        ) / len(keywords)
        
        combined_score = (1 - result["distance"]) * 0.7 + keyword_score * 0.3
        result["combined_score"] = combined_score
        scored_results.append(result)
    
    scored_results.sort(key=lambda x: x["combined_score"], reverse=True)
    return scored_results[:k]

Query Expansion

def expand_query(query: str) -> list[str]:
    """Generate multiple search queries from the original."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Generate 3 alternative phrasings of this search query. Return as JSON array of strings."
            },
            {"role": "user", "content": query}
        ],
        response_format={"type": "json_object"}
    )
    
    return json.loads(response.choices[0].message.content)["queries"]

Re-Ranking

def rerank_results(query: str, results: list[dict], top_k: int = 3) -> list[dict]:
    """Use LLM to re-rank retrieved chunks by relevance."""
    context = "\n".join([
        f"[{i}] {r['text'][:200]}" for i, r in enumerate(results)
    ])
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"Rank these {len(results)} passages by relevance to the query. Return the indices in order of relevance as a JSON array."
            },
            {
                "role": "user",
                "content": f"Query: {query}\n\nPassages:\n{context}"
            }
        ],
        response_format={"type": "json_object"}
    )
    
    indices = json.loads(response.choices[0].message.content)["ranking"]
    return [results[i] for i in indices[:top_k]]

Vector Database Options

Database	Type	Free Tier	Best For
ChromaDB	Local/embedded	Fully free	Development, small projects
Pinecone	Cloud managed	1 index, 100K vectors	Production, managed
Weaviate	Self-hosted/cloud	Self-hosted free	Full-featured, hybrid search
Qdrant	Self-hosted/cloud	Self-hosted free	Performance, filtering
pgvector	PostgreSQL extension	Free (with Postgres)	If you already use Postgres
Supabase	Cloud (pgvector)	500MB free	Supabase users

Production Considerations

Scaling

Scraping: Use proxy rotation for reliability at scale
Embedding: Batch embeddings and use async processing
Storage: Use a managed vector database for production
Querying: Cache common queries, use streaming responses

Monitoring

Track these metrics:

Retrieval relevance (are the right chunks returned?)
Answer quality (are responses accurate?)
Latency (end-to-end query time)
Data freshness (when was the source last scraped?)

Cost Estimation

Component	Cost (1,000 pages)
Scraping (Crawl4ai)	Free
Scraping (Firecrawl)	~$20
Embeddings (OpenAI small)	~$0.05
Embeddings (local)	Free
Vector DB (Chroma)	Free
Vector DB (Pinecone)	Free tier
Query LLM (GPT-4o-mini)	~$0.001/query

FAQ

How many documents should I include in a RAG pipeline?

Start with 50-100 high-quality documents and expand based on retrieval quality. More documents improve coverage but can reduce precision. Focus on quality over quantity — well-chunked, relevant content performs better than a massive but noisy corpus.

What’s the best chunk size for RAG?

500-1,000 tokens is the sweet spot for most use cases. Smaller chunks (200-500) improve precision but lose context. Larger chunks (1,000-2,000) maintain context but may dilute relevant information. Experiment with your specific content.

Can I use RAG with local models?

Yes. Use Ollama for the LLM, sentence-transformers for embeddings, and ChromaDB for vector storage. The entire pipeline runs locally with zero API costs. Accuracy may be slightly lower than cloud models, but it’s excellent for privacy-sensitive applications.

How do I measure RAG quality?

Track retrieval precision (are the returned chunks relevant?), answer faithfulness (does the answer match the source?), and answer relevance (does it actually answer the question?). Tools like Ragas and DeepEval provide automated RAG evaluation metrics.

Which scraper works best for RAG pipelines?

Crawl4ai is ideal for free, high-volume scraping with clean markdown output. Firecrawl is better for sites that need JavaScript rendering and anti-bot handling. Both produce the clean markdown that RAG pipelines need. See our AI web scraper comparison for more options.