RAG Pipeline with Web Scraping: Live Data for AI

RAG Pipeline with Web Scraping: Live Data for AI

Large language models have a fundamental limitation: their knowledge is frozen at training time. Ask GPT-4o about yesterday’s stock prices, a product launched last week, or the latest version of a library’s documentation, and you will get outdated or hallucinated answers.

Retrieval-Augmented Generation (RAG) solves this by fetching relevant information from external sources at query time and feeding it to the LLM as context. When you combine RAG with web scraping, you get a system that can answer questions with live, current data from any website — turning your AI from a static knowledge base into a dynamic research assistant.

This guide walks you through building a complete RAG pipeline powered by web scraping, from data collection to vector storage to intelligent retrieval.

Table of Contents

What Is RAG?

Retrieval-Augmented Generation is a technique that enhances LLM responses by:

  1. Retrieving relevant documents from a knowledge base
  2. Augmenting the LLM’s prompt with those documents
  3. Generating a response that’s grounded in the retrieved information

Why RAG + Web Scraping?

ApproachProsCons
LLM onlySimpleOutdated, hallucinations
RAG + static docsGrounded responsesContent goes stale
RAG + web scrapingLive data, always currentMore complex setup

The combination of RAG with web scraping gives you the best of both worlds: the reasoning power of LLMs with access to live, current information from any website.

Architecture Overview

┌──────────────────────────────────────────────────────┐
│                    INGESTION PIPELINE                  │
│                                                        │
│  [Web Sources] → [Scraper] → [Chunker] → [Embedder]  │
│                                    ↓                   │
│                            [Vector Database]           │
└──────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────┐
│                    QUERY PIPELINE                      │
│                                                        │
│  [User Query] → [Embed Query] → [Vector Search]      │
│                                       ↓               │
│                            [Retrieve Top-K Docs]      │
│                                       ↓               │
│                  [Augmented Prompt] → [LLM] → [Answer]│
└──────────────────────────────────────────────────────┘

Components

ComponentPurposeOptions
Web ScraperCollect content from websitesCrawl4ai, Firecrawl, custom scripts
ChunkerSplit content into manageable piecesToken-based, semantic, recursive
Embedding ModelConvert text to vectorsOpenAI, Cohere, sentence-transformers
Vector DatabaseStore and search embeddingsChroma, Pinecone, Weaviate, Qdrant
LLMGenerate responsesGPT-4o, Claude, Llama 3.1

Step 1: Web Scraping for Data Collection

Using Crawl4ai (Free, Open Source)

import asyncio
from crawl4ai import AsyncWebCrawler

async def scrape_documentation(urls: list[str]) -> list[dict]:
    """Scrape multiple documentation pages."""
    documents = []
    
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(urls=urls, max_concurrent=5)
        
        for result in results:
            if result.success:
                documents.append({
                    "url": result.url,
                    "title": result.metadata.get("title", ""),
                    "content": result.markdown,
                    "scraped_at": datetime.now().isoformat()
                })
    
    return documents

# Crawl an entire documentation site
async def crawl_site(start_url: str, max_pages: int = 50) -> list[dict]:
    """Crawl a site following internal links."""
    documents = []
    
    async with AsyncWebCrawler() as crawler:
        # First, get all URLs
        result = await crawler.arun(url=start_url)
        internal_links = [
            link["href"] for link in result.links.get("internal", [])
        ][:max_pages]
        
        # Then scrape each page
        results = await crawler.arun_many(urls=internal_links, max_concurrent=5)
        
        for result in results:
            if result.success and len(result.markdown) > 100:
                documents.append({
                    "url": result.url,
                    "title": result.metadata.get("title", ""),
                    "content": result.markdown,
                })
    
    return documents

Using Firecrawl (Managed API)

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-your-key")

def crawl_with_firecrawl(url: str, limit: int = 50) -> list[dict]:
    """Crawl a site using Firecrawl."""
    result = app.crawl_url(url, params={
        "limit": limit,
        "scrapeOptions": {
            "formats": ["markdown"],
            "onlyMainContent": True
        }
    })
    
    documents = []
    for page in result.get("data", []):
        documents.append({
            "url": page.get("metadata", {}).get("sourceURL", ""),
            "title": page.get("metadata", {}).get("title", ""),
            "content": page.get("markdown", ""),
        })
    
    return documents

Using Proxies for Reliable Scraping

For scraping multiple sources reliably, use proxies:

async with AsyncWebCrawler(
    proxy="http://user:pass@proxy:8080"
) as crawler:
    results = await crawler.arun_many(urls=urls)

Step 2: Content Processing & Chunking

Why Chunking Matters

LLMs have limited context windows. You can’t feed an entire website into a prompt. Chunking splits content into pieces that:

  • Fit within embedding model limits (typically 512-8192 tokens)
  • Contain enough context to be meaningful
  • Are small enough for precise retrieval

Recursive Character Splitting

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_documents(documents: list[dict], chunk_size: int = 1000, overlap: int = 200) -> list[dict]:
    """Split documents into chunks with metadata."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n## ", "\n### ", "\n\n", "\n", " "]
    )
    
    chunks = []
    for doc in documents:
        splits = splitter.split_text(doc["content"])
        for i, split in enumerate(splits):
            chunks.append({
                "text": split,
                "metadata": {
                    "source_url": doc["url"],
                    "title": doc["title"],
                    "chunk_index": i,
                    "total_chunks": len(splits)
                }
            })
    
    return chunks

Semantic Chunking

For better retrieval quality, split by semantic boundaries:

def semantic_chunk(text: str, max_tokens: int = 500) -> list[str]:
    """Split by markdown headers for semantic coherence."""
    sections = []
    current_section = ""
    
    for line in text.split("\n"):
        if line.startswith("## ") and current_section:
            if len(current_section.split()) > 50:  # Minimum size
                sections.append(current_section.strip())
            current_section = line + "\n"
        else:
            current_section += line + "\n"
    
    if current_section.strip():
        sections.append(current_section.strip())
    
    # Further split sections that are too large
    final_chunks = []
    splitter = RecursiveCharacterTextSplitter(chunk_size=max_tokens * 4)
    for section in sections:
        if len(section.split()) > max_tokens:
            final_chunks.extend(splitter.split_text(section))
        else:
            final_chunks.append(section)
    
    return final_chunks

Chunking Best Practices

ParameterRecommendationWhy
Chunk size500-1,500 tokensBalances context vs precision
Overlap10-20% of chunk sizePreserves cross-boundary context
SeparatorsHeaders > paragraphs > sentencesMaintains semantic coherence
Min size50 tokensAvoids meaningless fragments

Step 3: Embedding Generation

Using OpenAI Embeddings

from openai import OpenAI

client = OpenAI()

def generate_embeddings(chunks: list[dict], model: str = "text-embedding-3-small") -> list[dict]:
    """Generate embeddings for all chunks."""
    texts = [chunk["text"] for chunk in chunks]
    
    # Process in batches of 100
    all_embeddings = []
    for i in range(0, len(texts), 100):
        batch = texts[i:i + 100]
        response = client.embeddings.create(model=model, input=batch)
        embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(embeddings)
    
    # Attach embeddings to chunks
    for chunk, embedding in zip(chunks, all_embeddings):
        chunk["embedding"] = embedding
    
    return chunks

Using Free Local Embeddings

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def generate_local_embeddings(chunks: list[dict]) -> list[dict]:
    """Generate embeddings locally (free)."""
    texts = [chunk["text"] for chunk in chunks]
    embeddings = model.encode(texts, show_progress_bar=True)
    
    for chunk, embedding in zip(chunks, embeddings):
        chunk["embedding"] = embedding.tolist()
    
    return chunks

Embedding Model Comparison

ModelDimensionsCostQuality
OpenAI text-embedding-3-small1536$0.02/1M tokensGood
OpenAI text-embedding-3-large3072$0.13/1M tokensBest
all-MiniLM-L6-v2384Free (local)Good
nomic-embed-text (Ollama)768Free (local)Good
Cohere embed-v31024$0.10/1M tokensExcellent

Step 4: Vector Database Storage

Using ChromaDB (Local, Free)

import chromadb

def create_vector_store(chunks: list[dict], collection_name: str = "web_rag"):
    """Store chunks in ChromaDB."""
    client = chromadb.PersistentClient(path="./chroma_db")
    
    collection = client.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}
    )
    
    # Add chunks
    collection.add(
        ids=[f"chunk_{i}" for i in range(len(chunks))],
        embeddings=[c["embedding"] for c in chunks],
        documents=[c["text"] for c in chunks],
        metadatas=[c["metadata"] for c in chunks]
    )
    
    return collection

def search_vector_store(collection, query: str, n_results: int = 5) -> list[dict]:
    """Search for relevant chunks."""
    # Generate query embedding
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )
    
    return [
        {"text": doc, "metadata": meta, "distance": dist}
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        )
    ]

Using Pinecone (Cloud, Managed)

from pinecone import Pinecone

pc = Pinecone(api_key="your-pinecone-key")

def create_pinecone_index(chunks: list[dict], index_name: str = "web-rag"):
    """Store chunks in Pinecone."""
    # Create index if it doesn't exist
    if index_name not in pc.list_indexes().names():
        pc.create_index(
            name=index_name,
            dimension=1536,  # Match embedding model
            metric="cosine"
        )
    
    index = pc.Index(index_name)
    
    # Upsert in batches
    batch_size = 100
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        vectors = [
            {
                "id": f"chunk_{i + j}",
                "values": chunk["embedding"],
                "metadata": {
                    **chunk["metadata"],
                    "text": chunk["text"][:1000]  # Store text in metadata
                }
            }
            for j, chunk in enumerate(batch)
        ]
        index.upsert(vectors=vectors)
    
    return index

Step 5: Retrieval & Generation

The RAG Query Function

from openai import OpenAI

client = OpenAI()

def rag_query(question: str, collection, n_results: int = 5) -> str:
    """Answer a question using RAG."""
    # Step 1: Retrieve relevant chunks
    relevant_chunks = search_vector_store(collection, question, n_results)
    
    # Step 2: Build context
    context = "\n\n---\n\n".join([
        f"Source: {chunk['metadata']['source_url']}\n{chunk['text']}"
        for chunk in relevant_chunks
    ])
    
    # Step 3: Generate answer
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant that answers questions based on 
the provided context. Only use information from the context to answer.
If the context doesn't contain the answer, say so.
Always cite your sources by mentioning the source URL."""
            },
            {
                "role": "user",
                "content": f"""Context:
{context}

Question: {question}

Answer based on the context above:"""
            }
        ],
        temperature=0.3
    )
    
    return response.choices[0].message.content

Example Usage

# Build the pipeline
documents = await scrape_documentation([
    "https://docs.firecrawl.dev/introduction",
    "https://docs.firecrawl.dev/api/scrape",
    "https://docs.firecrawl.dev/api/crawl",
])

chunks = chunk_documents(documents)
chunks = generate_embeddings(chunks)
collection = create_vector_store(chunks)

# Query
answer = rag_query(
    "How do I use Firecrawl's extract mode with a custom schema?",
    collection
)
print(answer)

Complete Pipeline Implementation

All-in-One RAG Pipeline

import asyncio
from datetime import datetime
from crawl4ai import AsyncWebCrawler
from openai import OpenAI
import chromadb
from langchain.text_splitter import RecursiveCharacterTextSplitter

class WebRAGPipeline:
    def __init__(self, openai_key: str = None, db_path: str = "./rag_db"):
        self.llm_client = OpenAI(api_key=openai_key)
        self.chroma_client = chromadb.PersistentClient(path=db_path)
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000, chunk_overlap=200
        )
    
    async def ingest(self, urls: list[str], collection_name: str = "default"):
        """Scrape URLs and add to vector store."""
        # Scrape
        async with AsyncWebCrawler() as crawler:
            results = await crawler.arun_many(urls=urls, max_concurrent=5)
        
        documents = [
            {"url": r.url, "title": r.metadata.get("title", ""), "content": r.markdown}
            for r in results if r.success
        ]
        
        # Chunk
        chunks = []
        for doc in documents:
            splits = self.splitter.split_text(doc["content"])
            for i, split in enumerate(splits):
                chunks.append({
                    "text": split,
                    "metadata": {"source_url": doc["url"], "title": doc["title"]}
                })
        
        # Embed
        texts = [c["text"] for c in chunks]
        embeddings = []
        for i in range(0, len(texts), 100):
            batch = texts[i:i+100]
            resp = self.llm_client.embeddings.create(
                model="text-embedding-3-small", input=batch
            )
            embeddings.extend([d.embedding for d in resp.data])
        
        # Store
        collection = self.chroma_client.get_or_create_collection(collection_name)
        collection.add(
            ids=[f"chunk_{datetime.now().timestamp()}_{i}" for i in range(len(chunks))],
            embeddings=embeddings,
            documents=texts,
            metadatas=[c["metadata"] for c in chunks]
        )
        
        return len(chunks)
    
    def query(self, question: str, collection_name: str = "default", k: int = 5) -> str:
        """Query the RAG pipeline."""
        collection = self.chroma_client.get_collection(collection_name)
        
        # Embed query
        query_emb = self.llm_client.embeddings.create(
            model="text-embedding-3-small", input=question
        ).data[0].embedding
        
        # Search
        results = collection.query(query_embeddings=[query_emb], n_results=k)
        
        context = "\n\n---\n\n".join([
            f"Source: {meta['source_url']}\n{doc}"
            for doc, meta in zip(results["documents"][0], results["metadatas"][0])
        ])
        
        # Generate
        response = self.llm_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Answer using only the provided context. Cite sources."},
                {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
            ],
            temperature=0.3
        )
        
        return response.choices[0].message.content


# Usage
pipeline = WebRAGPipeline()

# Ingest documentation
asyncio.run(pipeline.ingest([
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/tutorials",
]))

# Query
answer = pipeline.query("How do I authenticate API requests?")
print(answer)

Keeping Data Fresh

Scheduled Re-Scraping

import schedule
import time

def refresh_knowledge_base():
    """Re-scrape sources and update the vector store."""
    asyncio.run(pipeline.ingest(source_urls, collection_name="docs"))
    print(f"Knowledge base refreshed at {datetime.now()}")

# Run daily at 6 AM
schedule.every().day.at("06:00").do(refresh_knowledge_base)

while True:
    schedule.run_pending()
    time.sleep(60)

Incremental Updates

async def incremental_update(pipeline, urls, collection_name="default"):
    """Only re-scrape pages that have changed."""
    collection = pipeline.chroma_client.get_collection(collection_name)
    
    for url in urls:
        # Check if we already have this URL
        existing = collection.get(
            where={"source_url": url},
            include=["metadatas"]
        )
        
        # Scrape the page
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(url=url)
        
        if result.success:
            # Compare content hash
            new_hash = hashlib.md5(result.markdown.encode()).hexdigest()
            
            if existing["ids"] and existing["metadatas"][0].get("content_hash") == new_hash:
                continue  # Content unchanged, skip
            
            # Delete old chunks for this URL
            if existing["ids"]:
                collection.delete(ids=existing["ids"])
            
            # Add new chunks
            await pipeline.ingest([url], collection_name)

Advanced Techniques

Hybrid Search (Vector + Keyword)

def hybrid_search(collection, query: str, k: int = 5) -> list:
    """Combine vector similarity with keyword matching."""
    # Vector search
    vector_results = search_vector_store(collection, query, k * 2)
    
    # Keyword filter
    keywords = query.lower().split()
    scored_results = []
    for result in vector_results:
        keyword_score = sum(
            1 for kw in keywords if kw in result["text"].lower()
        ) / len(keywords)
        
        combined_score = (1 - result["distance"]) * 0.7 + keyword_score * 0.3
        result["combined_score"] = combined_score
        scored_results.append(result)
    
    scored_results.sort(key=lambda x: x["combined_score"], reverse=True)
    return scored_results[:k]

Query Expansion

def expand_query(query: str) -> list[str]:
    """Generate multiple search queries from the original."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Generate 3 alternative phrasings of this search query. Return as JSON array of strings."
            },
            {"role": "user", "content": query}
        ],
        response_format={"type": "json_object"}
    )
    
    return json.loads(response.choices[0].message.content)["queries"]

Re-Ranking

def rerank_results(query: str, results: list[dict], top_k: int = 3) -> list[dict]:
    """Use LLM to re-rank retrieved chunks by relevance."""
    context = "\n".join([
        f"[{i}] {r['text'][:200]}" for i, r in enumerate(results)
    ])
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"Rank these {len(results)} passages by relevance to the query. Return the indices in order of relevance as a JSON array."
            },
            {
                "role": "user",
                "content": f"Query: {query}\n\nPassages:\n{context}"
            }
        ],
        response_format={"type": "json_object"}
    )
    
    indices = json.loads(response.choices[0].message.content)["ranking"]
    return [results[i] for i in indices[:top_k]]

Vector Database Options

DatabaseTypeFree TierBest For
ChromaDBLocal/embeddedFully freeDevelopment, small projects
PineconeCloud managed1 index, 100K vectorsProduction, managed
WeaviateSelf-hosted/cloudSelf-hosted freeFull-featured, hybrid search
QdrantSelf-hosted/cloudSelf-hosted freePerformance, filtering
pgvectorPostgreSQL extensionFree (with Postgres)If you already use Postgres
SupabaseCloud (pgvector)500MB freeSupabase users

Production Considerations

Scaling

  • Scraping: Use proxy rotation for reliability at scale
  • Embedding: Batch embeddings and use async processing
  • Storage: Use a managed vector database for production
  • Querying: Cache common queries, use streaming responses

Monitoring

Track these metrics:

  • Retrieval relevance (are the right chunks returned?)
  • Answer quality (are responses accurate?)
  • Latency (end-to-end query time)
  • Data freshness (when was the source last scraped?)

Cost Estimation

ComponentCost (1,000 pages)
Scraping (Crawl4ai)Free
Scraping (Firecrawl)~$20
Embeddings (OpenAI small)~$0.05
Embeddings (local)Free
Vector DB (Chroma)Free
Vector DB (Pinecone)Free tier
Query LLM (GPT-4o-mini)~$0.001/query

FAQ

How many documents should I include in a RAG pipeline?

Start with 50-100 high-quality documents and expand based on retrieval quality. More documents improve coverage but can reduce precision. Focus on quality over quantity — well-chunked, relevant content performs better than a massive but noisy corpus.

What’s the best chunk size for RAG?

500-1,000 tokens is the sweet spot for most use cases. Smaller chunks (200-500) improve precision but lose context. Larger chunks (1,000-2,000) maintain context but may dilute relevant information. Experiment with your specific content.

Can I use RAG with local models?

Yes. Use Ollama for the LLM, sentence-transformers for embeddings, and ChromaDB for vector storage. The entire pipeline runs locally with zero API costs. Accuracy may be slightly lower than cloud models, but it’s excellent for privacy-sensitive applications.

How do I measure RAG quality?

Track retrieval precision (are the returned chunks relevant?), answer faithfulness (does the answer match the source?), and answer relevance (does it actually answer the question?). Tools like Ragas and DeepEval provide automated RAG evaluation metrics.

Which scraper works best for RAG pipelines?

Crawl4ai is ideal for free, high-volume scraping with clean markdown output. Firecrawl is better for sites that need JavaScript rendering and anti-bot handling. Both produce the clean markdown that RAG pipelines need. See our AI web scraper comparison for more options.


Related Reading

Scroll to Top