RAG Pipeline with Web Scraping: Live Data for AI
Large language models have a fundamental limitation: their knowledge is frozen at training time. Ask GPT-4o about yesterday’s stock prices, a product launched last week, or the latest version of a library’s documentation, and you will get outdated or hallucinated answers.
Retrieval-Augmented Generation (RAG) solves this by fetching relevant information from external sources at query time and feeding it to the LLM as context. When you combine RAG with web scraping, you get a system that can answer questions with live, current data from any website — turning your AI from a static knowledge base into a dynamic research assistant.
This guide walks you through building a complete RAG pipeline powered by web scraping, from data collection to vector storage to intelligent retrieval.
Table of Contents
- What Is RAG?
- Architecture Overview
- Step 1: Web Scraping for Data Collection
- Step 2: Content Processing & Chunking
- Step 3: Embedding Generation
- Step 4: Vector Database Storage
- Step 5: Retrieval & Generation
- Complete Pipeline Implementation
- Keeping Data Fresh
- Advanced Techniques
- Vector Database Options
- Production Considerations
- FAQ
What Is RAG?
Retrieval-Augmented Generation is a technique that enhances LLM responses by:
- Retrieving relevant documents from a knowledge base
- Augmenting the LLM’s prompt with those documents
- Generating a response that’s grounded in the retrieved information
Why RAG + Web Scraping?
| Approach | Pros | Cons |
|---|---|---|
| LLM only | Simple | Outdated, hallucinations |
| RAG + static docs | Grounded responses | Content goes stale |
| RAG + web scraping | Live data, always current | More complex setup |
The combination of RAG with web scraping gives you the best of both worlds: the reasoning power of LLMs with access to live, current information from any website.
Architecture Overview
┌──────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
│ │
│ [Web Sources] → [Scraper] → [Chunker] → [Embedder] │
│ ↓ │
│ [Vector Database] │
└──────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│ QUERY PIPELINE │
│ │
│ [User Query] → [Embed Query] → [Vector Search] │
│ ↓ │
│ [Retrieve Top-K Docs] │
│ ↓ │
│ [Augmented Prompt] → [LLM] → [Answer]│
└──────────────────────────────────────────────────────┘Components
| Component | Purpose | Options |
|---|---|---|
| Web Scraper | Collect content from websites | Crawl4ai, Firecrawl, custom scripts |
| Chunker | Split content into manageable pieces | Token-based, semantic, recursive |
| Embedding Model | Convert text to vectors | OpenAI, Cohere, sentence-transformers |
| Vector Database | Store and search embeddings | Chroma, Pinecone, Weaviate, Qdrant |
| LLM | Generate responses | GPT-4o, Claude, Llama 3.1 |
Step 1: Web Scraping for Data Collection
Using Crawl4ai (Free, Open Source)
import asyncio
from crawl4ai import AsyncWebCrawler
async def scrape_documentation(urls: list[str]) -> list[dict]:
"""Scrape multiple documentation pages."""
documents = []
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(urls=urls, max_concurrent=5)
for result in results:
if result.success:
documents.append({
"url": result.url,
"title": result.metadata.get("title", ""),
"content": result.markdown,
"scraped_at": datetime.now().isoformat()
})
return documents
# Crawl an entire documentation site
async def crawl_site(start_url: str, max_pages: int = 50) -> list[dict]:
"""Crawl a site following internal links."""
documents = []
async with AsyncWebCrawler() as crawler:
# First, get all URLs
result = await crawler.arun(url=start_url)
internal_links = [
link["href"] for link in result.links.get("internal", [])
][:max_pages]
# Then scrape each page
results = await crawler.arun_many(urls=internal_links, max_concurrent=5)
for result in results:
if result.success and len(result.markdown) > 100:
documents.append({
"url": result.url,
"title": result.metadata.get("title", ""),
"content": result.markdown,
})
return documentsUsing Firecrawl (Managed API)
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-your-key")
def crawl_with_firecrawl(url: str, limit: int = 50) -> list[dict]:
"""Crawl a site using Firecrawl."""
result = app.crawl_url(url, params={
"limit": limit,
"scrapeOptions": {
"formats": ["markdown"],
"onlyMainContent": True
}
})
documents = []
for page in result.get("data", []):
documents.append({
"url": page.get("metadata", {}).get("sourceURL", ""),
"title": page.get("metadata", {}).get("title", ""),
"content": page.get("markdown", ""),
})
return documentsUsing Proxies for Reliable Scraping
For scraping multiple sources reliably, use proxies:
async with AsyncWebCrawler(
proxy="http://user:pass@proxy:8080"
) as crawler:
results = await crawler.arun_many(urls=urls)Step 2: Content Processing & Chunking
Why Chunking Matters
LLMs have limited context windows. You can’t feed an entire website into a prompt. Chunking splits content into pieces that:
- Fit within embedding model limits (typically 512-8192 tokens)
- Contain enough context to be meaningful
- Are small enough for precise retrieval
Recursive Character Splitting
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_documents(documents: list[dict], chunk_size: int = 1000, overlap: int = 200) -> list[dict]:
"""Split documents into chunks with metadata."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
separators=["\n## ", "\n### ", "\n\n", "\n", " "]
)
chunks = []
for doc in documents:
splits = splitter.split_text(doc["content"])
for i, split in enumerate(splits):
chunks.append({
"text": split,
"metadata": {
"source_url": doc["url"],
"title": doc["title"],
"chunk_index": i,
"total_chunks": len(splits)
}
})
return chunksSemantic Chunking
For better retrieval quality, split by semantic boundaries:
def semantic_chunk(text: str, max_tokens: int = 500) -> list[str]:
"""Split by markdown headers for semantic coherence."""
sections = []
current_section = ""
for line in text.split("\n"):
if line.startswith("## ") and current_section:
if len(current_section.split()) > 50: # Minimum size
sections.append(current_section.strip())
current_section = line + "\n"
else:
current_section += line + "\n"
if current_section.strip():
sections.append(current_section.strip())
# Further split sections that are too large
final_chunks = []
splitter = RecursiveCharacterTextSplitter(chunk_size=max_tokens * 4)
for section in sections:
if len(section.split()) > max_tokens:
final_chunks.extend(splitter.split_text(section))
else:
final_chunks.append(section)
return final_chunksChunking Best Practices
| Parameter | Recommendation | Why |
|---|---|---|
| Chunk size | 500-1,500 tokens | Balances context vs precision |
| Overlap | 10-20% of chunk size | Preserves cross-boundary context |
| Separators | Headers > paragraphs > sentences | Maintains semantic coherence |
| Min size | 50 tokens | Avoids meaningless fragments |
Step 3: Embedding Generation
Using OpenAI Embeddings
from openai import OpenAI
client = OpenAI()
def generate_embeddings(chunks: list[dict], model: str = "text-embedding-3-small") -> list[dict]:
"""Generate embeddings for all chunks."""
texts = [chunk["text"] for chunk in chunks]
# Process in batches of 100
all_embeddings = []
for i in range(0, len(texts), 100):
batch = texts[i:i + 100]
response = client.embeddings.create(model=model, input=batch)
embeddings = [item.embedding for item in response.data]
all_embeddings.extend(embeddings)
# Attach embeddings to chunks
for chunk, embedding in zip(chunks, all_embeddings):
chunk["embedding"] = embedding
return chunksUsing Free Local Embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def generate_local_embeddings(chunks: list[dict]) -> list[dict]:
"""Generate embeddings locally (free)."""
texts = [chunk["text"] for chunk in chunks]
embeddings = model.encode(texts, show_progress_bar=True)
for chunk, embedding in zip(chunks, embeddings):
chunk["embedding"] = embedding.tolist()
return chunksEmbedding Model Comparison
| Model | Dimensions | Cost | Quality |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | $0.02/1M tokens | Good |
| OpenAI text-embedding-3-large | 3072 | $0.13/1M tokens | Best |
| all-MiniLM-L6-v2 | 384 | Free (local) | Good |
| nomic-embed-text (Ollama) | 768 | Free (local) | Good |
| Cohere embed-v3 | 1024 | $0.10/1M tokens | Excellent |
Step 4: Vector Database Storage
Using ChromaDB (Local, Free)
import chromadb
def create_vector_store(chunks: list[dict], collection_name: str = "web_rag"):
"""Store chunks in ChromaDB."""
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"}
)
# Add chunks
collection.add(
ids=[f"chunk_{i}" for i in range(len(chunks))],
embeddings=[c["embedding"] for c in chunks],
documents=[c["text"] for c in chunks],
metadatas=[c["metadata"] for c in chunks]
)
return collection
def search_vector_store(collection, query: str, n_results: int = 5) -> list[dict]:
"""Search for relevant chunks."""
# Generate query embedding
query_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
results = collection.query(
query_embeddings=[query_embedding],
n_results=n_results
)
return [
{"text": doc, "metadata": meta, "distance": dist}
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
)
]Using Pinecone (Cloud, Managed)
from pinecone import Pinecone
pc = Pinecone(api_key="your-pinecone-key")
def create_pinecone_index(chunks: list[dict], index_name: str = "web-rag"):
"""Store chunks in Pinecone."""
# Create index if it doesn't exist
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536, # Match embedding model
metric="cosine"
)
index = pc.Index(index_name)
# Upsert in batches
batch_size = 100
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
vectors = [
{
"id": f"chunk_{i + j}",
"values": chunk["embedding"],
"metadata": {
**chunk["metadata"],
"text": chunk["text"][:1000] # Store text in metadata
}
}
for j, chunk in enumerate(batch)
]
index.upsert(vectors=vectors)
return indexStep 5: Retrieval & Generation
The RAG Query Function
from openai import OpenAI
client = OpenAI()
def rag_query(question: str, collection, n_results: int = 5) -> str:
"""Answer a question using RAG."""
# Step 1: Retrieve relevant chunks
relevant_chunks = search_vector_store(collection, question, n_results)
# Step 2: Build context
context = "\n\n---\n\n".join([
f"Source: {chunk['metadata']['source_url']}\n{chunk['text']}"
for chunk in relevant_chunks
])
# Step 3: Generate answer
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": """You are a helpful assistant that answers questions based on
the provided context. Only use information from the context to answer.
If the context doesn't contain the answer, say so.
Always cite your sources by mentioning the source URL."""
},
{
"role": "user",
"content": f"""Context:
{context}
Question: {question}
Answer based on the context above:"""
}
],
temperature=0.3
)
return response.choices[0].message.contentExample Usage
# Build the pipeline
documents = await scrape_documentation([
"https://docs.firecrawl.dev/introduction",
"https://docs.firecrawl.dev/api/scrape",
"https://docs.firecrawl.dev/api/crawl",
])
chunks = chunk_documents(documents)
chunks = generate_embeddings(chunks)
collection = create_vector_store(chunks)
# Query
answer = rag_query(
"How do I use Firecrawl's extract mode with a custom schema?",
collection
)
print(answer)Complete Pipeline Implementation
All-in-One RAG Pipeline
import asyncio
from datetime import datetime
from crawl4ai import AsyncWebCrawler
from openai import OpenAI
import chromadb
from langchain.text_splitter import RecursiveCharacterTextSplitter
class WebRAGPipeline:
def __init__(self, openai_key: str = None, db_path: str = "./rag_db"):
self.llm_client = OpenAI(api_key=openai_key)
self.chroma_client = chromadb.PersistentClient(path=db_path)
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200
)
async def ingest(self, urls: list[str], collection_name: str = "default"):
"""Scrape URLs and add to vector store."""
# Scrape
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(urls=urls, max_concurrent=5)
documents = [
{"url": r.url, "title": r.metadata.get("title", ""), "content": r.markdown}
for r in results if r.success
]
# Chunk
chunks = []
for doc in documents:
splits = self.splitter.split_text(doc["content"])
for i, split in enumerate(splits):
chunks.append({
"text": split,
"metadata": {"source_url": doc["url"], "title": doc["title"]}
})
# Embed
texts = [c["text"] for c in chunks]
embeddings = []
for i in range(0, len(texts), 100):
batch = texts[i:i+100]
resp = self.llm_client.embeddings.create(
model="text-embedding-3-small", input=batch
)
embeddings.extend([d.embedding for d in resp.data])
# Store
collection = self.chroma_client.get_or_create_collection(collection_name)
collection.add(
ids=[f"chunk_{datetime.now().timestamp()}_{i}" for i in range(len(chunks))],
embeddings=embeddings,
documents=texts,
metadatas=[c["metadata"] for c in chunks]
)
return len(chunks)
def query(self, question: str, collection_name: str = "default", k: int = 5) -> str:
"""Query the RAG pipeline."""
collection = self.chroma_client.get_collection(collection_name)
# Embed query
query_emb = self.llm_client.embeddings.create(
model="text-embedding-3-small", input=question
).data[0].embedding
# Search
results = collection.query(query_embeddings=[query_emb], n_results=k)
context = "\n\n---\n\n".join([
f"Source: {meta['source_url']}\n{doc}"
for doc, meta in zip(results["documents"][0], results["metadatas"][0])
])
# Generate
response = self.llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer using only the provided context. Cite sources."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
],
temperature=0.3
)
return response.choices[0].message.content
# Usage
pipeline = WebRAGPipeline()
# Ingest documentation
asyncio.run(pipeline.ingest([
"https://docs.example.com/getting-started",
"https://docs.example.com/api-reference",
"https://docs.example.com/tutorials",
]))
# Query
answer = pipeline.query("How do I authenticate API requests?")
print(answer)Keeping Data Fresh
Scheduled Re-Scraping
import schedule
import time
def refresh_knowledge_base():
"""Re-scrape sources and update the vector store."""
asyncio.run(pipeline.ingest(source_urls, collection_name="docs"))
print(f"Knowledge base refreshed at {datetime.now()}")
# Run daily at 6 AM
schedule.every().day.at("06:00").do(refresh_knowledge_base)
while True:
schedule.run_pending()
time.sleep(60)Incremental Updates
async def incremental_update(pipeline, urls, collection_name="default"):
"""Only re-scrape pages that have changed."""
collection = pipeline.chroma_client.get_collection(collection_name)
for url in urls:
# Check if we already have this URL
existing = collection.get(
where={"source_url": url},
include=["metadatas"]
)
# Scrape the page
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url)
if result.success:
# Compare content hash
new_hash = hashlib.md5(result.markdown.encode()).hexdigest()
if existing["ids"] and existing["metadatas"][0].get("content_hash") == new_hash:
continue # Content unchanged, skip
# Delete old chunks for this URL
if existing["ids"]:
collection.delete(ids=existing["ids"])
# Add new chunks
await pipeline.ingest([url], collection_name)Advanced Techniques
Hybrid Search (Vector + Keyword)
def hybrid_search(collection, query: str, k: int = 5) -> list:
"""Combine vector similarity with keyword matching."""
# Vector search
vector_results = search_vector_store(collection, query, k * 2)
# Keyword filter
keywords = query.lower().split()
scored_results = []
for result in vector_results:
keyword_score = sum(
1 for kw in keywords if kw in result["text"].lower()
) / len(keywords)
combined_score = (1 - result["distance"]) * 0.7 + keyword_score * 0.3
result["combined_score"] = combined_score
scored_results.append(result)
scored_results.sort(key=lambda x: x["combined_score"], reverse=True)
return scored_results[:k]Query Expansion
def expand_query(query: str) -> list[str]:
"""Generate multiple search queries from the original."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Generate 3 alternative phrasings of this search query. Return as JSON array of strings."
},
{"role": "user", "content": query}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)["queries"]Re-Ranking
def rerank_results(query: str, results: list[dict], top_k: int = 3) -> list[dict]:
"""Use LLM to re-rank retrieved chunks by relevance."""
context = "\n".join([
f"[{i}] {r['text'][:200]}" for i, r in enumerate(results)
])
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"Rank these {len(results)} passages by relevance to the query. Return the indices in order of relevance as a JSON array."
},
{
"role": "user",
"content": f"Query: {query}\n\nPassages:\n{context}"
}
],
response_format={"type": "json_object"}
)
indices = json.loads(response.choices[0].message.content)["ranking"]
return [results[i] for i in indices[:top_k]]Vector Database Options
| Database | Type | Free Tier | Best For |
|---|---|---|---|
| ChromaDB | Local/embedded | Fully free | Development, small projects |
| Pinecone | Cloud managed | 1 index, 100K vectors | Production, managed |
| Weaviate | Self-hosted/cloud | Self-hosted free | Full-featured, hybrid search |
| Qdrant | Self-hosted/cloud | Self-hosted free | Performance, filtering |
| pgvector | PostgreSQL extension | Free (with Postgres) | If you already use Postgres |
| Supabase | Cloud (pgvector) | 500MB free | Supabase users |
Production Considerations
Scaling
- Scraping: Use proxy rotation for reliability at scale
- Embedding: Batch embeddings and use async processing
- Storage: Use a managed vector database for production
- Querying: Cache common queries, use streaming responses
Monitoring
Track these metrics:
- Retrieval relevance (are the right chunks returned?)
- Answer quality (are responses accurate?)
- Latency (end-to-end query time)
- Data freshness (when was the source last scraped?)
Cost Estimation
| Component | Cost (1,000 pages) |
|---|---|
| Scraping (Crawl4ai) | Free |
| Scraping (Firecrawl) | ~$20 |
| Embeddings (OpenAI small) | ~$0.05 |
| Embeddings (local) | Free |
| Vector DB (Chroma) | Free |
| Vector DB (Pinecone) | Free tier |
| Query LLM (GPT-4o-mini) | ~$0.001/query |
FAQ
How many documents should I include in a RAG pipeline?
Start with 50-100 high-quality documents and expand based on retrieval quality. More documents improve coverage but can reduce precision. Focus on quality over quantity — well-chunked, relevant content performs better than a massive but noisy corpus.
What’s the best chunk size for RAG?
500-1,000 tokens is the sweet spot for most use cases. Smaller chunks (200-500) improve precision but lose context. Larger chunks (1,000-2,000) maintain context but may dilute relevant information. Experiment with your specific content.
Can I use RAG with local models?
Yes. Use Ollama for the LLM, sentence-transformers for embeddings, and ChromaDB for vector storage. The entire pipeline runs locally with zero API costs. Accuracy may be slightly lower than cloud models, but it’s excellent for privacy-sensitive applications.
How do I measure RAG quality?
Track retrieval precision (are the returned chunks relevant?), answer faithfulness (does the answer match the source?), and answer relevance (does it actually answer the question?). Tools like Ragas and DeepEval provide automated RAG evaluation metrics.
Which scraper works best for RAG pipelines?
Crawl4ai is ideal for free, high-volume scraping with clean markdown output. Firecrawl is better for sites that need JavaScript rendering and anti-bot handling. Both produce the clean markdown that RAG pipelines need. See our AI web scraper comparison for more options.
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
Related Reading
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data