Build a RAG Data Pipeline with Firecrawl and LangChain (Python 2026)

build a rag data pipeline with firecrawl and langchain (python 2026)

firecrawl crawls a website and returns clean markdown ready for embedding. langchain handles the chunking, embedding, vector storage, and retrieval-augmented question answering. together they let you build a production rag pipeline in under 100 lines of python: scrape a documentation site, chunk and embed the content, store it in a local vector database, and ask questions that get grounded answers with source citations. the full setup runs on your laptop in under 10 minutes.

retrieval-augmented generation is how most production llm apps work in 2026. you don’t ask a model to know everything, you give it the relevant context from your own data. for context built from web sources (docs sites, blogs, knowledge bases, sec filings), the bottleneck is getting clean, structured content out of the web. firecrawl removes that bottleneck.

this tutorial builds a working rag pipeline end-to-end. by the end you’ll have a chatbot that answers questions about any documentation site you point it at, with citations back to source urls.

what you’ll build

a python script that:
1. crawls a target documentation or content site with firecrawl, getting clean markdown
2. chunks the markdown into 500-1000 token segments with langchain
3. embeds each chunk with openai or a local embedding model
4. stores embeddings in chromadb (local) or pinecone/qdrant (cloud)
5. accepts a user question, retrieves the top-k relevant chunks
6. passes those chunks plus the question to a chat model
7. returns the answer with source url citations

real production rag setups are more complex (re-ranking, query rewriting, hybrid search, evaluation) but this skeleton is the foundation everyone builds on.

prerequisites

you need:
– python 3.10+
– a firecrawl api key from firecrawl.dev. free tier gives 500 credits, enough for the tutorial.
– an openai api key, or you can swap in a local model later
– about 10 minutes

pip install firecrawl-py langchain langchain-openai langchain-community langchain-chroma chromadb tiktoken

set environment variables:

export FIRECRAWL_API_KEY="fc-your-key"
export OPENAI_API_KEY="sk-your-key"

step 1: crawl with firecrawl

firecrawl has two relevant endpoints: scrape_url for a single page, and crawl_url for a whole site. for rag, you almost always want the crawl.

import os
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])

crawl_result = app.crawl_url(
    "https://docs.python.org/3/library/asyncio.html",
    params={
        "limit": 50,
        "scrapeOptions": {
            "formats": ["markdown"],
            "onlyMainContent": True,
        },
    },
    poll_interval=5,
)

pages = crawl_result["data"]
print(f"crawled {len(pages)} pages")
for p in pages[:3]:
    print(" -", p["metadata"]["sourceURL"])

limit: 50 caps the crawl to 50 pages. for prototype work this is plenty. onlyMainContent: True strips navigation, headers, footers, and ad blocks, leaving clean article markdown.

each page in pages has:
markdown: the cleaned content
metadata.sourceURL: original url
metadata.title, metadata.description: page metadata
metadata.statusCode: http status

if you only need a single page or url list, swap crawl_url for scrape_url in a loop.

step 2: chunk the markdown

you can’t just throw a 50-page document at an embedding model. you need to split it into chunks. langchain’s RecursiveCharacterTextSplitter is the standard choice.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", ". ", " ", ""],
)

docs = []
for page in pages:
    if not page.get("markdown"):
        continue
    chunks = splitter.split_text(page["markdown"])
    for chunk in chunks:
        docs.append(Document(
            page_content=chunk,
            metadata={
                "source": page["metadata"]["sourceURL"],
                "title": page["metadata"].get("title", ""),
            },
        ))

print(f"created {len(docs)} chunks from {len(pages)} pages")

chunk_size=1000 means each chunk is up to 1000 characters (roughly 200-300 tokens). chunk_overlap=150 means each chunk shares 150 characters with the next, so context isn’t lost at chunk boundaries. these are sensible defaults for documentation content. for narrative text or papers, larger chunks (1500-2000 chars) work better.

the separators list tells the splitter where to break, in order of preference. paragraph breaks first, then line breaks, then sentences, then words.

step 3: embed and store

embed the chunks and store them in chromadb. chroma runs locally with no setup.

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="asyncio_docs",
)
print(f"stored {vectorstore._collection.count()} embeddings in chroma")

text-embedding-3-small is openai’s cheap embedding model. $0.02 per million tokens in 2026. for 50 documentation pages with ~50k tokens total, the embedding cost is about $0.001. negligible.

for local embeddings (no api cost, runs on your machine), swap to:

from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")

bge-small-en-v1.5 is a 384-dim embedding model that runs fast on cpu and rivals openai’s small for english docs.

persist_directory="./chroma_db" saves the database to disk so you don’t re-embed on every run.

step 4: build the retrieval qa chain

the retrieval part. langchain has a few patterns for this. the modern lcel (langchain expression language) approach:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

prompt = ChatPromptTemplate.from_messages([
    ("system", "you are a helpful assistant. answer the question using only the context below. cite sources by url at the end. if the context doesn't have the answer, say so."),
    ("human", "context:\n{context}\n\nquestion: {question}"),
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(
        f"[source: {d.metadata['source']}]\n{d.page_content}"
        for d in docs
    )

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

answer = chain.invoke("what is asyncio.gather and when should i use it?")
print(answer)

k=5 retrieves the top 5 most similar chunks. for documentation sites this is usually enough. for less structured content, k=10 or higher.

temperature=0 keeps the model from hallucinating. for rag, you almost always want low temperature.

the system prompt instructs the model to cite sources by url. you’ll see urls at the end of each answer, traceable back to the original docs page.

step 5: full working script

putting it all together. this is the entire pipeline in one file.

import os
from firecrawl import FirecrawlApp
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# configuration
TARGET_URL = "https://docs.python.org/3/library/asyncio.html"
CRAWL_LIMIT = 30
COLLECTION = "asyncio_rag"
PERSIST_DIR = "./chroma_db"

def crawl(url, limit):
    app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
    result = app.crawl_url(url, params={
        "limit": limit,
        "scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True},
    }, poll_interval=5)
    return result["data"]

def to_documents(pages):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, chunk_overlap=150,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    docs = []
    for page in pages:
        if not page.get("markdown"):
            continue
        for chunk in splitter.split_text(page["markdown"]):
            docs.append(Document(
                page_content=chunk,
                metadata={
                    "source": page["metadata"]["sourceURL"],
                    "title": page["metadata"].get("title", ""),
                },
            ))
    return docs

def build_or_load_vectorstore(docs):
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    if os.path.exists(PERSIST_DIR):
        return Chroma(
            collection_name=COLLECTION,
            embedding_function=embeddings,
            persist_directory=PERSIST_DIR,
        )
    return Chroma.from_documents(
        documents=docs,
        embedding=embeddings,
        persist_directory=PERSIST_DIR,
        collection_name=COLLECTION,
    )

def build_chain(vectorstore):
    retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
    prompt = ChatPromptTemplate.from_messages([
        ("system", "you are a helpful assistant. answer using only the context. cite sources by url. if the context doesn't have the answer, say you don't know."),
        ("human", "context:\n{context}\n\nquestion: {question}"),
    ])
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    def format_docs(docs):
        return "\n\n".join(f"[source: {d.metadata['source']}]\n{d.page_content}" for d in docs)

    return (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt | llm | StrOutputParser()
    )

def main():
    if not os.path.exists(PERSIST_DIR):
        print("crawling and embedding...")
        pages = crawl(TARGET_URL, CRAWL_LIMIT)
        docs = to_documents(pages)
        vs = build_or_load_vectorstore(docs)
        print(f"indexed {vs._collection.count()} chunks")
    else:
        print("loading existing index...")
        vs = build_or_load_vectorstore(None)

    chain = build_chain(vs)

    print("\nask questions. ctrl-c to quit.\n")
    while True:
        try:
            q = input("you: ").strip()
            if not q:
                continue
            print(f"bot: {chain.invoke(q)}\n")
        except KeyboardInterrupt:
            break

if __name__ == "__main__":
    main()

run it:

python rag.py

first run crawls and embeds. subsequent runs reuse the persisted index. you get a working command-line chatbot grounded in your target site’s content.

production considerations

the script above works. it’s not production-ready. things you’d want to add:

re-ranking. the top-5 by vector similarity isn’t always the most relevant 5. add a reranker like cohere rerank or BAAI/bge-reranker-base to reorder retrieved chunks before passing to the llm. dramatic quality improvement.

hybrid search. vector similarity misses keyword-exact matches. add bm25 search alongside vector search and combine results. langchain has EnsembleRetriever for this.

incremental updates. real docs sites change. add a scheduler that re-crawls weekly and updates only changed pages by comparing content hashes.

rate limiting and retries. firecrawl’s free tier is 500 credits. crawls can take minutes. add backoff and retry logic for production reliability.

monitoring. log every query, every retrieved chunk, every llm response. you’ll learn what’s working and what isn’t only by looking at real interactions.

evaluation. create a set of 50-100 reference q&a pairs for your domain. run them through the pipeline weekly. measure accuracy. iterate on chunk size, k, and prompts based on the data.

for the broader landscape of scraping apis you might use instead of firecrawl, see the scraping apis comparison. for the side-by-side with crawl4ai and jina, see firecrawl vs crawl4ai vs jina.

swap-in alternatives

the architecture is modular. each component has alternatives.

crawler: firecrawl (hosted), crawl4ai (self-hosted), jina reader (free public api), playwright (full diy)

chunker: recursivecharacter (default), tokenize-based splitting (more accurate token counts), semantic chunkers (openai’s chunker, llamaindex’s nodeparser)

embedding model: openai text-embedding-3-small ($0.02/1m tokens), text-embedding-3-large ($0.13/1m, better quality), local (free, slower): bge-small, bge-large, voyage embeddings

vector store: chroma (local file-based), qdrant (open source, scalable), pinecone (hosted, premium), pgvector (postgres extension), weaviate, milvus

chat model: gpt-4o-mini (cheap, fast), gpt-4o (better quality), claude 3.7 sonnet (best for reasoning), claude haiku (cheapest), local llama 3.1 via ollama (free, privacy)

mix and match based on cost, latency, privacy, and quality requirements. the langchain abstractions make swapping any one component a 1-2 line change.

cost estimate at scale

for a small team building a docs chatbot:

  • firecrawl: $19/month covers the crawl
  • embeddings: $0.50 to embed 25m tokens of crawled text (about 5000 docs pages)
  • vector store: free if local chroma, $20/month for a hosted starter (qdrant cloud, pinecone)
  • llm queries: gpt-4o-mini at $0.15/$0.60 per 1m input/output tokens. with k=5 chunks, average query costs ~$0.001-0.003.

monthly all-in for a chatbot serving 10k queries on a 50-page site: under $50.

faq

why use firecrawl instead of just requests + beautifulsoup?
firecrawl handles javascript-rendered pages, anti-bot challenges, sitemap traversal, and clean markdown extraction in one api call. doing those four things yourself takes weeks.

how big should chunks be for rag?
500-1000 characters (roughly 100-250 tokens) is the sweet spot for most documentation content. larger for narrative or papers, smaller for q&a-style content.

which is better, openai or local embeddings?
openai text-embedding-3-small wins on quality and is cheap enough that cost is rarely the deciding factor. use local embeddings if privacy or air-gapped deployment matters. bge-large is the strongest local option in 2026.

can i use claude instead of gpt-4o-mini for the rag chain?
yes. swap ChatOpenAI for ChatAnthropic in the chain. claude haiku is comparable in cost to gpt-4o-mini. claude sonnet 3.7 produces stronger answers but costs ~5x more.

should i use chroma in production?
chroma is fine for under 1m vectors and a single-process app. for production scale (multiple replicas, millions of vectors, high-throughput retrieval), use qdrant, pinecone, or pgvector.

how do i handle pdf or docx documents in this pipeline?
firecrawl can ingest pdfs directly via scrape_url with formats: ["markdown"]. for docx, use unstructured or langchain_community.document_loaders.UnstructuredWordDocumentLoader. then feed the resulting text into the same chunking and embedding flow.

conclusion

firecrawl plus langchain is the fastest way to build a working rag pipeline in 2026. firecrawl handles the part that’s expensive to build (clean web content extraction). langchain handles the part that’s tedious to write (chunking, embedding, retrieval, prompting).

the script in this tutorial is the skeleton of every web-grounded rag system you’ll see in production. start there, measure quality on your real questions, then add re-ranking, hybrid search, and evaluation as the data demands. for the broader python web scraping foundation, the complete python guide covers everything underneath this stack.

ship a prototype this weekend. iterate on quality next week. that’s the rag playbook.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)