Vector databases for scraping pipelines in 2026
Vector databases scraping pipelines are inseparable in 2026. Almost every meaningful scraping operation that powers RAG, semantic search, recommendation, or AI-assisted analysis ends up writing embeddings to a vector store. The choice of vector database matters more than most teams initially recognise: it shapes ingestion throughput, query latency, hybrid retrieval support, operational overhead, cost economics, and the migration path when the system grows. The market consolidated around five serious options in 2024-2025, and the mid-2026 picture is clearer than ever. This guide walks through the production-grade vector databases, the comparison criteria that matter for scraping workloads, the deployment patterns that work, and a selection framework your team can apply.
The audience is the data engineer or platform owner choosing a vector database for a scraping-driven AI pipeline.
Why vector databases matter for scraping
Three reasons.
First, embedding storage and similarity search at scale require purpose-built infrastructure. A scraping operation that ingests 10 million documents produces tens of millions of vectors (one per chunk, often more). Storing and querying these in a general-purpose database does not scale.
Second, the retrieval pattern is different from traditional databases. Vector queries are nearest-neighbour searches over high-dimensional vectors, with hybrid sparse-plus-dense often required. The right database makes hybrid trivial; the wrong database makes it custom code.
Third, the operational characteristics matter: ingestion throughput, query latency at p99, memory footprint, replication, and cost per million vectors all shape the production experience.
For the broader RAG context, see RAG over scraped data production patterns. For the MCP integration, see MCP for data engineers.
The 2026 vector database landscape
Five production-grade options:
| Database | Type | Open source | Hosted | Strength |
|---|---|---|---|---|
| Qdrant | Purpose-built | Yes (Apache 2.0) | Yes (Qdrant Cloud) | Performance + filters |
| Weaviate | Purpose-built | Yes (BSD) | Yes (Weaviate Cloud) | Modules + multi-modal |
| Pinecone | Purpose-built | No | Yes only | Operational simplicity |
| pgvector (Postgres extension) | Embedded | Yes | Yes (Supabase, Neon, RDS) | Postgres-native |
| Milvus | Purpose-built | Yes (Apache 2.0) | Yes (Zilliz Cloud) | Massive scale |
The choice between them is rarely about raw performance. All five can serve millions of queries per day. The choice is about operational fit, ecosystem, and the rest of your stack.
Qdrant: the performance and filtering favourite
Qdrant is a purpose-built vector database written in Rust. It launched in 2021 and matured through 2023-2025 to become the production favourite for performance-sensitive workloads.
Strengths:
– Excellent query performance at scale (millions of vectors).
– Strong payload filtering: combine vector search with metadata filters efficiently.
– Open source with a permissive licence (Apache 2.0).
– Mature client libraries (Python, TypeScript, Go, Rust).
– Hybrid search (dense + sparse) supported natively as of 2024.
Weaknesses:
– Self-hosting requires more operational sophistication than pgvector.
– Hosted Qdrant Cloud is reasonably priced but not the cheapest.
– Less ecosystem integration than Weaviate (modules) or pgvector (Postgres).
Best for: production scraping pipelines where filter combination and query performance matter; teams comfortable with self-hosted infrastructure.
A minimal Qdrant ingestion in Python:
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance
client = QdrantClient(host="qdrant.internal", port=6333)
client.recreate_collection(
collection_name="docs",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
client.upsert(
collection_name="docs",
points=[
PointStruct(id=i, vector=embedding,
payload={"url": url, "scraped_at": ts, "text": text[:200]})
for i, (embedding, url, ts, text) in enumerate(rows)
],
)
Weaviate: the modules and multi-modal favourite
Weaviate is purpose-built, written in Go, and launched in 2019. It has a stronger orientation around modular pipelines (built-in embedding generation, reranking, summarisation) and multi-modal data (images, audio, video alongside text).
Strengths:
– Modules: built-in connectors for OpenAI, Cohere, Hugging Face, ColBERT, and many more.
– Multi-modal native: cleaner support for cross-modal queries.
– GraphQL query language: convenient for complex retrieval.
– Open source (BSD licence) with managed hosting.
Weaknesses:
– More opinionated; the modular design adds complexity to simple use cases.
– Performance similar to Qdrant but the operational characteristics differ.
– Smaller community than pgvector or Pinecone.
Best for: multi-modal pipelines, teams that benefit from built-in embedding/reranking modules, GraphQL-friendly stacks.
Pinecone: the operational-simplicity choice
Pinecone is the original commercial vector database, launched 2019. It is hosted-only and proprietary. Its value proposition is operational simplicity: a managed service with predictable pricing, zero infrastructure ownership.
Strengths:
– Zero-ops: no self-hosting required.
– Predictable pricing model.
– Strong production reliability.
– Clean Python SDK, well-documented.
Weaknesses:
– Hosted-only; no self-host option.
– More expensive at scale than self-hosted alternatives.
– Closed source; no inspection of internals.
– Filter performance has historically lagged Qdrant.
Best for: teams that want to outsource vector database operations entirely; pre-production, smaller teams; situations where vendor lock-in is acceptable.
pgvector: the Postgres-native option
pgvector is a Postgres extension that adds vector data types and similarity search. It launched in 2021 and matured significantly in 2024-2025 with HNSW index support and improved performance.
Strengths:
– Postgres native: reuse existing Postgres expertise, tooling, backups, observability.
– Cost-effective when Postgres is already in the stack.
– Transactional consistency with relational data.
– Hosted everywhere (RDS, Supabase, Neon, CloudSQL).
– Open source, free.
Weaknesses:
– Performance lower than purpose-built databases at large scale (50M+ vectors).
– Index types (IVFFlat, HNSW) and parameter tuning require expertise.
– No native sparse-plus-dense hybrid search; requires combining with full-text search manually.
Best for: teams already running Postgres, smaller corpora (under 50M vectors), use cases where transactional consistency with relational data is valuable.
A pgvector setup with HNSW:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
id BIGSERIAL PRIMARY KEY,
embedding vector(1024),
url TEXT,
scraped_at TIMESTAMPTZ,
text TEXT
);
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
Milvus: the massive-scale option
Milvus is open source, written in C++ and Go, and designed for extreme scale (billions of vectors). It launched in 2019 and matured through 2024-2025 into the standard for the largest deployments.
Strengths:
– Scale: production deployments at billions of vectors and 10K+ QPS.
– Distributed architecture: separate compute and storage scale independently.
– Multiple index types (IVF_FLAT, IVF_SQ8, HNSW, DiskANN).
– Strong China-region adoption with mature Mandarin documentation.
Weaknesses:
– Operational complexity: distributed Milvus requires real DevOps investment.
– For corpora under 100M vectors, the complexity is overkill.
– The hosted version (Zilliz Cloud) is mature but less ecosystem-adopted.
Best for: extreme-scale deployments, teams with mature DevOps, organisations with massive scraping operations producing billions of vectors.
For the deeper proxy infrastructure question, see self-hosted proxy infrastructure.
Comparison matrix
| Dimension | Qdrant | Weaviate | Pinecone | pgvector | Milvus |
|---|---|---|---|---|---|
| Open source | Yes | Yes | No | Yes | Yes |
| Self-hosting | Yes | Yes | No | Yes (via Postgres) | Yes |
| Managed hosting | Yes | Yes | Yes only | Yes (Supabase, Neon, RDS) | Yes (Zilliz) |
| Hybrid search native | Yes | Yes | Limited | Manual | Yes |
| Multi-modal | Limited | Strong | Limited | Limited | Strong |
| Filter performance | Excellent | Good | Moderate | Good | Excellent |
| Scale ceiling | 100M+ vectors | 100M+ vectors | 100M+ vectors | 50M vectors | 10B+ vectors |
| Ecosystem maturity | High | High | High | High (Postgres) | High |
| Best client lang | Python, TS, Go | Python, GraphQL | Python | SQL | Python, Java, Go |
| Cost at 10M vectors | Low (self) / Medium (cloud) | Low (self) / Medium (cloud) | Medium-High | Low (self) | Low (self) |
| Cost at 1B vectors | High (self) | High (self) | Highest (cloud only) | Not recommended | Best for scale |
Decision tree: pick a vector database
Q1: Is your team running Postgres already?
├── Yes -> Q2
└── No -> Q3
Q2: Will the corpus stay under 50M vectors for the next 18 months?
├── Yes -> pgvector. Reuse the stack.
└── No -> Q3
Q3: Is operational simplicity (no self-host) the priority?
├── Yes -> Pinecone (commercial); Qdrant Cloud or Weaviate Cloud (open).
└── No -> Q4
Q4: Is the corpus at 1B+ vectors or expected to be?
├── Yes -> Milvus.
└── No -> Q5
Q5: Are filter combinations central to your queries?
├── Yes -> Qdrant.
└── No -> Q6
Q6: Do you need built-in modules or strong multi-modal?
├── Yes -> Weaviate.
└── No -> Qdrant (sensible default).
The decision tree handles 80 percent of cases. Edge cases (regulatory data residency, specific cloud provider lock-in, language-specific embedding tooling) override.
Production deployment patterns
Three patterns for production deployment.
Pattern one: managed cloud, single region. Pinecone, Qdrant Cloud, Weaviate Cloud, Zilliz Cloud, or Supabase pgvector. Lowest operational overhead. Suitable for most teams.
Pattern two: self-hosted on Kubernetes. Qdrant, Weaviate, Milvus, or pgvector deployed via Helm charts on EKS/GKE/AKS. Medium operational overhead. Required for data-residency or cost-sensitive deployments.
Pattern three: self-hosted on bare metal. Same databases as pattern two, deployed directly on dedicated hardware. Highest operational overhead, lowest cost per vector. Required for the largest deployments where cloud egress and managed-service margins dominate.
For the broader infrastructure question, see building scraping pipelines with Prefect 3.
Operational characteristics
Beyond the headline benchmark numbers, the operational characteristics that matter:
| Characteristic | Why it matters |
|---|---|
| Ingestion throughput | Determines time to backfill a large corpus |
| Query p99 latency | Determines user-facing response time |
| Memory footprint | Determines hosting cost |
| Index build time | Determines time-to-first-query after data load |
| Replication and HA | Determines uptime SLO |
| Backup and restore | Determines disaster recovery RTO/RPO |
| Schema evolution | Determines pain when payload schema changes |
| Multi-tenancy | Determines whether you can run shared collections |
A team that picks a database without evaluating these often discovers them painfully in month two of production.
Hybrid search implementation
Hybrid retrieval (dense embedding + sparse keyword) is the production default in 2026. The implementation differs across databases:
| Database | Hybrid mechanism |
|---|---|
| Qdrant | Native sparse vectors; combine with dense via fusion |
| Weaviate | Native hybrid query mode |
| Pinecone | Sparse-dense indexes (sparse vectors with dense) |
| pgvector | Combine with Postgres full-text search; manual fusion |
| Milvus | Native hybrid via multi-field queries |
A Qdrant hybrid query:
from qdrant_client.models import Prefetch, Fusion, FusionQuery
results = client.query_points(
collection_name="docs",
prefetch=[
Prefetch(query=dense_vector, using="dense", limit=20),
Prefetch(query=sparse_vector, using="sparse", limit=20),
],
query=FusionQuery(fusion=Fusion.RRF),
limit=5,
)
Reciprocal Rank Fusion (RRF) is the standard merger.
For the deeper retrieval discussion, see RAG over scraped data.
Cost economics at scale
Rough cost benchmarks for storing and querying 100M vectors of 1024 dimensions in mid-2026:
| Option | Storage cost (monthly) | Query cost per 1M | Notes |
|---|---|---|---|
| Qdrant Cloud | USD 800-1500 | Included up to volume | Predictable |
| Weaviate Cloud | USD 900-1800 | Included up to volume | Module surcharges |
| Pinecone | USD 1200-2500 | USD 0.40 | Tier-based |
| pgvector on RDS | USD 600-1200 | Included | Suboptimal at this scale |
| Milvus self-hosted | USD 400-800 | Included | Plus DevOps overhead |
| Qdrant self-hosted | USD 300-600 | Included | Plus DevOps overhead |
Self-hosting wins on raw cost. Managed wins on total cost of ownership when DevOps capacity is constrained.
External references
The Qdrant documentation is at qdrant.tech/documentation. Weaviate’s documentation is at weaviate.io/developers/weaviate. Pinecone’s docs are at docs.pinecone.io. pgvector’s repository is at github.com/pgvector/pgvector. Milvus is at milvus.io. Vector database benchmarks are tracked at vectordbbench.com.
Migration patterns
Teams frequently migrate vector databases as scale grows. The pattern that works:
- Start with pgvector if Postgres exists, or Qdrant if not.
- Migrate when corpus or QPS exceeds the comfortable operating range (50M vectors for pgvector; 100M+ for purpose-built single-node).
- Plan migration as: dual-write during transition; read-cutover after validation; old database retired after 2 weeks.
- Embedding models do not need to change unless the migration coincides with a model upgrade. Vector dimensions must match.
A typical migration is one engineer-month of effort. The cost is real but predictable.
FAQ
Which vector database is best for scraping pipelines?
Qdrant is the sensible default for most. pgvector if you already run Postgres at sub-50M scale. Pinecone if zero-ops is paramount.
Do I need a vector database if I use OpenAI embeddings?
Yes. The embeddings need to be stored and searched somewhere. OpenAI provides embeddings; vector databases provide retrieval.
Is pgvector good enough for production?
Yes, up to about 50M vectors. Beyond that, the index build times and query latencies push toward purpose-built options.
What about hybrid sparse-plus-dense search?
Native in Qdrant, Weaviate, and Milvus. Manual in pgvector. Limited in Pinecone unless using sparse-dense indexes.
Can I run multi-tenant collections?
All five support some form of multi-tenancy via collection separation or payload filtering. Implementation differs.
Extended vector database analysis
The vector database market consolidated around several production-ready options in 2026. The choice depends on scale, latency, and operational preference.
- pgvector on PostgreSQL. Best for teams already running Postgres. Hits 100M vector scale comfortably with HNSW indexing. Strong filter performance via standard SQL.
- Qdrant. Rust-based, excellent filter performance, strong for hybrid search. Self-hosted or managed.
- Weaviate. Schema-driven, GraphQL surface, strong for multi-tenancy.
- Pinecone. Managed only, simplest operations, highest cost per vector.
- Milvus. High scale (billions of vectors), more operational complexity.
- LanceDB. Embedded, columnar, best for analytics-style workloads.
Production ingestion pattern with deduplication
import hashlib
from typing import List, Dict
class IngestionPipeline:
def __init__(self, embedder, vector_store, dedupe_table):
self.embedder = embedder
self.vector_store = vector_store
self.dedupe = dedupe_table
def doc_hash(self, doc: Dict) -> str:
content = f"{doc['url']}|{doc['text']}"
return hashlib.sha256(content.encode()).hexdigest()
async def ingest(self, docs: List[Dict]) -> int:
new_docs = []
for doc in docs:
h = self.doc_hash(doc)
if not await self.dedupe.exists(h):
doc["_hash"] = h
new_docs.append(doc)
if not new_docs:
return 0
embeddings = await self.embedder.embed_batch(
[d["text"] for d in new_docs], batch_size=64
)
records = [{
"id": d["_hash"],
"vector": e,
"metadata": {
"url": d["url"],
"scraped_at": d.get("scraped_at"),
"source": d.get("source"),
"purpose": d.get("purpose"),
},
} for d, e in zip(new_docs, embeddings)]
await self.vector_store.upsert(records)
for d in new_docs:
await self.dedupe.add(d["_hash"])
return len(new_docs)
Index choice matters
The HNSW vs IVF-PQ vs DiskANN choice affects latency, recall, and memory.
- HNSW. In-memory graph index. Best recall and latency. RAM-bound.
- IVF-PQ. Quantised inverted file. Memory-efficient at moderate recall cost.
- DiskANN. Disk-based graph index. Scales beyond RAM. Slightly higher latency.
- ScaNN. Google’s hybrid. Strong recall and speed in benchmarks.
A 2026 production choice often pairs HNSW for the hot tier (last 30 days, in-memory) with IVF-PQ or DiskANN for the cold tier.
Filter performance: pre-filter vs post-filter
Pre-filtering applies the metadata filter before the vector search. Post-filtering applies it after. Pre-filter is correct but slower when filter selectivity is low. Post-filter is faster but may return fewer than k results.
A 2026 pattern is adaptive filtering. The query planner estimates filter selectivity and chooses pre or post per query.
def adaptive_search(query_vec, filter_dict, k=10, selectivity_threshold=0.05):
estimated = estimate_selectivity(filter_dict)
if estimated < selectivity_threshold:
return vector_store.search(query_vec, k=k, filter=filter_dict, mode="pre")
else:
return vector_store.search(query_vec, k=k*3, filter=filter_dict, mode="post")[:k]
Comparison: vector databases 2026
| DB | Max scale | Latency p95 | Filter perf | Best for |
|---|---|---|---|---|
| pgvector | 100M+ | 20-50ms | Excellent (SQL) | Postgres shops |
| Qdrant | 1B+ | 10-30ms | Excellent | Hybrid search |
| Weaviate | 500M+ | 15-40ms | Good | Multi-tenant |
| Pinecone | Multi-billion | 30-100ms | Moderate | Managed simplicity |
| Milvus | 10B+ | 10-50ms | Good | Massive scale |
| LanceDB | 100M+ | 20-60ms | Good | Embedded analytics |
Cost optimisation patterns
Vector database costs follow three drivers.
- Storage (per million vectors per month).
- Compute for index building and query.
- Network egress for hosted services.
The 2026 cost-cutting patterns are.
- Quantisation (PQ, OPQ, scalar quantisation) to reduce vector size 4x to 32x.
- Dimensionality reduction via Matryoshka embeddings to truncate vectors at query time.
- Tiered storage with hot, warm, cold partitions.
- Embedding model swap to a smaller model for cost-sensitive workloads.
Additional FAQ
Should I use a vector DB or a SQL extension?
For most teams pgvector is enough up to 100M vectors. Beyond that consider a dedicated vector DB.
How do I handle incremental updates?
HNSW supports insert efficiently. Delete is harder; many production systems use tombstones plus periodic reindex.
What about hybrid search?
Most modern vector DBs support BM25-plus-vector fusion via reciprocal rank fusion or weighted scores. Use it.
How do I version embeddings?
When the embedding model changes you must re-embed. Maintain two indexes during the transition and dual-write.
The choice between embedded and dedicated vector databases
A team building a scraping-plus-RAG pipeline faces an early decision: embedded vector database or dedicated. Embedded options (LanceDB, Chroma in single-node mode, FAISS) live in the application process. Dedicated options (Qdrant, Weaviate, Milvus, Pinecone) live in their own service.
Embedded wins for prototyping, single-node deployments, and analytics-style workloads where the embedding store is read more than written. Dedicated wins for multi-service deployments, multi-team usage, high-write workloads, and operational requirements like backup and replication.
The 2026 pattern is to start embedded for the first 90 days of a project and migrate to dedicated when the project graduates to production. The migration is non-trivial but well-traveled. The embedded prototype validates the data model and the schema before the dedicated commitment.
A specific case worth calling out is pgvector. pgvector lives in PostgreSQL and benefits from the operational maturity of Postgres. For teams that already run Postgres, pgvector is often the right answer up to 100 million vectors. The savings on operating an additional database service often outweigh the marginal performance benefits of a dedicated vector DB at moderate scale.
The embedding model decision
The embedding model is the foundation of every vector store. A change of embedding model requires re-embedding the entire corpus, which is expensive at scale. The model choice should therefore be considered carefully.
The 2026 leaderboard (MTEB) lists models by retrieval quality on standard benchmarks. The current state of the art for English is around 70+ on the average MTEB score. The popular open-weight choices include the BGE family, the Jina embeddings v3, and the Cohere multilingual embeddings (proprietary but well-regarded).
Dimensionality matters for storage and latency. A 1536-dimensional embedding takes 6 KB per vector at float32. A 768-dimensional embedding takes 3 KB. Smaller dimensions also support faster nearest neighbour search. The 2024 Matryoshka representation learning approach lets a single model produce embeddings that can be truncated to smaller dimensions with graceful quality degradation.
For multilingual scraping the choice narrows. Models trained on diverse language data perform better on cross-lingual queries. The Cohere multilingual embeddings, the BGE-M3 model, and the Jina multilingual variants are the strong open choices.
The chunking strategy decision
Chunking is the most underrated decision in vector pipelines. The chunk size, the chunking strategy, and the overlap all materially affect retrieval quality.
Fixed-size chunking (split every 512 tokens) is simple but breaks semantic boundaries. Document-structure-aware chunking respects headings, paragraphs, code blocks, and tables. Semantic chunking uses a small model to find sentence boundaries that preserve meaning.
The 2026 pattern is to use document-structure-aware chunking as the default, with semantic chunking for prose-heavy content where structure is weak. Fixed-size chunking remains useful as a fallback for content with no exploitable structure.
Overlap improves retrieval quality at the cost of storage. A typical 2026 default is 10-20 percent overlap. Below 10 percent the boundary effects degrade retrieval. Above 20 percent the marginal benefit is small.
Cost optimisation in production
A production vector store at scale costs real money. Three optimisation patterns drive material savings.
The first is quantisation. Standard float32 vectors can be compressed to int8 (4x reduction) or int4 (8x reduction) with small recall impact. Product quantisation goes further (32x reduction) with more recall impact. The 2026 best practice is to ship int8 quantised indexes for hot data.
The second is tiered storage. Hot data (last 30 days) lives in HNSW in-memory. Warm data (last year) lives in a quantised disk-backed index. Cold data (older) lives in compressed object storage with on-demand re-indexing. The tier boundaries are workload-specific.
The third is matryoshka truncation. A model trained for matryoshka representation produces embeddings that work at multiple dimensionalities. The vector store can store the full dimension and serve at a smaller dimension when latency matters, or vice versa.
Each pattern stacks. A vector store using all three can be 50-100x cheaper than a naive deployment of the same workload, with modest recall impact. The 2026 best practice is to layer the techniques rather than choose one.
Next steps
If you have not picked a vector database yet, evaluate Qdrant and pgvector for your specific workload. An afternoon of prototyping with both will tell you more than weeks of comparison reading. For broader emerging-tech context, head to the DRT emerging-tech hub and pair this with the RAG over scraped data guide.
This guide is informational, not engineering or legal advice.