Vector Search for Scraped Data: pgvector vs Qdrant vs Weaviate (2026)

Vector search has become the connective tissue between scraped data and LLM applications. If you’re running a pipeline that collects product pages, documentation, job listings, or forum threads at scale, the database you embed into determines query latency, operational cost, and how well your retrieval actually performs. In 2026, three options dominate real production stacks: pgvector (Postgres extension), Qdrant (purpose-built Rust vector DB), and Weaviate (graph-aware vector store). Here’s an honest breakdown of when each one wins.

Why vector search matters for scraped data pipelines

Most scraped content is unstructured. A product description, a review thread, or a scraped knowledge base article has no natural primary key you can query against. Once you embed that text with a model like text-embedding-3-small or nomic-embed-text, you need somewhere fast to store and query 768 to 1536-dimensional float vectors alongside the raw payload.

If you’re already building RAG applications on scraped documentation, the vector database choice shapes every part of the architecture: how you batch upserts, how you filter by metadata, and whether you can do hybrid keyword-plus-vector search in a single query.

Head-to-head comparison

Featurepgvector (0.7+)Qdrant 1.9Weaviate 1.25
Storage backendPostgres heapCustom segment filesRocksDB + object store
Max dims (HNSW)20006553565535
Hybrid searchManual BM25 + vectorBuilt-in sparse+denseBuilt-in BM25+vector
FilteringSQL WHERE (pre or post)Payload index (fast pre-filter)GraphQL where clause
Ops modelYour Postgres clusterStandalone binary or cloudStandalone binary or cloud
Horizontal scalingCitus or read replicasNative sharding + replicationMulti-node native
Managed cloudSupabase, Neon, RDSQdrant CloudWeaviate Cloud
LicenseApache 2.0Apache 2.0BSD-3

Latency at 10M vectors (p99, single-node, 1536 dims, ef=128): pgvector sits around 40-80ms, Qdrant 8-20ms, Weaviate 12-25ms. The gap is real and matters at query volume.

pgvector: when your existing Postgres stack is the right answer

pgvector makes the most sense when you already run Postgres and your vector workload is moderate, under 5M vectors with infrequent bulk loads. The killer feature is that you get joins for free. A scraped product with price, category, and embedding lives in one table, and you can filter on price < 50 AND category = 'electronics' before the ANN scan without moving data between systems.

SELECT id, title, 1 - (embedding <=> $1::vector) AS score
FROM scraped_products
WHERE category = 'electronics' AND price < 50
ORDER BY embedding <=> $1::vector
LIMIT 20;

The <=> cosine operator uses an HNSW index when one exists. Set hnsw.ef_search higher for recall, lower for speed. The tradeoff: at 10M+ vectors, index build time is painful (hours, not minutes), and parallel ingestion creates lock contention on the table. If your scraper runs continuous bulk upserts of embeddings, Postgres will feel that pressure.

Supabase ships pgvector out of the box, which is why it's the default starting point for many scraping-to-RAG projects. For anything that fits, it's operationally cheap and requires no additional infrastructure.

Qdrant: the production workhorse for high-volume scraping pipelines

Qdrant is the right call when you're storing tens of millions of vectors and need sub-20ms p99 with aggressive metadata filtering. Its payload indexing system lets you define typed indexes on JSON fields, so filtering on scraped metadata (source domain, scrape date, language, content type) happens before the HNSW scan, not after. That matters: post-filter ANN is a correctness trap at high filter selectivity.

When you're scraping and chunking long-form articles for LLM context, you end up with many small chunks per source document. Qdrant's named vectors feature lets you store multiple embedding types per point (dense semantic + sparse BM25 weights) without duplicating the payload, which cuts storage by 30-40% in practice.

from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, Filter, FieldCondition, MatchValue

client = QdrantClient("localhost", port=6333)

results = client.search(
    collection_name="scraped_articles",
    query_vector=("dense", embedding),
    query_filter=Filter(
        must=[FieldCondition(key="language", match=MatchValue(value="en"))]
    ),
    limit=10,
)

Operational notes worth knowing:

  • Qdrant snapshots are the backup primitive. Wire them into your cron or use Qdrant Cloud with S3-backed snapshots.
  • The gRPC interface is faster than REST for bulk operations. Use it for ingestion.
  • WAL + memmap storage means cold-start on large collections takes 10-30 seconds on first query. Account for this in health checks.

Weaviate: when you need graph context alongside vectors

Weaviate's edge case is when your scraped data has meaningful cross-document relationships you want to traverse. Think: scraped forum threads where replies reference other posts, or product catalogs where categories link to subcategories. Weaviate models this as a native object graph with vector-indexed classes and cross-references, so you can do nearText queries that propagate through linked objects.

Its auto-vectorization modules (OpenAI, Cohere, HuggingFace) mean you can configure the embedding provider at the schema level and skip managing embeddings explicitly. This is convenient but adds latency on write and couples your pipeline to the module API.

When building scrapers with LLM-driven auto-generated schemas, Weaviate's class/property schema maps naturally to Pydantic models. The mismatch is that Weaviate's schema is more rigid than document stores, so schema evolution during active scraping requires careful migration.

For straightforward document retrieval without cross-references, Weaviate adds complexity you probably don't need. Qdrant or pgvector will be faster and simpler to operate.

Practical decision checklist

Use this to pick fast:

  1. Already on Postgres and under 5M vectors? Start with pgvector on Supabase or RDS.
  2. Need sub-20ms p99, 10M+ vectors, and rich metadata filtering? Use Qdrant.
  3. Need hybrid keyword+vector in one query without extra setup? Both Qdrant (sparse+dense) and Weaviate handle this natively. pgvector needs a BM25 workaround via pg_bm25 (ParadeDB).
  4. Storing multi-lingual scraped content with language-level filtering? Add a language payload index in Qdrant and filter before ANN.
  5. Running self-healing scraper pipelines where schema changes at runtime? See how LLMs handle selector breakage and make sure your vector store schema can absorb new fields gracefully.

For specialized scraping targets like Naver.com's Korean search engine, multi-lingual embedding models (e.g., multilingual-e5-large) paired with language-tagged payload filtering in Qdrant give the cleanest retrieval results across mixed CJK and Latin corpora.

Bottom line

For most scraped-data pipelines in 2026, Qdrant is the default choice once you cross 5M vectors: it's operationally simple, fast under load, and its filtering model is designed for exactly the kind of metadata scrapers produce. Stay on pgvector if Postgres is already your operational baseline and your scale fits. Use Weaviate only if cross-document graph traversal is a first-class requirement. DRT covers this infrastructure layer regularly, and the right answer here is almost always "start simple, migrate when the numbers force you."

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)