Web scraping pipelines hit a wall the moment you try to run aggregate queries over tens of millions of rows in Postgres or SQLite. ClickHouse fixes that. This guide walks through building a scraping-to-ClickHouse real-time analytics pipeline in 2026, covering ingestion patterns, schema design, and the tradeoffs you’ll hit in production.
Why ClickHouse for Web Scraping Data
Most scraped datasets are append-only, time-stamped, and read-heavy after the initial write. That’s exactly what ClickHouse’s MergeTree engine is optimized for. A query scanning 500 million e-commerce price records that would take 40+ seconds in Postgres finishes in under 2 seconds in ClickHouse on the same hardware.
Compare the realistic options for storing high-volume scraped data:
| Store | Best for | Writes/sec (single node) | Query speed (500M rows) | Cost |
|---|---|---|---|---|
| ClickHouse | Analytics, time-series, aggregates | 500K+ | <2s | Low (self-hosted) |
| Postgres | Transactional, relational joins | ~10K | 40-120s | Medium |
| MongoDB | Variable schema, document storage | ~50K | 10-30s | Medium-High |
| DuckDB | Local analytics, single-machine | N/A (in-process) | <3s | Free |
| BigQuery | Serverless analytics, team access | Managed | <5s | Per-query |
If your pipeline is local-only and you don’t need real-time, Scraping to DuckDB: Local Analytics Pipeline for Web Data (2026) is a lighter alternative. For distributed, multi-team access with live ingestion, ClickHouse wins.
Schema Design for Scraped Data
The single biggest mistake engineers make is designing ClickHouse tables like Postgres tables. ClickHouse doesn’t do row updates cheaply. Design for immutability from the start.
A solid schema for a price-tracking scraper:
CREATE TABLE product_prices (
scraped_at DateTime,
domain LowCardinality(String),
product_id String,
price_usd Float32,
currency LowCardinality(String),
in_stock UInt8,
raw_url String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(scraped_at)
ORDER BY (domain, product_id, scraped_at);Key decisions here:
LowCardinality(String)ondomainandcurrencycompresses repeated values aggressively (10-20x savings on high-cardinality string columns)ORDER BYdetermines the primary index and sort order; put your most common filter dimensions first- Monthly partitioning makes dropping old data a one-line
ALTER TABLE DROP PARTITIONinstead of a slow DELETE
Avoid nullable columns unless you genuinely need NULL distinction. Use sentinel values (empty string, 0) instead. ClickHouse handles them much faster.
Ingestion Patterns: From Scraper to ClickHouse
There are three practical ingestion paths, each with different latency and complexity tradeoffs.
Direct HTTP Insert (simplest)
ClickHouse exposes an HTTP interface on port 8123. You can POST batches directly from your scraper:
import httpx, json
CLICKHOUSE_URL = "http://localhost:8123/"
def insert_batch(rows: list[dict]):
payload = "\n".join(json.dumps(r) for r in rows)
httpx.post(
CLICKHOUSE_URL,
params={"query": "INSERT INTO product_prices FORMAT JSONEachRow"},
content=payload,
headers={"Content-Type": "application/json"},
).raise_for_status()Batch size matters. Inserting one row at a time creates thousands of small parts, which ClickHouse must merge constantly. Aim for batches of 10,000 to 100,000 rows. If your scraper is slow, buffer in Redis or a local queue first.
Kafka as a Buffer (production-grade)
For pipelines scraping multiple sites in parallel, Kafka decouples producers from ClickHouse. ClickHouse has a native Kafka table engine that consumes directly:
CREATE TABLE product_prices_kafka_consumer
ENGINE = Kafka
SETTINGS
kafka_broker_list = 'broker:9092',
kafka_topic_list = 'scrape.prices',
kafka_group_name = 'clickhouse-consumer',
kafka_format = 'JSONEachRow';
CREATE MATERIALIZED VIEW prices_mv TO product_prices AS
SELECT * FROM product_prices_kafka_consumer;This gives you at-least-once delivery and lets you scale scraping workers independently of the DB. Orchestrating those workers with something like Scraping with Dagster: Orchestrating Web Scraping at Scale (2026) pairs well here since Dagster’s asset model maps cleanly onto Kafka topic partitions.
Prefect for Periodic Batch Loads
If real-time isn’t a requirement and you’re running scheduled scrape jobs, a Prefect flow writing Parquet to S3 and loading via clickhouse-client --query is simpler to operate than Kafka. Scraping with Prefect: Modern Workflow Orchestration for Scrapers (2026) covers that orchestration layer in detail.
Data Quality and Deduplication
Scrapers produce duplicates. A page scraped twice in 10 minutes generates two rows with identical product_id and close scraped_at. ClickHouse won’t silently deduplicate these.
Your options:
- Use
ReplacingMergeTreewith aversioncolumn so ClickHouse keeps only the latest row per key during merges (merges are async, so duplicates exist temporarily) - Deduplicate at query time with
SELECT DISTINCTorargMax(price_usd, scraped_at) - Deduplicate upstream before insert using a bloom filter in Redis
For price tracking specifically, option 2 is usually fine. For pipeline data with strict uniqueness requirements (like audit logs or legal data), option 1 is safer.
Also worth considering: if your scraping pipeline includes variable schema data, for example scrapers hitting APIs that return different fields by endpoint, Scraping to MongoDB: Schema-Less Storage for Variable Web Data documents a pattern where Mongo handles the variable layer and a normalized view feeds ClickHouse.
Querying Patterns for Scraped Data
ClickHouse SQL is ANSI-compatible but has quirks. The most useful functions for scraped data analytics:
toStartOfHour(scraped_at)/toStartOfDay(scraped_at)for time bucketingquantile(0.95)(price_usd)for P95 price calculations across millions of productsgroupArray(price_usd)to collect an array of values per product for trend chartingdomain(raw_url)to extract the host from raw scraped URLs without regex
A real query pulling daily median price per domain:
SELECT
domain,
toDate(scraped_at) AS day,
median(price_usd) AS median_price,
count() AS data_points
FROM product_prices
WHERE scraped_at >= now() - INTERVAL 30 DAY
GROUP BY domain, day
ORDER BY domain, day;This scans 30 days of data across 500 million rows in under 3 seconds on a 4-core ClickHouse instance. The same query in Postgres on a comparable machine runs 45-90 seconds even with indexes.
For production pipelines where scraped data feeds business decisions, like the kind of real estate or market research use cases covered in Real Estate API vs Web Scraping: When to Build vs Buy Your Data Pipeline (2026), that latency difference determines whether analysts can run ad-hoc queries or need to wait for overnight reports.
Bottom Line
Use ClickHouse when your scraped dataset exceeds 50 million rows, you need sub-second aggregates, and your write pattern is append-only. Start with direct HTTP batch inserts, add Kafka only when your ingestion rate exceeds what a single writer thread can sustain. Design schemas with MergeTree, monthly partitioning, and LowCardinality on repeated strings from day one — retrofitting these later requires a full table rewrite. DRT covers more pipeline architecture patterns like this across the data collection stack.
Related guides on dataresearchtools.com
- Scraping to DuckDB: Local Analytics Pipeline for Web Data (2026)
- Scraping with Dagster: Orchestrating Web Scraping at Scale (2026)
- Scraping with Prefect: Modern Workflow Orchestration for Scrapers (2026)
- Scraping to MongoDB: Schema-Less Storage for Variable Web Data
- Pillar: Real Estate API vs Web Scraping: When to Build vs Buy Your Data Pipeline (2026)