Scraping to ClickHouse: Real-Time Analytics Pipeline for Web Data (2026)

Web scraping pipelines hit a wall the moment you try to run aggregate queries over tens of millions of rows in Postgres or SQLite. ClickHouse fixes that. This guide walks through building a scraping-to-ClickHouse real-time analytics pipeline in 2026, covering ingestion patterns, schema design, and the tradeoffs you’ll hit in production.

Why ClickHouse for Web Scraping Data

Most scraped datasets are append-only, time-stamped, and read-heavy after the initial write. That’s exactly what ClickHouse’s MergeTree engine is optimized for. A query scanning 500 million e-commerce price records that would take 40+ seconds in Postgres finishes in under 2 seconds in ClickHouse on the same hardware.

Compare the realistic options for storing high-volume scraped data:

StoreBest forWrites/sec (single node)Query speed (500M rows)Cost
ClickHouseAnalytics, time-series, aggregates500K+<2sLow (self-hosted)
PostgresTransactional, relational joins~10K40-120sMedium
MongoDBVariable schema, document storage~50K10-30sMedium-High
DuckDBLocal analytics, single-machineN/A (in-process)<3sFree
BigQueryServerless analytics, team accessManaged<5sPer-query

If your pipeline is local-only and you don’t need real-time, Scraping to DuckDB: Local Analytics Pipeline for Web Data (2026) is a lighter alternative. For distributed, multi-team access with live ingestion, ClickHouse wins.

Schema Design for Scraped Data

The single biggest mistake engineers make is designing ClickHouse tables like Postgres tables. ClickHouse doesn’t do row updates cheaply. Design for immutability from the start.

A solid schema for a price-tracking scraper:

CREATE TABLE product_prices (
    scraped_at   DateTime,
    domain       LowCardinality(String),
    product_id   String,
    price_usd    Float32,
    currency     LowCardinality(String),
    in_stock     UInt8,
    raw_url      String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(scraped_at)
ORDER BY (domain, product_id, scraped_at);

Key decisions here:

  • LowCardinality(String) on domain and currency compresses repeated values aggressively (10-20x savings on high-cardinality string columns)
  • ORDER BY determines the primary index and sort order; put your most common filter dimensions first
  • Monthly partitioning makes dropping old data a one-line ALTER TABLE DROP PARTITION instead of a slow DELETE

Avoid nullable columns unless you genuinely need NULL distinction. Use sentinel values (empty string, 0) instead. ClickHouse handles them much faster.

Ingestion Patterns: From Scraper to ClickHouse

There are three practical ingestion paths, each with different latency and complexity tradeoffs.

Direct HTTP Insert (simplest)

ClickHouse exposes an HTTP interface on port 8123. You can POST batches directly from your scraper:

import httpx, json

CLICKHOUSE_URL = "http://localhost:8123/"

def insert_batch(rows: list[dict]):
    payload = "\n".join(json.dumps(r) for r in rows)
    httpx.post(
        CLICKHOUSE_URL,
        params={"query": "INSERT INTO product_prices FORMAT JSONEachRow"},
        content=payload,
        headers={"Content-Type": "application/json"},
    ).raise_for_status()

Batch size matters. Inserting one row at a time creates thousands of small parts, which ClickHouse must merge constantly. Aim for batches of 10,000 to 100,000 rows. If your scraper is slow, buffer in Redis or a local queue first.

Kafka as a Buffer (production-grade)

For pipelines scraping multiple sites in parallel, Kafka decouples producers from ClickHouse. ClickHouse has a native Kafka table engine that consumes directly:

CREATE TABLE product_prices_kafka_consumer
ENGINE = Kafka
SETTINGS
    kafka_broker_list = 'broker:9092',
    kafka_topic_list = 'scrape.prices',
    kafka_group_name = 'clickhouse-consumer',
    kafka_format = 'JSONEachRow';

CREATE MATERIALIZED VIEW prices_mv TO product_prices AS
SELECT * FROM product_prices_kafka_consumer;

This gives you at-least-once delivery and lets you scale scraping workers independently of the DB. Orchestrating those workers with something like Scraping with Dagster: Orchestrating Web Scraping at Scale (2026) pairs well here since Dagster’s asset model maps cleanly onto Kafka topic partitions.

Prefect for Periodic Batch Loads

If real-time isn’t a requirement and you’re running scheduled scrape jobs, a Prefect flow writing Parquet to S3 and loading via clickhouse-client --query is simpler to operate than Kafka. Scraping with Prefect: Modern Workflow Orchestration for Scrapers (2026) covers that orchestration layer in detail.

Data Quality and Deduplication

Scrapers produce duplicates. A page scraped twice in 10 minutes generates two rows with identical product_id and close scraped_at. ClickHouse won’t silently deduplicate these.

Your options:

  1. Use ReplacingMergeTree with a version column so ClickHouse keeps only the latest row per key during merges (merges are async, so duplicates exist temporarily)
  2. Deduplicate at query time with SELECT DISTINCT or argMax(price_usd, scraped_at)
  3. Deduplicate upstream before insert using a bloom filter in Redis

For price tracking specifically, option 2 is usually fine. For pipeline data with strict uniqueness requirements (like audit logs or legal data), option 1 is safer.

Also worth considering: if your scraping pipeline includes variable schema data, for example scrapers hitting APIs that return different fields by endpoint, Scraping to MongoDB: Schema-Less Storage for Variable Web Data documents a pattern where Mongo handles the variable layer and a normalized view feeds ClickHouse.

Querying Patterns for Scraped Data

ClickHouse SQL is ANSI-compatible but has quirks. The most useful functions for scraped data analytics:

  • toStartOfHour(scraped_at) / toStartOfDay(scraped_at) for time bucketing
  • quantile(0.95)(price_usd) for P95 price calculations across millions of products
  • groupArray(price_usd) to collect an array of values per product for trend charting
  • domain(raw_url) to extract the host from raw scraped URLs without regex

A real query pulling daily median price per domain:

SELECT
    domain,
    toDate(scraped_at) AS day,
    median(price_usd) AS median_price,
    count() AS data_points
FROM product_prices
WHERE scraped_at >= now() - INTERVAL 30 DAY
GROUP BY domain, day
ORDER BY domain, day;

This scans 30 days of data across 500 million rows in under 3 seconds on a 4-core ClickHouse instance. The same query in Postgres on a comparable machine runs 45-90 seconds even with indexes.

For production pipelines where scraped data feeds business decisions, like the kind of real estate or market research use cases covered in Real Estate API vs Web Scraping: When to Build vs Buy Your Data Pipeline (2026), that latency difference determines whether analysts can run ad-hoc queries or need to wait for overnight reports.

Bottom Line

Use ClickHouse when your scraped dataset exceeds 50 million rows, you need sub-second aggregates, and your write pattern is append-only. Start with direct HTTP batch inserts, add Kafka only when your ingestion rate exceeds what a single writer thread can sustain. Design schemas with MergeTree, monthly partitioning, and LowCardinality on repeated strings from day one — retrofitting these later requires a full table rewrite. DRT covers more pipeline architecture patterns like this across the data collection stack.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)