Scraping crypto exchange order books in 2026

Scraping crypto exchange order books in 2026

Scrape crypto order books pipelines have matured into one of the most demanding data engineering problems on the public internet. A single Binance BTC/USDT depth feed pushes 50-200 updates per second during normal hours and spikes past 1000 per second during news events. Multiply that by 200 trading pairs across 8 exchanges and you have a firehose that requires careful architecture decisions before you write a single line of code. The good news is that almost every major exchange exposes order book data through public websockets that do not require API keys for read access. The bad news is that latency, message ordering, and gap recovery decisions you make in the first day of building will haunt the system for years.

This guide covers the practical mechanics of scraping crypto order books in 2026: which exchange feeds work, how to normalize depth across venues, the latency vs cost tradeoff for hosting, and the storage patterns that let small teams keep multi-month order book history without burning $20,000 a month on infrastructure.

Why scrape order books instead of buying the data

Three vendors dominate paid crypto market data: Kaiko, CryptoCompare, and Tardis.dev. They are excellent. They are also expensive: a full L2 order book history feed for 50 pairs across 5 exchanges runs $5,000-25,000 per month depending on tier. For a quant fund this is rounding error. For an indie research project, an algo trading hobbyist, or a startup building MEV tooling, it is the entire budget.

Self-collecting order book data costs roughly $200-500 per month in compute and storage if you are disciplined. The tradeoff is engineering time. The first month is hard. After that the pipeline runs itself with weekly babysitting. For research that depends on data going back further than the day you started collecting, you still need a vendor for the historical backfill, but ongoing collection is cheap.

What “order book” actually means at the wire level

Every exchange streams order book updates as either snapshots or deltas. A snapshot is the full current state of the book at a moment in time. A delta is a list of price-level changes since the last update. Most production feeds are delta-based with periodic snapshot resyncs.

A typical delta message looks like this from Binance:

{
  "e": "depthUpdate",
  "E": 1715000000000,
  "s": "BTCUSDT",
  "U": 12345600,
  "u": 12345610,
  "b": [["63500.00", "0.5"], ["63499.50", "0"]],
  "a": [["63501.00", "1.2"], ["63502.00", "0.8"]]
}

The b and a arrays are bid and ask updates. A quantity of 0 means delete that price level. The U and u fields are the first and last update IDs in this message, which you use to detect gaps. If you miss a message, you have to refetch the full snapshot from the REST endpoint and replay deltas from there.

This gap recovery logic is where most amateur scrapers break. A naive listener that just appends deltas without checking sequence numbers will silently corrupt the book within minutes.

import asyncio
import json
import websockets
import requests
from collections import defaultdict

class BinanceDepthListener:
    def __init__(self, symbol: str):
        self.symbol = symbol.upper()
        self.bids = {}  # price -> qty
        self.asks = {}
        self.last_update_id = None

    async def fetch_snapshot(self):
        url = f"https://api.binance.com/api/v3/depth?symbol={self.symbol}&limit=1000"
        resp = requests.get(url, timeout=10)
        data = resp.json()
        self.last_update_id = data["lastUpdateId"]
        self.bids = {float(p): float(q) for p, q in data["bids"]}
        self.asks = {float(p): float(q) for p, q in data["asks"]}

    def apply_delta(self, msg):
        if msg["u"] <= self.last_update_id:
            return
        if msg["U"] > self.last_update_id + 1:
            raise GapDetectedError(msg["U"], self.last_update_id)
        for price, qty in msg["b"]:
            p, q = float(price), float(qty)
            if q == 0:
                self.bids.pop(p, None)
            else:
                self.bids[p] = q
        for price, qty in msg["a"]:
            p, q = float(price), float(qty)
            if q == 0:
                self.asks.pop(p, None)
            else:
                self.asks[p] = q
        self.last_update_id = msg["u"]

    async def run(self):
        url = f"wss://stream.binance.com:9443/ws/{self.symbol.lower()}@depth@100ms"
        await self.fetch_snapshot()
        async with websockets.connect(url) as ws:
            async for raw in ws:
                msg = json.loads(raw)
                try:
                    self.apply_delta(msg)
                except GapDetectedError:
                    await self.fetch_snapshot()

The @100ms suffix in the websocket URL is critical. Without it you get the default 1000ms depth stream, which is slow. The 100ms feed is the highest cadence Binance offers without paying for the institutional pro feed.

Exchange feed comparison

exchangewebsocket URLbest feed cadencesnapshot via RESTmessage ordering
Binance Spotwss://stream.binance.com:9443100msyes, /api/v3/depthsequence IDs
Binance Futureswss://fstream.binance.com100msyes, /fapi/v1/depthsequence IDs
Coinbasewss://ws-feed.exchange.coinbase.comper eventsnapshot in streamsequence per product
OKXwss://ws.okx.com:8443100mssnapshot in streamchecksum per update
Bybitwss://stream.bybit.com/v5/public/spotper eventsnapshot via RESTupdate IDs
Krakenwss://ws.kraken.com/v2per eventsnapshot in streamchecksum per book
KuCoinwss://ws-api.kucoin.comper eventsnapshot via RESTsequence IDs
Bitgetwss://ws.bitget.com/v2100mssnapshot in streamchecksum per update

Coinbase, OKX, and Kraken include the initial snapshot in the websocket stream when you subscribe. Binance, Bybit, and KuCoin require a separate REST call. The stream-included snapshot is faster to start with but harder to recover from on disconnect because you need to resubscribe to get a new one. The REST snapshot pattern is more flexible.

Latency budget

Order book data is only useful if you can act on it within the timeframe relevant to your strategy. For research and analytics, 100-500ms latency is fine. For market making or arbitrage, you need single-digit milliseconds. The hosting choice changes accordingly.

Binance hosts its main matching engine in AWS Tokyo. The lowest latency to its websocket is from a server in ap-northeast-1. Coinbase runs from AWS US-East-1 (Ashburn). OKX runs from Hong Kong. Bybit from AWS Singapore. If you want sub-10ms feeds you have to host in the same AZ as the exchange and pay for cross-connect or AWS DX.

For research-grade scraping, a $40/month VPS in Tokyo or Singapore from Hetzner, OVH, or Vultr gets you under 30ms to most exchanges. That is fine for everything except active trading strategies.

Normalizing depth across exchanges

Every exchange publishes its order book in slightly different formats. To do any cross-exchange analysis you need a common schema. The minimum viable normalized record:

from dataclasses import dataclass
from typing import List, Tuple

@dataclass
class NormalizedBook:
    exchange: str
    symbol: str  # canonical, like "BTC-USDT"
    captured_at_ms: int  # exchange timestamp
    received_at_ms: int  # local receive timestamp
    bids: List[Tuple[float, float]]  # sorted desc
    asks: List[Tuple[float, float]]  # sorted asc
    sequence: int

The two timestamps are critical. captured_at_ms is what the exchange reported. received_at_ms is when your collector got the message. The difference tells you network latency. Without both you cannot diagnose feed degradation.

Symbol normalization is annoying but mechanical. Binance uses BTCUSDT, Coinbase uses BTC-USD, OKX uses BTC-USDT. Maintain a mapping table:

SYMBOL_MAP = {
    "binance": {"BTCUSDT": "BTC-USDT", "ETHUSDT": "ETH-USDT"},
    "coinbase": {"BTC-USD": "BTC-USD", "ETH-USD": "ETH-USD"},
    "okx": {"BTC-USDT": "BTC-USDT", "ETH-USDT": "ETH-USDT"},
}

Note that USD and USDT are not the same. Coinbase quotes against USD, most others against USDT. For arbitrage analysis you need to track this carefully and convert via the USDT/USD pair.

Storage: the real cost driver

Naive storage of every depth update across 50 pairs and 5 exchanges produces 100-500 GB per day. Compressed Parquet brings it down to 20-80 GB. Most operations cannot afford to keep raw deltas indefinitely, so the standard pattern is tiered:

  • Hot: last 7 days of full deltas in ClickHouse or DuckDB
  • Warm: last 90 days of 1-second OHLCV bars + L2 snapshots every minute
  • Cold: indefinite history of 1-minute snapshots in compressed Parquet on S3

This gets you to roughly $50-150/month in storage for 5-exchange, 50-pair coverage. The compression ratio matters: Parquet with ZSTD level 9 hits about 8:1 on order book data because most price levels do not change between snapshots.

import pyarrow as pa
import pyarrow.parquet as pq

def write_snapshot_batch(snapshots: list, path: str):
    table = pa.Table.from_pylist(snapshots)
    pq.write_table(
        table,
        path,
        compression="zstd",
        compression_level=9,
        use_dictionary=True,
    )

ClickHouse handles the hot path well. A modest 4-core server holds 30 days of full deltas for 50 pairs across 5 exchanges with room to query.

Recovering from a corrupted book

The deepest source of bugs is when your in-memory book diverges from reality and you do not notice for hours. Defensive checks that run continuously in the background:

  1. Top-of-book sanity: every minute, fetch the top of book via REST and compare to your cached book’s top bid/ask. A divergence greater than 0.5% on a liquid pair means resync.
  2. Crossed book detection: the highest bid should never exceed the lowest ask. If it does, the book is corrupted; trigger a full snapshot resync.
  3. Negative quantity detection: quantities should always be positive. Any negative value indicates a delta-application bug.
  4. Sequence gap counter: track the rate of detected gaps per hour. A baseline of 0.5-2 gaps per hour per pair is normal. A sudden jump indicates network degradation worth investigating before it affects downstream consumers.
  5. Idle channel detection: if no message arrives for a pair in 30 seconds during a normally active hour, force a reconnect. Idle does not always mean closed; the socket can hang in a half-open state.

A simple watchdog process running these checks across all pairs adds about 3% overhead and catches >95% of the silent corruption cases that otherwise show up as bad model outputs days later.

Connection management at scale

Each websocket connection costs file descriptors and memory. A single Python process with websockets library can comfortably handle 100-200 simultaneous connections. Past that, you fragment across processes or move to a more efficient runtime.

For the 50-pair, 5-exchange scenario, the practical architecture is one process per exchange handling all that exchange’s pairs as a single multiplexed subscription where the protocol allows it. Binance and Coinbase both accept multi-symbol subscribe messages on a single connection. OKX requires one subscription per channel but supports multiplexing across symbols.

async def binance_multi_symbol(symbols: list[str]):
    streams = "/".join([f"{s.lower()}@depth@100ms" for s in symbols])
    url = f"wss://stream.binance.com:9443/stream?streams={streams}"
    async with websockets.connect(url, ping_interval=20) as ws:
        async for raw in ws:
            msg = json.loads(raw)
            stream = msg["stream"]
            data = msg["data"]
            await process(stream, data)

Reconnect logic must include exponential backoff plus full snapshot refetch. A 30-second disconnect during a busy hour means thousands of missed updates. Just resuming the websocket subscription will give you a corrupted book.

Proxy considerations

Public websocket endpoints generally do not enforce strict per-IP rate limits because they are designed for HFT clients with stable connections. You usually do not need a proxy for normal volume.

The exception is REST snapshot endpoints. Binance imposes a weight-based rate limit on REST: 6000 weight per minute per IP, and the depth endpoint at limit=1000 costs 50 weight. That allows about 120 snapshots per minute. If you have more than 50 pairs and a busy gap-recovery day, you can blow through the limit.

For that case, route REST snapshot traffic through a small datacenter proxy pool of 5-10 IPs. The websocket can stay on your direct connection. We cover the broader proxy strategy in our best datacenter proxy providers 2026 review.

Storage cost worked example

A practical breakdown for the standard 50-pair, 5-exchange research deployment looks like this. Raw delta volume averages 18 GB per exchange per day during normal weeks, climbing to 50-80 GB during high-volatility events. Across 5 exchanges that is 90-400 GB per day uncompressed. After ZSTD-9 Parquet compression with column dictionaries, expect 11-50 GB per day landing in cold storage. At AWS S3 Standard pricing of $0.023 per GB-month and a 90-day rolling cold tier, total storage cost falls between $25 and $110 per month for the historical archive, plus another $40-80 per month for the ClickHouse hot tier on a 4 vCPU 8 GB box.

Egress is the silent cost killer. If you ever need to move 5 TB of historical Parquet to a different cloud for a research project, AWS will bill $450 in egress alone. Either run the analytics in the same region as the bucket or use Cloudflare R2 / Backblaze B2, both of which have free or near-free egress and price storage at $0.005-0.015 per GB-month. The R2 path saves real money once the archive grows past 1 TB.

Cross-exchange arbitrage signal extraction

The classic application of order book scraping is arbitrage detection. The simplest version: for every (base asset, quote asset) pair, find the highest bid across all exchanges and the lowest ask across all exchanges. If the highest bid is greater than the lowest ask plus fees, there is an opportunity (in theory).

In practice, transfer time, withdrawal fees, exchange-specific rules, and slippage eat most of the gap. But the same data lets you compute the cross-exchange spread distribution over time, which is genuinely useful for understanding market microstructure and for backtesting more sophisticated strategies.

def best_bid_ask(books: dict[str, NormalizedBook]) -> dict:
    best_bid = max((b.bids[0] for b in books.values() if b.bids), key=lambda x: x[0])
    best_ask = min((b.asks[0] for b in books.values() if b.asks), key=lambda x: x[0])
    return {
        "best_bid": best_bid[0],
        "best_ask": best_ask[0],
        "spread": best_ask[0] - best_bid[0],
    }

Common gotchas

A few traps from real production deployments:

  • Binance’s lastUpdateId from the REST snapshot is occasionally lower than the first delta U you have already buffered. The official spec says to discard buffered deltas where u <= lastUpdateId and apply the rest, but ensure your buffer holds at least 200 deltas during reconnect because the snapshot can lag a busy moment by several seconds.
  • Coinbase’s match channel and the level2 channel are separate. To compute correct mid-price you need only level2; matches are useful for trade history but should not feed the book directly.
  • OKX checksums are CRC32 over a specific concatenation of the top 25 levels. If the checksum fails twice in a row, OKX expects you to resubscribe, not to reconnect the socket.
  • Bybit’s v5 endpoint changed sequence semantics from v3. Old code that assumed continuous sequence numbers across all symbols on one connection will silently corrupt because v5 sequences are per symbol.
  • Kraken occasionally sends a snapshot mid-stream without warning when their internal book reconciliation kicks off. If you see a snapshot event after subscribing, treat it as the new baseline and drop your existing book state.
  • Time synchronization on the collector matters. NTP drift of even 50 ms makes the received_at_ms minus captured_at_ms latency metric meaningless. Run chrony with multiple peers and monitor offset.

Compliance and exchange terms of service

Most exchanges’ public market data is, by their own terms of service, freely usable for personal and commercial research. Redistributing the data as a real-time feed competing with the exchange’s institutional product is a different matter and usually requires a market data license.

Binance, Coinbase, and Kraken all explicitly permit personal trading and research use of public market data. Building a competing aggregator service that resells the data is gray area. For the typical research, alpha generation, or in-house analytics use case you are fine.

External authoritative reference: the Binance API documentation covers websocket spec and rate limits.

FAQ

Q: can I scrape order books without websockets?
Yes, you can poll the REST depth endpoint. But REST polling at 1-second cadence misses 99% of updates and consumes more rate-limit budget than the websocket equivalent. Websockets are the only sensible choice for production.

Q: how do I handle exchange downtime?
Most exchanges schedule maintenance windows in advance. Subscribe to their status APIs and stop reconnect attempts during announced windows to avoid wasting API budget. For unannounced outages, use exponential backoff capped at 5 minutes between retries.

Q: do I need to store every delta or are snapshots enough?
Depends on use case. For backtesting you want deltas because you need event-by-event playback. For analytics on spreads and depth, 1-second snapshots are usually sufficient and 100x cheaper to store.

Q: what about DEX order books?
DEXs publish state on-chain, not via websocket. You read it via RPC calls or via subgraph queries. Uniswap V3 and similar AMMs do not have order books at all; they have liquidity curves. dYdX and similar perp DEXs do have order books and expose them via gRPC. Different problem.

Q: how do I detect washtrading from order book data?
Look for orders that get placed and immediately taken by the same exchange’s matching engine, identical-size orders bouncing between two price levels, and trade volume spikes that do not correspond to depth movement. This is a deep topic; treat it as a separate downstream analysis on top of the raw scraped data.

Q: should I use a managed message broker like Kafka in front of the storage layer?
Only if you have multiple downstream consumers that need the live feed. For a single-consumer research pipeline, Kafka adds operational overhead without value. A simple in-process queue from listener to writer is enough. Kafka becomes worth the cost once you have a trading bot, an analytics dashboard, and a backtester all consuming the same feed concurrently.

Q: do I need a colocation server for arbitrage research?
No. Colocation matters for execution, not research. A Tokyo VPS that gets order book data 30-50 ms after the matching engine is plenty for spotting historical opportunities and modeling spreads. Save the colocation budget for after you have validated the strategy.

Q: can a single Python process really keep up with a busy day?
With asyncio and uvloop, yes, up to about 200 simultaneous connections and several thousand messages per second. Past that, switch to Rust, Go, or split across processes per exchange. Most research workloads never hit that ceiling.

Closing

Scraping crypto order books at scale in 2026 is a tractable engineering problem if you respect the wire-level details: sequence numbers, gap recovery, snapshot resyncs, and the difference between exchange and local timestamps. The first month is the hard part. Once your collector survives a Sunday night Coinbase reconnect storm and a Binance maintenance window without corrupting state, you have a pipeline that generates millions of dollars worth of vendor data for the cost of a small VPS. For the broader market data infrastructure picture see our crypto-defi category hub.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)