OpenAI’s Batch API cuts costs by 50% versus real-time calls — but for scraping pipelines, the decision is rarely that clean. Latency tolerance, pipeline architecture, and extraction failure rates all shift the math. If you’re routing LLM calls for web extraction in 2026, understanding when to batch versus when to stream is one of the highest-leverage optimizations available.
What the Batch API Actually Does
OpenAI’s Batch API accepts up to 50,000 requests per .jsonl file and processes them within 24 hours at half the token price. You POST a file, get a batch ID back, and poll for completion. There’s no streaming, no per-request latency SLA, and no priority queue.
For scraping specifically, this means you’re decoupling extraction from collection. Your crawler runs at full speed, writes raw HTML or JSON to a staging store, and a separate process batches those payloads into OpenAI every few hours. The extracted fields arrive asynchronously.
import openai, json
requests = [
{
"custom_id": f"row-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "Extract: title, price, sku as JSON."},
{"role": "user", "content": row["html_snippet"]}
],
"max_tokens": 256
}
}
for i, row in enumerate(staging_rows)
]
with open("batch_input.jsonl", "w") as f:
for r in requests: f.write(json.dumps(r) + "\n")
batch_file = openai.files.create(file=open("batch_input.jsonl","rb"), purpose="batch")
batch = openai.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h")The 50% discount applies to both input and output tokens. On gpt-4o-mini at $0.15/1M input tokens (real-time), batch drops that to $0.075. For high-volume extraction, that’s material — but it’s not always the right choice.
Real-Time vs Batch: The Decision Matrix
The core tradeoff is pipeline coupling. Real-time fits when downstream systems need extracted data before the next step can run. Batch fits when extraction is a post-processing step on already-collected data.
| Dimension | Real-Time API | Batch API |
|---|---|---|
| Latency | 1-10s per call | up to 24h turnaround |
| Cost (gpt-4o-mini) | $0.15/$0.60 per 1M tokens | $0.075/$0.30 per 1M tokens |
| Rate limits | Standard (tier-dependent) | Separate 2M token/day batch quota |
| Failure handling | Retry per request | Failed rows in output file |
| Use case fit | Live monitoring, triggered alerts | Bulk historical extraction |
| Streaming support | Yes | No |
If your scraping feeds a live dashboard — price monitoring, job board alerting, inventory tracking — batch is the wrong tool regardless of the cost savings. If you’re doing a one-time extraction pass on 200,000 product pages you’ve already crawled, batch is almost always correct.
Choosing the right model for each call type matters as much as real-time vs batch. the routing logic covered in Building an LLM Model Router for Scraping: Cheap vs Smart Trade-Offs (2026) shows how to classify extraction difficulty before routing — that classification layer applies to both real-time and batch queues.
Where Batch Breaks Down for Scrapers
Three failure modes that trip up scraping pipelines specifically:
- 24-hour SLA is a soft ceiling, not a guarantee. During peak hours OpenAI batch queues can back up. If your pipeline expects results within a business day, add buffer or fall back to real-time for time-sensitive subsets.
- Partial failures are silent by default. Batch output files contain a mix of successes and errors. You must parse every row’s
errorfield. A 2% failure rate across 50,000 rows is 1,000 silently missing extractions. - Token counting is your responsibility. Batch files over the per-request token limit fail at the row level, not at submission. Pre-tokenize your HTML snippets before building the
.jsonl—tiktokenhandles this in one pass.
A related optimization: if your HTML has stable structure (same e-commerce template, same job board layout), prompt caching can stack on top of batch pricing. Caching LLM Responses for Scrapers: Hit-Rate Patterns That Save 70% (2026) covers the cache key design patterns that work best here. Anthropic’s approach is structurally different — Anthropic Prompt Caching for Scraping Workflows: Real-World Savings (2026) benchmarks it against OpenAI’s implicit caching with real extraction workloads.
Cost Modeling for Real Workloads
Rough numbers for a 100,000-page product extraction job using gpt-4o-mini, assuming 800 input tokens and 200 output tokens per page:
- Total input tokens: 80M
- Total output tokens: 20M
- Real-time cost: (80 × $0.15) + (20 × $0.60) = $12 + $12 = $24
- Batch cost: (80 × $0.075) + (20 × $0.30) = $6 + $6 = $12
$12 saved per 100K pages. At scale that compounds fast. A crawler doing 5M pages/month saves ~$600/month on extraction alone, without any model or prompt changes.
For a full per-token comparison across providers including Gemini Flash, Claude Haiku, and Mistral, Scraping Cost Per Token 2026: Comparing 9 LLMs for Web Extraction has the updated benchmark table. OpenAI batch is competitive but not always cheapest — Gemini Flash 2.0’s pricing undercuts it on input tokens even without a batch discount.
Architecting a Hybrid Pipeline
Most production scraping systems end up using both modes. A practical split:
- Real-time: triggered extractions (new listings, price drops, monitoring alerts), anything feeding a live UI or webhook
- Batch: bulk historical crawls, re-extraction passes when prompts change, enrichment jobs on existing datasets
The staging layer between crawler and extractor is the key piece. Write raw HTML to S3 or ClickHouse staging, tag each row with a priority flag, and route high-priority rows to real-time while low-priority rows accumulate for batch submission. For how a ClickHouse-backed pipeline handles this at the infrastructure level, Scraping to ClickHouse: Real-Time Analytics Pipeline for Web Data (2026) walks through the schema and ingestion patterns.
A minimal routing config looks like this:
def route_extraction(row):
if row["priority"] == "high" or row["age_seconds"] < 300:
return "realtime"
return "batch"Keep the logic simple. Over-engineering the router adds latency to the decision itself.
Bottom Line
Use the Batch API for any extraction workload that can tolerate multi-hour latency — it’s a straightforward 50% cut with no quality tradeoff. Reserve real-time for live monitoring and triggered pipelines where freshness matters. The hybrid approach is almost always correct at scale: batch as the default, real-time as the exception. DRT covers cost and architecture tradeoffs like this across the full LLM-for-scraping stack, with real numbers rather than vendor benchmarks.
Related guides on dataresearchtools.com
- Building an LLM Model Router for Scraping: Cheap vs Smart Trade-Offs (2026)
- Scraping Cost Per Token 2026: Comparing 9 LLMs for Web Extraction
- Caching LLM Responses for Scrapers: Hit-Rate Patterns That Save 70% (2026)
- Anthropic Prompt Caching for Scraping Workflows: Real-World Savings (2026)
- Pillar: Scraping to ClickHouse: Real-Time Analytics Pipeline for Web Data (2026)