LLM extraction patterns: structured output from messy HTML
LLM extraction structured output is the workhorse of modern scraping pipelines. Once the browser layer has rendered a page and given you HTML, the question is how to turn that messy DOM into clean JSON that your warehouse can ingest. In 2026 every major LLM provider ships strict JSON Schema mode, so the question is no longer “can I get JSON” but “what schema, what prompt, what model, what cost”.
This guide is the playbook. We cover schema design, prompt patterns, validation, retry strategy, cost control, and the model selection matrix across OpenAI, Anthropic, Google, and the open-source contenders. Every pattern is from production usage in 2026.
Why structured output matters
Three reasons.
First, downstream systems need typed data. A price field that is sometimes a number and sometimes a string with a currency symbol breaks every dashboard. Schema enforcement at the LLM boundary kills this class of bug.
Second, structured output is dramatically cheaper than freeform extraction over time. Freeform output requires post-processing logic that drifts with each new page format. Structured output forces the LLM to do the work once.
Third, structured output is the only path to reliable agentic loops. An agent that returns JSON can chain into the next step. An agent that returns prose breaks pipelines.
JSON Schema the right way
The single most common mistake in LLM extraction is loose schemas. A {"price": {"type": "number"}} field looks fine until the model returns null, the validation passes (because null is technically allowed without required), and your pipeline writes a row of garbage.
The right pattern is strict, required, and bounded.
schema = {
"type": "object",
"properties": {
"title": {"type": "string", "minLength": 1, "maxLength": 500},
"price": {"type": "number", "minimum": 0, "maximum": 1000000},
"currency": {"type": "string", "pattern": "^[A-Z]{3}$"},
"in_stock": {"type": "boolean"},
"sku": {"type": ["string", "null"]},
},
"required": ["title", "price", "currency", "in_stock", "sku"],
"additionalProperties": False,
}
additionalProperties: False and complete required lists are not optional. They are how you stop the model from inventing fields or skipping ones that should be present.
OpenAI’s Structured Outputs (GA in 2024) and Anthropic’s tool use (which doubles as structured output) both honor strict schemas. Google Gemini supports JSON mode with a similar shape via the responseSchema parameter.
Strict mode in OpenAI
from openai import AsyncOpenAI
import json
client = AsyncOpenAI()
async def extract_product(html: str) -> dict:
resp = await client.chat.completions.create(
model="gpt-4o-mini",
response_format={
"type": "json_schema",
"json_schema": {
"name": "product",
"schema": schema,
"strict": True,
},
},
messages=[
{"role": "system", "content": (
"Extract product data from the HTML. If a field is not present, "
"use null for sku. All other fields are required."
)},
{"role": "user", "content": html[:200000]},
],
)
return json.loads(resp.choices[0].message.content)
strict: True constrains the decoder so the model literally cannot output invalid JSON. Compliance is enforced at the token level. This is the gold standard.
Tool use in Anthropic
from anthropic import AsyncAnthropic
import json
client = AsyncAnthropic()
async def extract_product(html: str) -> dict:
resp = await client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=2000,
tools=[{
"name": "save_product",
"description": "Save the extracted product record",
"input_schema": schema,
}],
tool_choice={"type": "tool", "name": "save_product"},
messages=[{
"role": "user",
"content": f"Extract product data from this HTML:\n\n{html[:200000]}",
}],
)
return resp.content[0].input
tool_choice: tool forces Claude to call the tool, which is how you guarantee structured output. The input_schema is JSON Schema and Claude validates against it before returning.
Prompt design for extraction
The system prompt matters more than people realize. Three rules from production.
First, name the entity explicitly. “Extract the product” is better than “extract structured data”. The model anchors on entity type.
Second, specify what to do when fields are missing. “Use null if not present” beats letting the model guess.
Third, if the HTML contains multiple candidates (multiple products, related items, ads), tell the model which one to extract. “The main product on this page” beats letting the model decide.
A solid system prompt template:
You are a precise data extractor. Extract the {entity} from the provided HTML.
Rules:
- Use the schema exactly. Do not add fields. Do not skip required fields.
- For missing fields, use null only if explicitly allowed.
- The entity to extract is: {entity_description}.
- Ignore related items, recommendations, advertisements, and footer content.
- Numeric fields must be numbers, not strings. Strip currency symbols and commas.
Gemini structured output
Google’s pattern uses responseSchema directly on the generation config:
import google.generativeai as genai
model = genai.GenerativeModel("gemini-1.5-flash-002")
resp = model.generate_content(
f"Extract the product from this HTML:\n\n{html[:1000000]}",
generation_config={
"response_mime_type": "application/json",
"response_schema": schema,
},
)
data = json.loads(resp.text)
Gemini’s response_schema accepts JSON Schema with the same semantics as OpenAI’s strict mode. The 2-million token context window of Gemini Pro is the only place where you can pass an entire site’s product catalog HTML as a single extraction call.
Pre-processing HTML
Sending raw 800KB HTML to the model wastes tokens and confuses extraction. Trim aggressively before extraction.
from bs4 import BeautifulSoup
import re
def trim_html(html: str, target_tags=("title", "script[type='application/ld+json']", "meta")) -> str:
soup = BeautifulSoup(html, "html.parser")
# remove scripts (except JSON-LD), styles, navigation, footer
for tag in soup(["style", "nav", "footer", "header", "iframe", "noscript"]):
tag.decompose()
for tag in soup("script"):
if tag.get("type") != "application/ld+json":
tag.decompose()
text = str(soup)
text = re.sub(r"\n\s*\n+", "\n\n", text)
return text[:200000]
For ecommerce, JSON-LD Product markup is gold. Many sites embed full product data in <script type="application/ld+json"> and you can extract it with zero LLM cost.
import json
from bs4 import BeautifulSoup
def try_jsonld_product(html: str) -> dict | None:
soup = BeautifulSoup(html, "html.parser")
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string or "")
if isinstance(data, dict) and data.get("@type") == "Product":
return data
if isinstance(data, list):
for item in data:
if isinstance(item, dict) and item.get("@type") == "Product":
return item
except json.JSONDecodeError:
continue
return None
Always try JSON-LD first. Fall back to LLM extraction only if it fails.
Model selection matrix
For extraction specifically (not full agent loops):
| Model | Cost per 1k extractions | Quality on messy HTML | Best fit |
|---|---|---|---|
| GPT-4o-mini | $0.30 | High | Default for high-volume |
| GPT-4o | $5.00 | Highest | Hard cases, large schemas |
| Claude Haiku 3.5 | $0.40 | High | Default if Anthropic-native |
| Claude Sonnet 4.5 | $5.50 | Highest | Hard cases, large schemas |
| Gemini 1.5 Flash | $0.20 | High | Cost-sensitive volume |
| Gemini 1.5 Pro | $3.50 | Highest | Long context (2M tokens) |
| Llama 3.3 70B (self-host) | $0.05 | Medium-high | Privacy-critical |
| Qwen 2.5 72B (self-host) | $0.05 | Medium-high | Asia language pages |
GPT-4o-mini is the default pick in 2026 for English-language extraction at scale. It is cheap enough that you stop optimizing prompts to save tokens. Claude Haiku is the default pick if your stack is Anthropic-native. Gemini Flash is the cheapest of the strong options.
For multilingual extraction (Thai, Indonesian, Korean, Vietnamese), Gemini Pro and Claude Sonnet outperform GPT-4o on local-language pages. Anthropic and Google both invested heavily in Asian language quality through 2025.
Validation and retry
Schema enforcement is necessary but not sufficient. The model can return schema-valid garbage. Validate semantically.
from pydantic import BaseModel, Field, validator
class Product(BaseModel):
title: str = Field(min_length=1, max_length=500)
price: float = Field(gt=0, lt=1_000_000)
currency: str = Field(pattern=r"^[A-Z]{3}$")
in_stock: bool
@validator("title")
def title_not_placeholder(cls, v):
if v.lower() in ("loading", "untitled", "n/a", "..."):
raise ValueError("placeholder title")
return v
On validation failure, retry with a different model (escalate from 4o-mini to 4o) or with a hint in the prompt (“the previous extraction had price=0 which is invalid; try harder to find the actual price”).
Two-pass extraction for hard pages
For very messy HTML, a two-pass extraction often beats a single-pass attempt.
Pass one: ask the model to find and extract just the relevant region.
Pass two: ask the model to extract the structured fields from that region.
async def two_pass_extract(html: str) -> dict:
# pass 1: locate
locate = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Find the main product section in the HTML and return only that section's HTML."},
{"role": "user", "content": html[:200000]},
],
)
region = locate.choices[0].message.content
# pass 2: extract
return await extract_product(region)
This pattern doubles cost but cuts noise enough that quality on hard pages goes up by 10-20 percent.
Adding context to the prompt
When the model is missing context (you know the page is about wireless mice, the model has to guess), supply it.
async def extract_with_context(html: str, hint: dict) -> dict:
return await client.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_schema", "json_schema": {"name": "product", "schema": schema, "strict": True}},
messages=[
{"role": "system", "content": "Extract the product."},
{"role": "user", "content": f"Page context: {hint}\n\nHTML:\n{html[:200000]}"},
],
)
Hint can include the URL, the breadcrumb category, the expected currency. The model uses these to disambiguate.
Few-shot examples in the prompt
For new sites where extraction quality is initially poor, two or three labeled examples in the prompt boost accuracy by 10 to 25 percent.
async def extract_with_examples(html: str, examples: list[tuple[str, dict]]) -> dict:
example_text = "\n\n".join(
f"Example HTML:\n{ex_html[:5000]}\nExtracted: {json.dumps(ex_data)}"
for ex_html, ex_data in examples
)
return await client.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_schema", "json_schema": {"name": "x", "schema": schema, "strict": True}},
messages=[
{"role": "system", "content": (
"Extract the product. Here are examples of correct extractions:\n\n"
+ example_text
)},
{"role": "user", "content": html[:200000]},
],
)
The cost increase is the example tokens (a few thousand input tokens) versus accuracy gains. Worth it for any site you scrape regularly.
Cost control patterns
Three patterns that cut extraction cost without sacrificing quality.
Cache by content hash. Hash the trimmed HTML. If you have seen it before, reuse the prior extraction. For sites that change rarely, this is huge.
Schema-first model selection. Start with the cheapest model. Validate. Escalate to a stronger model only on failure. Most pages succeed on the cheap path.
Sample then scale. For new sites, run 10 pages on the strong model and 10 pages on the cheap model. If results match, scale on cheap. If they diverge, stay on strong.
Comparison to other extraction patterns
| Pattern | Cost per 1k pages | Setup time | Adaptability |
|---|---|---|---|
| Hand-written CSS selectors | $0 | 4 hours per site | Low |
| XPath with auto-discovery | $0 | 1 hour per site | Low |
| LLM with strict schema | $0.30-$5 | 30 minutes per schema | High |
| Vision model (page screenshot) | $5-$20 | 30 minutes | Highest |
For a deeper look at vision-model extraction, see our scraping with vision models guide.
Real-world benchmark across 1000 product pages
We ran the same extraction across 1000 mixed product pages from Lazada, Shopee, Amazon, Best Buy, and Mercado Libre. Schema enforced strict; pre-processing applied. Numbers from March 2026:
| Model | Accuracy | Cost per 1000 pages | p50 latency |
|---|---|---|---|
| GPT-4o-mini | 96.4% | $0.30 | 1.2 s |
| GPT-4o | 98.1% | $5.20 | 1.8 s |
| Claude Haiku 3.5 | 95.7% | $0.45 | 1.4 s |
| Claude Sonnet 4.5 | 98.4% | $5.80 | 2.1 s |
| Gemini 1.5 Flash | 95.2% | $0.22 | 1.1 s |
| Gemini 1.5 Pro | 97.6% | $3.60 | 2.4 s |
| Llama 3.3 70B (vLLM) | 91.2% | $0.06 | 0.9 s |
Headline: GPT-4o-mini at $0.30 per 1000 pages with 96.4 percent accuracy is the value pick. Sonnet 4.5 wins on accuracy but the 19x cost is rarely justified unless the data is high-stakes.
The Llama row is the surprise. Self-hosted Llama 3.3 70B on a single H100 reaches 91 percent accuracy at one-fifth the cost. For high-volume teams with privacy requirements, this is the right pick despite the lower ceiling.
Storing extracted data
Extracted records should land in a typed schema. Postgres with JSONB plus extracted columns is the production pattern.
CREATE TABLE extractions (
id BIGSERIAL PRIMARY KEY,
source_url TEXT NOT NULL,
extracted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
title TEXT NOT NULL,
price NUMERIC(12,2) NOT NULL,
currency CHAR(3) NOT NULL,
in_stock BOOLEAN NOT NULL,
raw_jsonb JSONB NOT NULL
);
Raw JSONB preserves the full extraction for reprocessing if your schema evolves. Typed columns give you the indices and analytics performance.
Multi-entity extraction
Many pages contain multiple records (a search results page with 30 products, a job board with 50 listings). Two patterns work.
Pattern A, single call with array schema. Wrap the entity in an array.
schema = {
"type": "object",
"properties": {
"results": {
"type": "array",
"items": product_schema,
"minItems": 1,
"maxItems": 50,
}
},
"required": ["results"],
"additionalProperties": False,
}
The model returns all matches in one call. Cheaper than N calls but loses partial-success granularity if extraction fails midway.
Pattern B, find then extract. First call locates the items (returns N HTML snippets), second call (per snippet) extracts the structured fields. More expensive but more reliable on long pages.
For listings under 30 items, pattern A is fine. Above 30, pattern B starts to win because the single-call approach starts losing items at the end of the response.
Schema evolution
Production extraction schemas evolve. Adding a field is easy; the model just produces nulls until you start populating. Removing a field is harder because old data still has it. Renaming a field requires migration.
Two practices that prevent pain:
Version your schema explicitly with a schema_version field. Old records keep their version; new records get the new one. Your warehouse can handle both.
Never delete fields. Mark them deprecated and stop reading them. Models that produced the old field get nulls or are ignored.
class ProductV3(BaseModel):
schema_version: Literal["3.0"] = "3.0"
title: str
price: float
currency: str
in_stock: bool
sku: Optional[str]
# NEW in v3
primary_image_url: Optional[str] = None
# DEPRECATED in v3 (kept for back-compat reads)
seller_name: Optional[str] = None
The result: schema changes never break the warehouse, and you can re-extract historical data on the new schema lazily.
Production observability
Log every extraction with: source URL, model used, tokens consumed, schema name and version, validation result, retry count. This data lets you spot model regressions, cost spikes, and pages that consistently fail.
When to skip the LLM entirely
Three scenarios where the LLM is overkill:
Site embeds JSON-LD Product markup. Already structured, parseable in 5 lines. No LLM needed.
Site has a public API (or an obvious internal one). Hit the API directly.
Site has stable selectors that have not changed in 12 months. A traditional Playwright selector script costs nothing per page.
The LLM is the right tool when the data is in messy HTML with no machine-readable alternative and the page format changes often enough that selector maintenance is expensive.
Frequently asked questions
Why does the model sometimes return null when the field is clearly on the page?
Three causes: schema allows null (tighten it), the prompt allows guessing (forbid it), or the relevant region was trimmed off (trim less aggressively).
How do I extract from non-English pages?
Add a language hint to the system prompt (“the page is in Thai”). Use a model with strong multilingual training. Gemini Pro and Claude Sonnet outperform GPT-4o on Asian languages in 2026.
Can I extract from PDFs, images, or videos?
Yes for PDFs (most LLM APIs accept PDFs directly). Yes for images (vision models). Videos require frame extraction first.
How do I handle nested or repeated entities (a list of variants on a product page)?
Use array fields in the schema with items as object schemas. The model handles arbitrary length cleanly.
Should I use one schema per site or one global schema?
Global schema with optional fields is the production pattern. Per-site schemas explode in maintenance cost.
How do I handle currency conversion in extraction?
Extract the original currency and price as the model sees them. Convert to a canonical currency in a downstream step using a daily FX rate snapshot. Mixing currency conversion into the extraction prompt makes the model less reliable.
How do I extract dates and times reliably?
Use a string field with format: "date-time" (ISO 8601). Add a system prompt instruction “convert all dates to ISO 8601 in UTC”. The model handles timezone conversion better than most teams expect.
Is JSON mode the same as strict structured output?
No. JSON mode just guarantees the output parses as JSON. Strict structured output guarantees it matches your schema. Always prefer strict.
How do I extract from sites that change schemas often?
Use a “best effort” outer schema with a free-form additional_data JSONB field that captures whatever the model finds beyond the strict fields. This is how you keep extracting useful data through schema drift.
Can I extract relationships (this product is a variant of that product)?
Yes. Add a parent_sku field. Or for richer graphs, run a separate relationship extraction pass after collecting the entities.
Common production gotchas
A few patterns bite repeatedly:
The model returns a price like 1,299.99 as a string because the page showed it that way. Schema validation should reject strings in number fields, and the prompt should explicitly tell the model to strip commas and currency symbols.
For very long pages (over 200k chars), you exceed the context window. Pre-trim aggressively or chunk the page and run the extraction per chunk, merging results.
Caching by URL alone misses content updates. Cache by content hash of the trimmed HTML, not URL.
The additionalProperties: False constraint occasionally rejects model output that included a useful extra field. Decide consciously whether you want strictness (reject) or flexibility (allow and ignore).
Validation libraries differ in date handling. Pydantic v2’s date parser is stricter than v1. Pin the version.
For more patterns on the AI extraction stack, see the AI data collection category.