JSON-LD structured data is one of the cleanest extraction targets on the modern web. Sites embed Schema.org markup directly in tags, giving you machine-readable product prices, review counts, event dates, and breadcrumb paths without touching a single rendered DOM element. At scale, scraping JSON-LD structured data is faster, cheaper, and far more stable than CSS selector scraping -- but there are real gotchas around multi-block pages, nested types, and sites that inject it client-side.
Why JSON-LD Is Worth Targeting First
Schema.org adoption has grown significantly. As of 2026, roughly 44% of pages indexed by Google carry at least one JSON-LD block, concentrated in e-commerce, local business, recipes, events, and job listings. These are exactly the categories where structured data is most commercially valuable.
The extraction path is simple: parse the HTML, find all tags, deserialize the JSON, and filter by @type. You skip layout changes, class renames, and A/B test variants entirely. Unlike scraping rendered state from a React or Vue SPA, JSON-LD usually lives in the raw HTML response, which means you can run it through a static HTTP fetcher without a headless browser -- and that cuts per-page cost by an order of magnitude.
The catch: some sites, particularly SPAs using Next.js or Nuxt with client-side hydration, inject the JSON-LD block post-render. You will not find it in the raw response. Test a sample of your targets with curl before committing to a headless-free pipeline.
Extraction Patterns in Python
For static pages, BeautifulSoup or lxml is enough. Here is a tight extractor that handles multiple JSON-LD blocks per page (common on product pages that embed both Product and BreadcrumbList):
import httpx
from bs4 import BeautifulSoup
import json
def extract_jsonld(url: str) -> list[dict]:
r = httpx.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)
soup = BeautifulSoup(r.text, "lxml")
blocks = []
for tag in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(tag.string or "")
if isinstance(data, list):
blocks.extend(data)
else:
blocks.append(data)
except json.JSONDecodeError:
continue
return blocks
def filter_by_type(blocks: list[dict], schema_type: str) -> list[dict]:
return [b for b in blocks if schema_type in str(b.get("@type", ""))]
One subtlety: @type can be a string, a list, or occasionally nested inside a @graph array. The str() cast above is intentionally broad -- for production use, recurse into @graph explicitly.
For sites that require JavaScript rendering, Playwright with page.content() after networkidle is the reliable path. This mirrors patterns covered in Scraping WebSocket-Based Apps: Patterns for Real-Time Data (2026) where deferred content loads after the initial HTML is served.
Schema Types and What You Actually Get
The most commercially useful Schema.org types by extraction ROI:
| @type | Key fields | Common use case |
|---|---|---|
Product |
name, offers.price, offers.availability |
Price monitoring |
LocalBusiness |
address, telephone, openingHours |
Business intelligence |
Review / AggregateRating |
ratingValue, reviewCount |
Sentiment signals |
Event |
startDate, location, offers.price |
Event data feeds |
JobPosting |
title, baseSalary, datePosted |
Job market analytics |
BreadcrumbList |
itemListElement |
Site taxonomy mapping |
Product pages are the most valuable and the most defended. Retailers like major marketplaces aggressively block scrapers regardless of the data source, so rotating residential proxies are non-negotiable for volume runs. Static data types like LocalBusiness on independent sites are far more permissive.
Handling Scale and Storage
A pipeline extracting JSON-LD from 500,000 pages per day needs to handle schema drift gracefully. Fields are optional, types nest arbitrarily, and publishers mis-implement the spec constantly -- "price": "$29.99" instead of a numeric string, missing @context, or @type: ["Product", "Thing"] arrays.
This is where schema-less storage wins over relational tables. Scraping to MongoDB: Schema-Less Storage for Variable Web Data covers the design pattern in depth, but the short version: store the raw JSON-LD block as-is alongside a normalized subset of fields you actually query. You can always re-derive the structured fields; you cannot recover dropped data.
For feed-style sources that publish structured event or article data on a schedule, combining JSON-LD extraction with RSS and Atom Feed Aggregation at Scale 2026: Tooling and Rate Patterns gives you a low-cost polling loop where you only re-scrape pages that have actually changed.
Anti-Bot Considerations and Rendering Edge Cases
Most anti-bot systems target behavioral signals, not payload content. JSON-LD scraping via plain HTTP requests is inherently low-fingerprint -- no canvas, no WebGL, no mouse events -- which reduces detection surface considerably.
What gets you blocked anyway:
- Request rate above the site's bot threshold (start at 1-2 req/s per domain, back off on 429s)
- Missing or obviously non-browser headers (
Accept,Accept-Language,Referer) - IP reputation (datacenter ranges get flagged immediately on high-value targets)
- Cookie requirements on sites using Cloudflare JS challenge or similar gate
For sites using HTMX-Powered partial page updates, the JSON-LD block is typically in the initial HTML response, not in the HTMX swap fragment -- so static fetching still works, but you need the full page load URL, not the fragment endpoint.
On sites that stream content updates, checking for JSON-LD in Server-Sent Events (SSE) Streams: Live Data Patterns (2026) is rarely useful since SSE payloads are almost never Schema.org-formatted -- but knowing that the page uses SSE explains why a static fetch returns incomplete markup.
Numbered checklist for a production-ready JSON-LD pipeline:
- Baseline with
curlto determine if the target embeds JSON-LD in raw HTML or requires JS rendering - Build a block extractor that handles both single objects and
@grapharrays - Validate
@typebefore writing to storage -- filter noise early - Store raw JSON alongside normalized fields to survive schema changes
- Set per-domain rate limits and back off exponentially on 429/503 responses
- Monitor field completeness weekly -- publishers update their markup without notice
Bottom line
JSON-LD is the highest-signal, lowest-friction extraction target in modern web scraping when you pick the right site categories. Prioritize static pages serving Product, LocalBusiness, or JobPosting types, build a schema-flexible storage layer, and keep your rendering budget for the minority of targets that actually need it. DRT covers this class of structured-protocol scraping regularly because it represents where the real volume and reliability gains are in 2026.
Related guides on dataresearchtools.com
- Scraping WebSocket-Based Apps: Patterns for Real-Time Data (2026)
- Scraping Server-Sent Events (SSE) Streams: Live Data Patterns (2026)
- RSS and Atom Feed Aggregation at Scale 2026: Tooling and Rate Patterns
- Scraping HTMX-Powered Sites: Different from React/Vue Patterns (2026)
- Pillar: Scraping to MongoDB: Schema-Less Storage for Variable Web Data