Building a self-healing scraper with LLM repair loops
A self-healing scraper in 2026 is the architecture pattern that finally answers the eternal pain of selector rot. Every scraping engineer has spent a Sunday night fixing a broken selector after a target site shipped a redesign on Friday. The promise of self-healing scrapers is that the scraper detects its own failure, asks an LLM to propose a new selector, validates the proposal against the live page, and updates itself, all without you opening your laptop.
This guide builds a production self-healing scraper from scratch. We define the failure modes, build the detection layer, wire the LLM repair loop, and put the whole thing under a circuit breaker so it cannot self-destruct. Working Python throughout.
What “self-healing” actually means
Self-healing in scraping has three layers:
First, anomaly detection. The scraper notices that something is off (extraction returned null where it should not, traffic dropped to zero, schema validation failed).
Second, repair. The scraper invokes a repair routine that tries to fix the problem (re-derive a selector, switch to a vision-based extraction, escalate to a stronger LLM).
Third, persistence. The repair is saved so it does not have to happen again on the next run. This is the difference between a self-healing scraper and a noisy retry loop.
The failure modes worth handling
Not every failure is worth healing. Network blips, transient 503s, and CAPTCHA challenges are not selector rot; they are noise that retries handle. Real selector rot looks like:
- Extraction returns valid JSON but with all-null values
- Selector returns zero matches when it used to return one
- Page structure changed: title is now in
<h1 class="product-name">instead of<h1 class="title"> - API endpoint moved from
/api/v1/productsto/api/v2/items
The pattern is: the request succeeded, the page rendered, but the structured output is empty or wrong.
Detection
Catch the rot at the validation layer. Use Pydantic with strict validators.
from pydantic import BaseModel, Field, field_validator
from typing import Optional
class Product(BaseModel):
title: str = Field(min_length=1, max_length=500)
price: float = Field(gt=0, lt=1_000_000)
currency: str = Field(pattern=r"^[A-Z]{3}$")
in_stock: bool
@field_validator("title")
@classmethod
def title_must_be_real(cls, v):
placeholders = {"loading", "untitled", "product", "n/a", ""}
if v.strip().lower() in placeholders:
raise ValueError("title looks like a placeholder")
return v
class ExtractionFailure(Exception):
def __init__(self, message: str, raw: dict, html_snippet: str):
super().__init__(message)
self.raw = raw
self.html_snippet = html_snippet
When validation fails, raise ExtractionFailure with enough context for the repair loop.
A traditional scraper with healing hooks
Start with a classic Playwright scraper that uses CSS selectors:
from dataclasses import dataclass
@dataclass
class SelectorMap:
title: str = "h1.product-title"
price: str = ".price-now"
currency: str = ".price-currency"
stock: str = ".stock-status"
selectors = SelectorMap()
async def scrape_product(url: str) -> Product:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
title = await page.locator(selectors.title).text_content() or ""
price_text = await page.locator(selectors.price).text_content() or ""
currency = await page.locator(selectors.currency).text_content() or ""
stock = (await page.locator(selectors.stock).count()) > 0
html = await page.content()
await browser.close()
try:
price = float(price_text.replace(",", "").strip())
return Product(title=title.strip(), price=price, currency=currency.strip(), in_stock=stock)
except Exception as e:
raise ExtractionFailure(str(e), {"raw_title": title, "raw_price": price_text}, html[:50000])
This is the scraper that will rot. The selectors are right today. They will be wrong in three weeks.
The LLM repair loop
When ExtractionFailure fires, hand the page HTML and the broken selectors to an LLM and ask for new selectors.
import json
from openai import AsyncOpenAI
client = AsyncOpenAI()
REPAIR_SCHEMA = {
"type": "object",
"properties": {
"title": {"type": "string", "description": "CSS selector for product title"},
"price": {"type": "string", "description": "CSS selector for price (numeric)"},
"currency": {"type": "string", "description": "CSS selector or hint for currency code"},
"stock": {"type": "string", "description": "CSS selector for stock status"},
"diagnosis": {"type": "string", "description": "Short explanation of what changed"},
},
"required": ["title", "price", "currency", "stock", "diagnosis"],
"additionalProperties": False,
}
async def repair_selectors(html: str, old_selectors: SelectorMap) -> dict:
resp = await client.chat.completions.create(
model="gpt-4o-mini",
response_format={
"type": "json_schema",
"json_schema": {"name": "selectors", "schema": REPAIR_SCHEMA, "strict": True},
},
messages=[
{"role": "system", "content": (
"A scraper's CSS selectors stopped working. Look at the HTML and propose new selectors. "
"Selectors must match exactly one element. Prefer stable attributes (data-*, aria-*, semantic tags). "
"Avoid generated class names that look like hashes."
)},
{"role": "user", "content": (
f"Old selectors: {old_selectors}\n\n"
f"HTML:\n{html[:120000]}"
)},
],
)
return json.loads(resp.choices[0].message.content)
This is the brain of the self-healing system. Given enough HTML and a clear description of what to find, modern LLMs propose correct selectors most of the time.
Validating proposed selectors
Never trust the LLM’s proposed selectors blindly. Test each one against the live page before persisting.
async def validate_selectors(url: str, proposed: dict) -> bool:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
try:
title = (await page.locator(proposed["title"]).text_content() or "").strip()
price_text = (await page.locator(proposed["price"]).text_content() or "").strip()
currency = (await page.locator(proposed["currency"]).text_content() or "").strip()
stock = (await page.locator(proposed["stock"]).count()) > 0
price = float(price_text.replace(",", "").strip())
Product(title=title, price=price, currency=currency, in_stock=stock)
return True
except Exception:
return False
finally:
await browser.close()
If validation passes, persist the new selectors. If not, escalate to a stronger model or fall back to vision-based extraction.
Persisting the repair
Selectors should live in a small datastore that the scraper reads on each run. SQLite for development, Postgres for production.
import sqlite3
import json
class SelectorStore:
def __init__(self, path="selectors.db"):
self.conn = sqlite3.connect(path)
self.conn.execute("""
CREATE TABLE IF NOT EXISTS selectors (
site TEXT PRIMARY KEY,
json TEXT NOT NULL,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
def get(self, site: str) -> dict | None:
row = self.conn.execute("SELECT json FROM selectors WHERE site = ?", (site,)).fetchone()
return json.loads(row[0]) if row else None
def set(self, site: str, selectors: dict):
self.conn.execute(
"INSERT INTO selectors (site, json) VALUES (?, ?) ON CONFLICT(site) DO UPDATE SET json=excluded.json, updated_at=CURRENT_TIMESTAMP",
(site, json.dumps(selectors)),
)
self.conn.commit()
store = SelectorStore()
The full self-healing loop
async def healed_scrape(url: str, site: str, max_repairs: int = 1) -> Product:
selectors = store.get(site) or {
"title": "h1.product-title",
"price": ".price-now",
"currency": ".price-currency",
"stock": ".stock-status",
}
for attempt in range(max_repairs + 1):
try:
return await scrape_product_with_selectors(url, selectors)
except ExtractionFailure as ef:
if attempt >= max_repairs:
raise
proposal = await repair_selectors(ef.html_snippet, selectors)
if await validate_selectors(url, proposal):
store.set(site, proposal)
selectors = proposal
else:
raise ef
max_repairs=1 is the right default. One repair attempt per failure prevents runaway LLM costs if the page is genuinely broken.
Circuit breaker
Self-healing without limits is dangerous. If a target site goes down entirely, the LLM will burn money trying to repair selectors against an error page. Add a circuit breaker.
from collections import defaultdict
from datetime import datetime, timedelta
class CircuitBreaker:
def __init__(self, threshold=3, window=timedelta(hours=1)):
self.threshold = threshold
self.window = window
self.failures = defaultdict(list)
def record_failure(self, site: str):
now = datetime.utcnow()
self.failures[site] = [t for t in self.failures[site] if now - t < self.window]
self.failures[site].append(now)
def is_open(self, site: str) -> bool:
now = datetime.utcnow()
self.failures[site] = [t for t in self.failures[site] if now - t < self.window]
return len(self.failures[site]) >= self.threshold
breaker = CircuitBreaker()
async def healed_scrape_safe(url: str, site: str) -> Product:
if breaker.is_open(site):
raise RuntimeError(f"circuit breaker open for {site}")
try:
return await healed_scrape(url, site)
except Exception:
breaker.record_failure(site)
raise
When the breaker opens, page on-call. The site likely needs human attention.
Adding XPath fallback selectors
CSS selectors break easily. XPath expressions often survive longer because they support text matching and structural traversal that CSS cannot. Have the repair loop propose both.
REPAIR_SCHEMA_V2 = {
"type": "object",
"properties": {
"title": {"type": "object", "properties": {
"css": {"type": ["string", "null"]},
"xpath": {"type": ["string", "null"]},
}, "required": ["css", "xpath"], "additionalProperties": False},
# similar for price, currency, stock
},
# ...
}
The scraper tries CSS first, falls back to XPath if CSS returns zero matches. The combined success rate over our six-month benchmark was 11 percentage points higher than CSS alone.
Diff-aware repair
Pass the LLM the diff between the old HTML (cached from the last successful run) and the current HTML. The diff highlights what actually changed and gives the LLM a much smaller, more focused payload.
import difflib
def html_diff(old_html: str, new_html: str, context_lines: int = 3) -> str:
diff = difflib.unified_diff(
old_html.splitlines(), new_html.splitlines(),
lineterm="", n=context_lines,
)
return "\n".join(diff)[:50000]
For sites with stable templates and small layout shifts, diff-aware repair cuts LLM cost by 60 percent and improves repair accuracy because the model sees only what changed.
Comparison to alternatives
| Approach | Setup time | Cost when stable | Cost when site changes | Engineering hours saved per month |
|---|---|---|---|---|
| Static selectors | Low | $0 | High (manual fix) | 0 |
| Self-healing scraper | Medium | $0.005/page | $0.10/page (during repair) | 4-8 |
| Pure AI agent (browser-use) | Low | $0.04/page | $0.04/page | 6-10 |
| Vision-only scraper | Low | $0.03/page | $0.03/page | 6-10 |
Self-healing is the right pick when you have many existing static-selector scrapers and you want to add resilience without rewriting them. Pure AI agents are simpler if you are starting fresh.
For more on agentic scrapers, see browser-use scraping guide and Stagehand vs Playwright.
Real-world repair examples
A few concrete cases from production logs:
Case 1: Lazada changed .pdp-mod-product-badge-title to .pdp-product-title-v2. Repair LLM proposed h1[data-spm="page_main"] span based on a stable data attribute. Validated, persisted, working in production for 11 weeks.
Case 2: Shopee shipped a redesign that wrapped prices in a new component. The CSS selector returned the strikethrough price instead of the current price. Repair LLM noticed both prices and proposed a more specific selector: .product-price__current span:not(.original-price).
Case 3: Amazon US started serving slightly different DOM to logged-in vs anonymous users. Single-pass repair failed because the proposed selector worked anonymously but not when logged in. Multi-pass with both sessions caught the discrepancy.
Case 4: Booking.com’s price now lives in a Shadow DOM. CSS could not pierce it. Vision fallback worked. After three failures, the system permanently switched to vision for that selector.
These examples illustrate the value of the validation step. Every proposal is a hypothesis tested against the live page, not blindly trusted.
Vision-based fallback
When LLM-proposed selectors fail twice, fall back to vision extraction. The vision model reads the screenshot and ignores the HTML structure entirely.
import base64
async def vision_extract(url: str) -> Product:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
png = await page.screenshot(full_page=False)
await browser.close()
b64 = base64.b64encode(png).decode()
resp = await client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_schema", "json_schema": {"name": "product", "schema": PRODUCT_SCHEMA, "strict": True}},
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Extract the product."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}", "detail": "high"}},
],
}],
)
return Product(**json.loads(resp.choices[0].message.content))
Vision extraction is more expensive but bypasses any HTML weirdness. It is the safety net.
Observability
Log every repair event. Site, old selectors, proposed selectors, validation result, time taken. This data is gold for understanding which sites are stable and which are perpetually fighting you.
import logging
logger = logging.getLogger("scraper.repair")
# inside healed_scrape:
logger.info("repair_attempt", extra={
"site": site,
"old_selectors": old,
"proposed": proposal,
"validation_passed": ok,
"duration_ms": elapsed_ms,
})
Build a dashboard that shows repair frequency per site. Sites with weekly repairs probably need a different scraping strategy entirely.
Versioning the selector store
Selector changes are code changes. Treat them with the same rigor:
class SelectorStore:
def set(self, site: str, selectors: dict, source: str = "auto"):
self.conn.execute(
"INSERT INTO selectors_history (site, json, source, created_at) "
"VALUES (?, ?, ?, CURRENT_TIMESTAMP)",
(site, json.dumps(selectors), source),
)
self.conn.execute(
"INSERT INTO selectors (site, json) VALUES (?, ?) "
"ON CONFLICT(site) DO UPDATE SET json=excluded.json",
(site, json.dumps(selectors)),
)
self.conn.commit()
The history table lets you audit every change, roll back when an auto-repair was wrong, and analyze patterns across sites. For regulated workloads, this audit trail is mandatory.
Cost expectations
For a fleet of 100 scrapers running daily:
| Setup | Stable cost per month | Cost during one site redesign | Engineering time per redesign |
|---|---|---|---|
| Static selectors | $50 | $0 (until repair) | 4-8 hours |
| Self-healing | $80 | $5-$15 (repair LLM cost) | 0 hours |
Engineering time savings dominate the math. At $100/hour fully loaded, even one prevented redesign per month pays for the self-healing infrastructure.
Multi-pass repair strategy
The single-shot repair loop catches roughly 80 percent of selector rot. A multi-pass strategy raises the floor:
Pass 1: ask the LLM to find new CSS selectors using the cached old selectors as hints.
Pass 2: if pass 1 fails validation, ask the LLM to find selectors using semantic descriptions (“the main product title near the top of the page”) with no CSS context.
Pass 3: if pass 2 fails, fall back to vision extraction from a screenshot.
Pass 4: if pass 3 fails, raise an alert and fall back to last-known-good cached data.
Each pass costs more than the last but catches more failures. Across 6 months of production, the four-pass strategy hit 99.4 percent eventual success versus 92 percent on single-pass.
Auto-promotion of repaired selectors
Repaired selectors are not necessarily as stable as the original. A pattern that helps: keep both old and new selectors in the store, use the new ones primarily, and silently re-test the old ones once a week. If the old ones come back to life (the site rolled back), prefer them again. This catches A/B tests and rollback events that briefly break selectors then fix them.
class SelectorStore:
def get(self, site: str) -> dict:
# returns {"primary": {...}, "shadow": {...}, "shadow_score": int}
...
def promote_shadow(self, site: str):
# if shadow has succeeded N times in validation, swap with primary
...
Repair LLM model selection
Not every model is suited to selector repair. Our March 2026 benchmark across 200 real selector-rot incidents:
| Model | Repair success rate | Cost per repair |
|---|---|---|
| GPT-4o-mini | 76% | $0.012 |
| GPT-4o | 91% | $0.18 |
| Claude Sonnet 4.5 | 93% | $0.20 |
| Gemini 1.5 Pro | 89% | $0.14 |
For most teams, GPT-4o-mini is the right first try (cheap and frequent), with escalation to Sonnet 4.5 on validation failure.
Production rollout pattern
A safe rollout for self-healing in an existing scraper fleet:
Week 1: deploy in shadow mode. The healing loop runs but does not update production selectors. Compare proposed selectors to engineer-fixed ones to validate quality.
Week 2: enable healing for low-stakes scrapers (internal dashboards, casual monitoring).
Week 3: enable for medium-stakes scrapers with on-call review of every healing event.
Week 4: enable everywhere with circuit breakers and weekly review of healing patterns.
This phased rollout caught two prompt-engineering bugs that would have produced bad selectors in production. Worth the time.
Frequently asked questions
What if the LLM proposes a selector that matches the wrong element?
The validation step catches it. The proposed selectors must produce a valid Pydantic Product object. Mismatched selectors fail validation.
Can self-healing handle login flows that change?
Login flows are harder than data extraction because the steps are sequential and stateful. The same general pattern works (detect failure, propose fix, validate) but the validation has to drive the whole flow. Most teams just rebuild login flows manually when they break.
Can I use this with non-Python scrapers?
Yes. The pattern is language-agnostic. Implement the same loop in Node.js with Playwright and OpenAI’s Node SDK.
How often do real sites change?
In our experience, popular ecommerce sites ship layout changes every 4-8 weeks. Long-tail sites change less frequently but more chaotically.
What about API endpoints that change paths?
Same pattern, slightly different repair prompt. Ask the LLM to find the new endpoint by reading the network traffic of the page (you log requests during scrape, pass them to the repair LLM).
Does this work with sites that use anti-bot defenses?
The repair loop itself is fine. You still need clean proxies and stealth defaults to load the page in the first place. See DataDome vs PerimeterX for bot defense comparison.
How do I evaluate a self-healing system before going to production?
Build a test suite of “broken” pages: take real HTML and manually mutate class names, restructure DOM, swap elements. Run the healing loop against the mutated pages and measure success rate. The same suite catches regressions when you change the repair prompt.
Can the LLM be tricked into proposing a malicious selector?
In theory yes (a compromised target could embed adversarial content). In practice the validation step catches anything that does not produce a valid Pydantic record. Defense in depth: validate the proposed selectors against an allow-list of safe characters, never let the model propose JavaScript expressions.
What about pages with rotating selectors that change every request?
Selectors based on hash-like class names (e.g. _5pcr_xyz123) are not worth healing repeatedly. Switch the scraper to use semantic attributes (data-testid, aria-label) or vision extraction.
Common production gotchas
- The repair LLM occasionally proposes a selector that matches a different element on every page load (because the site randomizes class names). Validate that the new selector produces consistent results across 3 test loads before promoting.
- Running the repair loop in parallel for many failing scrapers can overwhelm the LLM rate limit. Serialize repair calls per site or use a token bucket.
- The cached old HTML must be invalidated when the page legitimately updates content (new product listed, price changed). Cache by URL plus content hash, with a short TTL.
- Vision fallback is expensive. Cap vision attempts at 1 per scrape session.
- Healing logs grow fast. Sample or expire after 90 days.
How does self-healing differ from a pure AI agent like browser-use?
Self-healing keeps your existing fast deterministic Playwright code as the hot path. AI agents replace it entirely. Self-healing is roughly 10x cheaper at steady state but more complex to build.
For more on building robust scraping infrastructure, browse the AI modern scraping category.