Pydantic AI for Web Scraping: Type-Safe LLM Scrapers in 2026

Pydantic AI landed in late 2024 and by 2026 it’s become the go-to way to build type-safe, LLM-powered scrapers that actually return structured data instead of raw text blobs. If you’ve spent time wrestling with JSON parsing failures, hallucinated field names, or retry logic scattered across notebook cells, Pydantic AI for web scraping is worth a serious look.

What Pydantic AI Brings to Scraping Pipelines

Pydantic AI wraps LLM calls behind a typed interface. You define a Pydantic model for the data you want, pass it to the agent, and get back a validated Python object — not a string you have to parse yourself. The library handles retries, validation errors, and model switching out of the box.

For scraping this matters because the hardest part of LLM-assisted extraction isn’t prompting, it’s reliability. A scraper that works 90% of the time and silently drops 10% of records is worse than one that fails loudly. Pydantic AI’s validation layer forces the LLM to conform or retry, and when it can’t, it raises a typed exception you can catch and log.

Compare this to raw LLM calls or even Crawl4AI’s extraction mode, which gives you markdown and leaves structured parsing to you. Pydantic AI sits one layer above: you still feed it cleaned HTML or markdown, but the output contract is enforced.

Setting Up a Basic Pydantic AI Scraper

Install the stack:

pip install pydantic-ai httpx crawl4ai

Define your schema and agent:

from pydantic import BaseModel
from pydantic_ai import Agent
import httpx

class JobPosting(BaseModel):
    title: str
    company: str
    salary_range: str | None
    location: str
    remote: bool

agent = Agent(
    "openai:gpt-4o-mini",
    result_type=JobPosting,
    system_prompt="Extract the job posting details from the HTML. Return null for salary_range if not listed.",
)

async def scrape_job(url: str) -> JobPosting:
    async with httpx.AsyncClient() as client:
        html = (await client.get(url)).text
    result = await agent.run(html[:8000])  # trim to token budget
    return result.data

The result.data is a validated JobPosting instance. If the LLM returns malformed JSON or omits a required field, Pydantic AI retries up to the configured limit before raising UnexpectedModelBehavior. No silent failures.

For the HTTP layer, the choice matters more than people think. If the target site uses TLS fingerprinting, plain httpx will get blocked. Comparing httpx, curl-cffi, and niquests shows curl-cffi as the 2026 default for anti-bot targets — it’s a drop-in replacement for the client.get() call above.

When to Use LLM Extraction vs. CSS Selectors

Not every scraper should use an LLM. Here’s an honest breakdown:

ScenarioLLM extractionCSS/XPath selectors
Schema varies per siteyespainful
Schema is stable, high volumeoverkillpreferred
Unstructured text (reviews, bios)yesno
Price / SKU gridsmarginalpreferred
JS-rendered SPAspair with browserpair with browser
Cost sensitivity~$0.002/page (gpt-4o-mini)~$0

The cost column is the honest check. At $0.002 per page with gpt-4o-mini, a 100K page crawl costs $200 in LLM calls alone — before proxies or infra. For stable schemas at scale, selectors win. LLM extraction is the right call when the schema is inconsistent across sources or when you’re extracting meaning from prose, not structured fields.

AutoScraper’s pattern-based approach sits between these two extremes — no selectors, no LLM costs, but it breaks on layout changes. Pydantic AI handles layout changes gracefully since it reads semantic content.

Handling JavaScript-Rendered Pages

Most 2026 targets require a browser. The standard pattern is to pair Pydantic AI with a browser layer that handles rendering, then pass the cleaned text to the agent.

Steps for a Playwright + Pydantic AI pipeline:

  1. Launch a browser context with Playwright (stealth mode, real user-agent)
  2. Navigate and wait for the target element or network idle
  3. Extract innerText or the full page HTML, trimmed to token budget
  4. Pass to the Pydantic AI agent for structured extraction
  5. Validate result, retry on ValidationError, log failures with the raw HTML for debugging

Playwright beats Puppeteer and Selenium for this use case in 2026 because its async API integrates cleanly with Pydantic AI’s async agent interface — no thread bridging, no sync wrappers.

For teams that want a managed crawl layer instead of raw Playwright, Crawlee for Python handles request queuing, retries, and session rotation, and can pipe rendered HTML directly into a Pydantic AI extraction step. It’s a good fit when you’re crawling hundreds of pages with structured output requirements.

Model Selection and Cost Control

The main levers:

  • gpt-4o-mini: default choice, fast, cheap, handles well-structured HTML reliably
  • claude-haiku-3-5: slightly better at prose extraction, similar cost tier
  • gpt-4o: for complex nested schemas or ambiguous content, 10x the cost
  • local models (ollama): zero API cost, 3-5x slower, accuracy drops on noisy HTML

Pydantic AI lets you swap models per agent or per run, so you can route simple extractions to mini and fall back to a stronger model on retry. A practical pattern: catch UnexpectedModelBehavior on the first run with mini, then retry once with gpt-4o before logging as a permanent failure.

Keep prompts tight. Token bloat is the main cost driver. Strip

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)