Claude AI Python Web Scraping Workflow: Complete Guide
Claude, built by Anthropic, is one of the most capable large language models for structured data extraction. unlike search-augmented models, Claude excels at parsing raw HTML or text content you provide and extracting exactly the fields you need. this makes it a powerful component in web scraping pipelines where traditional CSS selectors struggle with inconsistent page layouts.
this guide shows you how to build a complete web scraping workflow that combines Python scraping libraries with Claude’s API for intelligent, adaptive data extraction.
Why Use Claude for Web Scraping
traditional scraping breaks when websites change their HTML structure. a price that was in a span.price element yesterday might move to a div.product-price today. Claude solves this because it understands the semantic meaning of page content, not just its structure.
here are the specific advantages:
- handles unstructured data: Claude can extract structured JSON from messy HTML, plain text, or even OCR output
- adapts to layout changes: no CSS selectors to maintain. if the data is visible on the page, Claude finds it
- multi-format output: ask for JSON, CSV rows, or any custom format
- context window: Claude’s large context window (200K tokens) can process entire pages at once
- consistent schemas: Claude reliably outputs data matching your specified schema
the tradeoff is cost and speed. calling Claude’s API is slower and more expensive per page than parsing HTML with BeautifulSoup. the right approach is using Claude selectively for pages where traditional parsing is fragile or impractical.
Architecture Overview
the workflow has three stages:
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Fetch Layer │────>│ Clean Layer │────>│ Extract Layer │
│ │ │ │ │ │
│ - requests/httpx │ │ - strip scripts │ │ - Claude API │
│ - playwright │ │ - remove nav │ │ - structured │
│ - proxy rotation │ │ - to markdown │ │ JSON output │
└──────────────────┘ └──────────────────┘ └──────────────────┘
Prerequisites and Setup
install the required packages:
pip install anthropic httpx beautifulsoup4 html2text playwright
python -m playwright install chromium
set your API key:
import os
os.environ["ANTHROPIC_API_KEY"] = "your-claude-api-key"
Step 1: Build the Fetch Layer with Proxy Support
the fetch layer retrieves web pages. we support both simple HTTP requests and browser-rendered pages for JavaScript-heavy sites.
import httpx
from playwright.sync_api import sync_playwright
import random
class PageFetcher:
def __init__(self, proxies=None):
self.proxies = proxies or []
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
def _get_proxy(self):
if not self.proxies:
return None
return random.choice(self.proxies)
def fetch_simple(self, url):
"""fetch page with httpx (no JS rendering)."""
proxy = self._get_proxy()
with httpx.Client(
proxy=proxy,
headers=self.headers,
timeout=30.0,
follow_redirects=True
) as client:
response = client.get(url)
response.raise_for_status()
return response.text
def fetch_rendered(self, url):
"""fetch page with Playwright (full JS rendering)."""
proxy = self._get_proxy()
browser_args = {}
if proxy:
# parse proxy URL for Playwright format
browser_args["proxy"] = {"server": proxy}
with sync_playwright() as p:
browser = p.chromium.launch(headless=True, **browser_args)
page = browser.new_page()
page.set_extra_http_headers(self.headers)
page.goto(url, wait_until="networkidle", timeout=30000)
content = page.content()
browser.close()
return content
# usage with residential proxies
fetcher = PageFetcher(proxies=[
"http://user:pass@us.residential-proxy.com:8080",
"http://user:pass@eu.residential-proxy.com:8080",
])
html = fetcher.fetch_simple("https://example.com/products")
Step 2: Clean the HTML
raw HTML is full of scripts, styles, navigation, and other noise that wastes Claude’s tokens and reduces extraction accuracy. cleaning the content before sending it to Claude is essential.
from bs4 import BeautifulSoup
import html2text
import re
class ContentCleaner:
def __init__(self):
self.converter = html2text.HTML2Text()
self.converter.ignore_links = False
self.converter.ignore_images = True
self.converter.body_width = 0 # no wrapping
def clean(self, html, keep_tables=True):
"""convert HTML to clean markdown for Claude."""
soup = BeautifulSoup(html, "html.parser")
# remove non-content elements
for tag in soup(["script", "style", "nav", "footer", "header",
"aside", "iframe", "noscript", "svg", "form"]):
tag.decompose()
# remove hidden elements
for tag in soup.find_all(style=re.compile(r"display\s*:\s*none")):
tag.decompose()
# remove comment sections
for tag in soup.find_all(class_=re.compile(r"comment|sidebar|menu|popup|modal")):
tag.decompose()
# convert to markdown
markdown = self.converter.handle(str(soup))
# clean up excessive whitespace
markdown = re.sub(r"\n{3,}", "\n\n", markdown)
markdown = markdown.strip()
return markdown
def extract_main_content(self, html):
"""try to find and return only the main content area."""
soup = BeautifulSoup(html, "html.parser")
# look for common main content containers
main = (
soup.find("main") or
soup.find("article") or
soup.find(id=re.compile(r"content|main|product")) or
soup.find(class_=re.compile(r"content|main|product"))
)
if main:
return self.clean(str(main))
return self.clean(html)
cleaner = ContentCleaner()
clean_content = cleaner.extract_main_content(html)
Step 3: Build the Claude Extraction Layer
this is where the magic happens. we send cleaned content to Claude with a schema definition and get back structured data.
import anthropic
import json
from typing import Any
class ClaudeExtractor:
def __init__(self, api_key=None):
self.client = anthropic.Anthropic(api_key=api_key)
def extract(self, content, schema, instructions="", model="claude-sonnet-4-20250514"):
"""extract structured data from content using Claude."""
system_prompt = """you are a data extraction assistant. your job is to extract structured data from web page content and return it as valid JSON.
RULES:
- return ONLY valid JSON. no markdown code blocks, no explanations
- follow the provided schema exactly
- use null for fields where data is not found
- do not invent or hallucinate data. only extract what is present in the content"""
user_prompt = f"""extract data from the following web page content.
SCHEMA:
{json.dumps(schema, indent=2)}
{f"ADDITIONAL INSTRUCTIONS: {instructions}" if instructions else ""}
WEB PAGE CONTENT:
{content}"""
response = self.client.messages.create(
model=model,
max_tokens=4096,
system=system_prompt,
messages=[{"role": "user", "content": user_prompt}]
)
raw = response.content[0].text.strip()
# handle potential markdown wrapping
if raw.startswith("```"):
raw = raw.split("\n", 1)[1].rsplit("```", 1)[0].strip()
return json.loads(raw)
def extract_batch(self, pages, schema, instructions=""):
"""extract data from multiple pages."""
results = []
for i, page in enumerate(pages):
try:
data = self.extract(page["content"], schema, instructions)
results.append({
"url": page.get("url", f"page_{i}"),
"data": data,
"success": True
})
except (json.JSONDecodeError, Exception) as e:
results.append({
"url": page.get("url", f"page_{i}"),
"error": str(e),
"success": False
})
return results
Step 4: Put It All Together
here is the complete workflow that fetches, cleans, and extracts data:
import json
import csv
import time
class ClaudeScrapingWorkflow:
def __init__(self, claude_key, proxies=None):
self.fetcher = PageFetcher(proxies=proxies)
self.cleaner = ContentCleaner()
self.extractor = ClaudeExtractor(api_key=claude_key)
def scrape(self, url, schema, use_browser=False, instructions=""):
"""full scraping pipeline for a single URL."""
# fetch
if use_browser:
html = self.fetcher.fetch_rendered(url)
else:
html = self.fetcher.fetch_simple(url)
# clean
content = self.cleaner.extract_main_content(html)
# check token budget (roughly 4 chars per token)
estimated_tokens = len(content) // 4
if estimated_tokens > 150000:
# truncate to fit Claude's context window
content = content[:600000]
print(f"warning: content truncated from ~{estimated_tokens} tokens")
# extract
data = self.extractor.extract(content, schema, instructions)
return data
def scrape_multiple(self, urls, schema, use_browser=False,
instructions="", delay=2.0):
"""scrape multiple URLs with delay between requests."""
results = []
for i, url in enumerate(urls):
print(f"[{i+1}/{len(urls)}] scraping: {url}")
try:
data = self.scrape(url, schema, use_browser, instructions)
results.append({"url": url, "data": data, "success": True})
except Exception as e:
print(f" error: {e}")
results.append({"url": url, "error": str(e), "success": False})
if i < len(urls) - 1:
time.sleep(delay)
return results
def to_csv(self, results, output_file, flat_keys=None):
"""save extraction results to CSV."""
if not results or not any(r["success"] for r in results):
print("no successful results to save")
return
# get keys from first successful result
first_success = next(r for r in results if r["success"])
if flat_keys is None:
flat_keys = list(first_success["data"].keys())
with open(output_file, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["url"] + flat_keys)
writer.writeheader()
for result in results:
if result["success"]:
row = {"url": result["url"]}
row.update(result["data"])
writer.writerow(row)
print(f"saved {sum(1 for r in results if r['success'])} rows to {output_file}")
# usage example: scrape product data
workflow = ClaudeScrapingWorkflow(
claude_key="your-anthropic-api-key",
proxies=[
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
]
)
product_schema = {
"product_name": "string",
"price": "string (include currency symbol)",
"rating": "float (out of 5)",
"review_count": "integer",
"availability": "string (in stock / out of stock)",
"key_features": ["string"],
"description_summary": "string (max 200 chars)"
}
urls = [
"https://example.com/product/1",
"https://example.com/product/2",
"https://example.com/product/3",
]
results = workflow.scrape_multiple(
urls,
product_schema,
instructions="focus on the main product listing, ignore related products"
)
workflow.to_csv(results, "products.csv")
Advanced: Using Claude for Dynamic Schema Detection
one of Claude’s strengths is figuring out what data is available on a page without you specifying the schema in advance.
def auto_detect_schema(extractor, sample_content):
"""let Claude analyze a page and suggest an extraction schema."""
response = extractor.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"""analyze this web page content and identify all extractable data fields.
return a JSON object where:
- keys are field names (snake_case)
- values describe the data type and what the field contains
WEB PAGE CONTENT:
{sample_content[:10000]}"""
}]
)
raw = response.content[0].text.strip()
if raw.startswith("```"):
raw = raw.split("\n", 1)[1].rsplit("```", 1)[0].strip()
return json.loads(raw)
# first, detect what data is available
sample_html = fetcher.fetch_simple("https://example.com/product/1")
sample_content = cleaner.extract_main_content(sample_html)
detected_schema = auto_detect_schema(extractor, sample_content)
print("detected fields:", json.dumps(detected_schema, indent=2))
# then use that schema for batch extraction
results = workflow.scrape_multiple(urls, detected_schema)
Cost Optimization Strategies
Claude API calls cost money. here are practical ways to reduce costs:
1. minimize input tokens
def aggressive_clean(html, max_chars=20000):
"""aggressively reduce content size to save tokens."""
soup = BeautifulSoup(html, "html.parser")
# remove everything except the main content area
for tag in soup(["script", "style", "nav", "footer", "header",
"aside", "iframe", "noscript", "svg", "form",
"img", "video", "audio"]):
tag.decompose()
# remove all attributes except href
for tag in soup.find_all(True):
attrs = dict(tag.attrs)
for attr in attrs:
if attr != "href":
del tag[attr]
text = soup.get_text(separator=" ", strip=True)
return text[:max_chars]
2. use the right model
# use Haiku for simple extractions (cheapest)
simple_data = extractor.extract(content, schema, model="claude-3-5-haiku-20241022")
# use Sonnet for complex pages (balanced)
complex_data = extractor.extract(content, schema, model="claude-sonnet-4-20250514")
# use Opus only for pages requiring deep reasoning
deep_data = extractor.extract(content, schema, model="claude-opus-4-20250514")
3. cache results
import hashlib
import os
def cached_extract(extractor, content, schema, cache_dir="cache"):
"""cache extraction results to avoid re-processing identical pages."""
os.makedirs(cache_dir, exist_ok=True)
# create cache key from content hash
content_hash = hashlib.md5(content.encode()).hexdigest()
cache_file = os.path.join(cache_dir, f"{content_hash}.json")
if os.path.exists(cache_file):
with open(cache_file) as f:
return json.load(f)
# extract and cache
result = extractor.extract(content, schema)
with open(cache_file, "w") as f:
json.dump(result, f)
return result
Error Handling and Validation
Claude does not always return perfect JSON. here is how to handle common issues:
from pydantic import BaseModel, ValidationError
from typing import Optional
class ProductData(BaseModel):
product_name: str
price: Optional[str] = None
rating: Optional[float] = None
review_count: Optional[int] = None
availability: Optional[str] = None
def validated_extract(extractor, content, schema):
"""extract data and validate against a Pydantic model."""
raw_data = extractor.extract(content, schema)
try:
validated = ProductData(**raw_data)
return validated.model_dump()
except ValidationError as e:
print(f"validation failed: {e}")
# try extraction again with stricter instructions
return extractor.extract(
content, schema,
instructions="be very precise. ensure price is a string, rating is a float, review_count is an integer."
)
When to Use Claude vs Traditional Parsing
| Scenario | Use Claude | Use BeautifulSoup/CSS |
|---|---|---|
| consistent HTML structure | no | yes |
| pages change layout frequently | yes | no |
| extracting from plain text | yes | no |
| scraping 10,000+ pages | no (cost) | yes |
| complex nested data | yes | depends |
| real-time price monitoring | no (speed) | yes |
| one-off research extraction | yes | no |
Conclusion
the Claude AI Python web scraping workflow is most powerful when you use it selectively. let traditional scraping handle the high-volume, structured extraction, and bring in Claude for the pages where CSS selectors are too fragile, the data is unstructured, or you need to adapt to changing layouts without rewriting your parser.
the key insight is that Claude is not a replacement for BeautifulSoup or Scrapy. it is a layer you add on top of your existing scraping infrastructure to handle the 20% of pages that cause 80% of your maintenance headaches.