Claude’s API makes it practical to add AI-powered extraction and classification to Python scraping pipelines. this guide covers the complete workflow: scrape with requests/playwright, extract structure with Claude, handle large content efficiently, and keep costs low with the right model selection.
why Claude works well for scraping pipelines
Claude handles HTML extraction well for two reasons: it has a large context window (200K tokens in Claude 3.5+) that can process full page content without chunking, and its instruction-following is reliable enough for structured data extraction without extensive prompt engineering. you can send messy HTML and get clean JSON back consistently.
the practical use case is handling sites where selectors break frequently — job boards, e-commerce sites, news sites that redesign every 18 months. instead of maintaining fragile CSS selectors, you describe what you want in plain English and let Claude figure out where it is on the page.
basic setup
pip install anthropic requests beautifulsoup4 playwright
playwright install chromium
the core scrape-extract pattern
import anthropic
import requests
from bs4 import BeautifulSoup
client = anthropic.Anthropic(api_key='your-api-key')
def scrape_and_extract(url: str, extraction_instruction: str, model: str = 'claude-haiku-4-5') -> dict:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
response = requests.get(url, headers=headers, timeout=15)
soup = BeautifulSoup(response.text, 'html.parser')
for tag in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
tag.decompose()
clean_text = soup.get_text(separator='\n', strip=True)
message = client.messages.create(
model=model,
max_tokens=2048,
messages=[{
'role': 'user',
'content': extraction_instruction + '\n\npage content:\n' + clean_text[:15000]
}]
)
return {
'url': url,
'extracted': message.content[0].text,
'input_tokens': message.usage.input_tokens,
'output_tokens': message.usage.output_tokens
}
result = scrape_and_extract(
'https://example.com/jobs',
'extract all job titles and locations as a JSON array with title and location fields'
)
print(result['extracted'])
model selection for cost control
Claude model selection has a big impact on per-page cost. for extraction tasks on clean HTML, claude-haiku-4-5 is almost always sufficient and costs roughly 25x less than claude-opus-4-5. use Sonnet when the structure is complex or ambiguous. only use Opus for tasks that require reasoning across multiple pages or complex schema inference:
MODEL_TIERS = {
'simple': 'claude-haiku-4-5',
'standard': 'claude-sonnet-4-5',
'complex': 'claude-opus-4-5'
}
handling JavaScript-rendered pages with Playwright
from playwright.sync_api import sync_playwright
def scrape_js_page(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url, wait_until='networkidle')
content = page.content()
browser.close()
return content
def extract_from_js_page(url: str, instruction: str) -> dict:
html = scrape_js_page(url)
soup = BeautifulSoup(html, 'html.parser')
for tag in soup(['script', 'style']):
tag.decompose()
text = soup.get_text(separator='\n', strip=True)[:15000]
message = client.messages.create(
model='claude-haiku-4-5',
max_tokens=2048,
messages=[{'role': 'user', 'content': instruction + '\n\n' + text}]
)
return {'extracted': message.content[0].text}
batch processing with cost tracking
import time
def batch_scrape(urls: list, instruction: str) -> list:
results = []
total_in = 0
total_out = 0
for i, url in enumerate(urls):
print(f'processing {i+1}/{len(urls)}: {url}')
try:
result = scrape_and_extract(url, instruction)
results.append(result)
total_in += result['input_tokens']
total_out += result['output_tokens']
except Exception as e:
results.append({'url': url, 'error': str(e)})
time.sleep(1)
cost = (total_in / 1_000_000 * 0.25) + (total_out / 1_000_000 * 1.25)
print(f'estimated cost: ${cost:.4f}')
return results
combining with proxy rotation
Claude handles the extraction side. proxies handle the access side. for sites that block scrapers, route your requests through rotating residential or mobile proxies before passing content to Claude:
import random
def scrape_with_proxy(url: str, proxy_list: list) -> str:
proxy = random.choice(proxy_list)
proxies = {'http': proxy, 'https': proxy}
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers, proxies=proxies, timeout=20)
return response.text
pairing AI-powered scraping with the right proxy infrastructure matters. try our dedicated Singapore mobile proxy for clean, unblocked connections on your next project.
see our guides on proxy servers and SOCKS5 vs HTTP proxy for connection format details. for JS-heavy sites that need browser-level proxy config, see our Playwright vs Puppeteer comparison which covers proxy configuration for both.
sources and further reading
related guides
last updated: April 3, 2026