Claude AI + Python Web Scraping: Complete Workflow Guide

TL;DR
Claude’s API makes it practical to add AI-powered extraction and classification to Python scraping pipelines. this guide covers the complete workflow: scrape with requests/playwright, extract structure with Claude, handle large content efficiently, and keep costs low with the right model selection.

why Claude works well for scraping pipelines

Claude handles HTML extraction well for two reasons: it has a large context window (200K tokens in Claude 3.5+) that can process full page content without chunking, and its instruction-following is reliable enough for structured data extraction without extensive prompt engineering. you can send messy HTML and get clean JSON back consistently.

the practical use case is handling sites where selectors break frequently — job boards, e-commerce sites, news sites that redesign every 18 months. instead of maintaining fragile CSS selectors, you describe what you want in plain English and let Claude figure out where it is on the page.

basic setup

pip install anthropic requests beautifulsoup4 playwright
playwright install chromium

the core scrape-extract pattern

import anthropic
import requests
from bs4 import BeautifulSoup

client = anthropic.Anthropic(api_key='your-api-key')

def scrape_and_extract(url: str, extraction_instruction: str, model: str = 'claude-haiku-4-5') -> dict:
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
    response = requests.get(url, headers=headers, timeout=15)
    soup = BeautifulSoup(response.text, 'html.parser')
    for tag in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
        tag.decompose()
    clean_text = soup.get_text(separator='\n', strip=True)
    message = client.messages.create(
        model=model,
        max_tokens=2048,
        messages=[{
            'role': 'user',
            'content': extraction_instruction + '\n\npage content:\n' + clean_text[:15000]
        }]
    )
    return {
        'url': url,
        'extracted': message.content[0].text,
        'input_tokens': message.usage.input_tokens,
        'output_tokens': message.usage.output_tokens
    }

result = scrape_and_extract(
    'https://example.com/jobs',
    'extract all job titles and locations as a JSON array with title and location fields'
)
print(result['extracted'])

model selection for cost control

Claude model selection has a big impact on per-page cost. for extraction tasks on clean HTML, claude-haiku-4-5 is almost always sufficient and costs roughly 25x less than claude-opus-4-5. use Sonnet when the structure is complex or ambiguous. only use Opus for tasks that require reasoning across multiple pages or complex schema inference:

MODEL_TIERS = {
    'simple': 'claude-haiku-4-5',
    'standard': 'claude-sonnet-4-5',
    'complex': 'claude-opus-4-5'
}

handling JavaScript-rendered pages with Playwright

from playwright.sync_api import sync_playwright

def scrape_js_page(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url, wait_until='networkidle')
        content = page.content()
        browser.close()
    return content

def extract_from_js_page(url: str, instruction: str) -> dict:
    html = scrape_js_page(url)
    soup = BeautifulSoup(html, 'html.parser')
    for tag in soup(['script', 'style']):
        tag.decompose()
    text = soup.get_text(separator='\n', strip=True)[:15000]
    message = client.messages.create(
        model='claude-haiku-4-5',
        max_tokens=2048,
        messages=[{'role': 'user', 'content': instruction + '\n\n' + text}]
    )
    return {'extracted': message.content[0].text}

batch processing with cost tracking

import time

def batch_scrape(urls: list, instruction: str) -> list:
    results = []
    total_in = 0
    total_out = 0
    for i, url in enumerate(urls):
        print(f'processing {i+1}/{len(urls)}: {url}')
        try:
            result = scrape_and_extract(url, instruction)
            results.append(result)
            total_in += result['input_tokens']
            total_out += result['output_tokens']
        except Exception as e:
            results.append({'url': url, 'error': str(e)})
        time.sleep(1)
    cost = (total_in / 1_000_000 * 0.25) + (total_out / 1_000_000 * 1.25)
    print(f'estimated cost: ${cost:.4f}')
    return results

combining with proxy rotation

Claude handles the extraction side. proxies handle the access side. for sites that block scrapers, route your requests through rotating residential or mobile proxies before passing content to Claude:

import random

def scrape_with_proxy(url: str, proxy_list: list) -> str:
    proxy = random.choice(proxy_list)
    proxies = {'http': proxy, 'https': proxy}
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers, proxies=proxies, timeout=20)
    return response.text

pairing AI-powered scraping with the right proxy infrastructure matters. try our dedicated Singapore mobile proxy for clean, unblocked connections on your next project.

see our guides on proxy servers and SOCKS5 vs HTTP proxy for connection format details. for JS-heavy sites that need browser-level proxy config, see our Playwright vs Puppeteer comparison which covers proxy configuration for both.

sources and further reading

related guides

last updated: April 3, 2026

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)