Stateful AI Agents for Web Scraping: Memory, Context, and Adaptation
most web scrapers are stateless. they run, collect data, and forget everything about the process. every session starts from zero. they do not remember that a site changed its layout last week, that a particular proxy provider performs poorly against Cloudflare, or that a target page started requiring authentication.
stateful AI agents change this dynamic. by maintaining memory across sessions, these agents learn from past scraping runs, adapt to site changes automatically, and make intelligent decisions about how to approach each target. the result is scrapers that get better over time rather than breaking and requiring manual fixes.
this guide covers the architecture, implementation, and practical applications of stateful AI agents for web scraping.
What Makes an AI Agent Stateful
a stateless scraper executes the same logic every time regardless of what happened in previous runs. a stateful agent maintains several types of memory:
Short-Term Memory (Working Memory)
information relevant to the current scraping session:
– which pages have been visited in this session
– the current proxy and its performance so far
– error patterns encountered during this run
– context from previously scraped pages that informs how to handle the next one
Long-Term Memory (Persistent Memory)
information that persists across sessions:
– site-specific scraping strategies that worked in the past
– proxy performance history per target domain
– selector patterns that successfully extracted data
– rate limits and optimal request timing for each site
– authentication flows and session management patterns
Episodic Memory
records of specific past events:
– “on March 3rd, site X changed its layout and selectors broke”
– “proxy provider Y started returning CAPTCHAs for domain Z on February 15th”
– “the last successful scrape of site W used a 3-second delay with residential proxies”
Architecture
┌────────────────────────────────────────────────┐
│ AI Agent Core │
│ ┌────────────┐ ┌──────────┐ ┌────────────┐ │
│ │ Planning │ │ Execution │ │ Reflection │ │
│ │ Module │ │ Module │ │ Module │ │
│ └──────┬─────┘ └─────┬────┘ └──────┬─────┘ │
│ │ │ │ │
│ ┌──────▼──────────────▼───────────────▼─────┐ │
│ │ Memory Manager │ │
│ │ ┌──────────┐ ┌──────────┐ ┌───────────┐ │ │
│ │ │ Working │ │ Long-Term │ │ Episodic │ │ │
│ │ │ Memory │ │ Memory │ │ Memory │ │ │
│ │ └──────────┘ └──────────┘ └───────────┘ │ │
│ └───────────────────────────────────────────┘ │
└──────────────────────┬──────────────────────────┘
│
┌────────────┼────────────┐
│ │ │
┌─────▼─────┐ ┌───▼───┐ ┌─────▼─────┐
│ Proxy │ │Browser │ │ Storage │
│ Manager │ │ Engine │ │ Layer │
└───────────┘ └───────┘ └───────────┘
Step 1: Build the Memory System
the memory system is the foundation. here is a practical implementation using SQLite for persistence and an LLM for semantic retrieval:
# memory.py
import sqlite3
import json
import hashlib
from datetime import datetime, timedelta
from typing import Optional
class AgentMemory:
"""persistent memory system for a scraping agent."""
def __init__(self, db_path="agent_memory.db"):
self.conn = sqlite3.connect(db_path)
self.conn.row_factory = sqlite3.Row
self._create_tables()
def _create_tables(self):
self.conn.executescript("""
CREATE TABLE IF NOT EXISTS site_profiles (
domain TEXT PRIMARY KEY,
anti_bot_level TEXT,
best_proxy_type TEXT,
optimal_delay REAL,
selectors TEXT,
last_layout_change TEXT,
auth_required INTEGER DEFAULT 0,
notes TEXT,
updated_at TEXT
);
CREATE TABLE IF NOT EXISTS scraping_episodes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
domain TEXT NOT NULL,
timestamp TEXT NOT NULL,
proxy_type TEXT,
proxy_provider TEXT,
success_rate REAL,
avg_response_time REAL,
pages_scraped INTEGER,
errors TEXT,
strategy_used TEXT,
outcome TEXT,
lessons TEXT
);
CREATE TABLE IF NOT EXISTS proxy_performance (
id INTEGER PRIMARY KEY AUTOINCREMENT,
proxy_provider TEXT NOT NULL,
proxy_type TEXT NOT NULL,
target_domain TEXT NOT NULL,
success_rate REAL,
avg_latency REAL,
block_rate REAL,
captcha_rate REAL,
tested_at TEXT,
sample_size INTEGER
);
CREATE TABLE IF NOT EXISTS selector_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
domain TEXT NOT NULL,
field_name TEXT NOT NULL,
selector TEXT NOT NULL,
success_rate REAL,
first_seen TEXT,
last_working TEXT,
is_current INTEGER DEFAULT 1
);
CREATE TABLE IF NOT EXISTS working_memory (
key TEXT PRIMARY KEY,
value TEXT,
expires_at TEXT
);
""")
self.conn.commit()
# --- site profiles ---
def get_site_profile(self, domain):
"""retrieve stored knowledge about a domain."""
cursor = self.conn.execute(
"SELECT * FROM site_profiles WHERE domain = ?",
(domain,),
)
row = cursor.fetchone()
if row:
profile = dict(row)
if profile.get("selectors"):
profile["selectors"] = json.loads(profile["selectors"])
return profile
return None
def update_site_profile(self, domain, **kwargs):
"""update or create a site profile."""
existing = self.get_site_profile(domain)
if "selectors" in kwargs and isinstance(kwargs["selectors"], dict):
kwargs["selectors"] = json.dumps(kwargs["selectors"])
kwargs["updated_at"] = datetime.utcnow().isoformat()
if existing:
sets = ", ".join(f"{k} = ?" for k in kwargs)
values = list(kwargs.values()) + [domain]
self.conn.execute(
f"UPDATE site_profiles SET {sets} WHERE domain = ?",
values,
)
else:
kwargs["domain"] = domain
cols = ", ".join(kwargs.keys())
placeholders = ", ".join("?" * len(kwargs))
self.conn.execute(
f"INSERT INTO site_profiles ({cols}) VALUES ({placeholders})",
list(kwargs.values()),
)
self.conn.commit()
# --- episodes ---
def record_episode(self, domain, **kwargs):
"""record a scraping episode for future reference."""
kwargs["domain"] = domain
kwargs["timestamp"] = datetime.utcnow().isoformat()
if "errors" in kwargs and isinstance(kwargs["errors"], list):
kwargs["errors"] = json.dumps(kwargs["errors"])
cols = ", ".join(kwargs.keys())
placeholders = ", ".join("?" * len(kwargs))
self.conn.execute(
f"INSERT INTO scraping_episodes ({cols}) VALUES ({placeholders})",
list(kwargs.values()),
)
self.conn.commit()
def get_recent_episodes(self, domain, limit=10):
"""get recent scraping episodes for a domain."""
cursor = self.conn.execute(
"""SELECT * FROM scraping_episodes
WHERE domain = ?
ORDER BY timestamp DESC
LIMIT ?""",
(domain, limit),
)
return [dict(row) for row in cursor.fetchall()]
# --- proxy performance ---
def record_proxy_performance(self, provider, proxy_type,
domain, **metrics):
"""record proxy performance data."""
self.conn.execute(
"""INSERT INTO proxy_performance
(proxy_provider, proxy_type, target_domain,
success_rate, avg_latency, block_rate,
captcha_rate, tested_at, sample_size)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
provider, proxy_type, domain,
metrics.get("success_rate"),
metrics.get("avg_latency"),
metrics.get("block_rate"),
metrics.get("captcha_rate"),
datetime.utcnow().isoformat(),
metrics.get("sample_size", 0),
),
)
self.conn.commit()
def get_best_proxy_for_domain(self, domain):
"""find the best performing proxy config for a domain."""
cursor = self.conn.execute(
"""SELECT proxy_provider, proxy_type,
AVG(success_rate) as avg_success,
AVG(avg_latency) as avg_lat
FROM proxy_performance
WHERE target_domain = ?
AND tested_at > ?
GROUP BY proxy_provider, proxy_type
ORDER BY avg_success DESC, avg_lat ASC
LIMIT 1""",
(domain, (datetime.utcnow() - timedelta(days=30)).isoformat()),
)
row = cursor.fetchone()
return dict(row) if row else None
# --- working memory ---
def set_working(self, key, value, ttl_minutes=60):
"""set a working memory value with expiration."""
expires = datetime.utcnow() + timedelta(minutes=ttl_minutes)
self.conn.execute(
"""INSERT OR REPLACE INTO working_memory
(key, value, expires_at) VALUES (?, ?, ?)""",
(key, json.dumps(value), expires.isoformat()),
)
self.conn.commit()
def get_working(self, key):
"""get a working memory value if not expired."""
cursor = self.conn.execute(
"SELECT value, expires_at FROM working_memory WHERE key = ?",
(key,),
)
row = cursor.fetchone()
if row and row["expires_at"] > datetime.utcnow().isoformat():
return json.loads(row["value"])
return None
Step 2: Build the Planning Module
the planning module uses memory to decide how to approach a scraping task. it consults past episodes, proxy performance data, and site profiles to create an optimal strategy:
# planner.py
from memory import AgentMemory
class ScrapingPlanner:
"""plan scraping strategy based on memory."""
def __init__(self, memory: AgentMemory):
self.memory = memory
def plan(self, domain, target_urls, task_description=""):
"""create a scraping plan for a domain."""
# retrieve everything we know about this domain
profile = self.memory.get_site_profile(domain)
episodes = self.memory.get_recent_episodes(domain, limit=5)
best_proxy = self.memory.get_best_proxy_for_domain(domain)
plan = {
"domain": domain,
"target_urls": target_urls,
"strategy": {},
}
# determine proxy strategy
if best_proxy:
plan["strategy"]["proxy"] = {
"provider": best_proxy["proxy_provider"],
"type": best_proxy["proxy_type"],
"reason": f"historically best performer with "
f"{best_proxy['avg_success']:.0%} success rate",
}
elif profile and profile.get("best_proxy_type"):
plan["strategy"]["proxy"] = {
"type": profile["best_proxy_type"],
"reason": "from site profile",
}
else:
plan["strategy"]["proxy"] = {
"type": "datacenter",
"reason": "default for unknown domain, "
"will upgrade if blocked",
}
# determine request timing
if profile and profile.get("optimal_delay"):
plan["strategy"]["delay"] = profile["optimal_delay"]
elif episodes:
# use the delay from the most successful recent episode
best_episode = max(episodes, key=lambda e: e.get("success_rate", 0))
plan["strategy"]["delay"] = 2.0 # safe default
else:
plan["strategy"]["delay"] = 2.5 # conservative for unknown sites
# determine anti-bot level
if profile and profile.get("anti_bot_level"):
plan["strategy"]["anti_bot"] = profile["anti_bot_level"]
else:
plan["strategy"]["anti_bot"] = "unknown"
# check for recent failures
if episodes:
recent_failures = [
e for e in episodes[:3]
if e.get("success_rate", 1.0) < 0.5
]
if recent_failures:
plan["strategy"]["caution"] = True
plan["strategy"]["caution_reason"] = (
f"recent failures detected. "
f"last failure: {recent_failures[0].get('lessons', 'unknown cause')}"
)
# determine selectors
if profile and profile.get("selectors"):
plan["strategy"]["selectors"] = profile["selectors"]
else:
plan["strategy"]["selectors"] = None # will need discovery
# estimate effort
plan["estimated_pages"] = len(target_urls)
plan["estimated_time_minutes"] = (
len(target_urls) * plan["strategy"]["delay"] / 60
)
return plan
def adapt_plan(self, plan, current_results):
"""adapt the plan based on mid-run results."""
success_rate = current_results.get("success_rate", 1.0)
error_types = current_results.get("error_types", {})
adaptations = []
# if getting blocked, escalate proxy
if success_rate < 0.7:
if plan["strategy"]["proxy"]["type"] == "datacenter":
plan["strategy"]["proxy"]["type"] = "residential"
adaptations.append(
"upgraded to residential proxy due to low success rate"
)
# increase delay
plan["strategy"]["delay"] *= 1.5
adaptations.append(
f"increased delay to {plan['strategy']['delay']:.1f}s"
)
# if getting CAPTCHAs, switch to residential + add delay
if error_types.get("captcha", 0) > 3:
plan["strategy"]["proxy"]["type"] = "residential"
plan["strategy"]["delay"] = max(plan["strategy"]["delay"], 4.0)
adaptations.append(
"switched to residential proxy and increased delay "
"due to CAPTCHAs"
)
# if getting rate limited (429s)
if error_types.get("429", 0) > 2:
plan["strategy"]["delay"] *= 2.0
adaptations.append(
f"doubled delay to {plan['strategy']['delay']:.1f}s "
f"due to rate limiting"
)
return plan, adaptations
Step 3: Build the Execution Module
the execution module carries out the plan while feeding data back to the memory system:
# executor.py
import requests
import time
from datetime import datetime
from bs4 import BeautifulSoup
class ScrapingExecutor:
"""execute scraping plans with memory feedback."""
def __init__(self, memory, proxy_manager):
self.memory = memory
self.proxy_manager = proxy_manager
self.session = requests.Session()
def execute(self, plan):
"""execute a scraping plan."""
domain = plan["domain"]
strategy = plan["strategy"]
# configure proxy
proxy_url = self.proxy_manager.get_proxy(
proxy_type=strategy["proxy"]["type"],
provider=strategy["proxy"].get("provider"),
)
if proxy_url:
self.session.proxies = {
"http": proxy_url,
"https": proxy_url,
}
delay = strategy["delay"]
results = []
errors = []
error_types = {}
# store working memory for this session
self.memory.set_working(
f"session_{domain}",
{"started": datetime.utcnow().isoformat(), "pages": 0},
ttl_minutes=120,
)
for i, url in enumerate(plan["target_urls"]):
try:
start = time.time()
response = self.session.get(url, timeout=30)
elapsed = time.time() - start
if response.status_code == 200:
# extract data using known selectors or discovery
data = self._extract(
response.text, url, strategy.get("selectors")
)
results.append({
"url": url,
"data": data,
"response_time": elapsed,
"status": "success",
})
else:
error_type = str(response.status_code)
error_types[error_type] = error_types.get(error_type, 0) + 1
errors.append({
"url": url,
"status_code": response.status_code,
})
# check for CAPTCHA
if self._is_captcha(response.text):
error_types["captcha"] = error_types.get("captcha", 0) + 1
except Exception as e:
error_types["exception"] = error_types.get("exception", 0) + 1
errors.append({"url": url, "error": str(e)})
time.sleep(delay)
# periodically check if we need to adapt
if (i + 1) % 10 == 0:
current_rate = len(results) / (i + 1)
if current_rate < 0.5 and i < len(plan["target_urls"]) - 10:
print(f"low success rate ({current_rate:.0%}), adapting...")
# the planner will be called to adapt
# record episode
success_rate = len(results) / len(plan["target_urls"]) if plan["target_urls"] else 0
avg_time = (
sum(r["response_time"] for r in results) / len(results)
if results else 0
)
self.memory.record_episode(
domain=domain,
proxy_type=strategy["proxy"]["type"],
proxy_provider=strategy["proxy"].get("provider", ""),
success_rate=success_rate,
avg_response_time=avg_time,
pages_scraped=len(results),
errors=errors[:20], # cap stored errors
strategy_used=str(strategy),
outcome="success" if success_rate > 0.8 else "partial" if success_rate > 0.3 else "failure",
lessons=self._derive_lessons(success_rate, error_types),
)
# update site profile
self.memory.update_site_profile(
domain,
optimal_delay=delay,
best_proxy_type=strategy["proxy"]["type"],
)
# record proxy performance
if strategy["proxy"].get("provider"):
self.memory.record_proxy_performance(
provider=strategy["proxy"]["provider"],
proxy_type=strategy["proxy"]["type"],
domain=domain,
success_rate=success_rate,
avg_latency=avg_time,
block_rate=error_types.get("403", 0) / max(len(plan["target_urls"]), 1),
captcha_rate=error_types.get("captcha", 0) / max(len(plan["target_urls"]), 1),
sample_size=len(plan["target_urls"]),
)
return {
"results": results,
"errors": errors,
"success_rate": success_rate,
"error_types": error_types,
}
def _extract(self, html, url, selectors=None):
"""extract data from HTML, discovering selectors if needed."""
soup = BeautifulSoup(html, "html.parser")
if selectors:
return self._extract_with_selectors(soup, selectors)
# fallback to generic extraction
title = soup.find("title")
return {
"title": title.text.strip() if title else "",
"url": url,
}
def _extract_with_selectors(self, soup, selectors):
"""extract using known selectors."""
data = {}
for field, selector_list in selectors.items():
if isinstance(selector_list, list):
for sel in selector_list:
el = soup.select_one(sel)
if el:
data[field] = el.get_text(strip=True)
break
elif isinstance(selector_list, str):
el = soup.select_one(selector_list)
if el:
data[field] = el.get_text(strip=True)
return data
def _is_captcha(self, html):
"""detect if a response is a CAPTCHA page."""
captcha_indicators = [
"captcha", "recaptcha", "hcaptcha",
"challenge-platform", "cf-challenge",
"please verify you are human",
]
html_lower = html.lower()
return any(indicator in html_lower for indicator in captcha_indicators)
def _derive_lessons(self, success_rate, error_types):
"""derive lessons from the run for future reference."""
lessons = []
if success_rate > 0.95:
lessons.append("strategy worked well, no changes needed")
if error_types.get("captcha", 0) > 0:
lessons.append("CAPTCHAs detected, consider residential proxies")
if error_types.get("429", 0) > 0:
lessons.append("rate limiting encountered, increase delay")
if error_types.get("403", 0) > 5:
lessons.append("frequent blocks, site may have upgraded anti-bot")
if success_rate < 0.3:
lessons.append("critical failure, major strategy change needed")
return "; ".join(lessons) if lessons else "no specific lessons"
Step 4: Add LLM-Powered Reflection
the reflection module uses an LLM to analyze patterns across multiple episodes and generate strategic insights:
# reflection.py
import json
from openai import OpenAI
class AgentReflector:
"""use LLM to reflect on scraping performance and generate insights."""
def __init__(self, memory, model="gpt-4o-mini"):
self.memory = memory
self.client = OpenAI()
self.model = model
def reflect_on_domain(self, domain):
"""analyze recent performance for a domain and suggest improvements."""
episodes = self.memory.get_recent_episodes(domain, limit=20)
profile = self.memory.get_site_profile(domain)
if not episodes:
return "no episodes to analyze."
context = {
"domain": domain,
"profile": profile,
"recent_episodes": episodes,
}
prompt = f"""analyze these web scraping episodes for {domain} and provide actionable insights.
data:
{json.dumps(context, indent=2, default=str)}
provide:
1. overall assessment of scraping reliability
2. specific recommendations for proxy type, request timing, and error handling
3. patterns or trends in the data (improving, degrading, etc.)
4. any anomalies that need investigation
be specific and actionable. reference actual data from the episodes."""
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": "you are a web scraping optimization expert."},
{"role": "user", "content": prompt},
],
max_tokens=1000,
)
return response.choices[0].message.content
def generate_strategy_update(self, domain, current_issues):
"""generate updated scraping strategy based on current issues."""
profile = self.memory.get_site_profile(domain)
proxy_data = self.memory.get_best_proxy_for_domain(domain)
prompt = f"""a web scraper for {domain} is experiencing these issues:
{json.dumps(current_issues, indent=2)}
current profile: {json.dumps(profile, indent=2, default=str)}
best proxy data: {json.dumps(proxy_data, indent=2, default=str) if proxy_data else 'none'}
suggest a specific strategy update. output as JSON with these fields:
- proxy_type: string
- delay_seconds: float
- max_retries: int
- headers_strategy: string
- additional_recommendations: list of strings"""
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": "you are a web scraping expert. respond only with valid JSON."},
{"role": "user", "content": prompt},
],
max_tokens=500,
)
try:
return json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
return {"error": "could not parse LLM response"}
Step 5: Tie It All Together
the main agent class coordinates planning, execution, and reflection:
# agent.py
from memory import AgentMemory
from planner import ScrapingPlanner
from executor import ScrapingExecutor
from reflection import AgentReflector
class StatefulScrapingAgent:
"""an AI agent that learns and adapts across scraping sessions."""
def __init__(self, db_path="agent_memory.db", proxy_manager=None):
self.memory = AgentMemory(db_path)
self.planner = ScrapingPlanner(self.memory)
self.executor = ScrapingExecutor(self.memory, proxy_manager)
self.reflector = AgentReflector(self.memory)
def scrape(self, domain, urls, task=""):
"""scrape with full planning, execution, and learning cycle."""
# plan
print(f"planning scraping strategy for {domain}...")
plan = self.planner.plan(domain, urls, task)
print(f" proxy: {plan['strategy']['proxy']['type']} "
f"({plan['strategy']['proxy']['reason']})")
print(f" delay: {plan['strategy']['delay']}s")
print(f" estimated time: {plan['estimated_time_minutes']:.0f} min")
# execute
print(f"\nexecuting scrape of {len(urls)} URLs...")
results = self.executor.execute(plan)
print(f" success rate: {results['success_rate']:.0%}")
print(f" pages scraped: {len(results['results'])}")
# adapt if needed during execution
if results["success_rate"] < 0.7:
print("\nlow success rate, adapting strategy...")
plan, adaptations = self.planner.adapt_plan(plan, results)
for a in adaptations:
print(f" adaptation: {a}")
# re-run failed URLs with adapted strategy
failed_urls = [
e["url"] for e in results["errors"]
if "url" in e
]
if failed_urls:
retry_results = self.executor.execute({
**plan,
"target_urls": failed_urls,
})
results["results"].extend(retry_results["results"])
# reflect
if len(self.memory.get_recent_episodes(domain)) >= 5:
print("\nreflecting on recent performance...")
insights = self.reflector.reflect_on_domain(domain)
print(f" insights: {insights[:200]}...")
return results
# usage
if __name__ == "__main__":
from proxy_manager import ProxyManager
agent = StatefulScrapingAgent(
proxy_manager=ProxyManager(config_path="proxies.json"),
)
results = agent.scrape(
domain="example.com",
urls=["https://example.com/page1", "https://example.com/page2"],
task="collect product listings",
)
Practical Use Cases
Monitoring Price Changes
a stateful agent remembers what prices looked like last time. it can detect when a site changes its price display format and adapt its selectors automatically, rather than silently returning wrong data.
Multi-Site Aggregation
when scraping 50 different sites for the same type of data, a stateful agent learns the optimal configuration for each site individually. site A might need residential proxies with a 3-second delay while site B works fine with datacenter proxies and no delay.
Long-Running Research Projects
for research projects that scrape the same sources weekly or monthly over a long period, the agent builds up institutional knowledge about each source. it knows when sites typically go down for maintenance, when they update their layouts, and which proxy configurations have the best track record.
Limitations and Considerations
memory bloat. over time, the memory database can grow large. implement a retention policy that archives old episodes and removes outdated working memory entries.
LLM costs. the reflection module calls an LLM for analysis. for cost efficiency, run reflections periodically (weekly) rather than after every scraping session.
cold start. a new agent with no memory starts from defaults. seed the memory with known configurations for your most important target sites to avoid the learning period.
overfitting. if a site temporarily increases its rate limits during low-traffic hours, the agent might learn an overly aggressive strategy that fails during peak hours. include time-of-day context in your episodic memory.
Conclusion
stateful AI agents represent a fundamental shift in how we approach web scraping. instead of building brittle, static scrapers that break when sites change, you build agents that learn, adapt, and improve. the memory system turns every scraping run into training data that makes the next run better.
the implementation above totals about 600 lines of Python. the key insight is that the memory system does not need to be complex. a simple SQLite database with site profiles, episode logs, and proxy performance data gives you 90% of the benefit. the LLM-powered reflection layer adds strategic reasoning on top but is optional for many use cases.
start by adding basic episode logging to your existing scrapers. once you have a few weeks of data, the patterns will be obvious and you will see exactly where stateful reasoning would have saved you time.