How to Use Proxies with OpenAI Agents SDK for Web Research
OpenAI’s Agents SDK has given developers a structured framework for building autonomous AI agents that can reason, use tools, and interact with external services. One of the most powerful capabilities these agents can have is web browsing — the ability to search the internet, read pages, and extract information in real time.
The problem is that any agent performing web research at scale runs into the same barriers that plague traditional web scraping: IP blocks, rate limits, CAPTCHAs, and geographic restrictions. Without proxy support, a research agent’s effectiveness degrades rapidly as target websites detect and block its requests.
This guide shows you exactly how to integrate mobile proxies with OpenAI’s Agents SDK, build robust web research tools, and manage the rate limiting challenges that come with autonomous web browsing.
OpenAI Agents SDK Overview
The OpenAI Agents SDK provides a structured way to build agents with:
- Tool use: Define custom functions the agent can call during its reasoning process.
- Handoffs: Transfer control between specialized agents.
- Guardrails: Apply input and output validation.
- Tracing: Monitor agent execution for debugging and optimization.
For web research, the key component is tool use. You define web browsing functions as tools, and the agent decides when and how to use them based on the task at hand.
from agents import Agent, Runner, function_tool
@function_tool
def search_web(query: str) -> str:
"""Search the web for information."""
# This is where proxy integration happens
pass
@function_tool
def read_page(url: str) -> str:
"""Read and extract content from a web page."""
pass
research_agent = Agent(
name="Web Researcher",
instructions=(
"You are a web research assistant. Use the search_web tool "
"to find relevant pages, then use read_page to extract "
"detailed information. Synthesize findings into a clear report."
),
tools=[search_web, read_page]
)Adding Proxy Support for Web Tools
The proxy integration happens at the HTTP client level within your tool functions. Here is a complete implementation with proper proxy routing, error handling, and automatic retry logic.
Setting Up the Proxy Client
import httpx
from typing import Optional
import os
class ProxyClient:
"""HTTP client with proxy rotation and retry logic."""
def __init__(
self,
proxy_url: Optional[str] = None,
max_retries: int = 3
):
self.proxy_url = proxy_url or os.getenv("PROXY_URL")
self.max_retries = max_retries
self._client = None
self._request_count = 0
self._rotation_interval = 10
def _get_client(self) -> httpx.Client:
"""Get or create HTTP client, rotating proxy periodically."""
if (
self._client is None
or self._request_count % self._rotation_interval == 0
):
if self._client:
self._client.close()
self._client = httpx.Client(
proxy=self.proxy_url,
timeout=30.0,
follow_redirects=True,
headers={
"User-Agent": (
"Mozilla/5.0 (iPhone; CPU iPhone OS 17_4 like Mac OS X) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) "
"Version/17.4 Mobile/15E148 Safari/604.1"
),
"Accept": (
"text/html,application/xhtml+xml,"
"application/xml;q=0.9,*/*;q=0.8"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br"
}
)
return self._client
def get(self, url: str) -> httpx.Response:
"""Fetch URL with automatic retries and proxy rotation."""
last_error = None
for attempt in range(self.max_retries):
try:
client = self._get_client()
self._request_count += 1
response = client.get(url)
if response.status_code == 429:
# Rate limited - force proxy rotation
self._request_count = 0
continue
if response.status_code == 403:
# Blocked - rotate proxy and retry
self._request_count = 0
continue
return response
except Exception as e:
last_error = e
self._request_count = 0
raise last_error or Exception("All retry attempts failed")
# Global proxy client instance
proxy_client = ProxyClient(
proxy_url="http://user:pass@gateway.dataresearchtools.com:5000"
)Building Proxy-Enabled Web Tools
Now wire the proxy client into the agent’s tool functions:
from bs4 import BeautifulSoup
from agents import function_tool
import json
import re
@function_tool
def search_web(query: str) -> str:
"""Search the web for a given query and return titles and URLs."""
try:
response = proxy_client.get(
f"https://www.google.com/search?q={query}&num=10"
)
soup = BeautifulSoup(response.text, "html.parser")
results = []
for g in soup.select("div.g"):
title_el = g.select_one("h3")
link_el = g.select_one("a[href]")
snippet_el = g.select_one("div.VwiC3b")
if title_el and link_el:
results.append({
"title": title_el.get_text(),
"url": link_el["href"],
"snippet": (
snippet_el.get_text() if snippet_el else ""
)
})
return json.dumps(results[:10], indent=2)
except Exception as e:
return f"Search failed: {str(e)}"
@function_tool
def read_page(url: str) -> str:
"""Read a web page and return its main text content."""
try:
response = proxy_client.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# Remove non-content elements
for tag in soup(["script", "style", "nav", "footer",
"header", "aside", "iframe"]):
tag.decompose()
# Extract main content
main = soup.find("main") or soup.find("article") or soup.body
if not main:
return "Could not extract page content."
text = main.get_text(separator="\n", strip=True)
# Clean and truncate
lines = [line for line in text.splitlines() if line.strip()]
clean_text = "\n".join(lines)
# Truncate to ~4000 chars to fit in context window
if len(clean_text) > 4000:
clean_text = clean_text[:4000] + "\n\n[Content truncated...]"
return clean_text
except Exception as e:
return f"Failed to read page: {str(e)}"
@function_tool
def search_news(query: str) -> str:
"""Search for recent news articles on a topic."""
try:
response = proxy_client.get(
f"https://news.google.com/search?q={query}&hl=en-US&gl=US"
)
soup = BeautifulSoup(response.text, "html.parser")
articles = []
for article in soup.select("article")[:10]:
title_el = article.select_one("a")
time_el = article.select_one("time")
if title_el:
articles.append({
"title": title_el.get_text(strip=True),
"url": "https://news.google.com" + title_el.get("href", ""),
"time": (
time_el.get_text(strip=True)
if time_el else "Unknown"
)
})
return json.dumps(articles, indent=2)
except Exception as e:
return f"News search failed: {str(e)}"Building the Complete Research Agent
With proxy-enabled tools in place, you can build a research agent that autonomously investigates topics:
from agents import Agent, Runner
import asyncio
# Define the research agent
research_agent = Agent(
name="Research Agent",
model="gpt-4",
instructions="""You are an expert web researcher. Your task is to
thoroughly research topics by:
1. Searching for relevant information using search_web
2. Reading the most promising pages using read_page
3. Cross-referencing information across multiple sources
4. Synthesizing findings into a comprehensive report
Always cite your sources with URLs. Read at least 3-5 pages before
forming conclusions. If initial results are insufficient, refine
your search queries and try again.
Structure your final report with:
- Executive Summary
- Key Findings (with source citations)
- Analysis
- Recommendations (if applicable)
""",
tools=[search_web, read_page, search_news]
)
async def run_research(topic: str) -> str:
result = await Runner.run(
research_agent,
input=f"Research the following topic thoroughly: {topic}"
)
return result.final_output
# Run the agent
report = asyncio.run(
run_research("Impact of EU AI Act on web scraping practices 2026")
)
print(report)Managing Rate Limits in Agent Workflows
AI agents present unique rate limiting challenges because their request patterns are unpredictable. A research agent might send 3 requests in quick succession, pause for 30 seconds while reasoning, then send 10 more. This burstiness triggers rate limiters designed for steady request flows.
Token Bucket Rate Limiter
Implement a token bucket that smooths out request bursts:
import time
import threading
class TokenBucket:
def __init__(self, rate: float, capacity: int):
self.rate = rate # tokens per second
self.capacity = capacity
self.tokens = capacity
self.last_refill = time.monotonic()
self.lock = threading.Lock()
def acquire(self):
with self.lock:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.rate
)
self.last_refill = now
if self.tokens < 1:
wait_time = (1 - self.tokens) / self.rate
time.sleep(wait_time)
self.tokens = 0
else:
self.tokens -= 1
# Limit to 2 requests per second with burst capacity of 5
rate_limiter = TokenBucket(rate=2.0, capacity=5)
# Integrate into proxy client
class RateLimitedProxyClient(ProxyClient):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.limiter = TokenBucket(rate=2.0, capacity=5)
def get(self, url: str) -> httpx.Response:
self.limiter.acquire()
return super().get(url)Domain-Specific Rate Limiting
Different sites have different tolerance levels. Apply per-domain rate limits to stay under each site’s threshold:
from collections import defaultdict
from urllib.parse import urlparse
class DomainRateLimiter:
def __init__(self):
self.domain_limiters = {}
self.default_rate = 1.0 # 1 request per second
# Custom rates for known domains
self.custom_rates = {
"google.com": 0.5,
"linkedin.com": 0.3,
"twitter.com": 0.2,
"reddit.com": 1.0,
"github.com": 2.0
}
def get_limiter(self, url: str) -> TokenBucket:
domain = urlparse(url).netloc.replace("www.", "")
if domain not in self.domain_limiters:
rate = self.custom_rates.get(domain, self.default_rate)
self.domain_limiters[domain] = TokenBucket(
rate=rate, capacity=3
)
return self.domain_limiters[domain]
def wait(self, url: str):
limiter = self.get_limiter(url)
limiter.acquire()Building a Multi-Agent Research System
For complex research tasks, use multiple specialized agents with handoffs:
from agents import Agent, Runner
# Specialized search agent
search_agent = Agent(
name="Search Specialist",
instructions="""You specialize in finding relevant web pages.
Given a research question, generate multiple search queries
covering different angles. Use search_web for each query and
compile a list of the most relevant URLs with brief descriptions.""",
tools=[search_web, search_news]
)
# Deep reading agent
reader_agent = Agent(
name="Deep Reader",
instructions="""You specialize in extracting key information
from web pages. Given a URL and a research question, read the
page and extract all relevant facts, statistics, quotes, and
data points. Be thorough and precise.""",
tools=[read_page]
)
# Synthesis agent (no web tools needed)
synthesis_agent = Agent(
name="Research Synthesizer",
instructions="""You synthesize information from multiple sources
into a coherent research report. Organize findings thematically,
identify consensus and contradictions across sources, and
highlight key insights. Always cite sources.""",
tools=[]
)
# Orchestrator agent that hands off to specialists
orchestrator = Agent(
name="Research Orchestrator",
instructions="""You coordinate research by delegating to
specialized agents. For any research task:
1. Hand off to Search Specialist to find relevant pages
2. Hand off to Deep Reader for each important page
3. Hand off to Research Synthesizer for the final report""",
handoffs=[search_agent, reader_agent, synthesis_agent]
)Production Considerations
Logging and Monitoring
Track every web request your agents make for debugging and cost management:
import logging
from datetime import datetime
class RequestLogger:
def __init__(self):
self.logger = logging.getLogger("agent_requests")
self.stats = {
"total_requests": 0,
"successful": 0,
"failed": 0,
"bytes_transferred": 0
}
def log_request(self, url, status_code, response_size, duration):
self.stats["total_requests"] += 1
if 200 <= status_code < 400:
self.stats["successful"] += 1
else:
self.stats["failed"] += 1
self.stats["bytes_transferred"] += response_size
self.logger.info(
f"URL={url} status={status_code} "
f"size={response_size} duration={duration:.2f}s"
)Caching
Agents often revisit the same pages during a research session. Implement a request cache to avoid redundant proxy usage:
from functools import lru_cache
import hashlib
class PageCache:
def __init__(self, max_size: int = 1000, ttl_seconds: int = 3600):
self.cache = {}
self.max_size = max_size
self.ttl = ttl_seconds
def get(self, url: str) -> Optional[str]:
key = hashlib.md5(url.encode()).hexdigest()
if key in self.cache:
content, timestamp = self.cache[key]
if time.time() - timestamp < self.ttl:
return content
del self.cache[key]
return None
def set(self, url: str, content: str):
if len(self.cache) >= self.max_size:
# Evict oldest entry
oldest = min(self.cache, key=lambda k: self.cache[k][1])
del self.cache[oldest]
key = hashlib.md5(url.encode()).hexdigest()
self.cache[key] = (content, time.time())Why Mobile Proxies Are Best for Agent Workloads
AI agents have browsing patterns that look more suspicious to anti-bot systems than traditional scrapers. They visit many different domains, follow unpredictable paths, and make bursty requests. This makes proxy quality critical.
Mobile proxies outperform other proxy types for agent workloads because:
- Carrier-grade NAT means the same IP serves thousands of real mobile users simultaneously. Blocking the IP blocks all of them, so sites cannot be aggressive.
- Natural traffic mixing. Your agent’s requests blend with legitimate mobile traffic from the same IP, making detection nearly impossible.
- High trust scores. Mobile IPs consistently receive the highest trust scores from anti-bot services like Cloudflare and PerimeterX.
For agents that need to access ecommerce sites, social media platforms, or search engines for SEO research, mobile proxies provide the reliability needed for autonomous operation. Learn more about proxy fundamentals in our proxy glossary or explore specialized AI data collection proxy configurations for advanced agent setups.
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- Building Custom Datasets with Proxies: A Practical Guide
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
Related Reading
- Agentic Browsers Explained: Browserbase, Browser Use, and Proxy Infrastructure
- How AI Agents Use Proxies for Real-Time Web Data Collection in 2026
- Agentic Browsers Explained: The Future of AI + Proxies in 2026
- Mobile Proxies for AI Data Collection: Web Scraping for Training Data
- AI Web Scraper with Python: Build Your Own
- Best AI Web Scrapers 2026: Complete Comparison
last updated: April 3, 2026