How to Use Proxies with OpenAI Agents SDK for Web Research

How to Use Proxies with OpenAI Agents SDK for Web Research

OpenAI’s Agents SDK has given developers a structured framework for building autonomous AI agents that can reason, use tools, and interact with external services. One of the most powerful capabilities these agents can have is web browsing — the ability to search the internet, read pages, and extract information in real time.

The problem is that any agent performing web research at scale runs into the same barriers that plague traditional web scraping: IP blocks, rate limits, CAPTCHAs, and geographic restrictions. Without proxy support, a research agent’s effectiveness degrades rapidly as target websites detect and block its requests.

This guide shows you exactly how to integrate mobile proxies with OpenAI’s Agents SDK, build robust web research tools, and manage the rate limiting challenges that come with autonomous web browsing.

OpenAI Agents SDK Overview

The OpenAI Agents SDK provides a structured way to build agents with:

  • Tool use: Define custom functions the agent can call during its reasoning process.
  • Handoffs: Transfer control between specialized agents.
  • Guardrails: Apply input and output validation.
  • Tracing: Monitor agent execution for debugging and optimization.

For web research, the key component is tool use. You define web browsing functions as tools, and the agent decides when and how to use them based on the task at hand.

from agents import Agent, Runner, function_tool

@function_tool
def search_web(query: str) -> str:
    """Search the web for information."""
    # This is where proxy integration happens
    pass

@function_tool
def read_page(url: str) -> str:
    """Read and extract content from a web page."""
    pass

research_agent = Agent(
    name="Web Researcher",
    instructions=(
        "You are a web research assistant. Use the search_web tool "
        "to find relevant pages, then use read_page to extract "
        "detailed information. Synthesize findings into a clear report."
    ),
    tools=[search_web, read_page]
)

Adding Proxy Support for Web Tools

The proxy integration happens at the HTTP client level within your tool functions. Here is a complete implementation with proper proxy routing, error handling, and automatic retry logic.

Setting Up the Proxy Client

import httpx
from typing import Optional
import os

class ProxyClient:
    """HTTP client with proxy rotation and retry logic."""

    def __init__(
        self,
        proxy_url: Optional[str] = None,
        max_retries: int = 3
    ):
        self.proxy_url = proxy_url or os.getenv("PROXY_URL")
        self.max_retries = max_retries
        self._client = None
        self._request_count = 0
        self._rotation_interval = 10

    def _get_client(self) -> httpx.Client:
        """Get or create HTTP client, rotating proxy periodically."""
        if (
            self._client is None
            or self._request_count % self._rotation_interval == 0
        ):
            if self._client:
                self._client.close()
            self._client = httpx.Client(
                proxy=self.proxy_url,
                timeout=30.0,
                follow_redirects=True,
                headers={
                    "User-Agent": (
                        "Mozilla/5.0 (iPhone; CPU iPhone OS 17_4 like Mac OS X) "
                        "AppleWebKit/605.1.15 (KHTML, like Gecko) "
                        "Version/17.4 Mobile/15E148 Safari/604.1"
                    ),
                    "Accept": (
                        "text/html,application/xhtml+xml,"
                        "application/xml;q=0.9,*/*;q=0.8"
                    ),
                    "Accept-Language": "en-US,en;q=0.9",
                    "Accept-Encoding": "gzip, deflate, br"
                }
            )
        return self._client

    def get(self, url: str) -> httpx.Response:
        """Fetch URL with automatic retries and proxy rotation."""
        last_error = None

        for attempt in range(self.max_retries):
            try:
                client = self._get_client()
                self._request_count += 1
                response = client.get(url)

                if response.status_code == 429:
                    # Rate limited - force proxy rotation
                    self._request_count = 0
                    continue

                if response.status_code == 403:
                    # Blocked - rotate proxy and retry
                    self._request_count = 0
                    continue

                return response

            except Exception as e:
                last_error = e
                self._request_count = 0

        raise last_error or Exception("All retry attempts failed")

# Global proxy client instance
proxy_client = ProxyClient(
    proxy_url="http://user:pass@gateway.dataresearchtools.com:5000"
)

Building Proxy-Enabled Web Tools

Now wire the proxy client into the agent’s tool functions:

from bs4 import BeautifulSoup
from agents import function_tool
import json
import re

@function_tool
def search_web(query: str) -> str:
    """Search the web for a given query and return titles and URLs."""
    try:
        response = proxy_client.get(
            f"https://www.google.com/search?q={query}&num=10"
        )
        soup = BeautifulSoup(response.text, "html.parser")

        results = []
        for g in soup.select("div.g"):
            title_el = g.select_one("h3")
            link_el = g.select_one("a[href]")
            snippet_el = g.select_one("div.VwiC3b")

            if title_el and link_el:
                results.append({
                    "title": title_el.get_text(),
                    "url": link_el["href"],
                    "snippet": (
                        snippet_el.get_text() if snippet_el else ""
                    )
                })

        return json.dumps(results[:10], indent=2)

    except Exception as e:
        return f"Search failed: {str(e)}"


@function_tool
def read_page(url: str) -> str:
    """Read a web page and return its main text content."""
    try:
        response = proxy_client.get(url)
        soup = BeautifulSoup(response.text, "html.parser")

        # Remove non-content elements
        for tag in soup(["script", "style", "nav", "footer",
                         "header", "aside", "iframe"]):
            tag.decompose()

        # Extract main content
        main = soup.find("main") or soup.find("article") or soup.body
        if not main:
            return "Could not extract page content."

        text = main.get_text(separator="\n", strip=True)

        # Clean and truncate
        lines = [line for line in text.splitlines() if line.strip()]
        clean_text = "\n".join(lines)

        # Truncate to ~4000 chars to fit in context window
        if len(clean_text) > 4000:
            clean_text = clean_text[:4000] + "\n\n[Content truncated...]"

        return clean_text

    except Exception as e:
        return f"Failed to read page: {str(e)}"


@function_tool
def search_news(query: str) -> str:
    """Search for recent news articles on a topic."""
    try:
        response = proxy_client.get(
            f"https://news.google.com/search?q={query}&hl=en-US&gl=US"
        )
        soup = BeautifulSoup(response.text, "html.parser")

        articles = []
        for article in soup.select("article")[:10]:
            title_el = article.select_one("a")
            time_el = article.select_one("time")
            if title_el:
                articles.append({
                    "title": title_el.get_text(strip=True),
                    "url": "https://news.google.com" + title_el.get("href", ""),
                    "time": (
                        time_el.get_text(strip=True)
                        if time_el else "Unknown"
                    )
                })

        return json.dumps(articles, indent=2)

    except Exception as e:
        return f"News search failed: {str(e)}"

Building the Complete Research Agent

With proxy-enabled tools in place, you can build a research agent that autonomously investigates topics:

from agents import Agent, Runner
import asyncio

# Define the research agent
research_agent = Agent(
    name="Research Agent",
    model="gpt-4",
    instructions="""You are an expert web researcher. Your task is to
    thoroughly research topics by:

    1. Searching for relevant information using search_web
    2. Reading the most promising pages using read_page
    3. Cross-referencing information across multiple sources
    4. Synthesizing findings into a comprehensive report

    Always cite your sources with URLs. Read at least 3-5 pages before
    forming conclusions. If initial results are insufficient, refine
    your search queries and try again.

    Structure your final report with:
    - Executive Summary
    - Key Findings (with source citations)
    - Analysis
    - Recommendations (if applicable)
    """,
    tools=[search_web, read_page, search_news]
)

async def run_research(topic: str) -> str:
    result = await Runner.run(
        research_agent,
        input=f"Research the following topic thoroughly: {topic}"
    )
    return result.final_output

# Run the agent
report = asyncio.run(
    run_research("Impact of EU AI Act on web scraping practices 2026")
)
print(report)

Managing Rate Limits in Agent Workflows

AI agents present unique rate limiting challenges because their request patterns are unpredictable. A research agent might send 3 requests in quick succession, pause for 30 seconds while reasoning, then send 10 more. This burstiness triggers rate limiters designed for steady request flows.

Token Bucket Rate Limiter

Implement a token bucket that smooths out request bursts:

import time
import threading

class TokenBucket:
    def __init__(self, rate: float, capacity: int):
        self.rate = rate  # tokens per second
        self.capacity = capacity
        self.tokens = capacity
        self.last_refill = time.monotonic()
        self.lock = threading.Lock()

    def acquire(self):
        with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(
                self.capacity,
                self.tokens + elapsed * self.rate
            )
            self.last_refill = now

            if self.tokens < 1:
                wait_time = (1 - self.tokens) / self.rate
                time.sleep(wait_time)
                self.tokens = 0
            else:
                self.tokens -= 1

# Limit to 2 requests per second with burst capacity of 5
rate_limiter = TokenBucket(rate=2.0, capacity=5)

# Integrate into proxy client
class RateLimitedProxyClient(ProxyClient):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.limiter = TokenBucket(rate=2.0, capacity=5)

    def get(self, url: str) -> httpx.Response:
        self.limiter.acquire()
        return super().get(url)

Domain-Specific Rate Limiting

Different sites have different tolerance levels. Apply per-domain rate limits to stay under each site’s threshold:

from collections import defaultdict
from urllib.parse import urlparse

class DomainRateLimiter:
    def __init__(self):
        self.domain_limiters = {}
        self.default_rate = 1.0  # 1 request per second

        # Custom rates for known domains
        self.custom_rates = {
            "google.com": 0.5,
            "linkedin.com": 0.3,
            "twitter.com": 0.2,
            "reddit.com": 1.0,
            "github.com": 2.0
        }

    def get_limiter(self, url: str) -> TokenBucket:
        domain = urlparse(url).netloc.replace("www.", "")

        if domain not in self.domain_limiters:
            rate = self.custom_rates.get(domain, self.default_rate)
            self.domain_limiters[domain] = TokenBucket(
                rate=rate, capacity=3
            )

        return self.domain_limiters[domain]

    def wait(self, url: str):
        limiter = self.get_limiter(url)
        limiter.acquire()

Building a Multi-Agent Research System

For complex research tasks, use multiple specialized agents with handoffs:

from agents import Agent, Runner

# Specialized search agent
search_agent = Agent(
    name="Search Specialist",
    instructions="""You specialize in finding relevant web pages.
    Given a research question, generate multiple search queries
    covering different angles. Use search_web for each query and
    compile a list of the most relevant URLs with brief descriptions.""",
    tools=[search_web, search_news]
)

# Deep reading agent
reader_agent = Agent(
    name="Deep Reader",
    instructions="""You specialize in extracting key information
    from web pages. Given a URL and a research question, read the
    page and extract all relevant facts, statistics, quotes, and
    data points. Be thorough and precise.""",
    tools=[read_page]
)

# Synthesis agent (no web tools needed)
synthesis_agent = Agent(
    name="Research Synthesizer",
    instructions="""You synthesize information from multiple sources
    into a coherent research report. Organize findings thematically,
    identify consensus and contradictions across sources, and
    highlight key insights. Always cite sources.""",
    tools=[]
)

# Orchestrator agent that hands off to specialists
orchestrator = Agent(
    name="Research Orchestrator",
    instructions="""You coordinate research by delegating to
    specialized agents. For any research task:
    1. Hand off to Search Specialist to find relevant pages
    2. Hand off to Deep Reader for each important page
    3. Hand off to Research Synthesizer for the final report""",
    handoffs=[search_agent, reader_agent, synthesis_agent]
)

Production Considerations

Logging and Monitoring

Track every web request your agents make for debugging and cost management:

import logging
from datetime import datetime

class RequestLogger:
    def __init__(self):
        self.logger = logging.getLogger("agent_requests")
        self.stats = {
            "total_requests": 0,
            "successful": 0,
            "failed": 0,
            "bytes_transferred": 0
        }

    def log_request(self, url, status_code, response_size, duration):
        self.stats["total_requests"] += 1
        if 200 <= status_code < 400:
            self.stats["successful"] += 1
        else:
            self.stats["failed"] += 1
        self.stats["bytes_transferred"] += response_size

        self.logger.info(
            f"URL={url} status={status_code} "
            f"size={response_size} duration={duration:.2f}s"
        )

Caching

Agents often revisit the same pages during a research session. Implement a request cache to avoid redundant proxy usage:

from functools import lru_cache
import hashlib

class PageCache:
    def __init__(self, max_size: int = 1000, ttl_seconds: int = 3600):
        self.cache = {}
        self.max_size = max_size
        self.ttl = ttl_seconds

    def get(self, url: str) -> Optional[str]:
        key = hashlib.md5(url.encode()).hexdigest()
        if key in self.cache:
            content, timestamp = self.cache[key]
            if time.time() - timestamp < self.ttl:
                return content
            del self.cache[key]
        return None

    def set(self, url: str, content: str):
        if len(self.cache) >= self.max_size:
            # Evict oldest entry
            oldest = min(self.cache, key=lambda k: self.cache[k][1])
            del self.cache[oldest]

        key = hashlib.md5(url.encode()).hexdigest()
        self.cache[key] = (content, time.time())

Why Mobile Proxies Are Best for Agent Workloads

AI agents have browsing patterns that look more suspicious to anti-bot systems than traditional scrapers. They visit many different domains, follow unpredictable paths, and make bursty requests. This makes proxy quality critical.

Mobile proxies outperform other proxy types for agent workloads because:

  1. Carrier-grade NAT means the same IP serves thousands of real mobile users simultaneously. Blocking the IP blocks all of them, so sites cannot be aggressive.
  1. Natural traffic mixing. Your agent’s requests blend with legitimate mobile traffic from the same IP, making detection nearly impossible.
  1. High trust scores. Mobile IPs consistently receive the highest trust scores from anti-bot services like Cloudflare and PerimeterX.

For agents that need to access ecommerce sites, social media platforms, or search engines for SEO research, mobile proxies provide the reliability needed for autonomous operation. Learn more about proxy fundamentals in our proxy glossary or explore specialized AI data collection proxy configurations for advanced agent setups.


Related Reading

last updated: April 3, 2026

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)