Google ADK Web Scraping with Proxy Integration

Google ADK Web Scraping with Proxy Integration

Google’s Agent Development Kit (ADK) is an open-source framework for building AI agents that can use tools, make decisions, and execute multi-step workflows. when you combine ADK agents with web scraping capabilities and proxy infrastructure, you get AI-powered data collection systems that can adapt to different websites, handle errors intelligently, and extract data more effectively than traditional scrapers.

this guide shows you how to build web scraping tools for Google ADK agents and integrate proxy rotation for reliable, large-scale data collection.

What is Google ADK?

Google ADK (Agent Development Kit) is a Python framework that lets you build AI agents powered by Google’s Gemini models (or other LLMs). these agents can:

  • use custom tools you define
  • make decisions about which tools to call and in what order
  • handle multi-step workflows with error recovery
  • maintain context across interactions

for web scraping, this means you can build an agent that understands what data you need, figures out how to extract it from different website structures, and handles anti-bot challenges automatically.

Setting Up Google ADK

pip install google-adk curl-cffi beautifulsoup4 lxml

you’ll need a Google API key with access to Gemini models.

import os
os.environ["GOOGLE_API_KEY"] = "your-api-key-here"

Building a Web Scraping Tool for ADK

ADK agents work by calling tools. here’s how to create a web scraping tool that an agent can use.

from google.adk.agents import Agent
from google.adk.tools import FunctionTool
from curl_cffi import requests
from bs4 import BeautifulSoup
import json
import random

# proxy configuration
PROXY_LIST = [
    "http://user:pass@residential1.proxy.com:port",
    "http://user:pass@residential2.proxy.com:port",
    "http://user:pass@residential3.proxy.com:port",
]

def get_random_proxy():
    """select a random proxy from the pool"""
    proxy = random.choice(PROXY_LIST)
    return {"http": proxy, "https": proxy}

def scrape_webpage(url: str, extract_type: str = "text") -> str:
    """
    scrape a webpage and return its content.

    args:
        url: the URL to scrape
        extract_type: what to extract - "text" for all text, "links" for all links,
                      "tables" for HTML tables, "structured" for title + headings + paragraphs

    returns:
        extracted content as a string
    """
    session = requests.Session(impersonate="chrome124")
    proxy = get_random_proxy()

    try:
        response = session.get(
            url,
            headers={
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                "Accept-Language": "en-US,en;q=0.9",
            },
            proxies=proxy,
            timeout=30,
        )

        if response.status_code != 200:
            return json.dumps({"error": f"HTTP {response.status_code}", "url": url})

        soup = BeautifulSoup(response.text, "lxml")

        # remove script and style elements
        for element in soup(["script", "style", "nav", "footer", "header"]):
            element.decompose()

        if extract_type == "text":
            text = soup.get_text(separator="\n", strip=True)
            # limit to first 5000 chars to avoid context overflow
            return text[:5000]

        elif extract_type == "links":
            links = []
            for a in soup.find_all("a", href=True):
                links.append({
                    "text": a.get_text(strip=True),
                    "href": a["href"],
                })
            return json.dumps(links[:100])  # limit to 100 links

        elif extract_type == "tables":
            tables = []
            for table in soup.find_all("table"):
                rows = []
                for tr in table.find_all("tr"):
                    cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
                    if cells:
                        rows.append(cells)
                tables.append(rows)
            return json.dumps(tables)

        elif extract_type == "structured":
            result = {
                "title": soup.title.get_text(strip=True) if soup.title else "",
                "headings": [h.get_text(strip=True) for h in soup.find_all(["h1", "h2", "h3"])],
                "paragraphs": [p.get_text(strip=True) for p in soup.find_all("p")][:20],
            }
            return json.dumps(result, indent=2)

    except Exception as e:
        return json.dumps({"error": str(e), "url": url})

# create the ADK tool
scrape_tool = FunctionTool(func=scrape_webpage)

Building the Scraping Agent

now create an ADK agent that uses the scraping tool intelligently.

from google.adk.agents import Agent

scraping_agent = Agent(
    name="web_scraper",
    model="gemini-2.0-flash",
    description="an AI agent that scrapes websites and extracts structured data",
    instruction="""you are a web scraping assistant. when the user asks you to collect data
    from websites, use the scrape_webpage tool to fetch and extract the information.

    guidelines:
    - start with "structured" extract_type to understand the page layout
    - use "tables" for data that appears in tabular format
    - use "links" when looking for URLs or navigation
    - use "text" as a fallback for general content
    - if a scrape fails, try again with a different approach
    - always summarize the data you found in a clear format
    - when scraping multiple pages, process them one at a time
    """,
    tools=[scrape_tool],
)

Adding Advanced Proxy Management

for production use, you need more sophisticated proxy management.

import time
import random
from collections import defaultdict
from typing import Optional

class ProxyManager:
    def __init__(self, proxies: list):
        self.proxies = proxies
        self.proxy_stats = defaultdict(lambda: {"success": 0, "fail": 0, "last_used": 0})
        self.cooldown = 5  # seconds between uses of same proxy

    def get_proxy(self) -> dict:
        """get the best available proxy based on success rate and cooldown"""
        now = time.time()
        available = []

        for proxy in self.proxies:
            stats = self.proxy_stats[proxy]
            time_since_use = now - stats["last_used"]

            if time_since_use < self.cooldown:
                continue

            total = stats["success"] + stats["fail"]
            success_rate = stats["success"] / total if total > 0 else 0.5

            available.append((proxy, success_rate))

        if not available:
            # all proxies on cooldown, wait and pick random
            time.sleep(self.cooldown)
            proxy = random.choice(self.proxies)
        else:
            # weighted random selection favoring higher success rates
            available.sort(key=lambda x: x[1], reverse=True)
            # 70% chance of picking top proxy, 30% random
            if random.random() < 0.7 and available:
                proxy = available[0][0]
            else:
                proxy = random.choice([p for p, _ in available])

        self.proxy_stats[proxy]["last_used"] = time.time()
        return {"http": proxy, "https": proxy}

    def report_result(self, proxy_url: str, success: bool):
        """report whether a proxy request succeeded"""
        key = "success" if success else "fail"
        self.proxy_stats[proxy_url][key] += 1

    def get_stats(self) -> dict:
        """get performance stats for all proxies"""
        stats = {}
        for proxy in self.proxies:
            s = self.proxy_stats[proxy]
            total = s["success"] + s["fail"]
            rate = s["success"] / total if total > 0 else 0
            stats[proxy.split("@")[1] if "@" in proxy else proxy] = {
                "total": total,
                "success_rate": f"{rate:.1%}",
            }
        return stats

# initialize the proxy manager
proxy_manager = ProxyManager(PROXY_LIST)

Creating a Multi-Tool Scraping Agent

real-world scraping needs multiple tools. here’s an agent with scraping, search, and data processing capabilities.

import csv
import io

def search_google(query: str, num_results: int = 5) -> str:
    """
    search Google and return result URLs.

    args:
        query: the search query
        num_results: number of results to return (max 10)

    returns:
        JSON list of search results with title and URL
    """
    session = requests.Session(impersonate="chrome124")
    proxy = proxy_manager.get_proxy()

    try:
        response = session.get(
            "https://www.google.com/search",
            params={"q": query, "num": min(num_results, 10), "hl": "en", "gl": "us"},
            headers={
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            },
            proxies=proxy,
            timeout=30,
        )

        soup = BeautifulSoup(response.text, "lxml")
        results = []

        for div in soup.select("div.g"):
            title = div.select_one("h3")
            link = div.select_one("a")
            snippet = div.select_one("div.VwiC3b")

            if title and link:
                results.append({
                    "title": title.get_text(strip=True),
                    "url": link.get("href", ""),
                    "snippet": snippet.get_text(strip=True) if snippet else "",
                })

        proxy_url = list(proxy.values())[0]
        proxy_manager.report_result(proxy_url, len(results) > 0)

        return json.dumps(results[:num_results])

    except Exception as e:
        return json.dumps({"error": str(e)})

def save_data(data: str, filename: str, format: str = "json") -> str:
    """
    save extracted data to a file.

    args:
        data: the data to save (JSON string)
        filename: output filename
        format: "json" or "csv"

    returns:
        confirmation message
    """
    try:
        parsed = json.loads(data)

        if format == "csv" and isinstance(parsed, list) and len(parsed) > 0:
            output = io.StringIO()
            writer = csv.DictWriter(output, fieldnames=parsed[0].keys())
            writer.writeheader()
            writer.writerows(parsed)

            with open(filename, "w") as f:
                f.write(output.getvalue())
        else:
            with open(filename, "w") as f:
                json.dump(parsed, f, indent=2)

        return f"saved {len(parsed) if isinstance(parsed, list) else 1} records to {filename}"

    except Exception as e:
        return f"error saving data: {e}"

# create tools
search_tool = FunctionTool(func=search_google)
save_tool = FunctionTool(func=save_data)

# create the multi-tool agent
research_agent = Agent(
    name="data_researcher",
    model="gemini-2.0-flash",
    description="an AI agent that researches topics by searching and scraping the web",
    instruction="""you are a data research assistant that collects information from the web.

    workflow:
    1. search Google to find relevant pages for the user's query
    2. scrape the most promising results to extract detailed data
    3. organize the extracted data into a structured format
    4. save the data to a file if requested

    always use proxies (they're built into the tools) and be respectful of
    rate limits by not making too many requests in rapid succession.
    """,
    tools=[scrape_tool, search_tool, save_tool],
)

Running the Agent

from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService

async def run_scraping_task(task_description):
    """run the scraping agent with a specific task"""
    session_service = InMemorySessionService()
    runner = Runner(
        agent=research_agent,
        app_name="web_scraper",
        session_service=session_service,
    )

    session = await session_service.create_session(
        app_name="web_scraper",
        user_id="user_1",
    )

    response = await runner.run_async(
        session_id=session.id,
        user_id="user_1",
        new_message=task_description,
    )

    # collect all response events
    results = []
    async for event in response:
        if event.content and event.content.parts:
            for part in event.content.parts:
                if hasattr(part, "text") and part.text:
                    results.append(part.text)

    return "\n".join(results)

# example usage
import asyncio

result = asyncio.run(run_scraping_task(
    "find the top 10 proxy service providers and collect their pricing information. "
    "save the results to proxy_pricing.json"
))
print(result)

Error Handling and Retry Logic

ADK agents benefit from tools that handle errors gracefully.

def scrape_with_retry(url: str, extract_type: str = "text", max_retries: int = 3) -> str:
    """
    scrape a webpage with automatic retry and proxy rotation.

    args:
        url: the URL to scrape
        extract_type: what to extract - "text", "links", "tables", or "structured"
        max_retries: maximum number of retry attempts

    returns:
        extracted content or error message
    """
    errors = []

    for attempt in range(max_retries):
        session = requests.Session(impersonate="chrome124")
        proxy = proxy_manager.get_proxy()
        proxy_url = list(proxy.values())[0]

        try:
            response = session.get(
                url,
                headers={
                    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
                    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                    "Accept-Language": "en-US,en;q=0.9",
                },
                proxies=proxy,
                timeout=30,
            )

            if response.status_code == 200:
                proxy_manager.report_result(proxy_url, True)
                soup = BeautifulSoup(response.text, "lxml")

                for element in soup(["script", "style"]):
                    element.decompose()

                if extract_type == "structured":
                    result = {
                        "title": soup.title.get_text(strip=True) if soup.title else "",
                        "headings": [h.get_text(strip=True) for h in soup.find_all(["h1", "h2", "h3"])],
                        "paragraphs": [p.get_text(strip=True) for p in soup.find_all("p")][:20],
                        "url": url,
                    }
                    return json.dumps(result, indent=2)
                else:
                    text = soup.get_text(separator="\n", strip=True)
                    return text[:5000]

            else:
                proxy_manager.report_result(proxy_url, False)
                errors.append(f"attempt {attempt + 1}: HTTP {response.status_code}")

        except Exception as e:
            proxy_manager.report_result(proxy_url, False)
            errors.append(f"attempt {attempt + 1}: {str(e)}")

        time.sleep(random.uniform(2, 5))

    return json.dumps({
        "error": "all retries failed",
        "attempts": errors,
        "url": url,
    })

retry_scrape_tool = FunctionTool(func=scrape_with_retry)

Monitoring Agent Performance

track how your ADK scraping agent performs.

from datetime import datetime

class AgentMonitor:
    def __init__(self):
        self.tasks = []
        self.current_task = None

    def start_task(self, description):
        self.current_task = {
            "description": description,
            "start_time": datetime.now(),
            "tool_calls": [],
            "errors": [],
        }

    def log_tool_call(self, tool_name, url, success):
        if self.current_task:
            self.current_task["tool_calls"].append({
                "tool": tool_name,
                "url": url,
                "success": success,
                "timestamp": datetime.now().isoformat(),
            })

    def end_task(self):
        if self.current_task:
            self.current_task["end_time"] = datetime.now()
            self.current_task["duration"] = (
                self.current_task["end_time"] - self.current_task["start_time"]
            ).total_seconds()
            self.tasks.append(self.current_task)
            self.current_task = None

    def get_summary(self):
        total_calls = sum(len(t["tool_calls"]) for t in self.tasks)
        successful = sum(
            1 for t in self.tasks
            for c in t["tool_calls"]
            if c["success"]
        )

        return {
            "total_tasks": len(self.tasks),
            "total_tool_calls": total_calls,
            "success_rate": f"{successful/total_calls:.1%}" if total_calls else "N/A",
            "avg_duration": f"{sum(t['duration'] for t in self.tasks)/len(self.tasks):.1f}s" if self.tasks else "N/A",
            "proxy_stats": proxy_manager.get_stats(),
        }

Best Practices for ADK Scraping Agents

  1. keep tool outputs concise – LLMs have context limits. truncate scraped content to what’s needed rather than dumping entire pages.
  2. use structured extraction – the “structured” extract type gives the agent organized data to work with rather than raw text.
  3. build in rate limiting – proxy rotation helps, but also add delays between requests in your tools.
  4. handle errors in the tool, not the agent – return clear error messages from tools so the agent can decide what to do next.
  5. test with simple tasks first – start with “scrape this one URL and extract the title” before building complex multi-page workflows.

when testing your proxy setup, use the Browser Fingerprint Tester to verify that your requests look like genuine browser traffic.

Summary

Google ADK provides a powerful framework for building AI agents that can scrape the web intelligently. by wrapping your scraping logic in ADK tools and adding proxy rotation, you get agents that:

  • adapt to different website structures automatically
  • handle errors and retries intelligently
  • use proxies to avoid detection and rate limiting
  • extract structured data from unstructured web pages
  • chain multiple scraping steps together for complex research tasks

the combination of LLM intelligence and proxy infrastructure makes ADK scraping agents more resilient than traditional scrapers, since the agent can reason about failures and try alternative approaches rather than just retrying the same request.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top