Mastra.ai Agent Framework + Web Scraping Integration

TL;DR
Mastra.ai is a TypeScript-native AI agent framework that integrates web scraping as a tool within autonomous agent workflows. this guide covers how to build scraping-enabled agents using Mastra’s tool system, memory, and workflow orchestration.

AI agent frameworks are moving from experimental toys to production infrastructure. Mastra.ai is one of the more practical entries in the space — it is TypeScript-first, self-hostable, and designed around the idea that agents need tools, memory, and workflow orchestration to do real work. web scraping is a natural fit as a tool within this architecture.

this guide covers integrating web scraping capabilities into Mastra agents: building scraping tools, connecting them to agent workflows, and handling the async nature of scraping within Mastra’s execution model.

what mastra.ai is and why it matters for scraping

Mastra is an open-source TypeScript framework for building AI agents and workflows. it provides abstractions for tools (functions agents can call), memory (persistent agent state), and workflows (multi-step execution graphs). it integrates with major LLM providers including OpenAI, Anthropic, and Google.

for web scraping, Mastra solves the orchestration problem: instead of writing custom logic to decide when to scrape, what to do with the data, and how to handle failures, you define scraping as a tool and let the agent reason about when and how to use it. the agent can chain scraping with parsing, analysis, storage, and follow-up requests based on what it finds.

setting up mastra with scraping tools

import { Mastra, createTool } from '@mastra/core';
import { z } from 'zod';
import axios from 'axios';
import * as cheerio from 'cheerio';

// define a web scraping tool
const scrapePageTool = createTool({
  id: 'scrape-page',
  description: 'Fetches a URL and returns the page text content and links',
  inputSchema: z.object({
    url: z.string().url().describe('The URL to scrape'),
    selector: z.string().optional().describe('CSS selector to extract specific elements')
  }),
  outputSchema: z.object({
    title: z.string(),
    text: z.string(),
    links: z.array(z.string())
  }),
  execute: async ({ context }) => {
    const { url, selector } = context;
    
    const response = await axios.get(url, {
      headers: {
        'User-Agent': 'Mozilla/5.0 (compatible; ResearchBot/1.0)'
      },
      timeout: 15000
    });
    
    const $ = cheerio.load(response.data);
    const title = $('title').text().trim();
    const text = selector
      ? $(selector).text().trim()
      : $('body').text().replace(/\s+/g, ' ').trim().substring(0, 5000);
    const links = $('a[href]')
      .map((_, el) => $(el).attr('href'))
      .get()
      .filter(href => href && href.startsWith('http'))
      .slice(0, 20);
    
    return { title, text, links };
  }
});

// define a search-and-scrape tool
const searchAndScrapeTool = createTool({
  id: 'search-and-scrape',
  description: 'Searches for a query and scrapes the top results',
  inputSchema: z.object({
    query: z.string(),
    maxResults: z.number().default(3)
  }),
  outputSchema: z.object({
    results: z.array(z.object({
      url: z.string(),
      title: z.string(),
      snippet: z.string()
    }))
  }),
  execute: async ({ context }) => {
    // integrate with your search API here
    // returning mock structure for illustration
    return { results: [] };
  }
});

building a research agent with scraping

import { Agent } from '@mastra/core';

const researchAgent = new Agent({
  name: 'web-research-agent',
  instructions: `you are a web research assistant. when given a research question,
  use the scrape-page tool to gather information from relevant URLs.
  synthesize what you find into a structured summary.
  always cite the sources you used.`,
  model: {
    provider: 'ANTHROPIC',
    name: 'claude-sonnet-4-5',
    toolChoice: 'auto'
  },
  tools: {
    scrapePage: scrapePageTool,
    searchAndScrape: searchAndScrapeTool
  }
});

// use the agent
const result = await researchAgent.generate([
  {
    role: 'user',
    content: 'research the current pricing models of the top 5 residential proxy providers'
  }
]);

console.log(result.text);

adding proxy rotation to the scraping tool

for agents that scrape at scale or need to avoid IP-based blocking, integrate proxy rotation directly into the scraping tool. the agent does not need to know about proxies — the tool handles rotation transparently.

import { HttpsProxyAgent } from 'https-proxy-agent';

const proxyPool = process.env.PROXY_LIST?.split(',') || [];
let proxyIndex = 0;

const scrapeWithProxyTool = createTool({
  id: 'scrape-with-proxy',
  description: 'Scrapes a URL using a rotating proxy for reliability',
  inputSchema: z.object({ url: z.string().url() }),
  outputSchema: z.object({ content: z.string(), statusCode: z.number() }),
  execute: async ({ context }) => {
    const proxy = proxyPool[proxyIndex++ % proxyPool.length];
    const agent = proxy ? new HttpsProxyAgent(proxy) : undefined;
    
    const response = await axios.get(context.url, {
      httpAgent: agent,
      httpsAgent: agent,
      timeout: 20000
    });
    
    const $ = cheerio.load(response.data);
    $('script, style, nav, footer').remove();
    const content = $('main, article, .content, body')
      .first()
      .text()
      .replace(/\s+/g, ' ')
      .trim();
    
    return { content, statusCode: response.status };
  }
});

learn about proxy server types and how to select them for agent scraping workloads. SOCKS5 vs HTTP proxy differences affect which proxy agent library you use in Node.js.

mastra workflows for multi-step scraping

Mastra’s workflow system lets you define deterministic multi-step pipelines alongside the agent’s autonomous reasoning. for scraping pipelines that have a known structure (e.g. search, then scrape top 10 results, then extract structured data), workflows are more reliable and cheaper to run than pure agent reasoning.

import { Workflow, Step } from '@mastra/core';

const researchWorkflow = new Workflow({
  name: 'competitive-research',
  triggerSchema: z.object({ topic: z.string() })
});

researchWorkflow
  .step(searchStep)
  .then(scrapeResultsStep)
  .then(extractDataStep)
  .then(summarizeStep)
  .commit();

memory and deduplication

Mastra’s memory system lets agents remember what they have already scraped. configure memory with a vector store to prevent re-scraping the same URLs and to enable semantic search over previously collected data. this is particularly valuable for ongoing research tasks where the agent runs periodically to collect new information.

running AI agents that need web access? pair them with our dedicated mobile proxy for reliable, unblocked connections from Singapore carrier IPs.

understand the full web scraping pipeline to design agent tools that handle real-world HTML complexity rather than assuming clean, well-structured pages.

related guides

sources and further reading

last updated: April 3, 2026