Mastra AI Agent Framework for Web Scraping: Build Intelligent Scrapers
the era of writing rigid CSS selectors and brittle scraping scripts is giving way to something more flexible: AI agents that can understand web pages, adapt to layout changes, and make decisions about how to extract data. Mastra is a TypeScript-first AI agent framework that makes building these intelligent scraping agents surprisingly straightforward.
this guide covers how to use Mastra to build web scraping agents that combine LLM reasoning with traditional scraping tools, route requests through proxies, and handle the unpredictable nature of the modern web.
What Is Mastra
Mastra is an open-source framework for building AI agents and workflows in TypeScript. it provides:
- agent primitives with tool calling, memory, and structured output
- workflow orchestration for multi-step processes
- integrations with major LLM providers (OpenAI, Anthropic, Google)
- tool system for giving agents capabilities like web browsing, API calls, and data processing
- RAG support for knowledge-augmented agents
- observability with built-in logging and tracing
for web scraping, Mastra’s tool system is the key feature. you define scraping capabilities as tools, and the agent decides when and how to use them based on the task description.
Why Use AI Agents for Scraping
traditional scrapers are deterministic: they follow a fixed sequence of steps. when a website changes its layout, the scraper breaks. AI agents approach scraping differently:
- adaptive extraction. the agent understands what data you want and figures out where it is on the page, even if the layout changes
- decision-making. agents can decide whether to paginate, click into detail pages, or skip irrelevant content
- error recovery. when something unexpected happens, the agent can reason about the problem and try alternative approaches
- natural language instructions. you describe what you want in plain English instead of writing selectors
the tradeoff is speed and cost. AI agents are slower and more expensive per page than traditional scrapers. they shine for complex extraction tasks, one-off research, and situations where page structures are unpredictable.
Setting Up Mastra for Scraping
Installation
# create a new project
mkdir mastra-scraper && cd mastra-scraper
npm init -y
# install mastra and dependencies
npm install @mastra/core @mastra/tools
npm install playwright cheerio
npm install @anthropic-ai/sdk # or openai
npm install https-proxy-agent
npm install dotenv
Project Structure
mastra-scraper/
src/
agents/
scraper-agent.ts
research-agent.ts
tools/
browse.ts
extract.ts
proxy.ts
workflows/
scrape-workflow.ts
index.ts
.env
package.json
tsconfig.json
Environment Configuration
# .env
ANTHROPIC_API_KEY=your_key_here
PROXY_GATEWAY=http://gate.proxyservice.com:7777
PROXY_USERNAME=your_user
PROXY_PASSWORD=your_pass
Building Scraping Tools
Mastra agents use tools to interact with the world. here are the essential tools for a scraping agent.
Tool 1: Browse Web Page
this tool fetches a web page through a proxy and returns the HTML content:
// src/tools/browse.ts
import { createTool } from '@mastra/core';
import { z } from 'zod';
import { chromium } from 'playwright';
export const browseWebPage = createTool({
id: 'browse-web-page',
description: 'Fetches a web page and returns its HTML content. Supports JavaScript rendering and proxy routing.',
inputSchema: z.object({
url: z.string().url().describe('The URL to browse'),
waitForSelector: z.string().optional().describe('CSS selector to wait for before extracting content'),
country: z.string().optional().default('us').describe('Country code for geo-targeted proxy'),
}),
outputSchema: z.object({
html: z.string(),
title: z.string(),
url: z.string(),
statusCode: z.number(),
}),
execute: async ({ context }) => {
const { url, waitForSelector, country } = context;
const proxyServer = process.env.PROXY_GATEWAY;
const proxyUser = `${process.env.PROXY_USERNAME}-country-${country}`;
const proxyPass = process.env.PROXY_PASSWORD;
const browser = await chromium.launch({
headless: true,
proxy: proxyServer ? {
server: proxyServer,
username: proxyUser,
password: proxyPass,
} : undefined,
});
try {
const page = await browser.newPage({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
viewport: { width: 1920, height: 1080 },
});
await page.goto(url, { waitUntil: 'networkidle', timeout: 30000 });
if (waitForSelector) {
await page.waitForSelector(waitForSelector, { timeout: 10000 }).catch(() => {
console.log(`selector ${waitForSelector} not found, continuing anyway`);
});
}
const html = await page.content();
const title = await page.title();
return {
html,
title,
url: page.url(),
statusCode: 200,
};
} finally {
await browser.close();
}
},
});
Tool 2: Extract Structured Data
this tool uses an LLM to extract structured data from HTML:
// src/tools/extract.ts
import { createTool } from '@mastra/core';
import { z } from 'zod';
import Anthropic from '@anthropic-ai/sdk';
export const extractData = createTool({
id: 'extract-structured-data',
description: 'Extracts structured data from HTML content using AI. Provide a description of what data to extract.',
inputSchema: z.object({
html: z.string().describe('HTML content to extract from'),
extractionPrompt: z.string().describe('Description of what data to extract'),
outputFormat: z.string().optional().default('json').describe('Output format: json or csv'),
}),
outputSchema: z.object({
data: z.any(),
itemCount: z.number(),
}),
execute: async ({ context }) => {
const { html, extractionPrompt, outputFormat } = context;
// clean HTML to reduce token usage
const cleanedHtml = cleanHtml(html);
const client = new Anthropic();
const response = await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 4096,
messages: [{
role: 'user',
content: `Extract data from this HTML content and return it as a JSON array.
Extraction instructions: ${extractionPrompt}
HTML content:
${cleanedHtml}
Return ONLY valid JSON. No explanations or markdown formatting.`
}],
});
const responseText = response.content[0].type === 'text' ? response.content[0].text : '';
try {
const data = JSON.parse(responseText);
const items = Array.isArray(data) ? data : [data];
return {
data: items,
itemCount: items.length,
};
} catch {
return {
data: [{ raw: responseText }],
itemCount: 1,
};
}
},
});
function cleanHtml(html: string): string {
// remove scripts, styles, and unnecessary elements
return html
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
.replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
.replace(/<nav[^>]*>[\s\S]*?<\/nav>/gi, '')
.replace(/<footer[^>]*>[\s\S]*?<\/footer>/gi, '')
.replace(/<header[^>]*>[\s\S]*?<\/header>/gi, '')
.replace(/<!--[\s\S]*?-->/g, '')
.replace(/\s+/g, ' ')
.trim();
}
Tool 3: Search the Web
// src/tools/search.ts
import { createTool } from '@mastra/core';
import { z } from 'zod';
export const searchWeb = createTool({
id: 'search-web',
description: 'Searches the web for a query and returns a list of URLs and snippets.',
inputSchema: z.object({
query: z.string().describe('Search query'),
numResults: z.number().optional().default(10).describe('Number of results to return'),
}),
outputSchema: z.object({
results: z.array(z.object({
title: z.string(),
url: z.string(),
snippet: z.string(),
})),
}),
execute: async ({ context }) => {
const { query, numResults } = context;
// using SerpAPI for reliable search results
const response = await fetch(
`https://serpapi.com/search.json?q=${encodeURIComponent(query)}&num=${numResults}&api_key=${process.env.SERPAPI_KEY}`
);
const data = await response.json();
const results = (data.organic_results || []).map((r: any) => ({
title: r.title || '',
url: r.link || '',
snippet: r.snippet || '',
}));
return { results };
},
});
Building the Scraping Agent
now let us combine the tools into an intelligent scraping agent:
// src/agents/scraper-agent.ts
import { Agent } from '@mastra/core';
import { browseWebPage } from '../tools/browse';
import { extractData } from '../tools/extract';
import { searchWeb } from '../tools/search';
export const scraperAgent = new Agent({
name: 'web-scraper',
instructions: `You are an expert web scraping agent. Your job is to collect structured data from websites based on user requests.
Guidelines:
- Always use the browse tool to fetch pages before extracting data
- If you need to find the right URLs first, use the search tool
- Extract data in a clean, structured format
- Handle pagination by checking for "next page" links
- If a page fails to load, try once more before giving up
- Always report what data you collected and any issues encountered
- When scraping multiple pages, work through them systematically
- Respect rate limits by not scraping too aggressively`,
model: {
provider: 'ANTHROPIC',
name: 'claude-sonnet-4-20250514',
},
tools: {
browseWebPage,
extractData,
searchWeb,
},
});
Using the Agent
// src/index.ts
import { scraperAgent } from './agents/scraper-agent';
async function main() {
// example 1: scrape product data from a specific URL
const productResult = await scraperAgent.generate(
`Scrape all product listings from https://example-store.com/products.
For each product, extract:
- product name
- price
- rating (if available)
- number of reviews
Return the data as a JSON array.`
);
console.log('Products:', productResult.text);
// example 2: research-style scraping across multiple sources
const researchResult = await scraperAgent.generate(
`I need to research the current state of residential proxy pricing.
Search for "residential proxy pricing 2026" and scrape the top 5 results.
From each page, extract:
- provider name
- pricing tiers
- features included
- any bandwidth or request limits
Compile everything into a comparison table.`
);
console.log('Research:', researchResult.text);
}
main().catch(console.error);
Building a Multi-Step Scraping Workflow
for more complex scraping tasks, use Mastra’s workflow system:
// src/workflows/scrape-workflow.ts
import { Workflow, Step } from '@mastra/core';
import { z } from 'zod';
import { browseWebPage } from '../tools/browse';
import { extractData } from '../tools/extract';
import { searchWeb } from '../tools/search';
const findSourcesStep = new Step({
id: 'find-sources',
description: 'Find relevant URLs to scrape',
inputSchema: z.object({
topic: z.string(),
maxSources: z.number().default(5),
}),
execute: async ({ context }) => {
const searchResult = await searchWeb.execute({
context: {
query: context.topic,
numResults: context.maxSources,
}
});
return { urls: searchResult.results.map(r => r.url) };
},
});
const scrapePageStep = new Step({
id: 'scrape-page',
description: 'Scrape a single page',
inputSchema: z.object({
url: z.string(),
extractionPrompt: z.string(),
}),
execute: async ({ context }) => {
// browse the page
const pageContent = await browseWebPage.execute({
context: {
url: context.url,
country: 'us',
}
});
// extract data
const extracted = await extractData.execute({
context: {
html: pageContent.html,
extractionPrompt: context.extractionPrompt,
}
});
return {
url: context.url,
title: pageContent.title,
data: extracted.data,
itemCount: extracted.itemCount,
};
},
});
const aggregateStep = new Step({
id: 'aggregate-results',
description: 'Combine and deduplicate results from all pages',
inputSchema: z.object({
results: z.array(z.any()),
}),
execute: async ({ context }) => {
const allData = context.results.flatMap(r => r.data || []);
// simple deduplication by title/name
const seen = new Set();
const unique = allData.filter(item => {
const key = JSON.stringify(item);
if (seen.has(key)) return false;
seen.add(key);
return true;
});
return {
totalSources: context.results.length,
totalItems: unique.length,
data: unique,
};
},
});
export const researchWorkflow = new Workflow({
name: 'research-scrape',
steps: [findSourcesStep, scrapePageStep, aggregateStep],
});
Proxy Integration Patterns
Pattern 1: Automatic Country Detection
// src/tools/proxy.ts
import { createTool } from '@mastra/core';
import { z } from 'zod';
export const detectCountry = createTool({
id: 'detect-country',
description: 'Detects the likely country of a URL based on its domain',
inputSchema: z.object({
url: z.string(),
}),
outputSchema: z.object({
country: z.string(),
countryCode: z.string(),
}),
execute: async ({ context }) => {
const url = new URL(context.url);
const tld = url.hostname.split('.').pop() || '';
const tldMap: Record<string, { country: string; code: string }> = {
'br': { country: 'Brazil', code: 'br' },
'mx': { country: 'Mexico', code: 'mx' },
'ar': { country: 'Argentina', code: 'ar' },
'de': { country: 'Germany', code: 'de' },
'uk': { country: 'United Kingdom', code: 'gb' },
'fr': { country: 'France', code: 'fr' },
'jp': { country: 'Japan', code: 'jp' },
'cn': { country: 'China', code: 'cn' },
'in': { country: 'India', code: 'in' },
};
const detected = tldMap[tld] || { country: 'United States', code: 'us' };
return { country: detected.country, countryCode: detected.code };
},
});
Pattern 2: Smart Proxy Selection
// add to the scraper agent's tool set
export const selectProxy = createTool({
id: 'select-proxy',
description: 'Selects the best proxy type for a given target URL',
inputSchema: z.object({
url: z.string(),
antiBot: z.enum(['low', 'medium', 'high']).optional().default('medium'),
}),
outputSchema: z.object({
proxyType: z.string(),
proxyUrl: z.string(),
recommendation: z.string(),
}),
execute: async ({ context }) => {
const { url, antiBot } = context;
const gateway = process.env.PROXY_GATEWAY;
const user = process.env.PROXY_USERNAME;
const pass = process.env.PROXY_PASSWORD;
let proxyType: string;
let recommendation: string;
if (antiBot === 'high') {
proxyType = 'residential';
recommendation = 'using residential proxies for high anti-bot protection';
} else if (antiBot === 'medium') {
proxyType = 'isp';
recommendation = 'using ISP proxies for balanced speed and stealth';
} else {
proxyType = 'datacenter';
recommendation = 'using datacenter proxies for maximum speed';
}
const proxyUrl = `http://${user}-type-${proxyType}:${pass}@${gateway}`;
return { proxyType, proxyUrl, recommendation };
},
});
Real-World Use Cases
Use Case 1: Competitive Price Monitoring
const priceMonitorResult = await scraperAgent.generate(`
Monitor pricing for the following competitors:
- https://competitor1.com/pricing
- https://competitor2.com/plans
- https://competitor3.com/products
For each competitor, extract:
- all plan names and their prices
- key features per plan
- any free trial information
- enterprise/custom pricing availability
Compare the results and highlight differences.
`);
Use Case 2: Job Market Research
const jobResearch = await scraperAgent.generate(`
Search for "senior data engineer remote" job listings.
Scrape the top 5 search results.
From each listing, extract:
- job title
- company name
- salary range (if listed)
- required skills
- location/remote status
Summarize the common requirements and salary ranges.
`);
Use Case 3: Content Gap Analysis
const contentGaps = await scraperAgent.generate(`
I run a website about web scraping tools.
Search for "web scraping tutorial 2026" and scrape the top 10 results.
For each article, extract:
- title
- main topics covered
- word count estimate
- target audience (beginner/intermediate/advanced)
Identify topics that are NOT well covered by existing content.
`);
Performance and Cost Optimization
Reducing LLM Costs
AI-powered scraping is more expensive than traditional methods. here are ways to minimize costs:
clean HTML aggressively before sending to the LLM. remove navigation, footers, scripts, and styles. this can reduce token usage by 80% or more.
use cheaper models for simple extraction. not every page needs Claude Sonnet. for pages with well-structured HTML, a smaller model works fine.
cache LLM responses. if you are scraping the same type of page repeatedly (like product pages), cache the extraction logic after the first successful run.
batch related pages. instead of sending each page individually, combine content from similar pages into a single LLM call.
Handling Rate Limits
// add rate limiting to your agent's execution
class RateLimiter {
private queue: Array<() => Promise<void>> = [];
private processing = false;
private delayMs: number;
constructor(requestsPerMinute: number) {
this.delayMs = (60 * 1000) / requestsPerMinute;
}
async execute<T>(fn: () => Promise<T>): Promise<T> {
return new Promise((resolve, reject) => {
this.queue.push(async () => {
try {
const result = await fn();
resolve(result);
} catch (error) {
reject(error);
}
});
if (!this.processing) {
this.processQueue();
}
});
}
private async processQueue() {
this.processing = true;
while (this.queue.length > 0) {
const task = this.queue.shift();
if (task) {
await task();
await new Promise(r => setTimeout(r, this.delayMs));
}
}
this.processing = false;
}
}
Limitations and When Not to Use AI Agents
AI scraping agents are not always the right choice:
- high-volume scraping. if you need to scrape millions of pages, traditional scrapers are faster and cheaper
- simple, structured pages. if a page has a clear, consistent structure, CSS selectors are more efficient
- real-time data needs. the added latency of LLM calls makes agents unsuitable for real-time applications
- budget constraints. LLM API costs add up quickly for large scraping jobs
the sweet spot for AI scraping agents is:
– complex, unstructured pages
– one-off research tasks
– sites that change layouts frequently
– extraction tasks that require understanding context
– multi-source research that needs synthesis
Conclusion
Mastra provides a clean, TypeScript-native way to build AI scraping agents that go beyond simple data extraction. by combining LLM reasoning with traditional scraping tools and proxy infrastructure, you can build agents that adapt to changing websites and handle complex research tasks autonomously.
start with a simple agent that has browse and extract tools, test it on a few pages, and then add complexity as you understand the patterns. the workflow system becomes valuable when you have multi-step processes that need to run reliably and repeatedly.
the future of web scraping is hybrid: AI agents for complex, adaptive tasks and traditional scrapers for high-volume, structured extraction. Mastra makes it straightforward to build the AI side of that equation.