Cline AI Agent Web Scraping with MCP Integration
Cline is an autonomous AI coding agent that runs inside VS Code. it can read your codebase, write code, execute terminal commands, and interact with your browser. when connected to MCP servers, Cline gains the ability to use external tools, including web scraping tools that can fetch, parse, and extract data from any website.
this combination of an AI agent with MCP-powered scraping tools creates something powerful: an AI that can autonomously build and run scrapers, inspect the results, fix errors, and iterate until the data is correct. this guide covers how to set up Cline with MCP for web scraping, configure proxies, and use the system for real data extraction tasks.
What is Cline
Cline (formerly Claude Dev) is a VS Code extension that gives an AI model the ability to:
- read and write files in your project
- execute terminal commands
- browse the web using a built-in browser
- use MCP tools that you configure
- ask for your approval before taking actions
unlike a chatbot that just gives you code to copy, Cline actually executes the code, sees the output, and iterates. if a scraper fails, Cline can read the error, modify the code, and retry, all within a single conversation.
Setting Up Cline for Web Scraping
Step 1: Install Cline
install the Cline extension from the VS Code marketplace:
- open VS Code
- go to Extensions (Ctrl+Shift+X)
- search for “Cline”
- click Install
- configure your API key (Anthropic, OpenAI, or other supported providers)
Step 2: Configure an MCP Scraping Server
create an MCP server that provides web scraping tools to Cline:
# scraping_mcp_server.py
import asyncio
import json
import httpx
from bs4 import BeautifulSoup
from mcp.server import Server
from mcp.types import Tool, TextContent
server = Server("web-scraper-mcp")
# proxy configuration
PROXY_URL = "http://user:pass@proxy.example.com:8080"
@server.list_tools()
async def list_tools():
return [
Tool(
name="fetch_page",
description="Fetch a web page and return its text content. use for static pages.",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string", "description": "The URL to fetch"},
"selector": {
"type": "string",
"description": "Optional CSS selector to extract specific content",
"default": "body"
}
},
"required": ["url"]
}
),
Tool(
name="fetch_dynamic_page",
description="Fetch a JavaScript-rendered page using a headless browser. use for SPAs and dynamic content.",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string", "description": "The URL to fetch"},
"wait_selector": {
"type": "string",
"description": "CSS selector to wait for before extracting content"
}
},
"required": ["url"]
}
),
Tool(
name="search_google",
description="Search Google and return the top results.",
inputSchema={
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"num_results": {"type": "integer", "default": 10}
},
"required": ["query"]
}
),
Tool(
name="extract_links",
description="Extract all links from a web page.",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string", "description": "The URL to extract links from"},
"filter_pattern": {
"type": "string",
"description": "Optional regex pattern to filter links"
}
},
"required": ["url"]
}
),
Tool(
name="extract_table",
description="Extract tabular data from a web page and return as JSON.",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string", "description": "The URL containing the table"},
"table_index": {
"type": "integer",
"description": "Index of the table to extract (0-based)",
"default": 0
}
},
"required": ["url"]
}
)
]
@server.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "fetch_page":
return await _fetch_page(arguments["url"], arguments.get("selector", "body"))
elif name == "fetch_dynamic_page":
return await _fetch_dynamic(arguments["url"], arguments.get("wait_selector"))
elif name == "search_google":
return await _search_google(arguments["query"], arguments.get("num_results", 10))
elif name == "extract_links":
return await _extract_links(arguments["url"], arguments.get("filter_pattern"))
elif name == "extract_table":
return await _extract_table(arguments["url"], arguments.get("table_index", 0))
async def _fetch_page(url: str, selector: str) -> list[TextContent]:
"""fetch a static page through proxy."""
async with httpx.AsyncClient(
proxies={"all://": PROXY_URL},
timeout=30,
follow_redirects=True
) as client:
response = await client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
soup = BeautifulSoup(response.text, "html.parser")
# remove noise
for tag in soup.find_all(["script", "style", "nav", "footer", "header"]):
tag.decompose()
if selector != "body":
elements = soup.select(selector)
text = "\n".join(el.get_text(strip=True) for el in elements)
else:
text = soup.get_text(separator="\n", strip=True)
# truncate to avoid overwhelming the model
if len(text) > 50000:
text = text[:50000] + "\n... [truncated]"
return [TextContent(type="text", text=text)]
async def _fetch_dynamic(url: str, wait_selector: str = None) -> list[TextContent]:
"""fetch a dynamic page using Playwright."""
from playwright.async_api import async_playwright
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={"server": PROXY_URL}
)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
if wait_selector:
await page.wait_for_selector(wait_selector, timeout=10000)
content = await page.content()
await browser.close()
soup = BeautifulSoup(content, "html.parser")
for tag in soup.find_all(["script", "style"]):
tag.decompose()
text = soup.get_text(separator="\n", strip=True)
if len(text) > 50000:
text = text[:50000] + "\n... [truncated]"
return [TextContent(type="text", text=text)]
async def _search_google(query: str, num_results: int) -> list[TextContent]:
"""search Google and return results."""
from urllib.parse import quote
url = f"https://www.google.com/search?q={quote(query)}&num={num_results}"
async with httpx.AsyncClient(
proxies={"all://": PROXY_URL},
timeout=15
) as client:
response = await client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
soup = BeautifulSoup(response.text, "html.parser")
results = []
for g in soup.select(".g"):
title_el = g.select_one("h3")
link_el = g.select_one("a")
snippet_el = g.select_one(".VwiC3b")
if title_el and link_el:
results.append({
"title": title_el.get_text(strip=True),
"url": link_el.get("href", ""),
"snippet": snippet_el.get_text(strip=True) if snippet_el else ""
})
return [TextContent(type="text", text=json.dumps(results, indent=2))]
async def _extract_links(url: str, filter_pattern: str = None) -> list[TextContent]:
"""extract all links from a page."""
import re
async with httpx.AsyncClient(
proxies={"all://": PROXY_URL},
timeout=15
) as client:
response = await client.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = []
for a in soup.find_all("a", href=True):
href = a["href"]
text = a.get_text(strip=True)
if filter_pattern and not re.search(filter_pattern, href):
continue
links.append({"url": href, "text": text})
return [TextContent(type="text", text=json.dumps(links, indent=2))]
async def _extract_table(url: str, table_index: int) -> list[TextContent]:
"""extract a table from a page as JSON."""
async with httpx.AsyncClient(
proxies={"all://": PROXY_URL},
timeout=15
) as client:
response = await client.get(url)
soup = BeautifulSoup(response.text, "html.parser")
tables = soup.find_all("table")
if table_index >= len(tables):
return [TextContent(type="text", text=f"table index {table_index} not found. page has {len(tables)} tables.")]
table = tables[table_index]
rows = []
headers = []
# extract headers
header_row = table.find("thead")
if header_row:
headers = [th.get_text(strip=True) for th in header_row.find_all(["th", "td"])]
else:
first_row = table.find("tr")
if first_row:
headers = [th.get_text(strip=True) for th in first_row.find_all(["th", "td"])]
# extract data rows
for tr in table.find_all("tr")[1 if not header_row else 0:]:
cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
if headers and len(cells) == len(headers):
rows.append(dict(zip(headers, cells)))
elif cells:
rows.append(cells)
return [TextContent(type="text", text=json.dumps(rows, indent=2))]
if __name__ == "__main__":
from mcp.server.stdio import stdio_server
async def main():
async with stdio_server() as (read_stream, write_stream):
await server.run(read_stream, write_stream)
asyncio.run(main())
Step 3: Connect Cline to the MCP Server
add the MCP server to your Cline configuration. in VS Code, open the Cline MCP settings and add:
{
"mcpServers": {
"web-scraper": {
"command": "python",
"args": ["/path/to/scraping_mcp_server.py"],
"env": {
"PROXY_URL": "http://user:pass@proxy.example.com:8080"
}
}
}
}
after restarting Cline, it will have access to all the scraping tools defined in your MCP server.
Using Cline for Web Scraping Tasks
Example 1: Scraping Product Data
prompt Cline with:
“use the web scraper tools to fetch the top 10 results from Google for ‘best residential proxies 2026’, then visit each result and extract the proxy providers mentioned, their pricing, and key features. save the results to a CSV file.”
Cline will:
- call
search_googleto get search results - call
fetch_pagefor each result URL - analyze the content to extract provider names, pricing, and features
- write a Python script to save the data as CSV
- execute the script and verify the output
Example 2: Monitoring Competitor Prices
“write a scraper that checks the pricing page of brightdata.com, oxylabs.io, and smartproxy.com every day. use the fetch_dynamic_page tool since these pages use JavaScript. extract all pricing tiers and save to a JSON file with timestamps.”
Cline will use the MCP tools to fetch each page, analyze the pricing structures, and create an automated script.
Example 3: Building a Dataset
“I need a dataset of all Python web scraping libraries on GitHub with more than 1000 stars. use the tools to search GitHub, extract repo details (name, stars, description, last updated), and save as a CSV.”
Proxy Configuration Best Practices
Using Rotating Residential Proxies
for Cline’s MCP scraping tools, residential proxies provide the best results:
# enhanced proxy configuration in the MCP server
import random
PROXY_POOL = [
"http://user:pass@gate.smartproxy.com:7777",
"http://user:pass@pr.oxylabs.io:7777",
"http://user:pass@brd.superproxy.io:22225"
]
def get_proxy() -> str:
"""rotate through proxy pool."""
return random.choice(PROXY_POOL)
# use in fetch functions
async def _fetch_page(url: str, selector: str) -> list[TextContent]:
proxy = get_proxy()
async with httpx.AsyncClient(
proxies={"all://": proxy},
timeout=30
) as client:
response = await client.get(url)
# ... rest of the function
Geo-Targeted Proxies
when scraping location-specific content, configure country-targeted proxies:
GEO_PROXIES = {
"us": "http://user-country-us:pass@proxy.example.com:8080",
"uk": "http://user-country-gb:pass@proxy.example.com:8080",
"de": "http://user-country-de:pass@proxy.example.com:8080",
"jp": "http://user-country-jp:pass@proxy.example.com:8080"
}
# add a geo parameter to the fetch tool
@server.list_tools()
async def list_tools():
return [
Tool(
name="fetch_page_geo",
description="Fetch a page using a proxy from a specific country.",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string"},
"country": {
"type": "string",
"enum": ["us", "uk", "de", "jp"],
"description": "Country code for the proxy location"
}
},
"required": ["url", "country"]
}
)
]
Advanced Cline Scraping Workflows
Self-Correcting Scrapers
one of Cline’s biggest advantages is error recovery. when a scraper fails, Cline can:
- read the error message
- diagnose the problem (wrong selector, blocked request, timeout)
- modify the approach (try a different selector, switch proxy, add delays)
- retry and verify the fix
this makes Cline effective for scraping unfamiliar sites where you would normally spend time debugging.
Multi-Step Research Pipelines
Cline can chain multiple MCP tool calls to complete complex research tasks:
1. search_google("best proxy providers for web scraping")
2. fetch_page(result_1_url) -> extract provider names
3. fetch_page(provider_1_pricing_url) -> extract pricing
4. fetch_page(provider_1_reviews_url) -> extract ratings
5. repeat for each provider
6. compile into comparison table
7. save as CSV and generate summary
this kind of multi-step pipeline would require significant code in a traditional setup but can be driven entirely by natural language prompts in Cline.
Data Validation and Quality Checks
ask Cline to validate extracted data:
“scrape the product listings from example.com/products. after extraction, verify that all prices are valid numbers, all URLs are reachable, and no required fields are missing. fix any issues and save the clean dataset.”
Cline will scrape, validate, identify problems, and correct them, all in one workflow.
Security Considerations
when using Cline with MCP scraping tools, keep these security points in mind:
- API keys in MCP config: store proxy credentials and API keys in environment variables, not in the MCP server code
- approval mode: keep Cline’s approval mode enabled so you can review each tool call before it executes
- output review: always review the data Cline extracts before using it in production
- rate limiting: add rate limits to your MCP tools to prevent Cline from overwhelming target sites
- scope limitation: only expose the MCP tools that Cline needs. do not give it access to destructive operations
# rate limiting in MCP tools
from datetime import datetime, timedelta
LAST_REQUEST = {}
MIN_DELAY = 2 # seconds between requests to same domain
async def rate_limited_fetch(url: str):
from urllib.parse import urlparse
domain = urlparse(url).netloc
now = datetime.now()
if domain in LAST_REQUEST:
elapsed = (now - LAST_REQUEST[domain]).total_seconds()
if elapsed < MIN_DELAY:
await asyncio.sleep(MIN_DELAY - elapsed)
LAST_REQUEST[domain] = datetime.now()
# proceed with fetch
Cline vs Other AI Agents for Scraping
| feature | Cline | OpenAI Codex CLI | Cursor Agent | Aider |
|---|---|---|---|---|
| MCP support | yes | limited | yes | no |
| browser access | yes | no | limited | no |
| terminal execution | yes | yes | yes | yes |
| file editing | yes | yes | yes | yes |
| approval workflow | yes | yes | limited | yes |
| proxy integration | via MCP | manual | via MCP | manual |
| best for | full workflow | code generation | code editing | code editing |
Cline’s strength is the combination of MCP tool access, browser automation, and terminal execution. this makes it the most capable option for end-to-end scraping workflows where the agent needs to fetch data, write code, and verify results.
Conclusion
Cline with MCP integration transforms web scraping from a purely coding task into a conversational workflow. you describe what data you need, Cline uses MCP tools to fetch and extract it, writes any necessary code, handles errors, and validates the output. the key to making this work well is a robust MCP server with properly configured proxy support and rate limiting. start with the basic MCP server template in this guide, connect it to Cline, and experiment with increasingly complex scraping tasks. the agent handles the tedious parts while you focus on what data you actually need and how to use it.