Cline AI Agent Web Scraping with MCP Integration

Cline AI Agent Web Scraping with MCP Integration

Cline is an autonomous AI coding agent that runs inside VS Code. it can read your codebase, write code, execute terminal commands, and interact with your browser. when connected to MCP servers, Cline gains the ability to use external tools, including web scraping tools that can fetch, parse, and extract data from any website.

this combination of an AI agent with MCP-powered scraping tools creates something powerful: an AI that can autonomously build and run scrapers, inspect the results, fix errors, and iterate until the data is correct. this guide covers how to set up Cline with MCP for web scraping, configure proxies, and use the system for real data extraction tasks.

What is Cline

Cline (formerly Claude Dev) is a VS Code extension that gives an AI model the ability to:

  • read and write files in your project
  • execute terminal commands
  • browse the web using a built-in browser
  • use MCP tools that you configure
  • ask for your approval before taking actions

unlike a chatbot that just gives you code to copy, Cline actually executes the code, sees the output, and iterates. if a scraper fails, Cline can read the error, modify the code, and retry, all within a single conversation.

Setting Up Cline for Web Scraping

Step 1: Install Cline

install the Cline extension from the VS Code marketplace:

  1. open VS Code
  2. go to Extensions (Ctrl+Shift+X)
  3. search for “Cline”
  4. click Install
  5. configure your API key (Anthropic, OpenAI, or other supported providers)

Step 2: Configure an MCP Scraping Server

create an MCP server that provides web scraping tools to Cline:

# scraping_mcp_server.py
import asyncio
import json
import httpx
from bs4 import BeautifulSoup
from mcp.server import Server
from mcp.types import Tool, TextContent

server = Server("web-scraper-mcp")

# proxy configuration
PROXY_URL = "http://user:pass@proxy.example.com:8080"

@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="fetch_page",
            description="Fetch a web page and return its text content. use for static pages.",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "The URL to fetch"},
                    "selector": {
                        "type": "string",
                        "description": "Optional CSS selector to extract specific content",
                        "default": "body"
                    }
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="fetch_dynamic_page",
            description="Fetch a JavaScript-rendered page using a headless browser. use for SPAs and dynamic content.",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "The URL to fetch"},
                    "wait_selector": {
                        "type": "string",
                        "description": "CSS selector to wait for before extracting content"
                    }
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="search_google",
            description="Search Google and return the top results.",
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "num_results": {"type": "integer", "default": 10}
                },
                "required": ["query"]
            }
        ),
        Tool(
            name="extract_links",
            description="Extract all links from a web page.",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "The URL to extract links from"},
                    "filter_pattern": {
                        "type": "string",
                        "description": "Optional regex pattern to filter links"
                    }
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="extract_table",
            description="Extract tabular data from a web page and return as JSON.",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "The URL containing the table"},
                    "table_index": {
                        "type": "integer",
                        "description": "Index of the table to extract (0-based)",
                        "default": 0
                    }
                },
                "required": ["url"]
            }
        )
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "fetch_page":
        return await _fetch_page(arguments["url"], arguments.get("selector", "body"))
    elif name == "fetch_dynamic_page":
        return await _fetch_dynamic(arguments["url"], arguments.get("wait_selector"))
    elif name == "search_google":
        return await _search_google(arguments["query"], arguments.get("num_results", 10))
    elif name == "extract_links":
        return await _extract_links(arguments["url"], arguments.get("filter_pattern"))
    elif name == "extract_table":
        return await _extract_table(arguments["url"], arguments.get("table_index", 0))

async def _fetch_page(url: str, selector: str) -> list[TextContent]:
    """fetch a static page through proxy."""
    async with httpx.AsyncClient(
        proxies={"all://": PROXY_URL},
        timeout=30,
        follow_redirects=True
    ) as client:
        response = await client.get(url, headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        })

    soup = BeautifulSoup(response.text, "html.parser")

    # remove noise
    for tag in soup.find_all(["script", "style", "nav", "footer", "header"]):
        tag.decompose()

    if selector != "body":
        elements = soup.select(selector)
        text = "\n".join(el.get_text(strip=True) for el in elements)
    else:
        text = soup.get_text(separator="\n", strip=True)

    # truncate to avoid overwhelming the model
    if len(text) > 50000:
        text = text[:50000] + "\n... [truncated]"

    return [TextContent(type="text", text=text)]

async def _fetch_dynamic(url: str, wait_selector: str = None) -> list[TextContent]:
    """fetch a dynamic page using Playwright."""
    from playwright.async_api import async_playwright

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={"server": PROXY_URL}
        )
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")

        if wait_selector:
            await page.wait_for_selector(wait_selector, timeout=10000)

        content = await page.content()
        await browser.close()

    soup = BeautifulSoup(content, "html.parser")
    for tag in soup.find_all(["script", "style"]):
        tag.decompose()

    text = soup.get_text(separator="\n", strip=True)

    if len(text) > 50000:
        text = text[:50000] + "\n... [truncated]"

    return [TextContent(type="text", text=text)]

async def _search_google(query: str, num_results: int) -> list[TextContent]:
    """search Google and return results."""
    from urllib.parse import quote

    url = f"https://www.google.com/search?q={quote(query)}&num={num_results}"

    async with httpx.AsyncClient(
        proxies={"all://": PROXY_URL},
        timeout=15
    ) as client:
        response = await client.get(url, headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        })

    soup = BeautifulSoup(response.text, "html.parser")
    results = []

    for g in soup.select(".g"):
        title_el = g.select_one("h3")
        link_el = g.select_one("a")
        snippet_el = g.select_one(".VwiC3b")

        if title_el and link_el:
            results.append({
                "title": title_el.get_text(strip=True),
                "url": link_el.get("href", ""),
                "snippet": snippet_el.get_text(strip=True) if snippet_el else ""
            })

    return [TextContent(type="text", text=json.dumps(results, indent=2))]

async def _extract_links(url: str, filter_pattern: str = None) -> list[TextContent]:
    """extract all links from a page."""
    import re

    async with httpx.AsyncClient(
        proxies={"all://": PROXY_URL},
        timeout=15
    ) as client:
        response = await client.get(url)

    soup = BeautifulSoup(response.text, "html.parser")
    links = []

    for a in soup.find_all("a", href=True):
        href = a["href"]
        text = a.get_text(strip=True)

        if filter_pattern and not re.search(filter_pattern, href):
            continue

        links.append({"url": href, "text": text})

    return [TextContent(type="text", text=json.dumps(links, indent=2))]

async def _extract_table(url: str, table_index: int) -> list[TextContent]:
    """extract a table from a page as JSON."""
    async with httpx.AsyncClient(
        proxies={"all://": PROXY_URL},
        timeout=15
    ) as client:
        response = await client.get(url)

    soup = BeautifulSoup(response.text, "html.parser")
    tables = soup.find_all("table")

    if table_index >= len(tables):
        return [TextContent(type="text", text=f"table index {table_index} not found. page has {len(tables)} tables.")]

    table = tables[table_index]
    rows = []
    headers = []

    # extract headers
    header_row = table.find("thead")
    if header_row:
        headers = [th.get_text(strip=True) for th in header_row.find_all(["th", "td"])]
    else:
        first_row = table.find("tr")
        if first_row:
            headers = [th.get_text(strip=True) for th in first_row.find_all(["th", "td"])]

    # extract data rows
    for tr in table.find_all("tr")[1 if not header_row else 0:]:
        cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
        if headers and len(cells) == len(headers):
            rows.append(dict(zip(headers, cells)))
        elif cells:
            rows.append(cells)

    return [TextContent(type="text", text=json.dumps(rows, indent=2))]


if __name__ == "__main__":
    from mcp.server.stdio import stdio_server

    async def main():
        async with stdio_server() as (read_stream, write_stream):
            await server.run(read_stream, write_stream)

    asyncio.run(main())

Step 3: Connect Cline to the MCP Server

add the MCP server to your Cline configuration. in VS Code, open the Cline MCP settings and add:

{
  "mcpServers": {
    "web-scraper": {
      "command": "python",
      "args": ["/path/to/scraping_mcp_server.py"],
      "env": {
        "PROXY_URL": "http://user:pass@proxy.example.com:8080"
      }
    }
  }
}

after restarting Cline, it will have access to all the scraping tools defined in your MCP server.

Using Cline for Web Scraping Tasks

Example 1: Scraping Product Data

prompt Cline with:

“use the web scraper tools to fetch the top 10 results from Google for ‘best residential proxies 2026’, then visit each result and extract the proxy providers mentioned, their pricing, and key features. save the results to a CSV file.”

Cline will:

  1. call search_google to get search results
  2. call fetch_page for each result URL
  3. analyze the content to extract provider names, pricing, and features
  4. write a Python script to save the data as CSV
  5. execute the script and verify the output

Example 2: Monitoring Competitor Prices

“write a scraper that checks the pricing page of brightdata.com, oxylabs.io, and smartproxy.com every day. use the fetch_dynamic_page tool since these pages use JavaScript. extract all pricing tiers and save to a JSON file with timestamps.”

Cline will use the MCP tools to fetch each page, analyze the pricing structures, and create an automated script.

Example 3: Building a Dataset

“I need a dataset of all Python web scraping libraries on GitHub with more than 1000 stars. use the tools to search GitHub, extract repo details (name, stars, description, last updated), and save as a CSV.”

Proxy Configuration Best Practices

Using Rotating Residential Proxies

for Cline’s MCP scraping tools, residential proxies provide the best results:

# enhanced proxy configuration in the MCP server
import random

PROXY_POOL = [
    "http://user:pass@gate.smartproxy.com:7777",
    "http://user:pass@pr.oxylabs.io:7777",
    "http://user:pass@brd.superproxy.io:22225"
]

def get_proxy() -> str:
    """rotate through proxy pool."""
    return random.choice(PROXY_POOL)

# use in fetch functions
async def _fetch_page(url: str, selector: str) -> list[TextContent]:
    proxy = get_proxy()
    async with httpx.AsyncClient(
        proxies={"all://": proxy},
        timeout=30
    ) as client:
        response = await client.get(url)
    # ... rest of the function

Geo-Targeted Proxies

when scraping location-specific content, configure country-targeted proxies:

GEO_PROXIES = {
    "us": "http://user-country-us:pass@proxy.example.com:8080",
    "uk": "http://user-country-gb:pass@proxy.example.com:8080",
    "de": "http://user-country-de:pass@proxy.example.com:8080",
    "jp": "http://user-country-jp:pass@proxy.example.com:8080"
}

# add a geo parameter to the fetch tool
@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="fetch_page_geo",
            description="Fetch a page using a proxy from a specific country.",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string"},
                    "country": {
                        "type": "string",
                        "enum": ["us", "uk", "de", "jp"],
                        "description": "Country code for the proxy location"
                    }
                },
                "required": ["url", "country"]
            }
        )
    ]

Advanced Cline Scraping Workflows

Self-Correcting Scrapers

one of Cline’s biggest advantages is error recovery. when a scraper fails, Cline can:

  1. read the error message
  2. diagnose the problem (wrong selector, blocked request, timeout)
  3. modify the approach (try a different selector, switch proxy, add delays)
  4. retry and verify the fix

this makes Cline effective for scraping unfamiliar sites where you would normally spend time debugging.

Multi-Step Research Pipelines

Cline can chain multiple MCP tool calls to complete complex research tasks:

1. search_google("best proxy providers for web scraping")
2. fetch_page(result_1_url) -> extract provider names
3. fetch_page(provider_1_pricing_url) -> extract pricing
4. fetch_page(provider_1_reviews_url) -> extract ratings
5. repeat for each provider
6. compile into comparison table
7. save as CSV and generate summary

this kind of multi-step pipeline would require significant code in a traditional setup but can be driven entirely by natural language prompts in Cline.

Data Validation and Quality Checks

ask Cline to validate extracted data:

“scrape the product listings from example.com/products. after extraction, verify that all prices are valid numbers, all URLs are reachable, and no required fields are missing. fix any issues and save the clean dataset.”

Cline will scrape, validate, identify problems, and correct them, all in one workflow.

Security Considerations

when using Cline with MCP scraping tools, keep these security points in mind:

  1. API keys in MCP config: store proxy credentials and API keys in environment variables, not in the MCP server code
  2. approval mode: keep Cline’s approval mode enabled so you can review each tool call before it executes
  3. output review: always review the data Cline extracts before using it in production
  4. rate limiting: add rate limits to your MCP tools to prevent Cline from overwhelming target sites
  5. scope limitation: only expose the MCP tools that Cline needs. do not give it access to destructive operations
# rate limiting in MCP tools
from datetime import datetime, timedelta

LAST_REQUEST = {}
MIN_DELAY = 2  # seconds between requests to same domain

async def rate_limited_fetch(url: str):
    from urllib.parse import urlparse
    domain = urlparse(url).netloc

    now = datetime.now()
    if domain in LAST_REQUEST:
        elapsed = (now - LAST_REQUEST[domain]).total_seconds()
        if elapsed < MIN_DELAY:
            await asyncio.sleep(MIN_DELAY - elapsed)

    LAST_REQUEST[domain] = datetime.now()
    # proceed with fetch

Cline vs Other AI Agents for Scraping

featureClineOpenAI Codex CLICursor AgentAider
MCP supportyeslimitedyesno
browser accessyesnolimitedno
terminal executionyesyesyesyes
file editingyesyesyesyes
approval workflowyesyeslimitedyes
proxy integrationvia MCPmanualvia MCPmanual
best forfull workflowcode generationcode editingcode editing

Cline’s strength is the combination of MCP tool access, browser automation, and terminal execution. this makes it the most capable option for end-to-end scraping workflows where the agent needs to fetch data, write code, and verify results.

Conclusion

Cline with MCP integration transforms web scraping from a purely coding task into a conversational workflow. you describe what data you need, Cline uses MCP tools to fetch and extract it, writes any necessary code, handles errors, and validates the output. the key to making this work well is a robust MCP server with properly configured proxy support and rate limiting. start with the basic MCP server template in this guide, connect it to Cline, and experiment with increasingly complex scraping tasks. the agent handles the tedious parts while you focus on what data you actually need and how to use it.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top