Qwen Agent MCP Web Scraping: Complete Integration Guide

Qwen Agent MCP Web Scraping: Complete Integration Guide

Qwen Agent is Alibaba’s AI agent framework built on top of the Qwen language model family. it supports tool calling, multi-step reasoning, and integration with external services through the Model Context Protocol (MCP). for web scraping, Qwen Agent offers an interesting combination: a capable open-source model that can run locally, combined with MCP tools for structured web data collection.

this guide walks through setting up Qwen Agent with MCP servers for web scraping, configuring proxy support, and building practical extraction pipelines.

Why Qwen Agent for Web Scraping

Qwen Agent has several advantages for data collection workflows:

  • open-source and self-hostable: run the full agent locally without API costs, using Qwen models via Ollama or vLLM
  • strong tool calling: Qwen2.5 models have excellent function calling capabilities, comparable to GPT-4o for structured tasks
  • multilingual strength: Qwen excels at Chinese, Japanese, Korean, and other Asian languages, making it ideal for scraping non-English content
  • MCP compatibility: Qwen Agent supports MCP servers, letting you connect it to custom scraping tools
  • cost efficiency: running Qwen locally eliminates per-token API costs, which is significant for high-volume extraction

Qwen Model Options for Scraping

modelsizeuse casememory needed
Qwen2.5-7B7B paramsbasic extraction, simple schemas8GB RAM / 6GB VRAM
Qwen2.5-14B14B paramscomplex extraction, multi-language16GB RAM / 12GB VRAM
Qwen2.5-32B32B paramsadvanced reasoning, nested schemas32GB RAM / 24GB VRAM
Qwen2.5-72B72B paramsmaximum accuracy, research tasks64GB RAM / 48GB VRAM
Qwen-Plus (API)cloudno local resources neededn/a

for most scraping tasks, the 14B model hits the sweet spot between accuracy and resource usage. the 7B model handles simple extraction reliably and runs on modest hardware.

Setting Up Qwen Agent

Option 1: Local Setup with Ollama

# install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# pull a Qwen model
ollama pull qwen2.5:14b

# verify the model is running
ollama list

Option 2: Using Qwen API

if you prefer cloud-hosted Qwen, get an API key from DashScope (Alibaba Cloud):

# set up the Qwen API client
import os
os.environ["DASHSCOPE_API_KEY"] = "your-dashscope-api-key"

Installing Qwen Agent Framework

pip install qwen-agent
pip install httpx beautifulsoup4 markdownify playwright
playwright install chromium

Building an MCP Scraping Server for Qwen

create an MCP server that provides web scraping tools:

# qwen_scraping_mcp.py
import asyncio
import json
import httpx
from bs4 import BeautifulSoup
import markdownify
from mcp.server import Server
from mcp.types import Tool, TextContent

server = Server("qwen-web-scraper")

# proxy configuration
PROXY_CONFIG = {
    "default": "http://user:pass@proxy.example.com:8080",
    "us": "http://user-country-us:pass@proxy.example.com:8080",
    "cn": "http://user-country-cn:pass@proxy.example.com:8080",
    "jp": "http://user-country-jp:pass@proxy.example.com:8080",
    "kr": "http://user-country-kr:pass@proxy.example.com:8080"
}

def get_proxy(country: str = "default") -> str:
    return PROXY_CONFIG.get(country, PROXY_CONFIG["default"])

@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="scrape_page",
            description=(
                "Scrape a web page and return its content as clean markdown. "
                "supports proxy rotation and country targeting."
            ),
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "URL to scrape"},
                    "country": {
                        "type": "string",
                        "description": "Proxy country code (us, cn, jp, kr)",
                        "default": "default"
                    },
                    "extract_mode": {
                        "type": "string",
                        "enum": ["markdown", "text", "links", "tables"],
                        "default": "markdown"
                    }
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="scrape_dynamic_page",
            description="Scrape a JavaScript-rendered page using headless browser with proxy.",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string"},
                    "wait_for": {"type": "string", "description": "CSS selector to wait for"},
                    "country": {"type": "string", "default": "default"}
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="search_and_scrape",
            description="Search the web and scrape top results.",
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "num_results": {"type": "integer", "default": 5},
                    "scrape_results": {
                        "type": "boolean",
                        "description": "whether to also scrape each result page",
                        "default": False
                    }
                },
                "required": ["query"]
            }
        ),
        Tool(
            name="extract_product_data",
            description="Extract structured product data (name, price, description, specs) from a product page URL.",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string"},
                    "country": {"type": "string", "default": "default"}
                },
                "required": ["url"]
            }
        )
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "scrape_page":
        return await _scrape_page(
            arguments["url"],
            arguments.get("country", "default"),
            arguments.get("extract_mode", "markdown")
        )
    elif name == "scrape_dynamic_page":
        return await _scrape_dynamic(
            arguments["url"],
            arguments.get("wait_for"),
            arguments.get("country", "default")
        )
    elif name == "search_and_scrape":
        return await _search_and_scrape(
            arguments["query"],
            arguments.get("num_results", 5),
            arguments.get("scrape_results", False)
        )
    elif name == "extract_product_data":
        return await _extract_product(
            arguments["url"],
            arguments.get("country", "default")
        )

async def _scrape_page(url: str, country: str, mode: str) -> list[TextContent]:
    """scrape a static page through proxy."""
    proxy = get_proxy(country)

    async with httpx.AsyncClient(
        proxies={"all://": proxy},
        timeout=30,
        follow_redirects=True
    ) as client:
        response = await client.get(url, headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            "Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7"
        })
        response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser")

    # clean noise
    for tag in soup.find_all(["script", "style", "nav", "footer", "header", "aside", "iframe"]):
        tag.decompose()

    if mode == "markdown":
        content = markdownify.markdownify(str(soup), strip=["img"])
        import re
        content = re.sub(r'\n{3,}', '\n\n', content)

    elif mode == "text":
        content = soup.get_text(separator="\n", strip=True)

    elif mode == "links":
        links = []
        for a in soup.find_all("a", href=True):
            links.append({"text": a.get_text(strip=True), "url": a["href"]})
        content = json.dumps(links, indent=2)

    elif mode == "tables":
        tables = []
        for table in soup.find_all("table"):
            rows = []
            for tr in table.find_all("tr"):
                cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
                rows.append(cells)
            tables.append(rows)
        content = json.dumps(tables, indent=2)

    # truncate for context window
    if len(content) > 40000:
        content = content[:40000] + "\n\n[content truncated]"

    return [TextContent(type="text", text=content)]

async def _scrape_dynamic(url: str, wait_for: str = None, country: str = "default") -> list[TextContent]:
    """scrape a JavaScript-rendered page."""
    from playwright.async_api import async_playwright

    proxy = get_proxy(country)

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={"server": proxy}
        )
        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        )
        page = await context.new_page()

        await page.goto(url, wait_until="networkidle", timeout=30000)

        if wait_for:
            try:
                await page.wait_for_selector(wait_for, timeout=10000)
            except Exception:
                pass

        content = await page.content()
        await browser.close()

    soup = BeautifulSoup(content, "html.parser")
    for tag in soup.find_all(["script", "style"]):
        tag.decompose()

    markdown = markdownify.markdownify(str(soup), strip=["img"])

    if len(markdown) > 40000:
        markdown = markdown[:40000] + "\n\n[content truncated]"

    return [TextContent(type="text", text=markdown)]

async def _search_and_scrape(query: str, num_results: int, scrape: bool) -> list[TextContent]:
    """search and optionally scrape results."""
    from urllib.parse import quote

    proxy = get_proxy("default")
    search_url = f"https://html.duckduckgo.com/html/?q={quote(query)}"

    async with httpx.AsyncClient(
        proxies={"all://": proxy},
        timeout=15
    ) as client:
        response = await client.get(search_url, headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
        })

    soup = BeautifulSoup(response.text, "html.parser")
    results = []

    for item in soup.select(".result")[:num_results]:
        title_el = item.select_one(".result__title a")
        snippet_el = item.select_one(".result__snippet")

        if title_el:
            result = {
                "title": title_el.get_text(strip=True),
                "url": title_el.get("href", ""),
                "snippet": snippet_el.get_text(strip=True) if snippet_el else ""
            }

            if scrape and result["url"].startswith("http"):
                try:
                    page_content = await _scrape_page(result["url"], "default", "text")
                    result["content"] = page_content[0].text[:5000]
                except Exception as e:
                    result["content"] = f"error: {e}"

            results.append(result)

    return [TextContent(type="text", text=json.dumps(results, indent=2, ensure_ascii=False))]

async def _extract_product(url: str, country: str) -> list[TextContent]:
    """extract structured product data from a page."""
    page_data = await _scrape_page(url, country, "text")
    content = page_data[0].text

    # return the content for the Qwen model to extract structured data from
    extraction_prompt = {
        "page_content": content[:15000],
        "requested_fields": [
            "product_name",
            "price",
            "currency",
            "description",
            "specifications",
            "rating",
            "review_count",
            "availability",
            "brand",
            "category"
        ]
    }

    return [TextContent(type="text", text=json.dumps(extraction_prompt, indent=2, ensure_ascii=False))]

if __name__ == "__main__":
    from mcp.server.stdio import stdio_server

    async def main():
        async with stdio_server() as (read, write):
            await server.run(read, write)

    asyncio.run(main())

Connecting Qwen Agent to the MCP Server

Using qwen-agent Framework

# qwen_scraping_agent.py
from qwen_agent.agents import Assistant
from qwen_agent.tools.base import BaseTool, register_tool
import json

# register MCP tools as qwen-agent tools
@register_tool("web_scraper")
class WebScraperTool(BaseTool):
    description = "scrape a web page and return clean content"
    parameters = [{
        "name": "url",
        "type": "string",
        "description": "the URL to scrape",
        "required": True
    }, {
        "name": "mode",
        "type": "string",
        "description": "extraction mode: markdown, text, links, or tables",
        "required": False
    }]

    def call(self, params: str, **kwargs) -> str:
        import asyncio
        params = json.loads(params)
        url = params["url"]
        mode = params.get("mode", "text")

        # call the MCP server's scraping function
        result = asyncio.run(self._fetch(url, mode))
        return result

    async def _fetch(self, url: str, mode: str) -> str:
        import httpx
        from bs4 import BeautifulSoup

        proxy = "http://user:pass@proxy.example.com:8080"

        async with httpx.AsyncClient(
            proxies={"all://": proxy},
            timeout=30,
            follow_redirects=True
        ) as client:
            response = await client.get(url, headers={
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
            })

        soup = BeautifulSoup(response.text, "html.parser")
        for tag in soup.find_all(["script", "style", "nav", "footer"]):
            tag.decompose()

        return soup.get_text(separator="\n", strip=True)[:20000]

@register_tool("web_search")
class WebSearchTool(BaseTool):
    description = "search the web and return results"
    parameters = [{
        "name": "query",
        "type": "string",
        "description": "search query",
        "required": True
    }]

    def call(self, params: str, **kwargs) -> str:
        import asyncio
        params = json.loads(params)
        return asyncio.run(self._search(params["query"]))

    async def _search(self, query: str) -> str:
        import httpx
        from bs4 import BeautifulSoup
        from urllib.parse import quote

        proxy = "http://user:pass@proxy.example.com:8080"
        url = f"https://html.duckduckgo.com/html/?q={quote(query)}"

        async with httpx.AsyncClient(proxies={"all://": proxy}, timeout=15) as client:
            response = await client.get(url, headers={
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
            })

        soup = BeautifulSoup(response.text, "html.parser")
        results = []

        for item in soup.select(".result")[:10]:
            title = item.select_one(".result__title a")
            snippet = item.select_one(".result__snippet")
            if title:
                results.append({
                    "title": title.get_text(strip=True),
                    "url": title.get("href", ""),
                    "snippet": snippet.get_text(strip=True) if snippet else ""
                })

        return json.dumps(results, indent=2)

# create the Qwen agent with scraping tools
def create_scraping_agent():
    """create a Qwen agent configured for web scraping."""

    # for local Qwen via Ollama
    llm_cfg = {
        "model": "qwen2.5:14b",
        "model_server": "http://localhost:11434/v1",
        "api_key": "ollama"
    }

    # or for Qwen API
    # llm_cfg = {
    #     "model": "qwen-plus",
    #     "model_server": "https://dashscope.aliyuncs.com/compatible-mode/v1",
    #     "api_key": os.environ["DASHSCOPE_API_KEY"]
    # }

    agent = Assistant(
        llm=llm_cfg,
        name="Web Scraping Agent",
        description="an agent that scrapes and extracts data from websites",
        function_list=["web_scraper", "web_search"]
    )

    return agent

Running the Agent

def run_scraping_task(task: str):
    """run a web scraping task using the Qwen agent."""
    agent = create_scraping_agent()

    messages = [{"role": "user", "content": task}]

    responses = []
    for response in agent.run(messages=messages):
        responses.append(response)

    # get the final response
    final = responses[-1] if responses else None
    return final

# example tasks
tasks = [
    "search for 'best residential proxy providers 2026' and extract the top 5 providers with their pricing",
    "scrape https://example.com/products and extract all product names and prices into a JSON list",
    "find the current pricing for Bright Data, Oxylabs, and SmartProxy by visiting their pricing pages"
]

for task in tasks:
    print(f"\ntask: {task}")
    result = run_scraping_task(task)
    print(f"result: {result}")

Qwen Agent for Multi-Language Scraping

Qwen’s strong multilingual capabilities make it excellent for scraping non-English websites:

Chinese E-Commerce Scraping

def scrape_chinese_products():
    agent = create_scraping_agent()

    task = (
        "scrape the product listings from the provided Chinese e-commerce URL. "
        "extract: product name (in both Chinese and English), price in RMB, "
        "seller name, and rating. return as a JSON array."
    )

    messages = [{"role": "user", "content": task}]

    for response in agent.run(messages=messages):
        if response[-1].get("role") == "assistant":
            print(response[-1]["content"])

Japanese Market Research

def research_japanese_market():
    agent = create_scraping_agent()

    task = (
        "search for 'ウェブスクレイピング ツール' (web scraping tools in Japanese) "
        "and analyze the top 5 results. extract the tool names, pricing if available, "
        "and key features. translate all content to English in the output."
    )

    messages = [{"role": "user", "content": task}]

    for response in agent.run(messages=messages):
        print(response[-1].get("content", ""))

Building a Data Pipeline with Qwen Agent

combine Qwen Agent with data processing for a complete pipeline:

import csv
import json
from datetime import datetime

class QwenDataPipeline:
    """end-to-end data pipeline using Qwen Agent for extraction."""

    def __init__(self):
        self.agent = create_scraping_agent()
        self.results = []

    def extract_from_urls(self, urls: list[str], extraction_prompt: str) -> list:
        """extract data from multiple URLs using the Qwen agent."""
        for url in urls:
            task = f"{extraction_prompt}\n\nURL: {url}"
            messages = [{"role": "user", "content": task}]

            try:
                for response in self.agent.run(messages=messages):
                    last_msg = response[-1]
                    if last_msg.get("role") == "assistant":
                        content = last_msg["content"]
                        # try to parse as JSON
                        try:
                            data = json.loads(content)
                            data["_source_url"] = url
                            data["_extracted_at"] = datetime.utcnow().isoformat()
                            self.results.append(data)
                        except json.JSONDecodeError:
                            self.results.append({
                                "_source_url": url,
                                "_raw_content": content,
                                "_extracted_at": datetime.utcnow().isoformat()
                            })
            except Exception as e:
                self.results.append({
                    "_source_url": url,
                    "_error": str(e),
                    "_extracted_at": datetime.utcnow().isoformat()
                })

        return self.results

    def save_to_csv(self, filepath: str):
        """save results to CSV."""
        if not self.results:
            print("no results to save")
            return

        # collect all field names
        all_fields = set()
        for r in self.results:
            all_fields.update(r.keys())

        with open(filepath, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=sorted(all_fields))
            writer.writeheader()
            for row in self.results:
                writer.writerow(row)

        print(f"saved {len(self.results)} results to {filepath}")

    def save_to_json(self, filepath: str):
        """save results to JSON."""
        with open(filepath, "w", encoding="utf-8") as f:
            json.dump(self.results, f, indent=2, ensure_ascii=False)

        print(f"saved {len(self.results)} results to {filepath}")

# usage
pipeline = QwenDataPipeline()

urls = [
    "https://example.com/product/1",
    "https://example.com/product/2",
    "https://example.com/product/3"
]

prompt = (
    "use the web_scraper tool to fetch this page, then extract: "
    "product_name, price, description, and availability. "
    "return as a JSON object."
)

results = pipeline.extract_from_urls(urls, prompt)
pipeline.save_to_csv("products.csv")
pipeline.save_to_json("products.json")

Performance Optimization

Batch Processing with Concurrency Control

import asyncio

class BatchProcessor:
    def __init__(self, max_concurrent: int = 3, delay_seconds: float = 2.0):
        self.max_concurrent = max_concurrent
        self.delay = delay_seconds
        self.semaphore = asyncio.Semaphore(max_concurrent)

    async def process_batch(self, urls: list[str], processor_fn) -> list:
        """process multiple URLs with concurrency control."""
        results = []

        async def process_one(url):
            async with self.semaphore:
                result = await processor_fn(url)
                results.append(result)
                await asyncio.sleep(self.delay)

        tasks = [process_one(url) for url in urls]
        await asyncio.gather(*tasks)

        return results

Caching Extracted Data

import hashlib
from pathlib import Path

class ExtractionCache:
    def __init__(self, cache_dir: str = ".qwen_cache"):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)

    def _key(self, url: str) -> str:
        return hashlib.md5(url.encode()).hexdigest()

    def get(self, url: str) -> dict | None:
        cache_file = self.cache_dir / f"{self._key(url)}.json"
        if cache_file.exists():
            return json.loads(cache_file.read_text())
        return None

    def set(self, url: str, data: dict):
        cache_file = self.cache_dir / f"{self._key(url)}.json"
        cache_file.write_text(json.dumps(data, indent=2, ensure_ascii=False))

    def clear(self):
        for f in self.cache_dir.glob("*.json"):
            f.unlink()

Qwen Agent vs Other AI Agents for Scraping

featureQwen AgentClaude (Cline)GPT-4oDeepSeek
self-hostedyesnonoyes
MCP supportyesyesvia toolslimited
Chinese/CJK contentexcellentgoodgoodexcellent
cost (self-hosted)freen/an/afree
API cost (per 1M tokens)$0.50-$2.00$3.00-$15.00$2.50-$10.00$0.27-$1.10
extraction accuracyhighvery highvery highhigh
tool calling reliabilityhighvery highvery highhigh

Qwen Agent’s main advantages are self-hosting capability and multilingual strength. if you are scraping Chinese, Japanese, or Korean content, Qwen is an excellent choice. for English-only extraction where accuracy is paramount, Claude or GPT-4o may have a slight edge.

Conclusion

Qwen Agent with MCP integration provides a cost-effective, self-hostable solution for AI-powered web scraping. the framework’s tool calling capabilities let you build sophisticated scraping workflows where the AI handles content interpretation and data structuring while your MCP server handles the actual HTTP requests and proxy rotation. start with the MCP server template and qwen-agent tool registration in this guide, then build specialized extraction pipelines for your specific data needs. the ability to run everything locally makes Qwen Agent particularly attractive for high-volume scraping where API costs would otherwise be prohibitive.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top