Qwen Agent MCP Web Scraping: Complete Integration Guide
Qwen Agent is Alibaba’s AI agent framework built on top of the Qwen language model family. it supports tool calling, multi-step reasoning, and integration with external services through the Model Context Protocol (MCP). for web scraping, Qwen Agent offers an interesting combination: a capable open-source model that can run locally, combined with MCP tools for structured web data collection.
this guide walks through setting up Qwen Agent with MCP servers for web scraping, configuring proxy support, and building practical extraction pipelines.
Why Qwen Agent for Web Scraping
Qwen Agent has several advantages for data collection workflows:
- open-source and self-hostable: run the full agent locally without API costs, using Qwen models via Ollama or vLLM
- strong tool calling: Qwen2.5 models have excellent function calling capabilities, comparable to GPT-4o for structured tasks
- multilingual strength: Qwen excels at Chinese, Japanese, Korean, and other Asian languages, making it ideal for scraping non-English content
- MCP compatibility: Qwen Agent supports MCP servers, letting you connect it to custom scraping tools
- cost efficiency: running Qwen locally eliminates per-token API costs, which is significant for high-volume extraction
Qwen Model Options for Scraping
| model | size | use case | memory needed |
|---|---|---|---|
| Qwen2.5-7B | 7B params | basic extraction, simple schemas | 8GB RAM / 6GB VRAM |
| Qwen2.5-14B | 14B params | complex extraction, multi-language | 16GB RAM / 12GB VRAM |
| Qwen2.5-32B | 32B params | advanced reasoning, nested schemas | 32GB RAM / 24GB VRAM |
| Qwen2.5-72B | 72B params | maximum accuracy, research tasks | 64GB RAM / 48GB VRAM |
| Qwen-Plus (API) | cloud | no local resources needed | n/a |
for most scraping tasks, the 14B model hits the sweet spot between accuracy and resource usage. the 7B model handles simple extraction reliably and runs on modest hardware.
Setting Up Qwen Agent
Option 1: Local Setup with Ollama
# install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# pull a Qwen model
ollama pull qwen2.5:14b
# verify the model is running
ollama list
Option 2: Using Qwen API
if you prefer cloud-hosted Qwen, get an API key from DashScope (Alibaba Cloud):
# set up the Qwen API client
import os
os.environ["DASHSCOPE_API_KEY"] = "your-dashscope-api-key"
Installing Qwen Agent Framework
pip install qwen-agent
pip install httpx beautifulsoup4 markdownify playwright
playwright install chromium
Building an MCP Scraping Server for Qwen
create an MCP server that provides web scraping tools:
# qwen_scraping_mcp.py
import asyncio
import json
import httpx
from bs4 import BeautifulSoup
import markdownify
from mcp.server import Server
from mcp.types import Tool, TextContent
server = Server("qwen-web-scraper")
# proxy configuration
PROXY_CONFIG = {
"default": "http://user:pass@proxy.example.com:8080",
"us": "http://user-country-us:pass@proxy.example.com:8080",
"cn": "http://user-country-cn:pass@proxy.example.com:8080",
"jp": "http://user-country-jp:pass@proxy.example.com:8080",
"kr": "http://user-country-kr:pass@proxy.example.com:8080"
}
def get_proxy(country: str = "default") -> str:
return PROXY_CONFIG.get(country, PROXY_CONFIG["default"])
@server.list_tools()
async def list_tools():
return [
Tool(
name="scrape_page",
description=(
"Scrape a web page and return its content as clean markdown. "
"supports proxy rotation and country targeting."
),
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string", "description": "URL to scrape"},
"country": {
"type": "string",
"description": "Proxy country code (us, cn, jp, kr)",
"default": "default"
},
"extract_mode": {
"type": "string",
"enum": ["markdown", "text", "links", "tables"],
"default": "markdown"
}
},
"required": ["url"]
}
),
Tool(
name="scrape_dynamic_page",
description="Scrape a JavaScript-rendered page using headless browser with proxy.",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string"},
"wait_for": {"type": "string", "description": "CSS selector to wait for"},
"country": {"type": "string", "default": "default"}
},
"required": ["url"]
}
),
Tool(
name="search_and_scrape",
description="Search the web and scrape top results.",
inputSchema={
"type": "object",
"properties": {
"query": {"type": "string"},
"num_results": {"type": "integer", "default": 5},
"scrape_results": {
"type": "boolean",
"description": "whether to also scrape each result page",
"default": False
}
},
"required": ["query"]
}
),
Tool(
name="extract_product_data",
description="Extract structured product data (name, price, description, specs) from a product page URL.",
inputSchema={
"type": "object",
"properties": {
"url": {"type": "string"},
"country": {"type": "string", "default": "default"}
},
"required": ["url"]
}
)
]
@server.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "scrape_page":
return await _scrape_page(
arguments["url"],
arguments.get("country", "default"),
arguments.get("extract_mode", "markdown")
)
elif name == "scrape_dynamic_page":
return await _scrape_dynamic(
arguments["url"],
arguments.get("wait_for"),
arguments.get("country", "default")
)
elif name == "search_and_scrape":
return await _search_and_scrape(
arguments["query"],
arguments.get("num_results", 5),
arguments.get("scrape_results", False)
)
elif name == "extract_product_data":
return await _extract_product(
arguments["url"],
arguments.get("country", "default")
)
async def _scrape_page(url: str, country: str, mode: str) -> list[TextContent]:
"""scrape a static page through proxy."""
proxy = get_proxy(country)
async with httpx.AsyncClient(
proxies={"all://": proxy},
timeout=30,
follow_redirects=True
) as client:
response = await client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7"
})
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
# clean noise
for tag in soup.find_all(["script", "style", "nav", "footer", "header", "aside", "iframe"]):
tag.decompose()
if mode == "markdown":
content = markdownify.markdownify(str(soup), strip=["img"])
import re
content = re.sub(r'\n{3,}', '\n\n', content)
elif mode == "text":
content = soup.get_text(separator="\n", strip=True)
elif mode == "links":
links = []
for a in soup.find_all("a", href=True):
links.append({"text": a.get_text(strip=True), "url": a["href"]})
content = json.dumps(links, indent=2)
elif mode == "tables":
tables = []
for table in soup.find_all("table"):
rows = []
for tr in table.find_all("tr"):
cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
rows.append(cells)
tables.append(rows)
content = json.dumps(tables, indent=2)
# truncate for context window
if len(content) > 40000:
content = content[:40000] + "\n\n[content truncated]"
return [TextContent(type="text", text=content)]
async def _scrape_dynamic(url: str, wait_for: str = None, country: str = "default") -> list[TextContent]:
"""scrape a JavaScript-rendered page."""
from playwright.async_api import async_playwright
proxy = get_proxy(country)
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={"server": proxy}
)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
page = await context.new_page()
await page.goto(url, wait_until="networkidle", timeout=30000)
if wait_for:
try:
await page.wait_for_selector(wait_for, timeout=10000)
except Exception:
pass
content = await page.content()
await browser.close()
soup = BeautifulSoup(content, "html.parser")
for tag in soup.find_all(["script", "style"]):
tag.decompose()
markdown = markdownify.markdownify(str(soup), strip=["img"])
if len(markdown) > 40000:
markdown = markdown[:40000] + "\n\n[content truncated]"
return [TextContent(type="text", text=markdown)]
async def _search_and_scrape(query: str, num_results: int, scrape: bool) -> list[TextContent]:
"""search and optionally scrape results."""
from urllib.parse import quote
proxy = get_proxy("default")
search_url = f"https://html.duckduckgo.com/html/?q={quote(query)}"
async with httpx.AsyncClient(
proxies={"all://": proxy},
timeout=15
) as client:
response = await client.get(search_url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
})
soup = BeautifulSoup(response.text, "html.parser")
results = []
for item in soup.select(".result")[:num_results]:
title_el = item.select_one(".result__title a")
snippet_el = item.select_one(".result__snippet")
if title_el:
result = {
"title": title_el.get_text(strip=True),
"url": title_el.get("href", ""),
"snippet": snippet_el.get_text(strip=True) if snippet_el else ""
}
if scrape and result["url"].startswith("http"):
try:
page_content = await _scrape_page(result["url"], "default", "text")
result["content"] = page_content[0].text[:5000]
except Exception as e:
result["content"] = f"error: {e}"
results.append(result)
return [TextContent(type="text", text=json.dumps(results, indent=2, ensure_ascii=False))]
async def _extract_product(url: str, country: str) -> list[TextContent]:
"""extract structured product data from a page."""
page_data = await _scrape_page(url, country, "text")
content = page_data[0].text
# return the content for the Qwen model to extract structured data from
extraction_prompt = {
"page_content": content[:15000],
"requested_fields": [
"product_name",
"price",
"currency",
"description",
"specifications",
"rating",
"review_count",
"availability",
"brand",
"category"
]
}
return [TextContent(type="text", text=json.dumps(extraction_prompt, indent=2, ensure_ascii=False))]
if __name__ == "__main__":
from mcp.server.stdio import stdio_server
async def main():
async with stdio_server() as (read, write):
await server.run(read, write)
asyncio.run(main())
Connecting Qwen Agent to the MCP Server
Using qwen-agent Framework
# qwen_scraping_agent.py
from qwen_agent.agents import Assistant
from qwen_agent.tools.base import BaseTool, register_tool
import json
# register MCP tools as qwen-agent tools
@register_tool("web_scraper")
class WebScraperTool(BaseTool):
description = "scrape a web page and return clean content"
parameters = [{
"name": "url",
"type": "string",
"description": "the URL to scrape",
"required": True
}, {
"name": "mode",
"type": "string",
"description": "extraction mode: markdown, text, links, or tables",
"required": False
}]
def call(self, params: str, **kwargs) -> str:
import asyncio
params = json.loads(params)
url = params["url"]
mode = params.get("mode", "text")
# call the MCP server's scraping function
result = asyncio.run(self._fetch(url, mode))
return result
async def _fetch(self, url: str, mode: str) -> str:
import httpx
from bs4 import BeautifulSoup
proxy = "http://user:pass@proxy.example.com:8080"
async with httpx.AsyncClient(
proxies={"all://": proxy},
timeout=30,
follow_redirects=True
) as client:
response = await client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
})
soup = BeautifulSoup(response.text, "html.parser")
for tag in soup.find_all(["script", "style", "nav", "footer"]):
tag.decompose()
return soup.get_text(separator="\n", strip=True)[:20000]
@register_tool("web_search")
class WebSearchTool(BaseTool):
description = "search the web and return results"
parameters = [{
"name": "query",
"type": "string",
"description": "search query",
"required": True
}]
def call(self, params: str, **kwargs) -> str:
import asyncio
params = json.loads(params)
return asyncio.run(self._search(params["query"]))
async def _search(self, query: str) -> str:
import httpx
from bs4 import BeautifulSoup
from urllib.parse import quote
proxy = "http://user:pass@proxy.example.com:8080"
url = f"https://html.duckduckgo.com/html/?q={quote(query)}"
async with httpx.AsyncClient(proxies={"all://": proxy}, timeout=15) as client:
response = await client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
})
soup = BeautifulSoup(response.text, "html.parser")
results = []
for item in soup.select(".result")[:10]:
title = item.select_one(".result__title a")
snippet = item.select_one(".result__snippet")
if title:
results.append({
"title": title.get_text(strip=True),
"url": title.get("href", ""),
"snippet": snippet.get_text(strip=True) if snippet else ""
})
return json.dumps(results, indent=2)
# create the Qwen agent with scraping tools
def create_scraping_agent():
"""create a Qwen agent configured for web scraping."""
# for local Qwen via Ollama
llm_cfg = {
"model": "qwen2.5:14b",
"model_server": "http://localhost:11434/v1",
"api_key": "ollama"
}
# or for Qwen API
# llm_cfg = {
# "model": "qwen-plus",
# "model_server": "https://dashscope.aliyuncs.com/compatible-mode/v1",
# "api_key": os.environ["DASHSCOPE_API_KEY"]
# }
agent = Assistant(
llm=llm_cfg,
name="Web Scraping Agent",
description="an agent that scrapes and extracts data from websites",
function_list=["web_scraper", "web_search"]
)
return agent
Running the Agent
def run_scraping_task(task: str):
"""run a web scraping task using the Qwen agent."""
agent = create_scraping_agent()
messages = [{"role": "user", "content": task}]
responses = []
for response in agent.run(messages=messages):
responses.append(response)
# get the final response
final = responses[-1] if responses else None
return final
# example tasks
tasks = [
"search for 'best residential proxy providers 2026' and extract the top 5 providers with their pricing",
"scrape https://example.com/products and extract all product names and prices into a JSON list",
"find the current pricing for Bright Data, Oxylabs, and SmartProxy by visiting their pricing pages"
]
for task in tasks:
print(f"\ntask: {task}")
result = run_scraping_task(task)
print(f"result: {result}")
Qwen Agent for Multi-Language Scraping
Qwen’s strong multilingual capabilities make it excellent for scraping non-English websites:
Chinese E-Commerce Scraping
def scrape_chinese_products():
agent = create_scraping_agent()
task = (
"scrape the product listings from the provided Chinese e-commerce URL. "
"extract: product name (in both Chinese and English), price in RMB, "
"seller name, and rating. return as a JSON array."
)
messages = [{"role": "user", "content": task}]
for response in agent.run(messages=messages):
if response[-1].get("role") == "assistant":
print(response[-1]["content"])
Japanese Market Research
def research_japanese_market():
agent = create_scraping_agent()
task = (
"search for 'ウェブスクレイピング ツール' (web scraping tools in Japanese) "
"and analyze the top 5 results. extract the tool names, pricing if available, "
"and key features. translate all content to English in the output."
)
messages = [{"role": "user", "content": task}]
for response in agent.run(messages=messages):
print(response[-1].get("content", ""))
Building a Data Pipeline with Qwen Agent
combine Qwen Agent with data processing for a complete pipeline:
import csv
import json
from datetime import datetime
class QwenDataPipeline:
"""end-to-end data pipeline using Qwen Agent for extraction."""
def __init__(self):
self.agent = create_scraping_agent()
self.results = []
def extract_from_urls(self, urls: list[str], extraction_prompt: str) -> list:
"""extract data from multiple URLs using the Qwen agent."""
for url in urls:
task = f"{extraction_prompt}\n\nURL: {url}"
messages = [{"role": "user", "content": task}]
try:
for response in self.agent.run(messages=messages):
last_msg = response[-1]
if last_msg.get("role") == "assistant":
content = last_msg["content"]
# try to parse as JSON
try:
data = json.loads(content)
data["_source_url"] = url
data["_extracted_at"] = datetime.utcnow().isoformat()
self.results.append(data)
except json.JSONDecodeError:
self.results.append({
"_source_url": url,
"_raw_content": content,
"_extracted_at": datetime.utcnow().isoformat()
})
except Exception as e:
self.results.append({
"_source_url": url,
"_error": str(e),
"_extracted_at": datetime.utcnow().isoformat()
})
return self.results
def save_to_csv(self, filepath: str):
"""save results to CSV."""
if not self.results:
print("no results to save")
return
# collect all field names
all_fields = set()
for r in self.results:
all_fields.update(r.keys())
with open(filepath, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=sorted(all_fields))
writer.writeheader()
for row in self.results:
writer.writerow(row)
print(f"saved {len(self.results)} results to {filepath}")
def save_to_json(self, filepath: str):
"""save results to JSON."""
with open(filepath, "w", encoding="utf-8") as f:
json.dump(self.results, f, indent=2, ensure_ascii=False)
print(f"saved {len(self.results)} results to {filepath}")
# usage
pipeline = QwenDataPipeline()
urls = [
"https://example.com/product/1",
"https://example.com/product/2",
"https://example.com/product/3"
]
prompt = (
"use the web_scraper tool to fetch this page, then extract: "
"product_name, price, description, and availability. "
"return as a JSON object."
)
results = pipeline.extract_from_urls(urls, prompt)
pipeline.save_to_csv("products.csv")
pipeline.save_to_json("products.json")
Performance Optimization
Batch Processing with Concurrency Control
import asyncio
class BatchProcessor:
def __init__(self, max_concurrent: int = 3, delay_seconds: float = 2.0):
self.max_concurrent = max_concurrent
self.delay = delay_seconds
self.semaphore = asyncio.Semaphore(max_concurrent)
async def process_batch(self, urls: list[str], processor_fn) -> list:
"""process multiple URLs with concurrency control."""
results = []
async def process_one(url):
async with self.semaphore:
result = await processor_fn(url)
results.append(result)
await asyncio.sleep(self.delay)
tasks = [process_one(url) for url in urls]
await asyncio.gather(*tasks)
return results
Caching Extracted Data
import hashlib
from pathlib import Path
class ExtractionCache:
def __init__(self, cache_dir: str = ".qwen_cache"):
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
def _key(self, url: str) -> str:
return hashlib.md5(url.encode()).hexdigest()
def get(self, url: str) -> dict | None:
cache_file = self.cache_dir / f"{self._key(url)}.json"
if cache_file.exists():
return json.loads(cache_file.read_text())
return None
def set(self, url: str, data: dict):
cache_file = self.cache_dir / f"{self._key(url)}.json"
cache_file.write_text(json.dumps(data, indent=2, ensure_ascii=False))
def clear(self):
for f in self.cache_dir.glob("*.json"):
f.unlink()
Qwen Agent vs Other AI Agents for Scraping
| feature | Qwen Agent | Claude (Cline) | GPT-4o | DeepSeek |
|---|---|---|---|---|
| self-hosted | yes | no | no | yes |
| MCP support | yes | yes | via tools | limited |
| Chinese/CJK content | excellent | good | good | excellent |
| cost (self-hosted) | free | n/a | n/a | free |
| API cost (per 1M tokens) | $0.50-$2.00 | $3.00-$15.00 | $2.50-$10.00 | $0.27-$1.10 |
| extraction accuracy | high | very high | very high | high |
| tool calling reliability | high | very high | very high | high |
Qwen Agent’s main advantages are self-hosting capability and multilingual strength. if you are scraping Chinese, Japanese, or Korean content, Qwen is an excellent choice. for English-only extraction where accuracy is paramount, Claude or GPT-4o may have a slight edge.
Conclusion
Qwen Agent with MCP integration provides a cost-effective, self-hostable solution for AI-powered web scraping. the framework’s tool calling capabilities let you build sophisticated scraping workflows where the AI handles content interpretation and data structuring while your MCP server handles the actual HTTP requests and proxy rotation. start with the MCP server template and qwen-agent tool registration in this guide, then build specialized extraction pipelines for your specific data needs. the ability to run everything locally makes Qwen Agent particularly attractive for high-volume scraping where API costs would otherwise be prohibitive.