Model Context Protocol (MCP) for data engineers in 2026
MCP for data engineers has shifted from a curiosity to a foundational protocol in eighteen months. Anthropic announced the Model Context Protocol (MCP) in November 2024 as an open standard for connecting LLMs to external data sources and tools. By mid-2026, MCP is supported across Claude Desktop, Claude Code, ChatGPT, Cursor, Windsurf, Continue, Sourcegraph Cody, and a long list of independent agent frameworks. For data engineers running scraping, ETL, and RAG pipelines, MCP changed how agents access data, how data engineering work surfaces to non-engineers, and how downstream products integrate with proprietary datasets. This guide walks through what MCP actually is, the server and client architecture, worked Python and TypeScript implementations for common data-engineering use cases, deployment patterns, and where MCP fits versus other patterns like function calling or RAG-only.
The audience is the data engineer or platform team responsible for making proprietary or scraped data available to AI agents in 2026.
What MCP actually is and is not
MCP is a JSON-RPC 2.0 based protocol that defines how an MCP host (typically an AI agent runtime like Claude Desktop) discovers and calls capabilities exposed by an MCP server. The capabilities come in three classes: tools (callable functions), resources (read-only data), and prompts (reusable prompt templates).
The protocol is intentionally minimal. It does not specify the LLM. It does not specify the transport at the application layer (the spec defines stdio, HTTP, and SSE transports). It does not require Anthropic infrastructure (the spec is open, the SDK is MIT-licensed). What it does is standardise the metadata format and request/response shape so that any client can talk to any server without bespoke integration.
For a data engineer, MCP is most useful as a way to expose datasets, query interfaces, and pipeline triggers to AI agents in a way that does not require rebuilding integration per agent platform.
For the broader agent landscape, see the agentic browser revolution and AI agents as web users.
The MCP architecture in three roles
MCP has three roles: host, client, server.
The host is the agent runtime. Claude Desktop is a host. Claude Code is a host. Cursor is a host. The host is responsible for managing user trust, presenting the agent’s context to the user, and routing tool/resource calls to the appropriate server.
The client is a session within a host that connects to one specific MCP server. A host can have many concurrent clients, each connected to a different server.
The server is the process that exposes capabilities. A server can expose any combination of tools, resources, and prompts. Servers can run as local subprocesses (stdio transport) or as remote services (HTTP/SSE transport).
For a data engineer, the server is the unit of work. You write a server. You publish it. Your agent users (or your customers’ agent users) install or connect to it.
A minimal MCP server for a scraping pipeline
A typical scraping pipeline exposes four operations: list known sources, trigger a scrape, retrieve scraped records, and check pipeline status. Here is a minimal Python implementation using the official anthropic-mcp Python SDK.
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent, Resource
import asyncio
import json
server = Server("scraping-pipeline")
@server.list_tools()
async def list_tools():
return [
Tool(
name="list_sources",
description="List all known scraping sources.",
inputSchema={"type": "object", "properties": {}},
),
Tool(
name="trigger_scrape",
description="Queue a scrape for a specific source URL.",
inputSchema={
"type": "object",
"properties": {
"source_id": {"type": "string"},
"max_pages": {"type": "integer", "default": 10},
},
"required": ["source_id"],
},
),
Tool(
name="get_records",
description="Fetch scraped records for a source within a date range.",
inputSchema={
"type": "object",
"properties": {
"source_id": {"type": "string"},
"since": {"type": "string", "format": "date"},
"limit": {"type": "integer", "default": 100},
},
"required": ["source_id"],
},
),
Tool(
name="pipeline_status",
description="Show pipeline health and recent run summary.",
inputSchema={"type": "object", "properties": {}},
),
]
@server.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "list_sources":
result = await db.list_sources()
return [TextContent(type="text", text=json.dumps(result))]
elif name == "trigger_scrape":
job_id = await scheduler.enqueue(
arguments["source_id"], arguments.get("max_pages", 10)
)
return [TextContent(type="text", text=f"Queued job {job_id}")]
elif name == "get_records":
records = await db.fetch_records(
arguments["source_id"],
arguments.get("since"),
arguments.get("limit", 100),
)
return [TextContent(type="text", text=json.dumps(records))]
elif name == "pipeline_status":
status = await monitor.summary()
return [TextContent(type="text", text=json.dumps(status))]
async def main():
async with stdio_server() as (read, write):
await server.run(read, write, server.create_initialization_options())
if __name__ == "__main__":
asyncio.run(main())
The server is roughly 60 lines of code. An agent connected to it can list sources, trigger scrapes, fetch records, and check status without any agent-specific integration code.
Resources versus tools: when to use each
MCP servers can expose resources alongside tools. The distinction is intentional and matters for how agents reason.
Tools are imperative: they take arguments and return results. They are appropriate for actions (trigger a scrape, send an email, write a file).
Resources are declarative: they have a URI and a content type, and the host can read them on demand. They are appropriate for browsable content (a database table, a file in a blob store, a record from a CRM).
For a scraping pipeline, the typical pattern is:
| Capability | Type | Why |
|---|---|---|
| trigger_scrape | Tool | It is an action with side effects |
| pipeline_status | Tool | It is a runtime query, not browsable |
| list_sources | Resource | Sources are a browsable list |
| get_records | Resource (per-source) | Records are browsable content |
Mixing tools and resources gives the agent a richer mental model of your data surface. Tools-only servers feel like a set of CLI commands. Resource-rich servers feel like a database the agent can query.
Prompts: the under-used third capability
The third MCP capability, prompts, is the least understood. A prompt is a reusable template that a host can offer to its user. The user invokes it (typically via a slash command), and the host injects the rendered prompt into the conversation.
For a data engineer, prompts are useful for canonical workflows: “summarise yesterday’s scrape”, “diff today’s records against last week’s”, “draft a compliance report for source X”. The prompt definition lives on the server; the user invokes it by name; the host renders the templated content with arguments.
@server.list_prompts()
async def list_prompts():
return [
Prompt(
name="summarise_yesterday",
description="Summarise yesterday's scrape activity for the team.",
arguments=[],
),
Prompt(
name="diff_records",
description="Show changes in records since a specified date.",
arguments=[
PromptArgument(name="source_id", required=True),
PromptArgument(name="since", required=True),
],
),
]
@server.get_prompt()
async def get_prompt(name: str, arguments: dict):
if name == "summarise_yesterday":
return GetPromptResult(
messages=[
PromptMessage(
role="user",
content=TextContent(
type="text",
text="Use the pipeline_status tool to get yesterday's "
"activity. Then summarise the new sources, "
"successful runs, and any failures.",
),
)
]
)
Prompts close the loop: tools provide capability, resources provide browsable content, prompts provide canonical workflows.
Deployment patterns: stdio vs HTTP
MCP supports stdio and HTTP/SSE transports. The choice shapes deployment.
Stdio servers run as subprocesses spawned by the host. They are great for personal tools (the user installs the server and connects via Claude Desktop config). They are awful for shared infrastructure (every user runs their own instance, no shared state, no central observability).
HTTP/SSE servers run as long-lived services. They are great for shared infrastructure (one server, many users, shared state, central monitoring). They require authentication, networking, and operational ownership.
For a data engineering team, the typical deployment is HTTP/SSE behind your existing auth gateway. Add MCP to your existing service mesh; route /mcp/* to the MCP server; reuse your existing OIDC or token-based auth.
| Transport | Best for | Setup time | Operational overhead |
|---|---|---|---|
| Stdio | Personal/desktop tools | Minutes | None |
| HTTP/SSE | Team/shared infrastructure | Hours | Standard service ops |
| Streamable HTTP (2025 addition) | Hybrid; better browser support | Hours | Standard service ops |
The 2025 addition of streamable HTTP simplified browser-based hosts and is becoming the default for new HTTP servers.
MCP versus function calling: when to use each
Both MCP and function calling let an LLM invoke external capabilities. The difference is portability.
Function calling is per-LLM-provider. A tool defined for OpenAI’s function calling does not work with Anthropic’s tool use without translation. A change to one schema requires updates to all clients.
MCP is provider-neutral. A tool exposed via MCP works with any MCP-compatible host. The schema is declared once, used everywhere.
| Dimension | Function calling | MCP |
|---|---|---|
| Portability | Per provider | Cross provider |
| Discovery | Static registration | Dynamic at session start |
| Resources | Not standardised | First-class |
| Prompts | Not standardised | First-class |
| Agent platform reuse | Low | High |
| Maintenance overhead | Per provider | Once |
| 2026 ecosystem | Mature per provider | Rapidly growing cross provider |
A data engineering team that picks function calling locks itself into one provider. A team that picks MCP gets cross-provider reach for slightly more upfront work.
Worked use case: RAG over scraped data via MCP
A common pattern in 2026 is to expose a RAG corpus over scraped data through an MCP server, so any agent host can query the corpus naturally. The architecture:
- Scraping pipeline ingests source URLs, normalises HTML, embeds chunks, stores in a vector database.
- MCP server exposes a
search_corpustool with arguments for query and filters. - Agent host connects to MCP server; user asks a question; agent calls
search_corpus; corpus returns relevant chunks; agent synthesises an answer.
A minimal Python tool implementation:
@server.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "search_corpus":
query = arguments["query"]
filters = arguments.get("filters", {})
embedding = await embed(query)
results = await vectordb.query(
embedding, top_k=arguments.get("top_k", 5), filters=filters
)
return [TextContent(type="text", text=json.dumps(results))]
The MCP server is twenty lines. The vector database, embedder, and ingestion pipeline are independent. The server is the integration surface.
For the deeper RAG-over-scraped-data discussion, see RAG over scraped data production patterns and vector databases for scraping pipelines.
Security and trust model
MCP defines a trust boundary at the host-server connection. The server trusts the host to authenticate the user. The host trusts the server to honour its declared capabilities.
For HTTP transports, authentication is the server’s responsibility. The 2025 spec update added explicit guidance for OAuth 2.1 with PKCE, which is the recommended pattern for enterprise deployments. Bearer tokens work for service-to-service. Mutual TLS works for high-security environments.
Authorisation is the server’s responsibility. A server should enforce per-user permissions; the host’s user identity flows through the authentication layer. A scraping MCP server typically restricts trigger_scrape to authorised users while allowing get_records broadly.
Audit logging is the server’s responsibility. Every tool call should be logged with user identity, timestamp, arguments, and outcome. This is your defence against an agent calling the wrong thing on the wrong dataset.
Deployment checklist
| Step | Owner | Done when |
|---|---|---|
| Define capabilities (tools, resources, prompts) | Engineering | Schema agreed, written |
| Implement server (stdio for prototype) | Engineering | Local Claude Desktop test passes |
| Migrate to HTTP for shared use | Engineering | Reachable behind auth gateway |
| Wire authentication (OAuth 2.1 / Bearer) | Security | Unauthenticated calls rejected |
| Wire authorisation (per-tool, per-user) | Security | Permission matrix enforced |
| Add audit logging | Engineering | Every call logged with identity |
| Document for users | Product | Runbook published |
| Publish for installation (registry / docs) | Marketing | Discoverable in MCP registry |
| Monitor and observe | Platform | Metrics dashboard live |
External references
The MCP specification is at modelcontextprotocol.io. The Python SDK is at github.com/anthropics/python-mcp. The TypeScript SDK is at github.com/anthropics/typescript-mcp. Anthropic’s MCP servers reference at github.com/modelcontextprotocol/servers.
Comparison: MCP vs LangChain Tools vs OpenAI Function Calling
| Dimension | MCP | LangChain Tools | OpenAI Function Calling |
|---|---|---|---|
| Cross-provider | Yes | Partial (with adapters) | No |
| Cross-host | Yes | LangChain runtime only | OpenAI agents only |
| Resource browsability | Yes | No | No |
| Prompts as first class | Yes | No | No |
| Streaming | Yes | Yes | Yes |
| Auth pattern | Spec-defined (OAuth 2.1) | App responsibility | OpenAI’s |
| 2026 ecosystem maturity | High and growing | Mature within LangChain | Mature within OpenAI |
| Discovery | Dynamic at session | Static at runtime | Static per request |
MCP is the cross-provider winner. LangChain Tools are the most flexible if you accept LangChain runtime lock-in. OpenAI function calling is the simplest if you only target OpenAI hosts.
FAQ
Do I need Anthropic infrastructure to use MCP?
No. The protocol is open, the SDKs are MIT-licensed, and any host can implement the protocol.
Can I expose existing REST APIs through MCP?
Yes. A thin MCP server can wrap any HTTP API and expose it as tools and resources.
How does MCP compare to OpenAPI?
MCP is at the agent integration layer; OpenAPI is at the HTTP API description layer. They are complementary; many MCP servers wrap OpenAPI-described services.
What is the right transport for production?
HTTP/SSE or streamable HTTP. Stdio is fine for personal/desktop scenarios, but production sharing needs HTTP.
How do I authenticate users on an HTTP MCP server?
OAuth 2.1 with PKCE is the recommended pattern. Bearer tokens work for service-to-service.
Extended MCP architecture analysis
The Model Context Protocol matured rapidly between its November 2024 launch and 2026. The protocol specification at v1.2 (early 2026) covers four primitive types, namely tools, resources, prompts, and sampling. For data engineers the tools and resources primitives are central. Tools expose callable functions to a model. Resources expose readable, addressable data.
A data-engineering MCP server typically wraps three layers. First, connection management (database connections, API clients, cache). Second, the tool surface (query, insert, transform, validate). Third, the observability surface (logs, metrics, request IDs).
The protocol is transport-agnostic. The two common transports are stdio (process-spawned servers, lowest latency) and SSE plus HTTP (network-deployed servers, fan-out across clients). For data pipelines stdio is preferred for local agents and SSE for shared infrastructure.
Production-ready MCP server pattern
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import asyncio
import asyncpg
server = Server("data-pipeline")
pool = None
@server.list_tools()
async def list_tools():
return [
Tool(
name="run_query",
description="Execute a read-only SQL query against the warehouse",
inputSchema={
"type": "object",
"properties": {
"sql": {"type": "string"},
"limit": {"type": "integer", "default": 1000},
},
"required": ["sql"],
},
),
Tool(
name="describe_table",
description="Return schema for a warehouse table",
inputSchema={
"type": "object",
"properties": {"table": {"type": "string"}},
"required": ["table"],
},
),
]
@server.call_tool()
async def call_tool(name, arguments):
if name == "run_query":
sql = arguments["sql"]
if not sql.strip().lower().startswith("select"):
return [TextContent(type="text", text="Read-only access. SELECT only.")]
async with pool.acquire() as conn:
rows = await conn.fetch(sql + f" LIMIT {arguments.get('limit', 1000)}")
return [TextContent(type="text", text="\n".join(str(r) for r in rows))]
if name == "describe_table":
async with pool.acquire() as conn:
rows = await conn.fetch(
"SELECT column_name, data_type FROM information_schema.columns WHERE table_name = $1",
arguments["table"],
)
return [TextContent(type="text", text="\n".join(f"{r['column_name']}: {r['data_type']}" for r in rows))]
async def main():
global pool
pool = await asyncpg.create_pool("postgresql://...")
async with stdio_server() as (read, write):
await server.run(read, write, server.create_initialization_options())
if __name__ == "__main__":
asyncio.run(main())
Tool surface design patterns
Effective MCP tool surfaces follow five rules.
- Read and write are separate tools. Never bundle them.
- Every destructive tool requires a confirm token in the input schema.
- Pagination is explicit (cursor or offset/limit) rather than streaming everything.
- Errors return structured content with hints for the model on next steps.
- Long-running tools return a job ID and a separate status tool checks progress.
Comparison: MCP transport choices
| Transport | Latency | Fan-out | Best for |
|---|---|---|---|
| stdio | Low (process IPC) | One client | Local agents, dev tooling |
| SSE plus HTTP | Moderate (network) | Many clients | Shared infrastructure |
| WebSocket (proposed) | Low | Bidirectional | Real-time dashboards |
Observability for MCP servers
Production MCP servers should emit four signals.
- Request count by tool name.
- Request latency histogram by tool name.
- Error rate by tool name and error class.
- Token consumption per request (input plus output).
A simple approach is OpenTelemetry plus a Prometheus exporter. The MCP request lifecycle maps cleanly onto OTel spans.
Additional FAQ
Is MCP only for Anthropic models?
No. The protocol is open and other model vendors have shipped MCP support.
Can MCP replace REST APIs?
For agent-facing surfaces yes. For human-facing surfaces REST and GraphQL remain better fits.
How do I version an MCP tool surface?
Use semantic versioning on the server, expose the version in the initialization handshake, and add new tools rather than mutating existing ones.
What about authentication?
Transport-level auth (TLS, mutual TLS, signed headers) is the current pattern. The protocol does not prescribe an auth scheme.
When MCP wins versus when it loses
MCP is not the right answer for every data engineering integration. The protocol shines when the consumer is a model or an agent, and loses when the consumer is a deterministic application or a high-throughput batch job.
MCP wins when the access pattern is exploratory, when the schema is not known in advance, when the operations involve natural language, and when the consumer needs metadata to interpret the data. Examples include a model querying a warehouse for ad-hoc analysis, an agent investigating a customer support issue, and a copilot helping an analyst draft a report.
MCP loses when the access pattern is fixed, when the schema is well-known, when throughput requirements are high, and when latency budgets are tight. Examples include an ETL pipeline ingesting transactions, a real-time dashboard refreshing every second, and a microservice serving a known query at high QPS. For these cases REST, GraphQL, or direct database access remain the right choice.
The decision rule for a data engineering team is to expose the warehouse via REST or GraphQL for application consumers, and to layer MCP on top for agent and copilot consumers. The two surfaces share the underlying connection management and data layer, but expose different abstractions to different audiences.
Tool surface design beyond the basics
A first-pass MCP tool surface tends to expose run_query and describe_table. A production-quality tool surface goes further. The patterns that ship in 2026 include search_tables (for discovery when the agent does not know table names), suggest_join (for analytical queries that span tables), explain_query (for surfacing query plans), and validate_data (for checking data quality assumptions).
Each additional tool reduces the number of round-trips the agent needs to complete a task. An agent that can search, describe, query, and validate in four tool calls is materially more capable than an agent that can only query. The design goal is to anticipate the agent’s needs and expose primitives that satisfy them.
A countervailing concern is tool sprawl. An MCP server with too many tools confuses the agent. The 2026 sweet spot is somewhere between five and twenty tools, with clear non-overlapping purposes. Beyond twenty tools the agent’s planning quality degrades.
Security model for MCP servers
An MCP server that exposes warehouse access creates a powerful attack surface. The security model must address authentication, authorisation, audit, and rate limiting.
Authentication establishes who is connecting. For stdio transport the parent process is implicitly trusted. For SSE plus HTTP transport the connection should require a token, ideally short-lived and tied to a specific agent identity.
Authorisation determines what the authenticated principal can do. The 2026 pattern is to map the agent’s permissions onto the same role-based model used for human users. An agent acting on behalf of an analyst inherits the analyst’s permissions, plus additional restrictions specific to agent traffic.
Audit captures every tool call with the principal, the arguments, the result summary, and the timestamp. The audit log is the forensic record for incident investigation. The 2026 best practice is to retain MCP audit logs for at least ninety days.
Rate limiting prevents runaway agents from consuming resources. Per-tool, per-principal, and per-server quotas should each be enforced. The 2026 default is conservative quotas that can be raised on request.
MCP versioning and evolution
MCP is a young protocol, and breaking changes are expected as it matures. The 2026 best practice for MCP server operators is to follow semantic versioning, expose the version in the initialization handshake, and support the previous major version for at least six months after a breaking change.
For tool surfaces the rule is to add new tools rather than mutate existing ones. A tool that needs a new argument should be deprecated and replaced with a v2 variant. Existing agents continue to use the v1 tool until they are updated.
For data shapes the rule is similar. A response schema that needs to add a field is safe. A response schema that needs to remove or rename a field requires a versioned response. Many MCP servers expose a content_version metadata field on responses to allow agents to detect and adapt.
Next steps
The fastest first step is to wrap your most-used internal data interface in a minimal MCP server, run it via Claude Desktop, and see how it changes the team’s interaction. The integration overhead is low, the leverage is high. For broader emerging-tech context, head to the DRT emerging-tech hub and pair this with the agentic browser revolution and RAG over scraped data guides.
This guide is informational, not engineering or legal advice.