Model Context Protocol (MCP) for data engineers in 2026

Model Context Protocol (MCP) for data engineers in 2026

MCP for data engineers has shifted from a curiosity to a foundational protocol in eighteen months. Anthropic announced the Model Context Protocol (MCP) in November 2024 as an open standard for connecting LLMs to external data sources and tools. By mid-2026, MCP is supported across Claude Desktop, Claude Code, ChatGPT, Cursor, Windsurf, Continue, Sourcegraph Cody, and a long list of independent agent frameworks. For data engineers running scraping, ETL, and RAG pipelines, MCP changed how agents access data, how data engineering work surfaces to non-engineers, and how downstream products integrate with proprietary datasets. This guide walks through what MCP actually is, the server and client architecture, worked Python and TypeScript implementations for common data-engineering use cases, deployment patterns, and where MCP fits versus other patterns like function calling or RAG-only.

The audience is the data engineer or platform team responsible for making proprietary or scraped data available to AI agents in 2026.

What MCP actually is and is not

MCP is a JSON-RPC 2.0 based protocol that defines how an MCP host (typically an AI agent runtime like Claude Desktop) discovers and calls capabilities exposed by an MCP server. The capabilities come in three classes: tools (callable functions), resources (read-only data), and prompts (reusable prompt templates).

The protocol is intentionally minimal. It does not specify the LLM. It does not specify the transport at the application layer (the spec defines stdio, HTTP, and SSE transports). It does not require Anthropic infrastructure (the spec is open, the SDK is MIT-licensed). What it does is standardise the metadata format and request/response shape so that any client can talk to any server without bespoke integration.

For a data engineer, MCP is most useful as a way to expose datasets, query interfaces, and pipeline triggers to AI agents in a way that does not require rebuilding integration per agent platform.

For the broader agent landscape, see the agentic browser revolution and AI agents as web users.

The MCP architecture in three roles

MCP has three roles: host, client, server.

The host is the agent runtime. Claude Desktop is a host. Claude Code is a host. Cursor is a host. The host is responsible for managing user trust, presenting the agent’s context to the user, and routing tool/resource calls to the appropriate server.

The client is a session within a host that connects to one specific MCP server. A host can have many concurrent clients, each connected to a different server.

The server is the process that exposes capabilities. A server can expose any combination of tools, resources, and prompts. Servers can run as local subprocesses (stdio transport) or as remote services (HTTP/SSE transport).

For a data engineer, the server is the unit of work. You write a server. You publish it. Your agent users (or your customers’ agent users) install or connect to it.

A minimal MCP server for a scraping pipeline

A typical scraping pipeline exposes four operations: list known sources, trigger a scrape, retrieve scraped records, and check pipeline status. Here is a minimal Python implementation using the official anthropic-mcp Python SDK.

from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent, Resource
import asyncio
import json

server = Server("scraping-pipeline")

@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="list_sources",
            description="List all known scraping sources.",
            inputSchema={"type": "object", "properties": {}},
        ),
        Tool(
            name="trigger_scrape",
            description="Queue a scrape for a specific source URL.",
            inputSchema={
                "type": "object",
                "properties": {
                    "source_id": {"type": "string"},
                    "max_pages": {"type": "integer", "default": 10},
                },
                "required": ["source_id"],
            },
        ),
        Tool(
            name="get_records",
            description="Fetch scraped records for a source within a date range.",
            inputSchema={
                "type": "object",
                "properties": {
                    "source_id": {"type": "string"},
                    "since": {"type": "string", "format": "date"},
                    "limit": {"type": "integer", "default": 100},
                },
                "required": ["source_id"],
            },
        ),
        Tool(
            name="pipeline_status",
            description="Show pipeline health and recent run summary.",
            inputSchema={"type": "object", "properties": {}},
        ),
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "list_sources":
        result = await db.list_sources()
        return [TextContent(type="text", text=json.dumps(result))]
    elif name == "trigger_scrape":
        job_id = await scheduler.enqueue(
            arguments["source_id"], arguments.get("max_pages", 10)
        )
        return [TextContent(type="text", text=f"Queued job {job_id}")]
    elif name == "get_records":
        records = await db.fetch_records(
            arguments["source_id"],
            arguments.get("since"),
            arguments.get("limit", 100),
        )
        return [TextContent(type="text", text=json.dumps(records))]
    elif name == "pipeline_status":
        status = await monitor.summary()
        return [TextContent(type="text", text=json.dumps(status))]

async def main():
    async with stdio_server() as (read, write):
        await server.run(read, write, server.create_initialization_options())

if __name__ == "__main__":
    asyncio.run(main())

The server is roughly 60 lines of code. An agent connected to it can list sources, trigger scrapes, fetch records, and check status without any agent-specific integration code.

Resources versus tools: when to use each

MCP servers can expose resources alongside tools. The distinction is intentional and matters for how agents reason.

Tools are imperative: they take arguments and return results. They are appropriate for actions (trigger a scrape, send an email, write a file).

Resources are declarative: they have a URI and a content type, and the host can read them on demand. They are appropriate for browsable content (a database table, a file in a blob store, a record from a CRM).

For a scraping pipeline, the typical pattern is:

CapabilityTypeWhy
trigger_scrapeToolIt is an action with side effects
pipeline_statusToolIt is a runtime query, not browsable
list_sourcesResourceSources are a browsable list
get_recordsResource (per-source)Records are browsable content

Mixing tools and resources gives the agent a richer mental model of your data surface. Tools-only servers feel like a set of CLI commands. Resource-rich servers feel like a database the agent can query.

Prompts: the under-used third capability

The third MCP capability, prompts, is the least understood. A prompt is a reusable template that a host can offer to its user. The user invokes it (typically via a slash command), and the host injects the rendered prompt into the conversation.

For a data engineer, prompts are useful for canonical workflows: “summarise yesterday’s scrape”, “diff today’s records against last week’s”, “draft a compliance report for source X”. The prompt definition lives on the server; the user invokes it by name; the host renders the templated content with arguments.

@server.list_prompts()
async def list_prompts():
    return [
        Prompt(
            name="summarise_yesterday",
            description="Summarise yesterday's scrape activity for the team.",
            arguments=[],
        ),
        Prompt(
            name="diff_records",
            description="Show changes in records since a specified date.",
            arguments=[
                PromptArgument(name="source_id", required=True),
                PromptArgument(name="since", required=True),
            ],
        ),
    ]

@server.get_prompt()
async def get_prompt(name: str, arguments: dict):
    if name == "summarise_yesterday":
        return GetPromptResult(
            messages=[
                PromptMessage(
                    role="user",
                    content=TextContent(
                        type="text",
                        text="Use the pipeline_status tool to get yesterday's "
                             "activity. Then summarise the new sources, "
                             "successful runs, and any failures.",
                    ),
                )
            ]
        )

Prompts close the loop: tools provide capability, resources provide browsable content, prompts provide canonical workflows.

Deployment patterns: stdio vs HTTP

MCP supports stdio and HTTP/SSE transports. The choice shapes deployment.

Stdio servers run as subprocesses spawned by the host. They are great for personal tools (the user installs the server and connects via Claude Desktop config). They are awful for shared infrastructure (every user runs their own instance, no shared state, no central observability).

HTTP/SSE servers run as long-lived services. They are great for shared infrastructure (one server, many users, shared state, central monitoring). They require authentication, networking, and operational ownership.

For a data engineering team, the typical deployment is HTTP/SSE behind your existing auth gateway. Add MCP to your existing service mesh; route /mcp/* to the MCP server; reuse your existing OIDC or token-based auth.

TransportBest forSetup timeOperational overhead
StdioPersonal/desktop toolsMinutesNone
HTTP/SSETeam/shared infrastructureHoursStandard service ops
Streamable HTTP (2025 addition)Hybrid; better browser supportHoursStandard service ops

The 2025 addition of streamable HTTP simplified browser-based hosts and is becoming the default for new HTTP servers.

MCP versus function calling: when to use each

Both MCP and function calling let an LLM invoke external capabilities. The difference is portability.

Function calling is per-LLM-provider. A tool defined for OpenAI’s function calling does not work with Anthropic’s tool use without translation. A change to one schema requires updates to all clients.

MCP is provider-neutral. A tool exposed via MCP works with any MCP-compatible host. The schema is declared once, used everywhere.

DimensionFunction callingMCP
PortabilityPer providerCross provider
DiscoveryStatic registrationDynamic at session start
ResourcesNot standardisedFirst-class
PromptsNot standardisedFirst-class
Agent platform reuseLowHigh
Maintenance overheadPer providerOnce
2026 ecosystemMature per providerRapidly growing cross provider

A data engineering team that picks function calling locks itself into one provider. A team that picks MCP gets cross-provider reach for slightly more upfront work.

Worked use case: RAG over scraped data via MCP

A common pattern in 2026 is to expose a RAG corpus over scraped data through an MCP server, so any agent host can query the corpus naturally. The architecture:

  1. Scraping pipeline ingests source URLs, normalises HTML, embeds chunks, stores in a vector database.
  2. MCP server exposes a search_corpus tool with arguments for query and filters.
  3. Agent host connects to MCP server; user asks a question; agent calls search_corpus; corpus returns relevant chunks; agent synthesises an answer.

A minimal Python tool implementation:

@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "search_corpus":
        query = arguments["query"]
        filters = arguments.get("filters", {})
        embedding = await embed(query)
        results = await vectordb.query(
            embedding, top_k=arguments.get("top_k", 5), filters=filters
        )
        return [TextContent(type="text", text=json.dumps(results))]

The MCP server is twenty lines. The vector database, embedder, and ingestion pipeline are independent. The server is the integration surface.

For the deeper RAG-over-scraped-data discussion, see RAG over scraped data production patterns and vector databases for scraping pipelines.

Security and trust model

MCP defines a trust boundary at the host-server connection. The server trusts the host to authenticate the user. The host trusts the server to honour its declared capabilities.

For HTTP transports, authentication is the server’s responsibility. The 2025 spec update added explicit guidance for OAuth 2.1 with PKCE, which is the recommended pattern for enterprise deployments. Bearer tokens work for service-to-service. Mutual TLS works for high-security environments.

Authorisation is the server’s responsibility. A server should enforce per-user permissions; the host’s user identity flows through the authentication layer. A scraping MCP server typically restricts trigger_scrape to authorised users while allowing get_records broadly.

Audit logging is the server’s responsibility. Every tool call should be logged with user identity, timestamp, arguments, and outcome. This is your defence against an agent calling the wrong thing on the wrong dataset.

Deployment checklist

StepOwnerDone when
Define capabilities (tools, resources, prompts)EngineeringSchema agreed, written
Implement server (stdio for prototype)EngineeringLocal Claude Desktop test passes
Migrate to HTTP for shared useEngineeringReachable behind auth gateway
Wire authentication (OAuth 2.1 / Bearer)SecurityUnauthenticated calls rejected
Wire authorisation (per-tool, per-user)SecurityPermission matrix enforced
Add audit loggingEngineeringEvery call logged with identity
Document for usersProductRunbook published
Publish for installation (registry / docs)MarketingDiscoverable in MCP registry
Monitor and observePlatformMetrics dashboard live

External references

The MCP specification is at modelcontextprotocol.io. The Python SDK is at github.com/anthropics/python-mcp. The TypeScript SDK is at github.com/anthropics/typescript-mcp. Anthropic’s MCP servers reference at github.com/modelcontextprotocol/servers.

Comparison: MCP vs LangChain Tools vs OpenAI Function Calling

DimensionMCPLangChain ToolsOpenAI Function Calling
Cross-providerYesPartial (with adapters)No
Cross-hostYesLangChain runtime onlyOpenAI agents only
Resource browsabilityYesNoNo
Prompts as first classYesNoNo
StreamingYesYesYes
Auth patternSpec-defined (OAuth 2.1)App responsibilityOpenAI’s
2026 ecosystem maturityHigh and growingMature within LangChainMature within OpenAI
DiscoveryDynamic at sessionStatic at runtimeStatic per request

MCP is the cross-provider winner. LangChain Tools are the most flexible if you accept LangChain runtime lock-in. OpenAI function calling is the simplest if you only target OpenAI hosts.

FAQ

Do I need Anthropic infrastructure to use MCP?
No. The protocol is open, the SDKs are MIT-licensed, and any host can implement the protocol.

Can I expose existing REST APIs through MCP?
Yes. A thin MCP server can wrap any HTTP API and expose it as tools and resources.

How does MCP compare to OpenAPI?
MCP is at the agent integration layer; OpenAPI is at the HTTP API description layer. They are complementary; many MCP servers wrap OpenAPI-described services.

What is the right transport for production?
HTTP/SSE or streamable HTTP. Stdio is fine for personal/desktop scenarios, but production sharing needs HTTP.

How do I authenticate users on an HTTP MCP server?
OAuth 2.1 with PKCE is the recommended pattern. Bearer tokens work for service-to-service.

Extended MCP architecture analysis

The Model Context Protocol matured rapidly between its November 2024 launch and 2026. The protocol specification at v1.2 (early 2026) covers four primitive types, namely tools, resources, prompts, and sampling. For data engineers the tools and resources primitives are central. Tools expose callable functions to a model. Resources expose readable, addressable data.

A data-engineering MCP server typically wraps three layers. First, connection management (database connections, API clients, cache). Second, the tool surface (query, insert, transform, validate). Third, the observability surface (logs, metrics, request IDs).

The protocol is transport-agnostic. The two common transports are stdio (process-spawned servers, lowest latency) and SSE plus HTTP (network-deployed servers, fan-out across clients). For data pipelines stdio is preferred for local agents and SSE for shared infrastructure.

Production-ready MCP server pattern

from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import asyncio
import asyncpg

server = Server("data-pipeline")
pool = None

@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="run_query",
            description="Execute a read-only SQL query against the warehouse",
            inputSchema={
                "type": "object",
                "properties": {
                    "sql": {"type": "string"},
                    "limit": {"type": "integer", "default": 1000},
                },
                "required": ["sql"],
            },
        ),
        Tool(
            name="describe_table",
            description="Return schema for a warehouse table",
            inputSchema={
                "type": "object",
                "properties": {"table": {"type": "string"}},
                "required": ["table"],
            },
        ),
    ]

@server.call_tool()
async def call_tool(name, arguments):
    if name == "run_query":
        sql = arguments["sql"]
        if not sql.strip().lower().startswith("select"):
            return [TextContent(type="text", text="Read-only access. SELECT only.")]
        async with pool.acquire() as conn:
            rows = await conn.fetch(sql + f" LIMIT {arguments.get('limit', 1000)}")
        return [TextContent(type="text", text="\n".join(str(r) for r in rows))]
    if name == "describe_table":
        async with pool.acquire() as conn:
            rows = await conn.fetch(
                "SELECT column_name, data_type FROM information_schema.columns WHERE table_name = $1",
                arguments["table"],
            )
        return [TextContent(type="text", text="\n".join(f"{r['column_name']}: {r['data_type']}" for r in rows))]

async def main():
    global pool
    pool = await asyncpg.create_pool("postgresql://...")
    async with stdio_server() as (read, write):
        await server.run(read, write, server.create_initialization_options())

if __name__ == "__main__":
    asyncio.run(main())

Tool surface design patterns

Effective MCP tool surfaces follow five rules.

  1. Read and write are separate tools. Never bundle them.
  2. Every destructive tool requires a confirm token in the input schema.
  3. Pagination is explicit (cursor or offset/limit) rather than streaming everything.
  4. Errors return structured content with hints for the model on next steps.
  5. Long-running tools return a job ID and a separate status tool checks progress.

Comparison: MCP transport choices

TransportLatencyFan-outBest for
stdioLow (process IPC)One clientLocal agents, dev tooling
SSE plus HTTPModerate (network)Many clientsShared infrastructure
WebSocket (proposed)LowBidirectionalReal-time dashboards

Observability for MCP servers

Production MCP servers should emit four signals.

  1. Request count by tool name.
  2. Request latency histogram by tool name.
  3. Error rate by tool name and error class.
  4. Token consumption per request (input plus output).

A simple approach is OpenTelemetry plus a Prometheus exporter. The MCP request lifecycle maps cleanly onto OTel spans.

Additional FAQ

Is MCP only for Anthropic models?
No. The protocol is open and other model vendors have shipped MCP support.

Can MCP replace REST APIs?
For agent-facing surfaces yes. For human-facing surfaces REST and GraphQL remain better fits.

How do I version an MCP tool surface?
Use semantic versioning on the server, expose the version in the initialization handshake, and add new tools rather than mutating existing ones.

What about authentication?
Transport-level auth (TLS, mutual TLS, signed headers) is the current pattern. The protocol does not prescribe an auth scheme.

When MCP wins versus when it loses

MCP is not the right answer for every data engineering integration. The protocol shines when the consumer is a model or an agent, and loses when the consumer is a deterministic application or a high-throughput batch job.

MCP wins when the access pattern is exploratory, when the schema is not known in advance, when the operations involve natural language, and when the consumer needs metadata to interpret the data. Examples include a model querying a warehouse for ad-hoc analysis, an agent investigating a customer support issue, and a copilot helping an analyst draft a report.

MCP loses when the access pattern is fixed, when the schema is well-known, when throughput requirements are high, and when latency budgets are tight. Examples include an ETL pipeline ingesting transactions, a real-time dashboard refreshing every second, and a microservice serving a known query at high QPS. For these cases REST, GraphQL, or direct database access remain the right choice.

The decision rule for a data engineering team is to expose the warehouse via REST or GraphQL for application consumers, and to layer MCP on top for agent and copilot consumers. The two surfaces share the underlying connection management and data layer, but expose different abstractions to different audiences.

Tool surface design beyond the basics

A first-pass MCP tool surface tends to expose run_query and describe_table. A production-quality tool surface goes further. The patterns that ship in 2026 include search_tables (for discovery when the agent does not know table names), suggest_join (for analytical queries that span tables), explain_query (for surfacing query plans), and validate_data (for checking data quality assumptions).

Each additional tool reduces the number of round-trips the agent needs to complete a task. An agent that can search, describe, query, and validate in four tool calls is materially more capable than an agent that can only query. The design goal is to anticipate the agent’s needs and expose primitives that satisfy them.

A countervailing concern is tool sprawl. An MCP server with too many tools confuses the agent. The 2026 sweet spot is somewhere between five and twenty tools, with clear non-overlapping purposes. Beyond twenty tools the agent’s planning quality degrades.

Security model for MCP servers

An MCP server that exposes warehouse access creates a powerful attack surface. The security model must address authentication, authorisation, audit, and rate limiting.

Authentication establishes who is connecting. For stdio transport the parent process is implicitly trusted. For SSE plus HTTP transport the connection should require a token, ideally short-lived and tied to a specific agent identity.

Authorisation determines what the authenticated principal can do. The 2026 pattern is to map the agent’s permissions onto the same role-based model used for human users. An agent acting on behalf of an analyst inherits the analyst’s permissions, plus additional restrictions specific to agent traffic.

Audit captures every tool call with the principal, the arguments, the result summary, and the timestamp. The audit log is the forensic record for incident investigation. The 2026 best practice is to retain MCP audit logs for at least ninety days.

Rate limiting prevents runaway agents from consuming resources. Per-tool, per-principal, and per-server quotas should each be enforced. The 2026 default is conservative quotas that can be raised on request.

MCP versioning and evolution

MCP is a young protocol, and breaking changes are expected as it matures. The 2026 best practice for MCP server operators is to follow semantic versioning, expose the version in the initialization handshake, and support the previous major version for at least six months after a breaking change.

For tool surfaces the rule is to add new tools rather than mutate existing ones. A tool that needs a new argument should be deprecated and replaced with a v2 variant. Existing agents continue to use the v1 tool until they are updated.

For data shapes the rule is similar. A response schema that needs to add a field is safe. A response schema that needs to remove or rename a field requires a versioned response. Many MCP servers expose a content_version metadata field on responses to allow agents to detect and adapt.

Next steps

The fastest first step is to wrap your most-used internal data interface in a minimal MCP server, run it via Claude Desktop, and see how it changes the team’s interaction. The integration overhead is low, the leverage is high. For broader emerging-tech context, head to the DRT emerging-tech hub and pair this with the agentic browser revolution and RAG over scraped data guides.

This guide is informational, not engineering or legal advice.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)