Azure Functions for Web Scraping: Serverless Scraping on Microsoft Cloud

TL;DR
Azure Functions provides a cost-effective, scalable platform for web scraping workloads. this guide covers deployment patterns, proxy integration, state management with Azure Storage, and orchestration with Durable Functions for multi-step scraping pipelines.

serverless scraping solves a real infrastructure problem: most scraping workloads are bursty and intermittent, but traditional servers charge for idle time. Azure Functions charges only for execution time, scales from one request to thousands automatically, and integrates natively with Azure’s storage and messaging services.

this guide covers the patterns that work well for scraping on Azure Functions, including the constraints you need to work around.

why azure functions for scraping

Azure Functions has a few advantages over AWS Lambda for scraping specifically. the Consumption plan includes 1 million free executions per month, which covers substantial scraping volume at zero cost. the Premium plan adds VNet integration, which lets you route scraping traffic through a specific IP range — useful for IP allowlisting. function execution timeout on Consumption is up to 10 minutes (extendable with Durable Functions), which is adequate for most page scraping tasks.

the main constraint is the lack of a built-in browser. Azure Functions runs Python or Node.js code but does not include Chromium. for JavaScript-rendered pages, you need to either call an external browser API service or use a custom Docker container with a browser installed.

basic python scraping function

import azure.functions as func
import urllib.request
import json
import logging
import re

app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)

@app.route(route="scrape")
def scrape_url(req: func.HttpRequest) -> func.HttpResponse:
    url = req.params.get('url') or req.get_json().get('url')
    
    if not url:
        return func.HttpResponse(
            json.dumps({"error": "url parameter required"}),
            status_code=400,
            mimetype="application/json"
        )
    
    proxy_url = req.params.get('proxy')
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
    }
    
    if proxy_url:
        handler = urllib.request.ProxyHandler({"http": proxy_url, "https": proxy_url})
        opener = urllib.request.build_opener(handler)
    else:
        opener = urllib.request.build_opener()
    
    try:
        request = urllib.request.Request(url, headers=headers)
        with opener.open(request, timeout=20) as response:
            html = response.read().decode('utf-8', errors='replace')
        
        # extract title
        title_match = re.search(r']*>(.*?)', html, re.IGNORECASE | re.DOTALL)
        title = title_match.group(1).strip() if title_match else ""
        
        # strip tags for text extraction
        text = re.sub(r'<[^>]+>', ' ', html)
        text = re.sub(r'\s+', ' ', text).strip()[:3000]
        
        result = {"url": url, "title": title, "text": text, "length": len(html)}
        return func.HttpResponse(
            json.dumps(result),
            mimetype="application/json"
        )
    except Exception as e:
        logging.error(f"scrape failed for {url}: {str(e)}")
        return func.HttpResponse(
            json.dumps({"error": str(e), "url": url}),
            status_code=500,
            mimetype="application/json"
        )

timer-triggered scraping for scheduled collection

import azure.functions as func
from azure.storage.blob import BlobServiceClient
import urllib.request
import json
import datetime
import os

@app.timer_trigger(schedule="0 */6 * * *", arg_name="timer")  # every 6 hours
def scheduled_scrape(timer: func.TimerRequest) -> None:
    urls_to_scrape = [
        "https://example.com/prices",
        "https://competitor.com/catalog"
    ]
    
    connection_string = os.environ["AZURE_STORAGE_CONNECTION_STRING"]
    blob_service = BlobServiceClient.from_connection_string(connection_string)
    container = blob_service.get_container_client("scrape-results")
    
    for url in urls_to_scrape:
        try:
            # scrape the URL
            req = urllib.request.Request(url, headers={
                "User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0)"
            })
            with urllib.request.urlopen(req, timeout=20) as resp:
                content = resp.read().decode('utf-8')
            
            # store to blob storage
            timestamp = datetime.datetime.utcnow().strftime("%Y%m%d_%H%M%S")
            blob_name = f"{url.replace('https://', '').replace('/', '_')}_{timestamp}.html"
            blob_client = container.get_blob_client(blob_name)
            blob_client.upload_blob(content)
            
        except Exception as e:
            logging.error(f"failed to scrape {url}: {e}")

durable functions for multi-step scraping pipelines

Durable Functions solve the orchestration problem: coordinating a scrape that requires fetching a sitemap, then scraping each URL found, then processing results. standard Functions cannot maintain state across calls; Durable Functions can.

import azure.durable_functions as df

def orchestrator_function(context: df.DurableOrchestrationContext):
    # step 1: fetch sitemap
    sitemap_urls = yield context.call_activity("fetch_sitemap", "https://example.com/sitemap.xml")
    
    # step 2: scrape all URLs in parallel (fan-out)
    scrape_tasks = [context.call_activity("scrape_page", url) for url in sitemap_urls[:50]]
    results = yield context.task_all(scrape_tasks)
    
    # step 3: store aggregated results
    yield context.call_activity("store_results", results)
    
    return {"scraped": len(results)}

main = df.Orchestrator.create(orchestrator_function)

proxy integration on azure

for reliable scraping, integrate proxy rotation into your Azure Functions. store proxy credentials in Azure Key Vault and reference them as environment variables in your function app. rotate through proxies using a counter stored in Azure Cache for Redis to maintain consistent distribution across concurrent function instances.

for the most reliable results on tough targets, use mobile proxies accessed via HTTP from your Azure Function. the function’s outbound IP address does not matter when all requests route through an external proxy.

cold start and performance considerations

Azure Functions on the Consumption plan have cold start delays of 1-3 seconds after periods of inactivity. for scraping workloads, this is generally acceptable since individual page scrapes take longer than the cold start. if cold starts are a problem, upgrade to Premium plan which maintains warm instances.

use connection pooling via urllib.request or requests.Session() objects initialized at module level (outside the function handler). Azure Functions reuses warm instances, so module-level initialization persists across invocations and saves setup overhead. learn more about scraping fundamentals in our guide on what web scraping is.

cost estimation

serverless scraping functions need proxies that just work. pair your Azure setup with our dedicated Singapore mobile proxy for 4G/5G connections that minimize failed requests.

at the Consumption plan pricing (as of 2026), 1 million function executions and 400,000 GB-seconds of execution are free per month. a typical page scrape taking 3 seconds with 256MB memory uses 0.75 GB-seconds. this means roughly 533,000 free page scrapes per month before any billing begins. for most research and monitoring use cases, Azure Functions scraping runs entirely within the free tier.

related guides

sources and further reading

last updated: April 3, 2026

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)