Azure Functions provides a cost-effective, scalable platform for web scraping workloads. this guide covers deployment patterns, proxy integration, state management with Azure Storage, and orchestration with Durable Functions for multi-step scraping pipelines.
serverless scraping solves a real infrastructure problem: most scraping workloads are bursty and intermittent, but traditional servers charge for idle time. Azure Functions charges only for execution time, scales from one request to thousands automatically, and integrates natively with Azure’s storage and messaging services.
this guide covers the patterns that work well for scraping on Azure Functions, including the constraints you need to work around.
why azure functions for scraping
Azure Functions has a few advantages over AWS Lambda for scraping specifically. the Consumption plan includes 1 million free executions per month, which covers substantial scraping volume at zero cost. the Premium plan adds VNet integration, which lets you route scraping traffic through a specific IP range — useful for IP allowlisting. function execution timeout on Consumption is up to 10 minutes (extendable with Durable Functions), which is adequate for most page scraping tasks.
the main constraint is the lack of a built-in browser. Azure Functions runs Python or Node.js code but does not include Chromium. for JavaScript-rendered pages, you need to either call an external browser API service or use a custom Docker container with a browser installed.
basic python scraping function
import azure.functions as func
import urllib.request
import json
import logging
import re
app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)
@app.route(route="scrape")
def scrape_url(req: func.HttpRequest) -> func.HttpResponse:
url = req.params.get('url') or req.get_json().get('url')
if not url:
return func.HttpResponse(
json.dumps({"error": "url parameter required"}),
status_code=400,
mimetype="application/json"
)
proxy_url = req.params.get('proxy')
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
}
if proxy_url:
handler = urllib.request.ProxyHandler({"http": proxy_url, "https": proxy_url})
opener = urllib.request.build_opener(handler)
else:
opener = urllib.request.build_opener()
try:
request = urllib.request.Request(url, headers=headers)
with opener.open(request, timeout=20) as response:
html = response.read().decode('utf-8', errors='replace')
# extract title
title_match = re.search(r']*>(.*?) ', html, re.IGNORECASE | re.DOTALL)
title = title_match.group(1).strip() if title_match else ""
# strip tags for text extraction
text = re.sub(r'<[^>]+>', ' ', html)
text = re.sub(r'\s+', ' ', text).strip()[:3000]
result = {"url": url, "title": title, "text": text, "length": len(html)}
return func.HttpResponse(
json.dumps(result),
mimetype="application/json"
)
except Exception as e:
logging.error(f"scrape failed for {url}: {str(e)}")
return func.HttpResponse(
json.dumps({"error": str(e), "url": url}),
status_code=500,
mimetype="application/json"
)
timer-triggered scraping for scheduled collection
import azure.functions as func
from azure.storage.blob import BlobServiceClient
import urllib.request
import json
import datetime
import os
@app.timer_trigger(schedule="0 */6 * * *", arg_name="timer") # every 6 hours
def scheduled_scrape(timer: func.TimerRequest) -> None:
urls_to_scrape = [
"https://example.com/prices",
"https://competitor.com/catalog"
]
connection_string = os.environ["AZURE_STORAGE_CONNECTION_STRING"]
blob_service = BlobServiceClient.from_connection_string(connection_string)
container = blob_service.get_container_client("scrape-results")
for url in urls_to_scrape:
try:
# scrape the URL
req = urllib.request.Request(url, headers={
"User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0)"
})
with urllib.request.urlopen(req, timeout=20) as resp:
content = resp.read().decode('utf-8')
# store to blob storage
timestamp = datetime.datetime.utcnow().strftime("%Y%m%d_%H%M%S")
blob_name = f"{url.replace('https://', '').replace('/', '_')}_{timestamp}.html"
blob_client = container.get_blob_client(blob_name)
blob_client.upload_blob(content)
except Exception as e:
logging.error(f"failed to scrape {url}: {e}")
durable functions for multi-step scraping pipelines
Durable Functions solve the orchestration problem: coordinating a scrape that requires fetching a sitemap, then scraping each URL found, then processing results. standard Functions cannot maintain state across calls; Durable Functions can.
import azure.durable_functions as df
def orchestrator_function(context: df.DurableOrchestrationContext):
# step 1: fetch sitemap
sitemap_urls = yield context.call_activity("fetch_sitemap", "https://example.com/sitemap.xml")
# step 2: scrape all URLs in parallel (fan-out)
scrape_tasks = [context.call_activity("scrape_page", url) for url in sitemap_urls[:50]]
results = yield context.task_all(scrape_tasks)
# step 3: store aggregated results
yield context.call_activity("store_results", results)
return {"scraped": len(results)}
main = df.Orchestrator.create(orchestrator_function)
proxy integration on azure
for reliable scraping, integrate proxy rotation into your Azure Functions. store proxy credentials in Azure Key Vault and reference them as environment variables in your function app. rotate through proxies using a counter stored in Azure Cache for Redis to maintain consistent distribution across concurrent function instances.
for the most reliable results on tough targets, use mobile proxies accessed via HTTP from your Azure Function. the function’s outbound IP address does not matter when all requests route through an external proxy.
cold start and performance considerations
Azure Functions on the Consumption plan have cold start delays of 1-3 seconds after periods of inactivity. for scraping workloads, this is generally acceptable since individual page scrapes take longer than the cold start. if cold starts are a problem, upgrade to Premium plan which maintains warm instances.
use connection pooling via urllib.request or requests.Session() objects initialized at module level (outside the function handler). Azure Functions reuses warm instances, so module-level initialization persists across invocations and saves setup overhead. learn more about scraping fundamentals in our guide on what web scraping is.
cost estimation
serverless scraping functions need proxies that just work. pair your Azure setup with our dedicated Singapore mobile proxy for 4G/5G connections that minimize failed requests.
at the Consumption plan pricing (as of 2026), 1 million function executions and 400,000 GB-seconds of execution are free per month. a typical page scrape taking 3 seconds with 256MB memory uses 0.75 GB-seconds. this means roughly 533,000 free page scrapes per month before any billing begins. for most research and monitoring use cases, Azure Functions scraping runs entirely within the free tier.
related guides
- what is web scraping? introduction and use cases
- what is a proxy server? complete guide
- SOCKS5 vs HTTP proxy: which should you use?
- what is a mobile proxy? use cases and benefits
sources and further reading
- Azure Functions official documentation
- Azure Durable Functions overview
- Azure Functions pricing details
last updated: April 3, 2026