Google Cloud Serverless Scraping Pipeline: Complete Guide
running web scrapers on a dedicated server means paying for compute 24/7 even when your scrapers only run a few hours per day. serverless architecture flips this model: you pay only when your code executes, scale automatically from 1 to 10,000 concurrent requests, and eliminate all server management.
Google Cloud Platform offers a particularly strong set of services for building scraping pipelines. Cloud Functions handle the scraping logic, Pub/Sub manages the work queue, Cloud Scheduler triggers runs, BigQuery stores the results, and Cloud Storage holds raw HTML for reprocessing.
this guide walks you through building a production-grade serverless scraping pipeline on GCP from scratch.
Architecture Overview
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Cloud │────>│ Pub/Sub │────>│ Cloud │
│ Scheduler │ │ Topic │ │ Function │
│ (cron trigger)│ │ (URL queue) │ │ (scraper) │
└──────────────┘ └──────────────┘ └───────┬──────┘
│
┌────────┴────────┐
│ │
┌─────▼─────┐ ┌──────▼──────┐
│ BigQuery │ │ Cloud │
│ (structured│ │ Storage │
│ data) │ │ (raw HTML) │
└─────────────┘ └─────────────┘
the flow works like this:
- Cloud Scheduler publishes URLs to a Pub/Sub topic on a schedule
- each URL triggers a Cloud Function instance
- the function scrapes the page, extracts data, and stores results
- structured data goes to BigQuery for querying
- raw HTML goes to Cloud Storage for reprocessing if your parser changes
Prerequisites
you will need:
- a GCP account with billing enabled
- the
gcloudCLI installed and configured - Python 3.10+
- basic familiarity with GCP services
install the required Python packages:
pip install google-cloud-pubsub google-cloud-bigquery \
google-cloud-storage google-cloud-functions-framework \
requests beautifulsoup4 lxml
Step 1: Set Up GCP Resources
create the necessary GCP resources using the gcloud CLI:
# set your project
export PROJECT_ID="your-project-id"
gcloud config set project $PROJECT_ID
# enable required APIs
gcloud services enable cloudfunctions.googleapis.com
gcloud services enable pubsub.googleapis.com
gcloud services enable cloudscheduler.googleapis.com
gcloud services enable bigquery.googleapis.com
gcloud services enable storage.googleapis.com
# create Pub/Sub topic and subscription
gcloud pubsub topics create scraping-urls
gcloud pubsub subscriptions create scraping-urls-sub \
--topic=scraping-urls \
--ack-deadline=300
# create a dead letter topic for failed messages
gcloud pubsub topics create scraping-dead-letter
gcloud pubsub subscriptions create scraping-dead-letter-sub \
--topic=scraping-dead-letter
# create Cloud Storage bucket for raw HTML
gsutil mb -l us-central1 gs://${PROJECT_ID}-scraping-raw
# create BigQuery dataset
bq mk --dataset ${PROJECT_ID}:scraping_data
Step 2: Define the BigQuery Schema
create a table to store your scraped data:
bq mk --table \
${PROJECT_ID}:scraping_data.listings \
'url:STRING,title:STRING,price:FLOAT,description:STRING,category:STRING,seller:STRING,location:STRING,scraped_at:TIMESTAMP,source_domain:STRING,raw_html_path:STRING,extraction_version:STRING'
or use a JSON schema file for more complex tables:
# schema.py
LISTINGS_SCHEMA = [
{"name": "url", "type": "STRING", "mode": "REQUIRED"},
{"name": "title", "type": "STRING", "mode": "NULLABLE"},
{"name": "price", "type": "FLOAT", "mode": "NULLABLE"},
{"name": "currency", "type": "STRING", "mode": "NULLABLE"},
{"name": "description", "type": "STRING", "mode": "NULLABLE"},
{"name": "category", "type": "STRING", "mode": "NULLABLE"},
{"name": "seller", "type": "STRING", "mode": "NULLABLE"},
{"name": "location", "type": "STRING", "mode": "NULLABLE"},
{"name": "images", "type": "STRING", "mode": "REPEATED"},
{"name": "metadata", "type": "JSON", "mode": "NULLABLE"},
{"name": "scraped_at", "type": "TIMESTAMP", "mode": "REQUIRED"},
{"name": "source_domain", "type": "STRING", "mode": "REQUIRED"},
{"name": "raw_html_path", "type": "STRING", "mode": "NULLABLE"},
{"name": "extraction_version", "type": "STRING", "mode": "REQUIRED"},
{"name": "proxy_used", "type": "STRING", "mode": "NULLABLE"},
{"name": "response_time_ms", "type": "INTEGER", "mode": "NULLABLE"},
]
Step 3: Build the Scraping Cloud Function
this is the core of the pipeline. each Pub/Sub message contains a URL to scrape:
# main.py
import base64
import json
import time
import os
from datetime import datetime, timezone
import requests
from bs4 import BeautifulSoup
from google.cloud import bigquery, storage
# configuration
PROXY_URL = os.environ.get("PROXY_URL", "")
BQ_DATASET = os.environ.get("BQ_DATASET", "scraping_data")
BQ_TABLE = os.environ.get("BQ_TABLE", "listings")
GCS_BUCKET = os.environ.get("GCS_BUCKET", "")
EXTRACTION_VERSION = "v1.0"
def scrape_url(event, context):
"""Cloud Function triggered by Pub/Sub message."""
# decode the Pub/Sub message
message_data = base64.b64decode(event["data"]).decode("utf-8")
payload = json.loads(message_data)
url = payload["url"]
source = payload.get("source", "unknown")
parser_type = payload.get("parser", "generic")
print(f"scraping: {url}")
try:
# fetch the page
html, response_time = fetch_page(url)
if not html:
print(f"failed to fetch {url}")
return
# save raw HTML to Cloud Storage
raw_path = save_raw_html(url, html)
# parse the page
data = parse_page(html, parser_type)
if not data:
print(f"failed to parse {url}")
return
# enrich with metadata
data["url"] = url
data["scraped_at"] = datetime.now(timezone.utc).isoformat()
data["source_domain"] = source
data["raw_html_path"] = raw_path
data["extraction_version"] = EXTRACTION_VERSION
data["proxy_used"] = "residential" if PROXY_URL else "direct"
data["response_time_ms"] = int(response_time * 1000)
# save to BigQuery
save_to_bigquery(data)
print(f"successfully scraped and stored: {url}")
except Exception as e:
print(f"error processing {url}: {e}")
raise # let Pub/Sub retry
def fetch_page(url):
"""fetch a web page with proxy support."""
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
}
proxies = {}
if PROXY_URL:
proxies = {"http": PROXY_URL, "https": PROXY_URL}
start = time.time()
response = requests.get(
url,
headers=headers,
proxies=proxies,
timeout=30,
)
response_time = time.time() - start
if response.status_code != 200:
print(f"got status {response.status_code} for {url}")
return None, response_time
return response.text, response_time
def parse_page(html, parser_type="generic"):
"""extract structured data from HTML."""
soup = BeautifulSoup(html, "lxml")
parsers = {
"generic": parse_generic,
"ecommerce": parse_ecommerce,
"listing": parse_listing,
}
parser = parsers.get(parser_type, parse_generic)
return parser(soup)
def parse_generic(soup):
"""generic page parser."""
title = soup.find("title")
# remove non-content elements
for tag in soup.find_all(["script", "style", "nav", "footer"]):
tag.decompose()
body_text = soup.get_text(separator=" ", strip=True)[:10000]
return {
"title": title.text.strip() if title else "",
"description": body_text[:500],
"price": None,
"currency": None,
"category": None,
"seller": None,
"location": None,
"images": [],
"metadata": json.dumps({"parser": "generic"}),
}
def parse_ecommerce(soup):
"""ecommerce product page parser."""
data = parse_generic(soup)
# try common price selectors
price_selectors = [
".price", ".product-price", "[data-price]",
".current-price", ".sale-price",
]
for selector in price_selectors:
el = soup.select_one(selector)
if el:
import re
price_text = el.get_text(strip=True)
numbers = re.findall(r"[\d,.]+", price_text)
if numbers:
try:
data["price"] = float(
numbers[0].replace(",", "")
)
break
except ValueError:
continue
# extract images
images = soup.select("img[src*='product'], img[data-src*='product']")
data["images"] = [
img.get("src") or img.get("data-src")
for img in images[:10]
]
return data
def parse_listing(soup):
"""listing page parser for classifieds and marketplaces."""
data = parse_generic(soup)
# listing-specific extraction
seller = soup.select_one(
".seller-name, .dealer-name, [data-testid='seller']"
)
if seller:
data["seller"] = seller.get_text(strip=True)
location = soup.select_one(
".location, .seller-location, [data-testid='location']"
)
if location:
data["location"] = location.get_text(strip=True)
return data
def save_raw_html(url, html):
"""save raw HTML to Cloud Storage."""
if not GCS_BUCKET:
return ""
client = storage.Client()
bucket = client.bucket(GCS_BUCKET)
# create a path based on date and URL hash
import hashlib
url_hash = hashlib.md5(url.encode()).hexdigest()
date_path = datetime.now().strftime("%Y/%m/%d")
blob_path = f"raw/{date_path}/{url_hash}.html"
blob = bucket.blob(blob_path)
blob.upload_from_string(html, content_type="text/html")
return f"gs://{GCS_BUCKET}/{blob_path}"
def save_to_bigquery(data):
"""insert a row into BigQuery."""
client = bigquery.Client()
table_ref = f"{BQ_DATASET}.{BQ_TABLE}"
# handle the images list
if isinstance(data.get("images"), list):
data["images"] = data["images"]
errors = client.insert_rows_json(table_ref, [data])
if errors:
print(f"BigQuery insert errors: {errors}")
raise Exception(f"failed to insert into BigQuery: {errors}")
Step 4: Deploy the Cloud Function
deploy with the gcloud CLI:
gcloud functions deploy scrape-url \
--runtime python310 \
--trigger-topic scraping-urls \
--memory 512MB \
--timeout 120s \
--max-instances 50 \
--set-env-vars "PROXY_URL=http://user:pass@residential.provider.com:8080,BQ_DATASET=scraping_data,BQ_TABLE=listings,GCS_BUCKET=${PROJECT_ID}-scraping-raw" \
--entry-point scrape_url \
--region us-central1
key settings:
--max-instances 50: limits concurrency to prevent overwhelming target sites--timeout 120s: gives enough time for slow pages--memory 512MB: sufficient for HTML parsing
Step 5: Create the URL Publisher
this function publishes URLs to Pub/Sub for scraping:
# publisher.py
from google.cloud import pubsub_v1
import json
def publish_urls(urls, topic_name="scraping-urls",
source="manual", parser="generic"):
"""publish URLs to Pub/Sub for scraping."""
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(
"your-project-id", topic_name
)
futures = []
for url in urls:
message = json.dumps({
"url": url,
"source": source,
"parser": parser,
}).encode("utf-8")
future = publisher.publish(topic_path, message)
futures.append(future)
# wait for all messages to be published
results = [f.result() for f in futures]
print(f"published {len(results)} URLs to {topic_name}")
return results
def publish_from_sitemap(sitemap_url, topic_name="scraping-urls"):
"""parse a sitemap and publish all URLs."""
import requests
from bs4 import BeautifulSoup
response = requests.get(sitemap_url)
soup = BeautifulSoup(response.text, "xml")
urls = [loc.text for loc in soup.find_all("loc")]
print(f"found {len(urls)} URLs in sitemap")
publish_urls(urls, topic_name)
def publish_search_results(base_url, params, pages,
topic_name="scraping-urls"):
"""generate search result URLs and publish them."""
urls = []
for page in range(1, pages + 1):
url = f"{base_url}?{'&'.join(f'{k}={v}' for k, v in params.items())}&page={page}"
urls.append(url)
publish_urls(urls, topic_name, parser="listing")
Step 6: Set Up Cloud Scheduler
automate your scraping runs with Cloud Scheduler:
# run daily at 2 AM UTC
gcloud scheduler jobs create pubsub daily-scrape \
--schedule="0 2 * * *" \
--topic=scraping-urls \
--message-body='{"url": "https://example.com/listings?page=1", "source": "example.com", "parser": "listing"}' \
--location=us-central1
for more complex schedules, use a Cloud Function as the scheduler target:
# scheduler_function.py
import functions_framework
from publisher import publish_urls
@functions_framework.http
def trigger_scrape(request):
"""HTTP-triggered function to publish scraping jobs."""
# generate URLs to scrape
urls = []
base = "https://example.com/listings"
for page in range(1, 51):
urls.append(f"{base}?page={page}")
publish_urls(urls, source="example.com", parser="listing")
return f"published {len(urls)} scraping jobs", 200
deploy and schedule it:
gcloud functions deploy trigger-scrape \
--runtime python310 \
--trigger-http \
--allow-unauthenticated \
--region us-central1
gcloud scheduler jobs create http daily-scrape-trigger \
--schedule="0 2 * * *" \
--uri="https://us-central1-your-project.cloudfunctions.net/trigger-scrape" \
--http-method=POST \
--location=us-central1
Step 7: Monitor and Alert
set up monitoring for your pipeline:
# monitoring.py
from google.cloud import bigquery
from datetime import datetime, timedelta
def check_pipeline_health():
"""check if the pipeline is running correctly."""
client = bigquery.Client()
# check records scraped in the last 24 hours
query = """
SELECT
COUNT(*) as total_records,
COUNT(DISTINCT source_domain) as domains_scraped,
AVG(response_time_ms) as avg_response_time,
COUNTIF(price IS NOT NULL) / COUNT(*) as price_coverage,
COUNTIF(title IS NOT NULL) / COUNT(*) as title_coverage
FROM `scraping_data.listings`
WHERE scraped_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
"""
result = list(client.query(query).result())[0]
print(f"last 24 hours:")
print(f" total records: {result.total_records}")
print(f" domains scraped: {result.domains_scraped}")
print(f" avg response time: {result.avg_response_time:.0f}ms")
print(f" price coverage: {result.price_coverage:.1%}")
print(f" title coverage: {result.title_coverage:.1%}")
# alert if something looks wrong
if result.total_records == 0:
send_alert("no records scraped in the last 24 hours")
elif result.avg_response_time > 10000:
send_alert(f"high avg response time: {result.avg_response_time:.0f}ms")
elif result.title_coverage < 0.5:
send_alert(f"low title coverage: {result.title_coverage:.1%}")
def send_alert(message):
"""send an alert (implement with your notification service)."""
print(f"ALERT: {message}")
# integrate with Slack, email, PagerDuty, etc.
Cost Optimization
serverless scraping on GCP is cost-effective but there are ways to optimize further:
Cloud Function Costs
- use the smallest memory allocation that works (256MB is often enough for simple scraping)
- set appropriate timeouts to avoid paying for hanging requests
- use
--max-instancesto cap concurrent executions
Pub/Sub Costs
- batch small URLs into single messages to reduce message count
- use message ordering only when necessary
BigQuery Costs
- partition your table by
scraped_atdate to reduce query costs - use clustered tables on
source_domainfor faster filtered queries - set table expiration for old data you do not need
Estimated Monthly Costs
for a pipeline scraping 100,000 pages per day:
| Service | Estimated Cost |
|---|---|
| Cloud Functions (512MB, 10s avg) | $45/month |
| Pub/Sub (100K messages/day) | $4/month |
| Cloud Storage (1TB raw HTML) | $20/month |
| BigQuery (10GB data, 50 queries/day) | $5/month |
| Cloud Scheduler | $0.10/month |
| Total | ~$75/month |
compare this to running an equivalent EC2 or Compute Engine instance 24/7, which would cost $150-300/month.
Proxy Integration in Serverless
integrating proxies with serverless functions requires some adjustments:
Rotating Proxy Gateway
the simplest approach is using a rotating proxy gateway. each Cloud Function request goes through the gateway, which rotates the IP automatically:
PROXY_URL = "http://user:pass@gate.provider.com:7777"
Session-Based Proxies
if you need sticky sessions (same IP for multiple requests), pass a session ID:
def get_session_proxy(session_id):
"""get a proxy URL with session pinning."""
return f"http://user-session-{session_id}:pass@gate.provider.com:7777"
Cost Considerations
proxy costs often exceed cloud infrastructure costs. for 100,000 pages per day at $10/GB residential proxy pricing and an average page size of 200KB, proxy costs would be approximately $200/day or $6,000/month. optimize by:
- using datacenter proxies for sites that do not require residential IPs
- caching pages that do not change frequently
- only scraping pages that have actually changed (use ETags or Last-Modified headers)
Conclusion
a serverless scraping pipeline on GCP gives you automatic scaling, zero server management, and pay-per-use pricing. the combination of Cloud Functions for scraping, Pub/Sub for queuing, BigQuery for analysis, and Cloud Storage for raw data creates a robust, production-grade pipeline that handles everything from 100 to 1,000,000 pages per day.
the total infrastructure cost for most scraping workloads is under $100/month. the real cost driver is usually proxy services, so optimizing your proxy usage matters more than optimizing your cloud spend.