Google Cloud Serverless Scraping Pipeline: Complete Guide

Google Cloud Serverless Scraping Pipeline: Complete Guide

running web scrapers on a dedicated server means paying for compute 24/7 even when your scrapers only run a few hours per day. serverless architecture flips this model: you pay only when your code executes, scale automatically from 1 to 10,000 concurrent requests, and eliminate all server management.

Google Cloud Platform offers a particularly strong set of services for building scraping pipelines. Cloud Functions handle the scraping logic, Pub/Sub manages the work queue, Cloud Scheduler triggers runs, BigQuery stores the results, and Cloud Storage holds raw HTML for reprocessing.

this guide walks you through building a production-grade serverless scraping pipeline on GCP from scratch.

Architecture Overview

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Cloud       │────>│   Pub/Sub     │────>│   Cloud       │
│   Scheduler   │     │   Topic       │     │   Function    │
│  (cron trigger)│     │  (URL queue)  │     │  (scraper)    │
└──────────────┘     └──────────────┘     └───────┬──────┘
                                                   │
                                          ┌────────┴────────┐
                                          │                  │
                                    ┌─────▼─────┐    ┌──────▼──────┐
                                    │  BigQuery   │    │   Cloud      │
                                    │  (structured│    │   Storage    │
                                    │   data)     │    │  (raw HTML)  │
                                    └─────────────┘    └─────────────┘

the flow works like this:

  1. Cloud Scheduler publishes URLs to a Pub/Sub topic on a schedule
  2. each URL triggers a Cloud Function instance
  3. the function scrapes the page, extracts data, and stores results
  4. structured data goes to BigQuery for querying
  5. raw HTML goes to Cloud Storage for reprocessing if your parser changes

Prerequisites

you will need:

  • a GCP account with billing enabled
  • the gcloud CLI installed and configured
  • Python 3.10+
  • basic familiarity with GCP services

install the required Python packages:

pip install google-cloud-pubsub google-cloud-bigquery \
    google-cloud-storage google-cloud-functions-framework \
    requests beautifulsoup4 lxml

Step 1: Set Up GCP Resources

create the necessary GCP resources using the gcloud CLI:

# set your project
export PROJECT_ID="your-project-id"
gcloud config set project $PROJECT_ID

# enable required APIs
gcloud services enable cloudfunctions.googleapis.com
gcloud services enable pubsub.googleapis.com
gcloud services enable cloudscheduler.googleapis.com
gcloud services enable bigquery.googleapis.com
gcloud services enable storage.googleapis.com

# create Pub/Sub topic and subscription
gcloud pubsub topics create scraping-urls
gcloud pubsub subscriptions create scraping-urls-sub \
    --topic=scraping-urls \
    --ack-deadline=300

# create a dead letter topic for failed messages
gcloud pubsub topics create scraping-dead-letter
gcloud pubsub subscriptions create scraping-dead-letter-sub \
    --topic=scraping-dead-letter

# create Cloud Storage bucket for raw HTML
gsutil mb -l us-central1 gs://${PROJECT_ID}-scraping-raw

# create BigQuery dataset
bq mk --dataset ${PROJECT_ID}:scraping_data

Step 2: Define the BigQuery Schema

create a table to store your scraped data:

bq mk --table \
    ${PROJECT_ID}:scraping_data.listings \
    'url:STRING,title:STRING,price:FLOAT,description:STRING,category:STRING,seller:STRING,location:STRING,scraped_at:TIMESTAMP,source_domain:STRING,raw_html_path:STRING,extraction_version:STRING'

or use a JSON schema file for more complex tables:

# schema.py
LISTINGS_SCHEMA = [
    {"name": "url", "type": "STRING", "mode": "REQUIRED"},
    {"name": "title", "type": "STRING", "mode": "NULLABLE"},
    {"name": "price", "type": "FLOAT", "mode": "NULLABLE"},
    {"name": "currency", "type": "STRING", "mode": "NULLABLE"},
    {"name": "description", "type": "STRING", "mode": "NULLABLE"},
    {"name": "category", "type": "STRING", "mode": "NULLABLE"},
    {"name": "seller", "type": "STRING", "mode": "NULLABLE"},
    {"name": "location", "type": "STRING", "mode": "NULLABLE"},
    {"name": "images", "type": "STRING", "mode": "REPEATED"},
    {"name": "metadata", "type": "JSON", "mode": "NULLABLE"},
    {"name": "scraped_at", "type": "TIMESTAMP", "mode": "REQUIRED"},
    {"name": "source_domain", "type": "STRING", "mode": "REQUIRED"},
    {"name": "raw_html_path", "type": "STRING", "mode": "NULLABLE"},
    {"name": "extraction_version", "type": "STRING", "mode": "REQUIRED"},
    {"name": "proxy_used", "type": "STRING", "mode": "NULLABLE"},
    {"name": "response_time_ms", "type": "INTEGER", "mode": "NULLABLE"},
]

Step 3: Build the Scraping Cloud Function

this is the core of the pipeline. each Pub/Sub message contains a URL to scrape:

# main.py
import base64
import json
import time
import os
from datetime import datetime, timezone

import requests
from bs4 import BeautifulSoup
from google.cloud import bigquery, storage


# configuration
PROXY_URL = os.environ.get("PROXY_URL", "")
BQ_DATASET = os.environ.get("BQ_DATASET", "scraping_data")
BQ_TABLE = os.environ.get("BQ_TABLE", "listings")
GCS_BUCKET = os.environ.get("GCS_BUCKET", "")
EXTRACTION_VERSION = "v1.0"


def scrape_url(event, context):
    """Cloud Function triggered by Pub/Sub message."""

    # decode the Pub/Sub message
    message_data = base64.b64decode(event["data"]).decode("utf-8")
    payload = json.loads(message_data)

    url = payload["url"]
    source = payload.get("source", "unknown")
    parser_type = payload.get("parser", "generic")

    print(f"scraping: {url}")

    try:
        # fetch the page
        html, response_time = fetch_page(url)

        if not html:
            print(f"failed to fetch {url}")
            return

        # save raw HTML to Cloud Storage
        raw_path = save_raw_html(url, html)

        # parse the page
        data = parse_page(html, parser_type)

        if not data:
            print(f"failed to parse {url}")
            return

        # enrich with metadata
        data["url"] = url
        data["scraped_at"] = datetime.now(timezone.utc).isoformat()
        data["source_domain"] = source
        data["raw_html_path"] = raw_path
        data["extraction_version"] = EXTRACTION_VERSION
        data["proxy_used"] = "residential" if PROXY_URL else "direct"
        data["response_time_ms"] = int(response_time * 1000)

        # save to BigQuery
        save_to_bigquery(data)

        print(f"successfully scraped and stored: {url}")

    except Exception as e:
        print(f"error processing {url}: {e}")
        raise  # let Pub/Sub retry


def fetch_page(url):
    """fetch a web page with proxy support."""
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0.0.0 Safari/537.36"
        ),
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9",
    }

    proxies = {}
    if PROXY_URL:
        proxies = {"http": PROXY_URL, "https": PROXY_URL}

    start = time.time()

    response = requests.get(
        url,
        headers=headers,
        proxies=proxies,
        timeout=30,
    )

    response_time = time.time() - start

    if response.status_code != 200:
        print(f"got status {response.status_code} for {url}")
        return None, response_time

    return response.text, response_time


def parse_page(html, parser_type="generic"):
    """extract structured data from HTML."""
    soup = BeautifulSoup(html, "lxml")

    parsers = {
        "generic": parse_generic,
        "ecommerce": parse_ecommerce,
        "listing": parse_listing,
    }

    parser = parsers.get(parser_type, parse_generic)
    return parser(soup)


def parse_generic(soup):
    """generic page parser."""
    title = soup.find("title")

    # remove non-content elements
    for tag in soup.find_all(["script", "style", "nav", "footer"]):
        tag.decompose()

    body_text = soup.get_text(separator=" ", strip=True)[:10000]

    return {
        "title": title.text.strip() if title else "",
        "description": body_text[:500],
        "price": None,
        "currency": None,
        "category": None,
        "seller": None,
        "location": None,
        "images": [],
        "metadata": json.dumps({"parser": "generic"}),
    }


def parse_ecommerce(soup):
    """ecommerce product page parser."""
    data = parse_generic(soup)

    # try common price selectors
    price_selectors = [
        ".price", ".product-price", "[data-price]",
        ".current-price", ".sale-price",
    ]

    for selector in price_selectors:
        el = soup.select_one(selector)
        if el:
            import re
            price_text = el.get_text(strip=True)
            numbers = re.findall(r"[\d,.]+", price_text)
            if numbers:
                try:
                    data["price"] = float(
                        numbers[0].replace(",", "")
                    )
                    break
                except ValueError:
                    continue

    # extract images
    images = soup.select("img[src*='product'], img[data-src*='product']")
    data["images"] = [
        img.get("src") or img.get("data-src")
        for img in images[:10]
    ]

    return data


def parse_listing(soup):
    """listing page parser for classifieds and marketplaces."""
    data = parse_generic(soup)

    # listing-specific extraction
    seller = soup.select_one(
        ".seller-name, .dealer-name, [data-testid='seller']"
    )
    if seller:
        data["seller"] = seller.get_text(strip=True)

    location = soup.select_one(
        ".location, .seller-location, [data-testid='location']"
    )
    if location:
        data["location"] = location.get_text(strip=True)

    return data


def save_raw_html(url, html):
    """save raw HTML to Cloud Storage."""
    if not GCS_BUCKET:
        return ""

    client = storage.Client()
    bucket = client.bucket(GCS_BUCKET)

    # create a path based on date and URL hash
    import hashlib
    url_hash = hashlib.md5(url.encode()).hexdigest()
    date_path = datetime.now().strftime("%Y/%m/%d")
    blob_path = f"raw/{date_path}/{url_hash}.html"

    blob = bucket.blob(blob_path)
    blob.upload_from_string(html, content_type="text/html")

    return f"gs://{GCS_BUCKET}/{blob_path}"


def save_to_bigquery(data):
    """insert a row into BigQuery."""
    client = bigquery.Client()
    table_ref = f"{BQ_DATASET}.{BQ_TABLE}"

    # handle the images list
    if isinstance(data.get("images"), list):
        data["images"] = data["images"]

    errors = client.insert_rows_json(table_ref, [data])

    if errors:
        print(f"BigQuery insert errors: {errors}")
        raise Exception(f"failed to insert into BigQuery: {errors}")

Step 4: Deploy the Cloud Function

deploy with the gcloud CLI:

gcloud functions deploy scrape-url \
    --runtime python310 \
    --trigger-topic scraping-urls \
    --memory 512MB \
    --timeout 120s \
    --max-instances 50 \
    --set-env-vars "PROXY_URL=http://user:pass@residential.provider.com:8080,BQ_DATASET=scraping_data,BQ_TABLE=listings,GCS_BUCKET=${PROJECT_ID}-scraping-raw" \
    --entry-point scrape_url \
    --region us-central1

key settings:

  • --max-instances 50: limits concurrency to prevent overwhelming target sites
  • --timeout 120s: gives enough time for slow pages
  • --memory 512MB: sufficient for HTML parsing

Step 5: Create the URL Publisher

this function publishes URLs to Pub/Sub for scraping:

# publisher.py
from google.cloud import pubsub_v1
import json


def publish_urls(urls, topic_name="scraping-urls",
                 source="manual", parser="generic"):
    """publish URLs to Pub/Sub for scraping."""

    publisher = pubsub_v1.PublisherClient()
    topic_path = publisher.topic_path(
        "your-project-id", topic_name
    )

    futures = []
    for url in urls:
        message = json.dumps({
            "url": url,
            "source": source,
            "parser": parser,
        }).encode("utf-8")

        future = publisher.publish(topic_path, message)
        futures.append(future)

    # wait for all messages to be published
    results = [f.result() for f in futures]
    print(f"published {len(results)} URLs to {topic_name}")

    return results


def publish_from_sitemap(sitemap_url, topic_name="scraping-urls"):
    """parse a sitemap and publish all URLs."""
    import requests
    from bs4 import BeautifulSoup

    response = requests.get(sitemap_url)
    soup = BeautifulSoup(response.text, "xml")

    urls = [loc.text for loc in soup.find_all("loc")]
    print(f"found {len(urls)} URLs in sitemap")

    publish_urls(urls, topic_name)


def publish_search_results(base_url, params, pages,
                           topic_name="scraping-urls"):
    """generate search result URLs and publish them."""
    urls = []
    for page in range(1, pages + 1):
        url = f"{base_url}?{'&'.join(f'{k}={v}' for k, v in params.items())}&page={page}"
        urls.append(url)

    publish_urls(urls, topic_name, parser="listing")

Step 6: Set Up Cloud Scheduler

automate your scraping runs with Cloud Scheduler:

# run daily at 2 AM UTC
gcloud scheduler jobs create pubsub daily-scrape \
    --schedule="0 2 * * *" \
    --topic=scraping-urls \
    --message-body='{"url": "https://example.com/listings?page=1", "source": "example.com", "parser": "listing"}' \
    --location=us-central1

for more complex schedules, use a Cloud Function as the scheduler target:

# scheduler_function.py
import functions_framework
from publisher import publish_urls


@functions_framework.http
def trigger_scrape(request):
    """HTTP-triggered function to publish scraping jobs."""

    # generate URLs to scrape
    urls = []
    base = "https://example.com/listings"
    for page in range(1, 51):
        urls.append(f"{base}?page={page}")

    publish_urls(urls, source="example.com", parser="listing")

    return f"published {len(urls)} scraping jobs", 200

deploy and schedule it:

gcloud functions deploy trigger-scrape \
    --runtime python310 \
    --trigger-http \
    --allow-unauthenticated \
    --region us-central1

gcloud scheduler jobs create http daily-scrape-trigger \
    --schedule="0 2 * * *" \
    --uri="https://us-central1-your-project.cloudfunctions.net/trigger-scrape" \
    --http-method=POST \
    --location=us-central1

Step 7: Monitor and Alert

set up monitoring for your pipeline:

# monitoring.py
from google.cloud import bigquery
from datetime import datetime, timedelta


def check_pipeline_health():
    """check if the pipeline is running correctly."""
    client = bigquery.Client()

    # check records scraped in the last 24 hours
    query = """
    SELECT
        COUNT(*) as total_records,
        COUNT(DISTINCT source_domain) as domains_scraped,
        AVG(response_time_ms) as avg_response_time,
        COUNTIF(price IS NOT NULL) / COUNT(*) as price_coverage,
        COUNTIF(title IS NOT NULL) / COUNT(*) as title_coverage
    FROM `scraping_data.listings`
    WHERE scraped_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
    """

    result = list(client.query(query).result())[0]

    print(f"last 24 hours:")
    print(f"  total records: {result.total_records}")
    print(f"  domains scraped: {result.domains_scraped}")
    print(f"  avg response time: {result.avg_response_time:.0f}ms")
    print(f"  price coverage: {result.price_coverage:.1%}")
    print(f"  title coverage: {result.title_coverage:.1%}")

    # alert if something looks wrong
    if result.total_records == 0:
        send_alert("no records scraped in the last 24 hours")
    elif result.avg_response_time > 10000:
        send_alert(f"high avg response time: {result.avg_response_time:.0f}ms")
    elif result.title_coverage < 0.5:
        send_alert(f"low title coverage: {result.title_coverage:.1%}")


def send_alert(message):
    """send an alert (implement with your notification service)."""
    print(f"ALERT: {message}")
    # integrate with Slack, email, PagerDuty, etc.

Cost Optimization

serverless scraping on GCP is cost-effective but there are ways to optimize further:

Cloud Function Costs

  • use the smallest memory allocation that works (256MB is often enough for simple scraping)
  • set appropriate timeouts to avoid paying for hanging requests
  • use --max-instances to cap concurrent executions

Pub/Sub Costs

  • batch small URLs into single messages to reduce message count
  • use message ordering only when necessary

BigQuery Costs

  • partition your table by scraped_at date to reduce query costs
  • use clustered tables on source_domain for faster filtered queries
  • set table expiration for old data you do not need

Estimated Monthly Costs

for a pipeline scraping 100,000 pages per day:

ServiceEstimated Cost
Cloud Functions (512MB, 10s avg)$45/month
Pub/Sub (100K messages/day)$4/month
Cloud Storage (1TB raw HTML)$20/month
BigQuery (10GB data, 50 queries/day)$5/month
Cloud Scheduler$0.10/month
Total~$75/month

compare this to running an equivalent EC2 or Compute Engine instance 24/7, which would cost $150-300/month.

Proxy Integration in Serverless

integrating proxies with serverless functions requires some adjustments:

Rotating Proxy Gateway

the simplest approach is using a rotating proxy gateway. each Cloud Function request goes through the gateway, which rotates the IP automatically:

PROXY_URL = "http://user:pass@gate.provider.com:7777"

Session-Based Proxies

if you need sticky sessions (same IP for multiple requests), pass a session ID:

def get_session_proxy(session_id):
    """get a proxy URL with session pinning."""
    return f"http://user-session-{session_id}:pass@gate.provider.com:7777"

Cost Considerations

proxy costs often exceed cloud infrastructure costs. for 100,000 pages per day at $10/GB residential proxy pricing and an average page size of 200KB, proxy costs would be approximately $200/day or $6,000/month. optimize by:

  • using datacenter proxies for sites that do not require residential IPs
  • caching pages that do not change frequently
  • only scraping pages that have actually changed (use ETags or Last-Modified headers)

Conclusion

a serverless scraping pipeline on GCP gives you automatic scaling, zero server management, and pay-per-use pricing. the combination of Cloud Functions for scraping, Pub/Sub for queuing, BigQuery for analysis, and Cloud Storage for raw data creates a robust, production-grade pipeline that handles everything from 100 to 1,000,000 pages per day.

the total infrastructure cost for most scraping workloads is under $100/month. the real cost driver is usually proxy services, so optimizing your proxy usage matters more than optimizing your cloud spend.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)