Google Cloud Serverless Web Scraping Pipeline: Complete Guide

TL;DR
a Google Cloud serverless scraping pipeline uses Cloud Functions for the crawler, Cloud Scheduler for triggers, Cloud Storage for raw output, and BigQuery for querying results. the whole stack runs without managing any servers.

serverless architectures have become the default for scraping at scale. no servers to manage, you pay only for execution time, and the platform handles concurrency automatically. AWS Lambda gets most attention, but Google Cloud’s stack is equally capable and sometimes cheaper for data-heavy workloads that land in BigQuery.

architecture overview

Cloud Scheduler triggers a Pub/Sub message, which invokes a Cloud Function that scrapes the target and saves raw JSON to Cloud Storage. BigQuery reads from Cloud Storage for analysis. Pub/Sub decouples the trigger from execution and provides built-in retry logic.

setting up the project

gcloud services enable cloudfunctions.googleapis.com
gcloud services enable cloudscheduler.googleapis.com
gcloud services enable pubsub.googleapis.com
gcloud services enable bigquery.googleapis.com

the scraper cloud function

import functions_framework
import requests
from bs4 import BeautifulSoup
from google.cloud import storage
import json, base64
from datetime import datetime

@functions_framework.cloud_event
def scrape_handler(cloud_event):
    message_data = base64.b64decode(cloud_event.data['message']['data']).decode()
    config = json.loads(message_data)
    url = config.get('url', 'https://books.toscrape.com')

    proxies = {
        'http': 'http://user:pass@your-proxy:port',
        'https': 'http://user:pass@your-proxy:port'
    }
    r = requests.get(url, proxies=proxies, timeout=30)
    soup = BeautifulSoup(r.text, 'html.parser')
    books = []
    for article in soup.select('article.product_pod'):
        books.append({
            'title': article.select_one('h3 a')['title'],
            'price': article.select_one('.price_color').text.strip(),
            'scraped_at': datetime.utcnow().isoformat()
        })
    client = storage.Client()
    bucket = client.bucket('my-scraper-bucket')
    filename = f"books/{datetime.utcnow().strftime('%Y%m%d-%H%M%S')}.json"
    bucket.blob(filename).upload_from_string(json.dumps(books))
    print(f'saved {len(books)} records')

deploying the function

gcloud functions deploy scrape-books \
  --gen2 --runtime=python311 --region=us-central1 \
  --entry-point=scrape_handler --trigger-topic=scrape-trigger \
  --timeout=300s --memory=512MB

gen2 functions support timeouts up to 60 minutes and better cold starts. see our guide on what is a proxy server for proxy config in cloud environments.

scheduling with cloud scheduler

gcloud scheduler jobs create pubsub scrape-daily \
  --schedule="0 6 * * *" \
  --topic=scrape-trigger \
  --message-body='{"url": "https://books.toscrape.com"}' \
  --time-zone="UTC"

loading to bigquery

from google.cloud import bigquery
client = bigquery.Client()
table_id = 'my-project.scraper_dataset.books'
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField('title', 'STRING'),
        bigquery.SchemaField('price', 'STRING'),
        bigquery.SchemaField('scraped_at', 'TIMESTAMP'),
    ],
    source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
    write_disposition='WRITE_APPEND',
)
load_job = client.load_table_from_uri(
    'gs://my-scraper-bucket/books/*.json', table_id, job_config=job_config
)
load_job.result()

cost estimate

Cloud Functions gen2: $0.40 per million invocations plus compute time
Cloud Storage: $0.02/GB stored
BigQuery: first 10GB/month queried is free
Cloud Scheduler: first 3 jobs free

a daily scrape of 10,000 pages costs roughly $0.50-2.00 in GCP fees excluding proxy costs. combine with residential proxy rotation for protected targets. see our comparison at SOCKS5 vs HTTP proxy and our overview of what is web scraping.

sources and further reading

related guides

last updated: April 1, 2026