a Google Cloud serverless scraping pipeline uses Cloud Functions for the crawler, Cloud Scheduler for triggers, Cloud Storage for raw output, and BigQuery for querying results. the whole stack runs without managing any servers.
serverless architectures have become the default for scraping at scale. no servers to manage, you pay only for execution time, and the platform handles concurrency automatically. AWS Lambda gets most attention, but Google Cloud’s stack is equally capable and sometimes cheaper for data-heavy workloads that land in BigQuery.
architecture overview
Cloud Scheduler triggers a Pub/Sub message, which invokes a Cloud Function that scrapes the target and saves raw JSON to Cloud Storage. BigQuery reads from Cloud Storage for analysis. Pub/Sub decouples the trigger from execution and provides built-in retry logic.
setting up the project
gcloud services enable cloudfunctions.googleapis.com
gcloud services enable cloudscheduler.googleapis.com
gcloud services enable pubsub.googleapis.com
gcloud services enable bigquery.googleapis.comthe scraper cloud function
import functions_framework
import requests
from bs4 import BeautifulSoup
from google.cloud import storage
import json, base64
from datetime import datetime
@functions_framework.cloud_event
def scrape_handler(cloud_event):
message_data = base64.b64decode(cloud_event.data['message']['data']).decode()
config = json.loads(message_data)
url = config.get('url', 'https://books.toscrape.com')
proxies = {
'http': 'http://user:pass@your-proxy:port',
'https': 'http://user:pass@your-proxy:port'
}
r = requests.get(url, proxies=proxies, timeout=30)
soup = BeautifulSoup(r.text, 'html.parser')
books = []
for article in soup.select('article.product_pod'):
books.append({
'title': article.select_one('h3 a')['title'],
'price': article.select_one('.price_color').text.strip(),
'scraped_at': datetime.utcnow().isoformat()
})
client = storage.Client()
bucket = client.bucket('my-scraper-bucket')
filename = f"books/{datetime.utcnow().strftime('%Y%m%d-%H%M%S')}.json"
bucket.blob(filename).upload_from_string(json.dumps(books))
print(f'saved {len(books)} records')deploying the function
gcloud functions deploy scrape-books \
--gen2 --runtime=python311 --region=us-central1 \
--entry-point=scrape_handler --trigger-topic=scrape-trigger \
--timeout=300s --memory=512MBgen2 functions support timeouts up to 60 minutes and better cold starts. see our guide on what is a proxy server for proxy config in cloud environments.
scheduling with cloud scheduler
gcloud scheduler jobs create pubsub scrape-daily \
--schedule="0 6 * * *" \
--topic=scrape-trigger \
--message-body='{"url": "https://books.toscrape.com"}' \
--time-zone="UTC"loading to bigquery
from google.cloud import bigquery
client = bigquery.Client()
table_id = 'my-project.scraper_dataset.books'
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField('title', 'STRING'),
bigquery.SchemaField('price', 'STRING'),
bigquery.SchemaField('scraped_at', 'TIMESTAMP'),
],
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
write_disposition='WRITE_APPEND',
)
load_job = client.load_table_from_uri(
'gs://my-scraper-bucket/books/*.json', table_id, job_config=job_config
)
load_job.result()cost estimate
- Cloud Functions gen2: $0.40 per million invocations plus compute time
- Cloud Storage: $0.02/GB stored
- BigQuery: first 10GB/month queried is free
- Cloud Scheduler: first 3 jobs free
a daily scrape of 10,000 pages costs roughly $0.50-2.00 in GCP fees excluding proxy costs. combine with residential proxy rotation for protected targets. see our comparison at SOCKS5 vs HTTP proxy and our overview of what is web scraping.
sources and further reading
related guides
last updated: April 1, 2026