Python SERP Scraper Tutorial: Build a Google Rank Tracker with Proxies (2026)

Tracking where your website ranks for target keywords is one of the most fundamental SEO tasks, yet most rank tracking tools charge monthly fees that add up quickly — especially when you need to monitor hundreds or thousands of keywords. In this tutorial, you will build a fully functional Google rank tracker in Python that uses proxy rotation to avoid detection, parses live SERP HTML to extract ranking positions, stores historical data in a database, and runs on a schedule. By the end, you will have a tool that rivals commercial rank trackers at a fraction of the cost.

Prerequisites and Setup

Before diving into the code, make sure you have the following ready:

Required Python Knowledge

  • Comfortable with Python 3.9 or higher
  • Basic understanding of HTTP requests and HTML parsing
  • Familiarity with pip and virtual environments

Required Libraries

Install the following packages in your virtual environment:

pip install requests beautifulsoup4 lxml sqlite-utils schedule fake-useragent curl_cffi
LibraryPurposeWhy This Choice
curl_cffiHTTP requests with TLS fingerprintingAvoids TLS-based detection that blocks standard requests library
beautifulsoup4 + lxmlHTML parsingFast, reliable, well-documented
sqlite-utilsDatabase storageLightweight, no server needed, easy querying
scheduleTask schedulingSimple cron-like scheduling in Python
fake-useragentUser agent rotationGenerates realistic browser user agents

Proxy Setup

You will need rotating residential proxies for this project. Your proxy provider should give you credentials in one of these formats:

# Format 1: Rotating gateway (recommended)
proxy = "http://username:password@gate.proxyprovider.com:7777"

# Format 2: List of individual IPs
proxies = [
    "http://user:pass@192.168.1.1:8080",
    "http://user:pass@192.168.1.2:8080",
    # ... more proxies
]

For a comprehensive overview of how proxies work with search engine scraping, see our guide on scraping Google search results with proxies.

Step 1: Project Structure

Organize your project with a clean structure that separates concerns:

rank-tracker/
├── config.py          # Configuration settings
├── scraper.py         # SERP fetching and parsing
├── proxy_manager.py   # Proxy rotation logic
├── database.py        # Data storage and retrieval
├── tracker.py         # Main tracking orchestration
├── scheduler.py       # Scheduling logic
└── keywords.txt       # Your keyword list (one per line)

This modular approach makes it easy to swap components — for example, changing proxy providers or switching from SQLite to PostgreSQL without rewriting the entire application.

Step 2: Configuration Module

Create a central configuration file that holds all your settings:

# config.py
import os

# Target domain to track
TARGET_DOMAIN = "yourdomain.com"

# Google domain and country settings
GOOGLE_DOMAIN = "google.com"
GOOGLE_COUNTRY = "us"
GOOGLE_LANGUAGE = "en"

# Proxy configuration
PROXY_HOST = os.getenv("PROXY_HOST", "gate.proxyprovider.com")
PROXY_PORT = os.getenv("PROXY_PORT", "7777")
PROXY_USER = os.getenv("PROXY_USER", "your_username")
PROXY_PASS = os.getenv("PROXY_PASS", "your_password")

# Scraping settings
RESULTS_PER_PAGE = 100  # Request 100 results per query
MAX_RETRIES = 3
REQUEST_DELAY_MIN = 3  # Minimum seconds between requests
REQUEST_DELAY_MAX = 7  # Maximum seconds between requests

# Database
DB_PATH = "rankings.db"

# Scheduling
TRACK_TIMES = ["06:00", "14:00", "22:00"]  # Times to run tracking

Step 3: Proxy Manager

The proxy manager handles rotation, health tracking, and failover. This is the component that keeps your scraper running reliably over time.

# proxy_manager.py
import random
import time
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class ProxyStats:
    total_requests: int = 0
    successful_requests: int = 0
    failed_requests: int = 0
    last_used: float = 0
    cooldown_until: float = 0

class ProxyManager:
    def __init__(self, proxy_list: list[str] = None, rotating_gateway: str = None):
        self.rotating_gateway = rotating_gateway
        self.proxy_list = proxy_list or []
        self.proxy_stats: dict[str, ProxyStats] = {}

        if self.proxy_list:
            for proxy in self.proxy_list:
                self.proxy_stats[proxy] = ProxyStats()

    def get_proxy(self) -> dict:
        """Get the next proxy to use."""
        if self.rotating_gateway:
            return {
                "http": self.rotating_gateway,
                "https": self.rotating_gateway
            }

        now = time.time()
        available = [
            p for p in self.proxy_list
            if self.proxy_stats[p].cooldown_until < now
        ]

        if not available:
            # All proxies cooling down — use the one with
            # the earliest cooldown expiry
            available = sorted(
                self.proxy_list,
                key=lambda p: self.proxy_stats[p].cooldown_until
            )

        # Select proxy with fewest recent requests
        proxy = min(available,
                    key=lambda p: self.proxy_stats[p].total_requests)

        self.proxy_stats[proxy].last_used = now
        self.proxy_stats[proxy].total_requests += 1

        return {"http": proxy, "https": proxy}

    def report_success(self, proxy_url: str):
        """Mark a proxy request as successful."""
        if proxy_url in self.proxy_stats:
            self.proxy_stats[proxy_url].successful_requests += 1

    def report_failure(self, proxy_url: str, cooldown_seconds: int = 60):
        """Mark a proxy request as failed and set cooldown."""
        if proxy_url in self.proxy_stats:
            stats = self.proxy_stats[proxy_url]
            stats.failed_requests += 1
            stats.cooldown_until = time.time() + cooldown_seconds

The proxy manager tracks success and failure rates per IP, implements cooldown periods for IPs that trigger blocks, and selects the least-used available proxy. This approach distributes load evenly and automatically avoids problematic IPs.

Step 4: SERP Scraper

The scraper module handles fetching Google search results and parsing the HTML to extract ranking data. This is the most critical and most fragile part of the system, since Google frequently updates their HTML structure.

# scraper.py
import random
import time
from urllib.parse import urlencode, urlparse
from bs4 import BeautifulSoup
from curl_cffi import requests as curl_requests
from fake_useragent import UserAgent
import config

ua = UserAgent()

def build_google_url(keyword: str, page: int = 0) -> str:
    """Build a Google search URL with proper parameters."""
    params = {
        "q": keyword,
        "num": config.RESULTS_PER_PAGE,
        "hl": config.GOOGLE_LANGUAGE,
        "gl": config.GOOGLE_COUNTRY,
        "start": page * config.RESULTS_PER_PAGE,
    }
    return f"https://www.{config.GOOGLE_DOMAIN}/search?{urlencode(params)}"

def fetch_serp(url: str, proxy: dict) -> str | None:
    """Fetch a SERP page through a proxy."""
    headers = {
        "User-Agent": ua.random,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": f"{config.GOOGLE_LANGUAGE}-{config.GOOGLE_COUNTRY.upper()},{config.GOOGLE_LANGUAGE};q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
    }

    try:
        response = curl_requests.get(
            url,
            headers=headers,
            proxies=proxy,
            timeout=15,
            impersonate="chrome"  # TLS fingerprint matching
        )

        if response.status_code == 200:
            return response.text
        elif response.status_code == 429:
            print(f"Rate limited. Status: {response.status_code}")
            return None
        else:
            print(f"Unexpected status: {response.status_code}")
            return None

    except Exception as e:
        print(f"Request failed: {e}")
        return None

def parse_serp(html: str, target_domain: str) -> dict:
    """Parse Google SERP HTML and extract ranking data."""
    soup = BeautifulSoup(html, "lxml")
    results = []
    target_rank = None

    # Find all search result containers
    search_results = soup.select("div.g")

    for position, result in enumerate(search_results, 1):
        # Extract the URL
        link_tag = result.select_one("a[href]")
        if not link_tag:
            continue

        url = link_tag.get("href", "")
        if not url.startswith("http"):
            continue

        # Extract the title
        title_tag = result.select_one("h3")
        title = title_tag.get_text(strip=True) if title_tag else ""

        # Extract the snippet
        snippet_tag = result.select_one("div.VwiC3b")
        snippet = snippet_tag.get_text(strip=True) if snippet_tag else ""

        # Parse the domain
        parsed_url = urlparse(url)
        domain = parsed_url.netloc.replace("www.", "")

        result_data = {
            "position": position,
            "url": url,
            "domain": domain,
            "title": title,
            "snippet": snippet,
        }
        results.append(result_data)

        # Check if this is our target domain
        if target_domain in domain and target_rank is None:
            target_rank = position

    return {
        "results": results,
        "target_rank": target_rank,
        "total_results": len(results),
    }

Making the Parser Resilient

Google changes its HTML structure regularly. To build a parser that does not break every few weeks, follow these principles:

  • Use multiple selector strategies: Have fallback selectors for each element. If the primary CSS selector fails, try alternative selectors.
  • Validate extracted data: Check that URLs look like URLs, positions are sequential, and titles are non-empty.
  • Store raw HTML: Always save the raw HTML response alongside parsed data. When your parser breaks, you can fix it and re-parse without re-scraping.
  • Monitor parse success rate: Track what percentage of responses produce valid results. A sudden drop in parse rate indicates a structural change.

Step 5: Database Layer

Store ranking data in SQLite for simplicity. For larger operations, the same schema works with PostgreSQL or MySQL.

# database.py
import sqlite3
from datetime import datetime
import config

def init_db():
    """Initialize the database with required tables."""
    conn = sqlite3.connect(config.DB_PATH)
    cursor = conn.cursor()

    cursor.execute("""
        CREATE TABLE IF NOT EXISTS rankings (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            keyword TEXT NOT NULL,
            position INTEGER,
            url TEXT,
            title TEXT,
            tracked_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            google_domain TEXT DEFAULT 'google.com'
        )
    """)

    cursor.execute("""
        CREATE TABLE IF NOT EXISTS serp_snapshots (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            keyword TEXT NOT NULL,
            results_json TEXT,
            raw_html TEXT,
            tracked_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    cursor.execute("""
        CREATE INDEX IF NOT EXISTS idx_rankings_keyword
        ON rankings(keyword, tracked_at)
    """)

    conn.commit()
    conn.close()

def save_ranking(keyword: str, position: int | None,
                 url: str | None, title: str | None):
    """Save a single ranking result."""
    conn = sqlite3.connect(config.DB_PATH)
    cursor = conn.cursor()
    cursor.execute(
        "INSERT INTO rankings (keyword, position, url, title) "
        "VALUES (?, ?, ?, ?)",
        (keyword, position, url, title)
    )
    conn.commit()
    conn.close()

def get_ranking_history(keyword: str, days: int = 30) -> list:
    """Retrieve ranking history for a keyword."""
    conn = sqlite3.connect(config.DB_PATH)
    cursor = conn.cursor()
    cursor.execute(
        "SELECT position, tracked_at FROM rankings "
        "WHERE keyword = ? AND tracked_at >= datetime('now', ?) "
        "ORDER BY tracked_at",
        (keyword, f"-{days} days")
    )
    results = cursor.fetchall()
    conn.close()
    return results

Step 6: Main Tracker Orchestration

The tracker module ties everything together, managing the flow from keyword loading through scraping to data storage.

# tracker.py
import random
import time
import json
from proxy_manager import ProxyManager
from scraper import build_google_url, fetch_serp, parse_serp
from database import init_db, save_ranking
import config

def load_keywords(filepath: str = "keywords.txt") -> list[str]:
    """Load keywords from a text file, one per line."""
    with open(filepath) as f:
        return [line.strip() for line in f if line.strip()]

def track_keyword(keyword: str, proxy_manager: ProxyManager) -> dict | None:
    """Track ranking for a single keyword."""
    url = build_google_url(keyword)

    for attempt in range(config.MAX_RETRIES):
        proxy = proxy_manager.get_proxy()
        proxy_url = proxy.get("http", "")

        html = fetch_serp(url, proxy)

        if html:
            # Check for CAPTCHA or block pages
            if "detected unusual traffic" in html.lower():
                proxy_manager.report_failure(proxy_url, cooldown_seconds=300)
                continue

            proxy_manager.report_success(proxy_url)
            result = parse_serp(html, config.TARGET_DOMAIN)

            # Save to database
            target_url = None
            target_title = None
            if result["target_rank"]:
                for r in result["results"]:
                    if r["position"] == result["target_rank"]:
                        target_url = r["url"]
                        target_title = r["title"]
                        break

            save_ranking(keyword, result["target_rank"],
                        target_url, target_title)

            return result
        else:
            proxy_manager.report_failure(proxy_url)

        # Delay before retry
        time.sleep(random.uniform(2, 5))

    # All retries failed
    save_ranking(keyword, None, None, None)
    return None

def run_tracking():
    """Run a full tracking cycle for all keywords."""
    init_db()
    keywords = load_keywords()

    proxy_manager = ProxyManager(
        rotating_gateway=
        f"http://{config.PROXY_USER}:{config.PROXY_PASS}"
        f"@{config.PROXY_HOST}:{config.PROXY_PORT}"
    )

    print(f"Starting tracking for {len(keywords)} keywords...")

    for i, keyword in enumerate(keywords):
        result = track_keyword(keyword, proxy_manager)

        if result:
            rank = result["target_rank"] or "Not found"
            print(f"[{i+1}/{len(keywords)}] '{keyword}': Position {rank}")
        else:
            print(f"[{i+1}/{len(keywords)}] '{keyword}': FAILED")

        # Random delay between keywords
        delay = random.uniform(
            config.REQUEST_DELAY_MIN,
            config.REQUEST_DELAY_MAX
        )
        time.sleep(delay)

    print("Tracking cycle complete.")

if __name__ == "__main__":
    run_tracking()

Step 7: Scheduling

Set up automatic tracking runs at specified times:

# scheduler.py
import schedule
import time
from tracker import run_tracking
import config

def setup_schedule():
    """Configure the tracking schedule."""
    for track_time in config.TRACK_TIMES:
        schedule.every().day.at(track_time).do(run_tracking)

    print(f"Scheduled tracking at: {', '.join(config.TRACK_TIMES)}")
    print("Waiting for next scheduled run...")

    while True:
        schedule.run_pending()
        time.sleep(60)

if __name__ == "__main__":
    setup_schedule()

For production deployments, consider using system-level schedulers (cron on Linux, Task Scheduler on Windows) or containerized solutions with Docker and a process manager. The Python schedule library is excellent for development but lacks persistence across restarts.

Extending the Tracker

Adding Email or Slack Alerts

Set up notifications for significant ranking changes. A useful threshold is alerting when a keyword moves more than 5 positions in either direction, when a keyword enters or exits the top 10, or when a keyword drops off page 1 entirely.

Generating Reports

Query your SQLite database to generate weekly reports showing average position changes, keywords with the most improvement, keywords that need attention, and visibility score trends. For a broader look at how Python can be used for data-driven scraping projects, see our Python price scraping tutorial with proxies.

Multi-Location Tracking

To track rankings from different geographic locations, modify the configuration to support multiple Google domains and use geo-targeted proxies for each location. This is essential for businesses with a local SEO focus or international presence.

Common Pitfalls and How to Avoid Them

PitfallSymptomSolution
Not rotating user agentsBlocks after 50-100 requestsUse fake-useragent or a curated list of 20+ real user agents
Ignoring TLS fingerprintingBlocks despite good proxiesUse curl_cffi with impersonate="chrome" instead of standard requests
Too-fast request rateCAPTCHAs, 429 responsesMinimum 3-second delay, random jitter, backoff on errors
Hardcoded CSS selectorsParser breaks every few weeksMultiple fallback selectors, parse rate monitoring
Not storing raw HTMLCannot recover from parsing bugsSave raw HTML in database alongside parsed data
Single proxy providerTotal failure if provider has issuesHave a backup proxy provider configured

Frequently Asked Questions

Why not just use the Google Search API instead of scraping?

The Google Custom Search JSON API has significant limitations for rank tracking. It returns results from a custom search engine, not the actual Google SERP, so rankings do not match what real users see. The API is also limited to 10,000 queries per day (paid) and 100 queries per day (free), which is insufficient for tracking hundreds of keywords multiple times per day. Scraping the actual SERP provides real ranking data as users experience it.

How many keywords can this tracker handle?

With a single rotating proxy gateway and a 5-second average delay between requests, you can track approximately 700 keywords per hour, or about 17,000 per day across three tracking sessions. To scale beyond this, you can run multiple tracker instances in parallel with separate proxy connections, reduce delays (with better proxies), or distribute tracking across multiple servers.

Will Google detect and block this scraper?

With properly configured residential proxies, TLS fingerprinting via curl_cffi, realistic user agents, and appropriate delays, detection rates are very low — typically under 5% of requests trigger CAPTCHAs. The key is using quality residential or ISP proxies rather than datacenter proxies, and implementing proper rate limiting. The retry logic in our tracker handles the occasional detection gracefully.

Can I modify this to track Bing or other search engines?

Yes. The architecture is search-engine-agnostic. You would need to modify the URL builder for the target engine's search URL format, update the HTML parser for the target engine's result structure, and adjust rate limiting (most alternative engines are less aggressive than Google). The proxy manager, database, and scheduler modules work without modification.

How accurate is scraped ranking data compared to commercial tools?

Scraped ranking data is actually more accurate than most commercial tools because you control exactly when, where, and how the query is executed. Commercial tools often use cached data, approximate locations, or aggregate across multiple data sources. The main challenge with scraped data is consistency — you need to control for personalization (use clean sessions), location (use geo-targeted proxies), and timing (track at consistent times) to get comparable day-over-day data.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top