Sitemap Scraping with Python: Build a Parser for Large Sites

Sitemap scraping is one of the cleanest ways to discover URLs before building a crawler. Instead of guessing paths, clicking through navigation, or crawling every internal link from the homepage, you can read the URLs a site already exposes for search engines. For large sites, this can save hours of crawl time and reduce unnecessary requests.

This tutorial shows how to build a Python sitemap parser that handles sitemap indexes, normal URL sitemaps, gzip-compressed files, XML namespaces, deduplication, and basic crawl prioritization.

What sitemap scraping is useful for

A sitemap is an XML file that lists URLs a site wants crawlers to discover. It may include metadata such as lastmod, changefreq, and priority. Many sites expose it at:

https://example.com/sitemap.xml

or reference it from:

https://example.com/robots.txt

Sitemap scraping is useful when you need to:

  • discover product, article, category, or location URLs
  • build a seed list for a focused crawler
  • monitor new content published by competitors
  • compare indexed URL inventory over time
  • avoid noisy full-site crawling

It is not a replacement for normal crawling. A sitemap may omit pages, include stale pages, or split URLs across many files. Treat it as a high-quality starting point.

Sitemap index vs URL sitemap

There are two common sitemap formats.

A sitemap index points to other sitemap files:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/post-sitemap.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://example.com/page-sitemap.xml</loc>
  </sitemap>
</sitemapindex>

A URL sitemap lists actual pages:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/article</loc>
    <lastmod>2026-04-26</lastmod>
  </url>
</urlset>

A robust parser needs to handle both.

Step 1: fetch robots.txt and find sitemap URLs

Many sites declare one or more sitemaps in robots.txt. Start there.

from urllib.parse import urljoin
import requests

def sitemaps_from_robots(site_url):
    robots_url = urljoin(site_url.rstrip("/") + "/", "robots.txt")
    response = requests.get(robots_url, timeout=20)
    response.raise_for_status()

    sitemaps = []
    for line in response.text.splitlines():
        if line.lower().startswith("sitemap:"):
            sitemaps.append(line.split(":", 1)[1].strip())
    return sitemaps

If robots.txt does not list a sitemap, fall back to /sitemap.xml.

Step 2: fetch XML and handle gzip

Some sitemap files end in .gz. The requests library usually handles HTTP compression, but a .xml.gz file may still need manual decompression.

import gzip
import requests

def fetch_sitemap_bytes(url):
    headers = {"User-Agent": "DRT-Sitemap-ResearchBot/1.0"}
    response = requests.get(url, headers=headers, timeout=30)
    response.raise_for_status()
    data = response.content

    if url.endswith(".gz"):
        data = gzip.decompress(data)

    return data

Use a clear user agent for research and monitoring work. Do not pretend to be Googlebot.

Step 3: parse sitemap XML with namespaces

Sitemap XML uses a namespace, so simple tag names often fail. Use namespace-aware selectors.

import xml.etree.ElementTree as ET

NS = {"s": "http://www.sitemaps.org/schemas/sitemap/0.9"}

def parse_sitemap(data):
    root = ET.fromstring(data)
    tag = root.tag.lower()

    if tag.endswith("sitemapindex"):
        return {
            "type": "index",
            "sitemaps": [
                loc.text.strip()
                for loc in root.findall(".//s:sitemap/s:loc", NS)
                if loc.text
            ],
        }

    if tag.endswith("urlset"):
        urls = []
        for item in root.findall(".//s:url", NS):
            loc = item.find("s:loc", NS)
            lastmod = item.find("s:lastmod", NS)
            if loc is not None and loc.text:
                urls.append({
                    "url": loc.text.strip(),
                    "lastmod": lastmod.text.strip() if lastmod is not None and lastmod.text else "",
                })
        return {"type": "urlset", "urls": urls}

    return {"type": "unknown", "urls": []}

Step 4: recursively collect URLs

A sitemap index can point to dozens or hundreds of child sitemaps. Recursion is fine, but add limits so a broken sitemap cannot loop forever.

def collect_urls(seed_sitemaps, max_sitemaps=500):
    seen_sitemaps = set()
    seen_urls = {}
    queue = list(seed_sitemaps)

    while queue and len(seen_sitemaps) < max_sitemaps:
        sitemap_url = queue.pop(0)
        if sitemap_url in seen_sitemaps:
            continue
        seen_sitemaps.add(sitemap_url)

        data = fetch_sitemap_bytes(sitemap_url)
        parsed = parse_sitemap(data)

        if parsed["type"] == "index":
            for child in parsed["sitemaps"]:
                if child not in seen_sitemaps:
                    queue.append(child)

        elif parsed["type"] == "urlset":
            for row in parsed["urls"]:
                seen_urls[row["url"]] = row

    return list(seen_urls.values())

The dictionary deduplicates URLs while keeping the latest metadata for each URL.

Step 5: prioritize what to crawl

Large sitemaps can contain hundreds of thousands of URLs. You usually do not want to crawl everything immediately. Prioritize by URL pattern and lastmod.

from datetime import date

def priority_score(row):
    url = row["url"]
    score = 0

    if "/blog/" in url or "/article/" in url:
        score += 30
    if "/product/" in url or "/p/" in url:
        score += 20
    if row.get("lastmod", "").startswith(str(date.today().year)):
        score += 10

    return score

urls = sorted(collect_urls(["https://example.com/sitemap.xml"]),
              key=priority_score,
              reverse=True)

For competitor monitoring, you might prioritize recent lastmod values. For market research, you might prioritize product, category, or location pages. For content analysis, you might prioritize blog and guide URLs.

Complete minimal script

def discover_sitemap_urls(site_url):
    found = sitemaps_from_robots(site_url)
    if found:
        return found
    return [site_url.rstrip("/") + "/sitemap.xml"]

if __name__ == "__main__":
    site = "https://example.com"
    sitemaps = discover_sitemap_urls(site)
    urls = collect_urls(sitemaps)

    print(f"found {len(urls)} urls")
    for row in urls[:20]:
        print(row["url"], row.get("lastmod", ""))

This is enough for URL discovery. For production, add retries with backoff, structured logs, persistence, and per-domain request limits. Our rate limit backoff guide covers the retry layer in detail.

Common sitemap scraping problems

  • HTML instead of XML. The site may block your request or redirect to a landing page.
  • Huge files. Stream or chunk large files instead of loading everything in memory.
  • Stale URLs. Validate important URLs before treating them as live.
  • Multiple namespaces. News, image, and video sitemap extensions need extra parsing.
  • Bad lastmod values. Some sites update lastmod on every deploy, making it less useful.

Bottom line

Sitemap scraping is a low-noise way to discover URLs, build crawl seeds, and monitor site changes. Start with robots.txt, parse sitemap indexes recursively, deduplicate URLs, and prioritize the URLs that match your research goal. Then use a normal crawler for the pages that matter.

If you are building a full crawler after URL discovery, read our news crawler tutorial for scheduling, parsing, storage, and monitoring patterns.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)