Sitemap scraping is one of the cleanest ways to discover URLs before building a crawler. Instead of guessing paths, clicking through navigation, or crawling every internal link from the homepage, you can read the URLs a site already exposes for search engines. For large sites, this can save hours of crawl time and reduce unnecessary requests.
This tutorial shows how to build a Python sitemap parser that handles sitemap indexes, normal URL sitemaps, gzip-compressed files, XML namespaces, deduplication, and basic crawl prioritization.
What sitemap scraping is useful for
A sitemap is an XML file that lists URLs a site wants crawlers to discover. It may include metadata such as lastmod, changefreq, and priority. Many sites expose it at:
https://example.com/sitemap.xmlor reference it from:
https://example.com/robots.txtSitemap scraping is useful when you need to:
- discover product, article, category, or location URLs
- build a seed list for a focused crawler
- monitor new content published by competitors
- compare indexed URL inventory over time
- avoid noisy full-site crawling
It is not a replacement for normal crawling. A sitemap may omit pages, include stale pages, or split URLs across many files. Treat it as a high-quality starting point.
Sitemap index vs URL sitemap
There are two common sitemap formats.
A sitemap index points to other sitemap files:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/post-sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://example.com/page-sitemap.xml</loc>
</sitemap>
</sitemapindex>
A URL sitemap lists actual pages:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/article</loc>
<lastmod>2026-04-26</lastmod>
</url>
</urlset>
A robust parser needs to handle both.
Step 1: fetch robots.txt and find sitemap URLs
Many sites declare one or more sitemaps in robots.txt. Start there.
from urllib.parse import urljoin
import requests
def sitemaps_from_robots(site_url):
robots_url = urljoin(site_url.rstrip("/") + "/", "robots.txt")
response = requests.get(robots_url, timeout=20)
response.raise_for_status()
sitemaps = []
for line in response.text.splitlines():
if line.lower().startswith("sitemap:"):
sitemaps.append(line.split(":", 1)[1].strip())
return sitemaps
If robots.txt does not list a sitemap, fall back to /sitemap.xml.
Step 2: fetch XML and handle gzip
Some sitemap files end in .gz. The requests library usually handles HTTP compression, but a .xml.gz file may still need manual decompression.
import gzip
import requests
def fetch_sitemap_bytes(url):
headers = {"User-Agent": "DRT-Sitemap-ResearchBot/1.0"}
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
data = response.content
if url.endswith(".gz"):
data = gzip.decompress(data)
return data
Use a clear user agent for research and monitoring work. Do not pretend to be Googlebot.
Step 3: parse sitemap XML with namespaces
Sitemap XML uses a namespace, so simple tag names often fail. Use namespace-aware selectors.
import xml.etree.ElementTree as ET
NS = {"s": "http://www.sitemaps.org/schemas/sitemap/0.9"}
def parse_sitemap(data):
root = ET.fromstring(data)
tag = root.tag.lower()
if tag.endswith("sitemapindex"):
return {
"type": "index",
"sitemaps": [
loc.text.strip()
for loc in root.findall(".//s:sitemap/s:loc", NS)
if loc.text
],
}
if tag.endswith("urlset"):
urls = []
for item in root.findall(".//s:url", NS):
loc = item.find("s:loc", NS)
lastmod = item.find("s:lastmod", NS)
if loc is not None and loc.text:
urls.append({
"url": loc.text.strip(),
"lastmod": lastmod.text.strip() if lastmod is not None and lastmod.text else "",
})
return {"type": "urlset", "urls": urls}
return {"type": "unknown", "urls": []}
Step 4: recursively collect URLs
A sitemap index can point to dozens or hundreds of child sitemaps. Recursion is fine, but add limits so a broken sitemap cannot loop forever.
def collect_urls(seed_sitemaps, max_sitemaps=500):
seen_sitemaps = set()
seen_urls = {}
queue = list(seed_sitemaps)
while queue and len(seen_sitemaps) < max_sitemaps:
sitemap_url = queue.pop(0)
if sitemap_url in seen_sitemaps:
continue
seen_sitemaps.add(sitemap_url)
data = fetch_sitemap_bytes(sitemap_url)
parsed = parse_sitemap(data)
if parsed["type"] == "index":
for child in parsed["sitemaps"]:
if child not in seen_sitemaps:
queue.append(child)
elif parsed["type"] == "urlset":
for row in parsed["urls"]:
seen_urls[row["url"]] = row
return list(seen_urls.values())
The dictionary deduplicates URLs while keeping the latest metadata for each URL.
Step 5: prioritize what to crawl
Large sitemaps can contain hundreds of thousands of URLs. You usually do not want to crawl everything immediately. Prioritize by URL pattern and lastmod.
from datetime import date
def priority_score(row):
url = row["url"]
score = 0
if "/blog/" in url or "/article/" in url:
score += 30
if "/product/" in url or "/p/" in url:
score += 20
if row.get("lastmod", "").startswith(str(date.today().year)):
score += 10
return score
urls = sorted(collect_urls(["https://example.com/sitemap.xml"]),
key=priority_score,
reverse=True)
For competitor monitoring, you might prioritize recent lastmod values. For market research, you might prioritize product, category, or location pages. For content analysis, you might prioritize blog and guide URLs.
Complete minimal script
def discover_sitemap_urls(site_url):
found = sitemaps_from_robots(site_url)
if found:
return found
return [site_url.rstrip("/") + "/sitemap.xml"]
if __name__ == "__main__":
site = "https://example.com"
sitemaps = discover_sitemap_urls(site)
urls = collect_urls(sitemaps)
print(f"found {len(urls)} urls")
for row in urls[:20]:
print(row["url"], row.get("lastmod", ""))
This is enough for URL discovery. For production, add retries with backoff, structured logs, persistence, and per-domain request limits. Our rate limit backoff guide covers the retry layer in detail.
Common sitemap scraping problems
- HTML instead of XML. The site may block your request or redirect to a landing page.
- Huge files. Stream or chunk large files instead of loading everything in memory.
- Stale URLs. Validate important URLs before treating them as live.
- Multiple namespaces. News, image, and video sitemap extensions need extra parsing.
- Bad lastmod values. Some sites update
lastmodon every deploy, making it less useful.
Bottom line
Sitemap scraping is a low-noise way to discover URLs, build crawl seeds, and monitor site changes. Start with robots.txt, parse sitemap indexes recursively, deduplicate URLs, and prioritize the URLs that match your research goal. Then use a normal crawler for the pages that matter.
If you are building a full crawler after URL discovery, read our news crawler tutorial for scheduling, parsing, storage, and monitoring patterns.