CSS Selectors for Web Scraping: Complete Reference
CSS selectors are the most intuitive way to extract data from HTML. If you have written any CSS, you already know the basics. This reference covers every CSS selector type with practical web scraping examples.
Selector Types
Basic Selectors
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
soup.select('div') # Element: all <div>
soup.select('.product') # Class: elements with class="product"
soup.select('#main') # ID: element with id="main"
soup.select('*') # Universal: all elements
soup.select('div.product') # Combined: div with class product
soup.select('div#main') # Combined: div with id mainAttribute Selectors
soup.select('[data-id]') # Has attribute
soup.select('[data-id="123"]') # Exact value
soup.select('[class~="product"]') # Word in list
soup.select('[class^="prod"]') # Starts with
soup.select('[class$="card"]') # Ends with
soup.select('[class*="oduct"]') # Contains substring
soup.select('[href*="amazon.com"]') # Links to Amazon
soup.select('[data-price]') # Has data-price attrCombinators
# Descendant (space) — any depth
soup.select('div.products span.price')
# Child (>) — direct children only
soup.select('ul.menu > li')
# Adjacent sibling (+) — immediately after
soup.select('h2 + p') # First p after h2
# General sibling (~) — any following sibling
soup.select('h2 ~ p') # All p siblings after h2Pseudo-classes
soup.select('li:first-child') # First child
soup.select('li:last-child') # Last child
soup.select('li:nth-child(2)') # Second child
soup.select('li:nth-child(odd)') # Odd children
soup.select('li:nth-child(even)') # Even children
soup.select('li:nth-child(3n)') # Every 3rd
soup.select('li:nth-child(3n+1)') # 1st, 4th, 7th...
soup.select('div:not(.hidden)') # Not hidden
soup.select('p:first-of-type') # First p in parent
soup.select('p:last-of-type') # Last p in parent
soup.select(':empty') # Empty elementsCommon Scraping Patterns
E-Commerce Product Cards
def scrape_products(html):
soup = BeautifulSoup(html, 'lxml')
products = []
for card in soup.select('div.product-card, article.product'):
product = {}
# Name — try multiple selectors
name_el = card.select_one('h2, h3, .product-name, .title')
product['name'] = name_el.get_text(strip=True) if name_el else ''
# Price — handle sale prices
sale_price = card.select_one('.sale-price, .special-price, .price--sale')
regular_price = card.select_one('.regular-price, .price--regular, .price')
product['price'] = (sale_price or regular_price).get_text(strip=True) if (sale_price or regular_price) else ''
# Link
link = card.select_one('a[href]')
product['url'] = link['href'] if link else ''
# Image
img = card.select_one('img[src], img[data-src]')
product['image'] = img.get('src') or img.get('data-src', '') if img else ''
products.append(product)
return productsTable Extraction
import pandas as pd
def extract_table(html, table_selector='table'):
soup = BeautifulSoup(html, 'lxml')
table = soup.select_one(table_selector)
if not table:
return None
# Extract headers
headers = [th.get_text(strip=True) for th in table.select('thead th, tr:first-child th')]
# Extract rows
rows = []
for tr in table.select('tbody tr, tr:not(:first-child)'):
cells = [td.get_text(strip=True) for td in tr.select('td')]
if cells:
rows.append(cells)
return pd.DataFrame(rows, columns=headers if headers else None)Pagination Links
def extract_pagination(html, base_url):
soup = BeautifulSoup(html, 'lxml')
# Common pagination patterns
selectors = [
'nav.pagination a',
'.pagination a',
'ul.pager a',
'a.next',
'a[rel="next"]',
'.page-numbers a',
]
links = set()
for selector in selectors:
for a in soup.select(selector):
href = a.get('href', '')
if href:
from urllib.parse import urljoin
links.add(urljoin(base_url, href))
return sorted(links)Tips for Writing Robust Selectors
- Prefer data attributes over classes:
[data-testid="price"]is more stable than.css-abc123 - Avoid deep nesting:
div.product .priceinstead ofdiv.container > div.row > div.col > div.product > span.price - Use multiple fallback selectors: Try
.price, [data-price], span[itemprop="price"] - Test in browser first: Chrome DevTools Console:
document.querySelectorAll('your-selector')
Internal Links
- XPath for Web Scraping — when CSS selectors are not enough
- HTML Parsing Guide — choose the right parser library
- Beautiful Soup Tutorial — complete BS4 guide
- Scrapy Tutorial — CSS selectors in Scrapy
- Web Scraping with Python — complete Python guide
FAQ
Are CSS selectors enough for web scraping?
CSS selectors handle about 80% of scraping tasks. You need XPath for: selecting by text content, navigating to parent elements, and complex conditional selection. Most scrapers use CSS selectors as the primary method with XPath as a fallback.
Which CSS selector method is fastest?
selectolax is the fastest CSS selector engine in Python. lxml’s cssselect is second. BeautifulSoup with lxml parser is third. For most projects, the parser speed difference is negligible compared to network latency.
How do I handle dynamically generated class names?
Modern frameworks (React, Vue, Angular) often generate random class names like .css-1a2b3c. Use attribute selectors ([data-testid="name"]), structural selectors (:nth-child), or partial class matching ([class*="product"]) instead.
Can I select elements by their text content with CSS?
No, CSS selectors cannot match by text content. Use XPath (//div[text()="Hello"]) or BeautifulSoup’s find(string="Hello") method for text-based selection.
How do I test CSS selectors before coding?
In Chrome DevTools, press Ctrl+F in the Elements panel and type your CSS selector to highlight matches. Or use Console: document.querySelectorAll('.your-selector').length to count matches.
- AJAX Request Interception: Scraping API Calls Directly
- Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a Proxy Rotator in Python: Complete Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
- AJAX Request Interception: Scraping API Calls Directly
- Bandwidth Optimization for Proxies: Reduce Costs & Increase Speed
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a Proxy Rotator in Python: Complete Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)
Related Reading
- AJAX Request Interception: Scraping API Calls Directly
- Azure Functions for Serverless Web Scraping: the Complete Guide
- Build an Anti-Detection Test Suite: Verify Browser Stealth
- Build a News Crawler in Python: Step-by-Step Tutorial
- How to Configure Proxies on iPhone and Android
- How to Use Proxies in Node.js (Axios, Fetch, Puppeteer)