CSS Selectors for Web Scraping: Complete Reference

CSS Selectors for Web Scraping: Complete Reference

CSS selectors are the most intuitive way to extract data from HTML. If you have written any CSS, you already know the basics. This reference covers every CSS selector type with practical web scraping examples.

Selector Types

Basic Selectors

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

soup.select('div')              # Element: all <div>
soup.select('.product')         # Class: elements with class="product"
soup.select('#main')            # ID: element with id="main"
soup.select('*')                # Universal: all elements
soup.select('div.product')      # Combined: div with class product
soup.select('div#main')         # Combined: div with id main

Attribute Selectors

soup.select('[data-id]')              # Has attribute
soup.select('[data-id="123"]')        # Exact value
soup.select('[class~="product"]')     # Word in list
soup.select('[class^="prod"]')        # Starts with
soup.select('[class$="card"]')        # Ends with
soup.select('[class*="oduct"]')       # Contains substring
soup.select('[href*="amazon.com"]')   # Links to Amazon
soup.select('[data-price]')           # Has data-price attr

Combinators

# Descendant (space) — any depth
soup.select('div.products span.price')

# Child (>) — direct children only
soup.select('ul.menu > li')

# Adjacent sibling (+) — immediately after
soup.select('h2 + p')  # First p after h2

# General sibling (~) — any following sibling
soup.select('h2 ~ p')  # All p siblings after h2

Pseudo-classes

soup.select('li:first-child')         # First child
soup.select('li:last-child')          # Last child
soup.select('li:nth-child(2)')        # Second child
soup.select('li:nth-child(odd)')      # Odd children
soup.select('li:nth-child(even)')     # Even children
soup.select('li:nth-child(3n)')       # Every 3rd
soup.select('li:nth-child(3n+1)')     # 1st, 4th, 7th...
soup.select('div:not(.hidden)')       # Not hidden
soup.select('p:first-of-type')        # First p in parent
soup.select('p:last-of-type')         # Last p in parent
soup.select(':empty')                 # Empty elements

Common Scraping Patterns

E-Commerce Product Cards

def scrape_products(html):
    soup = BeautifulSoup(html, 'lxml')
    products = []
    
    for card in soup.select('div.product-card, article.product'):
        product = {}
        
        # Name — try multiple selectors
        name_el = card.select_one('h2, h3, .product-name, .title')
        product['name'] = name_el.get_text(strip=True) if name_el else ''
        
        # Price — handle sale prices
        sale_price = card.select_one('.sale-price, .special-price, .price--sale')
        regular_price = card.select_one('.regular-price, .price--regular, .price')
        product['price'] = (sale_price or regular_price).get_text(strip=True) if (sale_price or regular_price) else ''
        
        # Link
        link = card.select_one('a[href]')
        product['url'] = link['href'] if link else ''
        
        # Image
        img = card.select_one('img[src], img[data-src]')
        product['image'] = img.get('src') or img.get('data-src', '') if img else ''
        
        products.append(product)
    
    return products

Table Extraction

import pandas as pd

def extract_table(html, table_selector='table'):
    soup = BeautifulSoup(html, 'lxml')
    table = soup.select_one(table_selector)
    
    if not table:
        return None
    
    # Extract headers
    headers = [th.get_text(strip=True) for th in table.select('thead th, tr:first-child th')]
    
    # Extract rows
    rows = []
    for tr in table.select('tbody tr, tr:not(:first-child)'):
        cells = [td.get_text(strip=True) for td in tr.select('td')]
        if cells:
            rows.append(cells)
    
    return pd.DataFrame(rows, columns=headers if headers else None)

Pagination Links

def extract_pagination(html, base_url):
    soup = BeautifulSoup(html, 'lxml')
    
    # Common pagination patterns
    selectors = [
        'nav.pagination a',
        '.pagination a',
        'ul.pager a',
        'a.next',
        'a[rel="next"]',
        '.page-numbers a',
    ]
    
    links = set()
    for selector in selectors:
        for a in soup.select(selector):
            href = a.get('href', '')
            if href:
                from urllib.parse import urljoin
                links.add(urljoin(base_url, href))
    
    return sorted(links)

Tips for Writing Robust Selectors

  1. Prefer data attributes over classes: [data-testid="price"] is more stable than .css-abc123
  2. Avoid deep nesting: div.product .price instead of div.container > div.row > div.col > div.product > span.price
  3. Use multiple fallback selectors: Try .price, [data-price], span[itemprop="price"]
  4. Test in browser first: Chrome DevTools Console: document.querySelectorAll('your-selector')

Internal Links

FAQ

Are CSS selectors enough for web scraping?

CSS selectors handle about 80% of scraping tasks. You need XPath for: selecting by text content, navigating to parent elements, and complex conditional selection. Most scrapers use CSS selectors as the primary method with XPath as a fallback.

Which CSS selector method is fastest?

selectolax is the fastest CSS selector engine in Python. lxml’s cssselect is second. BeautifulSoup with lxml parser is third. For most projects, the parser speed difference is negligible compared to network latency.

How do I handle dynamically generated class names?

Modern frameworks (React, Vue, Angular) often generate random class names like .css-1a2b3c. Use attribute selectors ([data-testid="name"]), structural selectors (:nth-child), or partial class matching ([class*="product"]) instead.

Can I select elements by their text content with CSS?

No, CSS selectors cannot match by text content. Use XPath (//div[text()="Hello"]) or BeautifulSoup’s find(string="Hello") method for text-based selection.

How do I test CSS selectors before coding?

In Chrome DevTools, press Ctrl+F in the Elements panel and type your CSS selector to highlight matches. Or use Console: document.querySelectorAll('.your-selector').length to count matches.


Related Reading

Scroll to Top