Regex for Web Scraping: Pattern Matching Guide

Regex for Web Scraping: Pattern Matching Guide

Regular expressions are the scalpel of web scraping — precise, fast, and dangerous if misused. While you should never parse HTML structure with regex, it excels at extracting specific data patterns (emails, prices, phone numbers) from text content after initial HTML parsing.

When to Use Regex (and When Not To)

Use regex for:

  • Extracting emails, phones, prices from text
  • Cleaning extracted text (removing whitespace, special chars)
  • Validating scraped data formats
  • Quick pattern matching on API responses

Never use regex for:

  • Parsing HTML structure (use CSS/XPath selectors)
  • Navigating DOM trees
  • Handling nested HTML tags

Essential Scraping Patterns

import re

# Email extraction
EMAIL = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')

# Price extraction (multiple formats)
PRICE = re.compile(r'\$[\d,]+\.?\d{0,2}|\d+[.,]\d{2}\s*(?:USD|EUR|GBP)')

# Phone numbers (US format)
PHONE_US = re.compile(r'(?:\+?1[-.])?\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}')

# URL extraction
URL = re.compile(r'https?://[\w.-]+(?:/[\w./?%&=@#-]*)?')

# Date extraction (multiple formats)
DATE = re.compile(
    r'\d{4}[-/]\d{2}[-/]\d{2}'         # 2024-01-15
    r'|\d{2}[-/]\d{2}[-/]\d{4}'        # 01/15/2024
    r'|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*\s+\d{1,2},?\s+\d{4}'  # January 15, 2024
)

# SKU/Product codes
SKU = re.compile(r'[A-Z]{2,4}[-]?\d{3,8}')

# IP addresses
IP = re.compile(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')

# Clean HTML tags
TAGS = re.compile(r'<[^>]+>')
clean_text = TAGS.sub('', html_content)

# Extract JSON from script tags
JSON_IN_SCRIPT = re.compile(r'<script[^>]*>\s*var\s+\w+\s*=\s*({.*?});?\s*</script>', re.DOTALL)

# Extract numbers
NUMBERS = re.compile(r'[\d,]+\.?\d*')

Practical Examples

def extract_product_data(text):
    """Extract structured data using regex patterns."""
    return {
        'prices': PRICE.findall(text),
        'emails': EMAIL.findall(text),
        'phones': PHONE_US.findall(text),
        'dates': DATE.findall(text),
        'urls': URL.findall(text),
    }

def clean_price(price_str):
    """Normalize price strings to float."""
    cleaned = re.sub(r'[^\d.]', '', price_str)
    try:
        return float(cleaned)
    except ValueError:
        return None

def extract_between(text, start, end):
    """Extract text between two markers."""
    pattern = re.compile(f'{re.escape(start)}(.*?){re.escape(end)}', re.DOTALL)
    matches = pattern.findall(text)
    return [m.strip() for m in matches]

# Extract structured data from script tags
def extract_json_ld(html):
    """Extract JSON-LD structured data."""
    pattern = re.compile(
        r'<script\s+type="application/ld\+json"[^>]*>\s*(.*?)\s*</script>',
        re.DOTALL
    )
    matches = pattern.findall(html)
    import json
    return [json.loads(m) for m in matches if m.strip()]

Named Groups for Structured Extraction

# Named groups make extraction cleaner
ADDRESS_PATTERN = re.compile(
    r'(?P<street>\d+\s+[\w\s]+(?:St|Ave|Blvd|Rd|Dr|Ln|Way))\.?\s*,?\s*'
    r'(?P<city>[A-Z][\w\s]+),\s*'
    r'(?P<state>[A-Z]{2})\s+'
    r'(?P<zip>\d{5}(?:-\d{4})?)'
)

text = "Visit us at 123 Main Street, New York, NY 10001"
match = ADDRESS_PATTERN.search(text)
if match:
    print(match.groupdict())
    # {'street': '123 Main Street', 'city': 'New York', 'state': 'NY', 'zip': '10001'}

Performance Tips

# Compile patterns once, reuse many times
COMPILED = re.compile(r'\d+')  # Do this once

# Use re.findall for all matches
all_numbers = COMPILED.findall(text)

# Use re.search for first match only
first = COMPILED.search(text)

# Use non-greedy matching (.*?) instead of greedy (.*)
# Greedy: re.compile(r'<div>(.*)</div>')    — matches too much
# Non-greedy: re.compile(r'<div>(.*?)</div>') — matches correctly

Internal Links

FAQ

Should I use regex to parse HTML?

No. Use CSS selectors or XPath for HTML structure. Regex is for extracting data patterns from text content that you have already extracted from HTML. The classic answer is: you cannot reliably parse nested structures with regular expressions.

How do I make regex patterns case-insensitive?

Use the re.IGNORECASE flag: re.compile(r'pattern', re.IGNORECASE). Or use inline flag: (?i)pattern.

What is the fastest way to apply regex to millions of records?

Compile patterns once, use re.findall for simple extraction, and consider the regex module (pip install regex) which is faster than the standard library for complex patterns. For very high throughput, use Rust-based tools like ruff or hyperscan.

How do I handle multiline content?

Use re.DOTALL flag to make . match newlines: re.compile(r'

(.*?)

', re.DOTALL). Use re.MULTILINE to make ^ and $ match line boundaries.

Can regex extract data from minified JavaScript?

Yes, but it is fragile. Minified JS removes whitespace but preserves string literals. Use regex to extract specific variable assignments or function calls. For complex JS parsing, consider AST-based tools instead.


Related Reading

Scroll to Top