How to Scrape Company Emails from Websites with Proxies and Python

How to Scrape Company Emails from Websites with Proxies and Python

Finding decision-maker email addresses is the foundation of outbound B2B sales. While tools like Hunter.io and Apollo provide email databases, they often have incomplete coverage, especially for smaller companies or niche industries. Building your own email scraper gives you full control over data freshness, coverage, and cost.

This guide walks through building a production-ready email scraper using Python and mobile proxies that can extract email addresses from thousands of company websites per day.

Architecture Overview

A reliable email scraping system has five components:

  1. Input list — A CSV or database of target company domains.
  2. Crawler — A Python script that visits each domain’s contact pages, about pages, and footer sections.
  3. Proxy layer — Mobile proxies to avoid IP blocks when scraping at volume.
  4. Email extractor — Regex and heuristic patterns to find and validate email addresses.
  5. Output store — A database or CSV with extracted, validated emails linked to their source companies.

Setting Up the Project

Start with a clean Python environment and install dependencies:

pip install requests beautifulsoup4 lxml tldextract aiohttp

Proxy Configuration

Configure your mobile proxy as a reusable session:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_proxy_session(proxy_url):
    session = requests.Session()
    session.proxies = {
        "http": proxy_url,
        "https": proxy_url,
    }
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    })
    return session

proxy_url = "http://user:pass@gateway.dataresearchtools.com:5000"
session = create_proxy_session(proxy_url)

Building the Email Extractor

Core Email Pattern Matching

Email addresses follow predictable patterns, but naive regex misses many edge cases. Here is a robust extraction function:

import re
from html import unescape

def extract_emails(html_content, domain=None):
    """Extract email addresses from HTML content"""
    # Decode HTML entities first
    text = unescape(html_content)

    # Remove JavaScript and CSS to avoid false positives
    text = re.sub(r'<script[^>]*>.*?</script>', '', text, flags=re.DOTALL)
    text = re.sub(r'<style[^>]*>.*?</style>', '', text, flags=re.DOTALL)

    # Primary email regex
    email_pattern = r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}'
    raw_emails = re.findall(email_pattern, text)

    # Clean and filter
    valid_emails = []
    blacklist_extensions = ['.png', '.jpg', '.gif', '.svg', '.css', '.js']
    blacklist_domains = ['example.com', 'email.com', 'test.com', 'yourcompany.com']

    for email in raw_emails:
        email = email.lower().strip('.')
        # Skip image files and example domains
        if any(email.endswith(ext) for ext in blacklist_extensions):
            continue
        if any(domain in email for domain in blacklist_domains):
            continue
        # Skip very long emails (likely false positives)
        if len(email) > 60:
            continue
        valid_emails.append(email)

    return list(set(valid_emails))

Handling Obfuscated Emails

Many websites obfuscate email addresses to prevent scraping. Common techniques include:

def extract_obfuscated_emails(html_content):
    """Handle common email obfuscation techniques"""
    emails = []

    # Pattern: "user [at] domain [dot] com"
    at_pattern = r'([a-zA-Z0-9._%+-]+)\s*[\[\(]?\s*at\s*[\]\)]?\s*([a-zA-Z0-9.-]+)\s*[\[\(]?\s*dot\s*[\]\)]?\s*([a-zA-Z]{2,})'
    for match in re.finditer(at_pattern, html_content, re.IGNORECASE):
        emails.append(f"{match.group(1)}@{match.group(2)}.{match.group(3)}")

    # Pattern: mailto: links (even if text shows obfuscated version)
    mailto_pattern = r'mailto:([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})'
    emails.extend(re.findall(mailto_pattern, html_content))

    # Pattern: JavaScript-constructed emails
    # e.g., var email = "user" + "@" + "domain.com"
    js_pattern = r'"([a-zA-Z0-9._%+-]+)"\s*\+\s*"@"\s*\+\s*"([a-zA-Z0-9.-]+\.[a-zA-Z]{2,})"'
    for match in re.finditer(js_pattern, html_content):
        emails.append(f"{match.group(1)}@{match.group(2)}")

    return list(set(emails))

Crawling Company Websites

Not every page on a company website contains email addresses. Focus on high-yield pages:

from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import tldextract
import time
import random

def find_contact_pages(base_url, session):
    """Find pages likely to contain email addresses"""
    contact_keywords = [
        'contact', 'about', 'team', 'staff', 'people',
        'support', 'help', 'reach', 'connect', 'get-in-touch',
        'impressum', 'imprint', 'legal'
    ]

    try:
        response = session.get(base_url, timeout=15)
        soup = BeautifulSoup(response.text, 'lxml')
    except Exception as e:
        return [base_url]

    contact_urls = [base_url]  # Always check homepage
    base_domain = tldextract.extract(base_url).registered_domain

    for link in soup.find_all('a', href=True):
        href = urljoin(base_url, link['href'])
        parsed = urlparse(href)

        # Stay on the same domain
        if tldextract.extract(href).registered_domain != base_domain:
            continue

        # Check if URL or link text matches contact keywords
        url_lower = href.lower()
        text_lower = link.get_text().lower()

        if any(kw in url_lower or kw in text_lower for kw in contact_keywords):
            contact_urls.append(href)

    return list(set(contact_urls))[:10]  # Limit to 10 pages per domain


def scrape_company_emails(domain, session):
    """Scrape emails from a single company domain"""
    base_url = f"https://{domain}"
    all_emails = []

    contact_pages = find_contact_pages(base_url, session)

    for url in contact_pages:
        try:
            time.sleep(random.uniform(1, 3))
            response = session.get(url, timeout=15)
            emails = extract_emails(response.text, domain)
            emails += extract_obfuscated_emails(response.text)
            all_emails.extend(emails)
        except Exception as e:
            continue

    # Filter to emails matching the target domain
    domain_emails = [e for e in all_emails if domain in e]
    other_emails = [e for e in all_emails if domain not in e]

    return {
        "domain": domain,
        "domain_emails": list(set(domain_emails)),
        "other_emails": list(set(other_emails)),
    }

Scaling with Async Requests

For processing thousands of domains, switch to asynchronous requests. This dramatically increases throughput while maintaining polite request rates. Understanding the fundamentals behind this approach is easier with a solid grasp of proxy concepts from our proxy glossary.

import aiohttp
import asyncio

async def async_scrape_domain(domain, proxy_url, semaphore):
    """Scrape a single domain with concurrency control"""
    async with semaphore:
        try:
            connector = aiohttp.TCPConnector(ssl=False)
            async with aiohttp.ClientSession(connector=connector) as client_session:
                url = f"https://{domain}"
                async with client_session.get(
                    url,
                    proxy=proxy_url,
                    timeout=aiohttp.ClientTimeout(total=15)
                ) as response:
                    html = await response.text()
                    emails = extract_emails(html, domain)
                    return {"domain": domain, "emails": emails}
        except Exception as e:
            return {"domain": domain, "emails": [], "error": str(e)}

async def batch_scrape(domains, proxy_url, max_concurrent=20):
    """Scrape multiple domains concurrently"""
    semaphore = asyncio.Semaphore(max_concurrent)
    tasks = [async_scrape_domain(d, proxy_url, semaphore) for d in domains]
    results = await asyncio.gather(*tasks)
    return results

Email Validation

Raw extracted emails need validation before they enter your outreach pipeline:

import dns.resolver

def validate_email_domain(email):
    """Check if the email domain has valid MX records"""
    domain = email.split('@')[1]
    try:
        mx_records = dns.resolver.resolve(domain, 'MX')
        return len(mx_records) > 0
    except (dns.resolver.NXDOMAIN, dns.resolver.NoAnswer, dns.exception.Timeout):
        return False

def classify_email(email):
    """Classify email as personal or generic"""
    generic_prefixes = [
        'info', 'contact', 'hello', 'support', 'sales',
        'admin', 'help', 'office', 'mail', 'enquiries',
        'team', 'general', 'service', 'billing'
    ]
    prefix = email.split('@')[0].lower()
    return "generic" if prefix in generic_prefixes else "personal"

Handling Anti-Scraping Measures

Company websites use various techniques to prevent email scraping:

Cloudflare and WAF Protection

Many company websites sit behind Cloudflare or similar WAFs. Mobile proxies help bypass IP-based blocks, but you may also need to handle JavaScript challenges:

# For Cloudflare-protected sites, use a browser automation fallback
from playwright.sync_api import sync_playwright

def scrape_with_browser(url, proxy_config):
    with sync_playwright() as p:
        browser = p.chromium.launch(proxy={
            "server": proxy_config["server"],
            "username": proxy_config["username"],
            "password": proxy_config["password"],
        })
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        content = page.content()
        browser.close()
        return content

Rate Limiting Across Domains

Even though each request goes to a different company domain, your proxy IP still needs protection. Implement a global rate limiter:

from collections import defaultdict
import time

class RateLimiter:
    def __init__(self, requests_per_second=5):
        self.min_interval = 1.0 / requests_per_second
        self.last_request = 0

    def wait(self):
        elapsed = time.time() - self.last_request
        if elapsed < self.min_interval:
            time.sleep(self.min_interval - elapsed)
        self.last_request = time.time()

Output and Integration

Save results in a structured format that integrates with your CRM:

import csv

def save_results(results, output_file="leads_emails.csv"):
    with open(output_file, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['domain', 'email', 'type', 'source_url'])
        for result in results:
            for email in result.get('domain_emails', []):
                writer.writerow([
                    result['domain'],
                    email,
                    classify_email(email),
                    result.get('source_url', '')
                ])

Performance Benchmarks

With a properly configured mobile proxy and async scraping setup, you can expect:

  • 500-1,000 domains per hour with basic HTTP scraping
  • 100-200 domains per hour with browser-based scraping (for JS-heavy sites)
  • 60-70% email discovery rate across random business domains
  • 85-90% email discovery rate for domains with public contact pages

Legal and Ethical Guidelines

Email scraping operates in a legal gray area. Follow these practices:

  • Only scrape publicly displayed email addresses from company websites.
  • Never scrape personal social media profiles for email addresses.
  • Comply with CAN-SPAM, GDPR, and local data protection regulations.
  • Provide opt-out mechanisms in all outbound communications.
  • Store scraped data securely and delete it when no longer needed.

For teams conducting web scraping at scale, establishing clear data governance policies is just as important as the technical infrastructure.

Conclusion

Building a custom email scraper with Python and mobile proxies gives you a significant advantage over teams relying solely on third-party databases. You control the freshness of your data, can target niche industries that commercial tools underserve, and dramatically reduce your cost per lead. Start with the code patterns in this guide, validate your results against a known-good sample, and scale gradually as you refine your extraction logic.

Scroll to Top