How to Scrape Company Emails from Websites with Proxies and Python
Finding decision-maker email addresses is the foundation of outbound B2B sales. While tools like Hunter.io and Apollo provide email databases, they often have incomplete coverage, especially for smaller companies or niche industries. Building your own email scraper gives you full control over data freshness, coverage, and cost.
This guide walks through building a production-ready email scraper using Python and mobile proxies that can extract email addresses from thousands of company websites per day.
Architecture Overview
A reliable email scraping system has five components:
- Input list — A CSV or database of target company domains.
- Crawler — A Python script that visits each domain’s contact pages, about pages, and footer sections.
- Proxy layer — Mobile proxies to avoid IP blocks when scraping at volume.
- Email extractor — Regex and heuristic patterns to find and validate email addresses.
- Output store — A database or CSV with extracted, validated emails linked to their source companies.
Setting Up the Project
Start with a clean Python environment and install dependencies:
pip install requests beautifulsoup4 lxml tldextract aiohttpProxy Configuration
Configure your mobile proxy as a reusable session:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_proxy_session(proxy_url):
session = requests.Session()
session.proxies = {
"http": proxy_url,
"https": proxy_url,
}
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
return session
proxy_url = "http://user:pass@gateway.dataresearchtools.com:5000"
session = create_proxy_session(proxy_url)Building the Email Extractor
Core Email Pattern Matching
Email addresses follow predictable patterns, but naive regex misses many edge cases. Here is a robust extraction function:
import re
from html import unescape
def extract_emails(html_content, domain=None):
"""Extract email addresses from HTML content"""
# Decode HTML entities first
text = unescape(html_content)
# Remove JavaScript and CSS to avoid false positives
text = re.sub(r'<script[^>]*>.*?</script>', '', text, flags=re.DOTALL)
text = re.sub(r'<style[^>]*>.*?</style>', '', text, flags=re.DOTALL)
# Primary email regex
email_pattern = r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}'
raw_emails = re.findall(email_pattern, text)
# Clean and filter
valid_emails = []
blacklist_extensions = ['.png', '.jpg', '.gif', '.svg', '.css', '.js']
blacklist_domains = ['example.com', 'email.com', 'test.com', 'yourcompany.com']
for email in raw_emails:
email = email.lower().strip('.')
# Skip image files and example domains
if any(email.endswith(ext) for ext in blacklist_extensions):
continue
if any(domain in email for domain in blacklist_domains):
continue
# Skip very long emails (likely false positives)
if len(email) > 60:
continue
valid_emails.append(email)
return list(set(valid_emails))Handling Obfuscated Emails
Many websites obfuscate email addresses to prevent scraping. Common techniques include:
def extract_obfuscated_emails(html_content):
"""Handle common email obfuscation techniques"""
emails = []
# Pattern: "user [at] domain [dot] com"
at_pattern = r'([a-zA-Z0-9._%+-]+)\s*[\[\(]?\s*at\s*[\]\)]?\s*([a-zA-Z0-9.-]+)\s*[\[\(]?\s*dot\s*[\]\)]?\s*([a-zA-Z]{2,})'
for match in re.finditer(at_pattern, html_content, re.IGNORECASE):
emails.append(f"{match.group(1)}@{match.group(2)}.{match.group(3)}")
# Pattern: mailto: links (even if text shows obfuscated version)
mailto_pattern = r'mailto:([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})'
emails.extend(re.findall(mailto_pattern, html_content))
# Pattern: JavaScript-constructed emails
# e.g., var email = "user" + "@" + "domain.com"
js_pattern = r'"([a-zA-Z0-9._%+-]+)"\s*\+\s*"@"\s*\+\s*"([a-zA-Z0-9.-]+\.[a-zA-Z]{2,})"'
for match in re.finditer(js_pattern, html_content):
emails.append(f"{match.group(1)}@{match.group(2)}")
return list(set(emails))Crawling Company Websites
Not every page on a company website contains email addresses. Focus on high-yield pages:
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import tldextract
import time
import random
def find_contact_pages(base_url, session):
"""Find pages likely to contain email addresses"""
contact_keywords = [
'contact', 'about', 'team', 'staff', 'people',
'support', 'help', 'reach', 'connect', 'get-in-touch',
'impressum', 'imprint', 'legal'
]
try:
response = session.get(base_url, timeout=15)
soup = BeautifulSoup(response.text, 'lxml')
except Exception as e:
return [base_url]
contact_urls = [base_url] # Always check homepage
base_domain = tldextract.extract(base_url).registered_domain
for link in soup.find_all('a', href=True):
href = urljoin(base_url, link['href'])
parsed = urlparse(href)
# Stay on the same domain
if tldextract.extract(href).registered_domain != base_domain:
continue
# Check if URL or link text matches contact keywords
url_lower = href.lower()
text_lower = link.get_text().lower()
if any(kw in url_lower or kw in text_lower for kw in contact_keywords):
contact_urls.append(href)
return list(set(contact_urls))[:10] # Limit to 10 pages per domain
def scrape_company_emails(domain, session):
"""Scrape emails from a single company domain"""
base_url = f"https://{domain}"
all_emails = []
contact_pages = find_contact_pages(base_url, session)
for url in contact_pages:
try:
time.sleep(random.uniform(1, 3))
response = session.get(url, timeout=15)
emails = extract_emails(response.text, domain)
emails += extract_obfuscated_emails(response.text)
all_emails.extend(emails)
except Exception as e:
continue
# Filter to emails matching the target domain
domain_emails = [e for e in all_emails if domain in e]
other_emails = [e for e in all_emails if domain not in e]
return {
"domain": domain,
"domain_emails": list(set(domain_emails)),
"other_emails": list(set(other_emails)),
}Scaling with Async Requests
For processing thousands of domains, switch to asynchronous requests. This dramatically increases throughput while maintaining polite request rates. Understanding the fundamentals behind this approach is easier with a solid grasp of proxy concepts from our proxy glossary.
import aiohttp
import asyncio
async def async_scrape_domain(domain, proxy_url, semaphore):
"""Scrape a single domain with concurrency control"""
async with semaphore:
try:
connector = aiohttp.TCPConnector(ssl=False)
async with aiohttp.ClientSession(connector=connector) as client_session:
url = f"https://{domain}"
async with client_session.get(
url,
proxy=proxy_url,
timeout=aiohttp.ClientTimeout(total=15)
) as response:
html = await response.text()
emails = extract_emails(html, domain)
return {"domain": domain, "emails": emails}
except Exception as e:
return {"domain": domain, "emails": [], "error": str(e)}
async def batch_scrape(domains, proxy_url, max_concurrent=20):
"""Scrape multiple domains concurrently"""
semaphore = asyncio.Semaphore(max_concurrent)
tasks = [async_scrape_domain(d, proxy_url, semaphore) for d in domains]
results = await asyncio.gather(*tasks)
return resultsEmail Validation
Raw extracted emails need validation before they enter your outreach pipeline:
import dns.resolver
def validate_email_domain(email):
"""Check if the email domain has valid MX records"""
domain = email.split('@')[1]
try:
mx_records = dns.resolver.resolve(domain, 'MX')
return len(mx_records) > 0
except (dns.resolver.NXDOMAIN, dns.resolver.NoAnswer, dns.exception.Timeout):
return False
def classify_email(email):
"""Classify email as personal or generic"""
generic_prefixes = [
'info', 'contact', 'hello', 'support', 'sales',
'admin', 'help', 'office', 'mail', 'enquiries',
'team', 'general', 'service', 'billing'
]
prefix = email.split('@')[0].lower()
return "generic" if prefix in generic_prefixes else "personal"Handling Anti-Scraping Measures
Company websites use various techniques to prevent email scraping:
Cloudflare and WAF Protection
Many company websites sit behind Cloudflare or similar WAFs. Mobile proxies help bypass IP-based blocks, but you may also need to handle JavaScript challenges:
# For Cloudflare-protected sites, use a browser automation fallback
from playwright.sync_api import sync_playwright
def scrape_with_browser(url, proxy_config):
with sync_playwright() as p:
browser = p.chromium.launch(proxy={
"server": proxy_config["server"],
"username": proxy_config["username"],
"password": proxy_config["password"],
})
page = browser.new_page()
page.goto(url, wait_until="networkidle")
content = page.content()
browser.close()
return contentRate Limiting Across Domains
Even though each request goes to a different company domain, your proxy IP still needs protection. Implement a global rate limiter:
from collections import defaultdict
import time
class RateLimiter:
def __init__(self, requests_per_second=5):
self.min_interval = 1.0 / requests_per_second
self.last_request = 0
def wait(self):
elapsed = time.time() - self.last_request
if elapsed < self.min_interval:
time.sleep(self.min_interval - elapsed)
self.last_request = time.time()Output and Integration
Save results in a structured format that integrates with your CRM:
import csv
def save_results(results, output_file="leads_emails.csv"):
with open(output_file, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['domain', 'email', 'type', 'source_url'])
for result in results:
for email in result.get('domain_emails', []):
writer.writerow([
result['domain'],
email,
classify_email(email),
result.get('source_url', '')
])Performance Benchmarks
With a properly configured mobile proxy and async scraping setup, you can expect:
- 500-1,000 domains per hour with basic HTTP scraping
- 100-200 domains per hour with browser-based scraping (for JS-heavy sites)
- 60-70% email discovery rate across random business domains
- 85-90% email discovery rate for domains with public contact pages
Legal and Ethical Guidelines
Email scraping operates in a legal gray area. Follow these practices:
- Only scrape publicly displayed email addresses from company websites.
- Never scrape personal social media profiles for email addresses.
- Comply with CAN-SPAM, GDPR, and local data protection regulations.
- Provide opt-out mechanisms in all outbound communications.
- Store scraped data securely and delete it when no longer needed.
For teams conducting web scraping at scale, establishing clear data governance policies is just as important as the technical infrastructure.
Conclusion
Building a custom email scraper with Python and mobile proxies gives you a significant advantage over teams relying solely on third-party databases. You control the freshness of your data, can target niche industries that commercial tools underserve, and dramatically reduce your cost per lead. Start with the code patterns in this guide, validate your results against a known-good sample, and scale gradually as you refine your extraction logic.