How to Scrape Yelp Business Data for Local Lead Generation
Yelp hosts detailed listings for millions of local businesses across the United States, Canada, and other markets. For B2B companies selling to local businesses — marketing agencies, SaaS tools, payment processors, POS systems — Yelp is a goldmine of qualified lead data. Each listing contains the business name, phone number, address, website, operating hours, review count, rating, and business category.
This guide walks through the complete process of scraping Yelp for local lead generation using mobile proxies and Python.
What Makes Yelp Valuable for B2B Leads
Yelp data is uniquely useful for lead qualification because reviews provide signals that other directories lack:
- Review count indicates business maturity and customer volume
- Rating trends reveal businesses that may need marketing help (declining ratings)
- Response activity shows which business owners are actively engaged online
- Category specificity enables precise targeting (e.g., “Italian restaurants” vs. just “restaurants”)
- Photos and menu data indicate the business’s investment in online presence
A dental marketing agency, for example, can scrape all dental offices in a metro area, filter by those with fewer than 20 reviews (indicating weak online presence), and build a targeted outreach list.
Yelp’s Anti-Scraping Measures
Yelp actively combats scraping through multiple layers:
- IP-based rate limiting — Too many requests from one IP triggers blocks within minutes.
- JavaScript rendering — Critical page content loads via JavaScript, not static HTML.
- CAPTCHA challenges — Automated traffic triggers Yelp’s CAPTCHA system.
- Legal enforcement — Yelp has filed lawsuits against scraping operations (always review their Terms of Service).
- Honeypot traps — Hidden links and CSS-obfuscated elements designed to catch scrapers.
Mobile proxies address the IP-based detection, while browser automation handles JavaScript rendering. Always ensure your scraping activities comply with applicable laws and regulations.
Setting Up the Scraper
Dependencies
pip install playwright beautifulsoup4 lxml
playwright install chromiumProxy Configuration
PROXY_CONFIG = {
"server": "http://gateway.dataresearchtools.com:5000",
"username": "your_user",
"password": "your_pass",
}Core Scraping Logic
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import asyncio
import random
import json
import re
async def scrape_yelp_search(category, location, proxy_config, max_pages=10):
"""Scrape Yelp search results for a category and location"""
businesses = []
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=False,
proxy=proxy_config,
)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
locale="en-US",
)
page = await context.new_page()
for page_num in range(max_pages):
start = page_num * 10
url = f"https://www.yelp.com/search?find_desc={category}&find_loc={location}&start={start}"
await page.goto(url, wait_until="networkidle")
await page.wait_for_timeout(random.randint(3000, 7000))
# Check for CAPTCHA
if await page.query_selector('[class*="captcha"]'):
print(f"CAPTCHA detected at page {page_num}")
break
# Parse results
html = await page.content()
page_businesses = parse_yelp_results(html)
if not page_businesses:
break
businesses.extend(page_businesses)
print(f"Page {page_num + 1}: Found {len(page_businesses)} businesses")
# Human-like delay between pages
await page.wait_for_timeout(random.randint(5000, 12000))
await browser.close()
return businesses
def parse_yelp_results(html):
"""Parse business listings from Yelp search results HTML"""
soup = BeautifulSoup(html, 'lxml')
businesses = []
# Yelp uses dynamic class names, so we look for structural patterns
results = soup.find_all('div', {'data-testid': re.compile(r'serp-ia-card')})
if not results:
# Fallback to alternative selectors
results = soup.select('[class*="businessName"]')
for result in results:
business = {}
# Business name
name_link = result.find('a', href=re.compile(r'/biz/'))
if name_link:
business['name'] = name_link.get_text(strip=True)
business['yelp_url'] = f"https://www.yelp.com{name_link['href']}"
# Rating
rating_el = result.find('div', {'aria-label': re.compile(r'star rating')})
if rating_el:
label = rating_el.get('aria-label', '')
match = re.search(r'([\d.]+)', label)
if match:
business['rating'] = float(match.group(1))
# Review count
review_el = result.find('span', string=re.compile(r'\d+ review'))
if review_el:
match = re.search(r'(\d+)', review_el.get_text())
if match:
business['review_count'] = int(match.group(1))
# Category
category_els = result.find_all('a', href=re.compile(r'/search\?cflt='))
business['categories'] = [c.get_text(strip=True) for c in category_els]
# Address
address_el = result.find('address')
if address_el:
business['address'] = address_el.get_text(strip=True)
# Phone (sometimes visible in search results)
phone_el = result.find('p', string=re.compile(r'\(\d{3}\)'))
if phone_el:
business['phone'] = phone_el.get_text(strip=True)
if business.get('name'):
businesses.append(business)
return businessesExtracting Detailed Business Pages
Search results provide basic information. For complete lead data, visit individual business pages:
async def scrape_business_detail(page, yelp_url):
"""Scrape detailed information from a Yelp business page"""
await page.goto(yelp_url, wait_until="networkidle")
await page.wait_for_timeout(random.randint(3000, 6000))
html = await page.content()
soup = BeautifulSoup(html, 'lxml')
details = {}
# Phone number (reliably found on detail pages)
phone_link = soup.find('a', href=re.compile(r'tel:'))
if phone_link:
details['phone'] = phone_link.get_text(strip=True)
# Website
website_link = soup.find('a', href=re.compile(r'biz_redir'))
if website_link:
details['website'] = website_link.get('href', '')
# Full address
address_el = soup.find('address')
if address_el:
details['full_address'] = address_el.get_text(strip=True)
# Hours of operation
hours_table = soup.find('table', {'class': re.compile(r'hour')})
if hours_table:
hours = {}
rows = hours_table.find_all('tr')
for row in rows:
cells = row.find_all(['th', 'td'])
if len(cells) >= 2:
day = cells[0].get_text(strip=True)
time_range = cells[1].get_text(strip=True)
hours[day] = time_range
details['hours'] = hours
# Price range
price_el = soup.find('span', string=re.compile(r'^[\$]{1,4}$'))
if price_el:
details['price_range'] = price_el.get_text(strip=True)
# Owner response rate (indicates engaged business owner)
response_el = soup.find(string=re.compile(r'Business owner.*responded'))
if response_el:
details['owner_responsive'] = True
return detailsLead Qualification Filters
Not every Yelp business is a good lead. Apply filters based on your target customer profile:
def qualify_yelp_lead(business, criteria):
"""Score and qualify a Yelp business as a lead"""
score = 0
reasons = []
# Review count indicates business volume
reviews = business.get('review_count', 0)
if criteria.get('min_reviews') and reviews < criteria['min_reviews']:
return None # Too small
if criteria.get('max_reviews') and reviews > criteria['max_reviews']:
return None # Too established (less likely to need services)
if reviews < 20:
score += 30 # Low online presence = opportunity
reasons.append("Low review count - needs marketing help")
elif reviews < 50:
score += 20
reasons.append("Moderate review count - growth potential")
# Rating quality
rating = business.get('rating', 0)
if 3.0 <= rating <= 4.0:
score += 20
reasons.append("Mid-range rating - reputation management opportunity")
# Website presence
if not business.get('website'):
score += 25
reasons.append("No website listed - web design opportunity")
elif business.get('website'):
score += 10
# Phone number (required for outreach)
if business.get('phone'):
score += 15
# Category match
if criteria.get('categories'):
if any(cat in business.get('categories', []) for cat in criteria['categories']):
score += 10
business['lead_score'] = score
business['qualification_reasons'] = reasons
return business if score >= criteria.get('min_score', 40) else NoneHandling Yelp’s Dynamic Content
Yelp frequently updates its frontend, which breaks selectors. Build resilience into your scraper. For foundational proxy concepts, review our proxy glossary.
Selector Fallback Strategy
class ResilientSelector:
"""Try multiple selectors in order of preference"""
SELECTORS = {
"business_name": [
'[data-testid="serp-ia-card"] a[href*="/biz/"]',
'.businessName a',
'h3 a[href*="/biz/"]',
'a[class*="business-name"]',
],
"phone": [
'a[href^="tel:"]',
'[class*="phone"]',
'p:has-text("(")',
],
"rating": [
'[aria-label*="star rating"]',
'[class*="rating"]',
'div[role="img"][aria-label*="star"]',
],
}
@staticmethod
async def find(page, element_type):
"""Try selectors until one works"""
for selector in ResilientSelector.SELECTORS.get(element_type, []):
try:
element = await page.query_selector(selector)
if element:
return element
except Exception:
continue
return NoneJSON-LD Extraction
Many Yelp pages include structured data in JSON-LD format, which is more stable than HTML selectors:
def extract_jsonld(html):
"""Extract structured data from JSON-LD scripts"""
soup = BeautifulSoup(html, 'lxml')
scripts = soup.find_all('script', type='application/ld+json')
for script in scripts:
try:
data = json.loads(script.string)
if isinstance(data, dict) and data.get('@type') == 'LocalBusiness':
return {
'name': data.get('name'),
'phone': data.get('telephone'),
'address': data.get('address', {}).get('streetAddress'),
'city': data.get('address', {}).get('addressLocality'),
'state': data.get('address', {}).get('addressRegion'),
'zip': data.get('address', {}).get('postalCode'),
'rating': data.get('aggregateRating', {}).get('ratingValue'),
'review_count': data.get('aggregateRating', {}).get('reviewCount'),
'price_range': data.get('priceRange'),
}
except json.JSONDecodeError:
continue
return NoneScaling Across Markets
For national campaigns, parallelize scraping across metropolitan areas:
METRO_AREAS = [
{"name": "New York, NY", "proxy_geo": "US-NY"},
{"name": "Los Angeles, CA", "proxy_geo": "US-CA"},
{"name": "Chicago, IL", "proxy_geo": "US-IL"},
{"name": "Houston, TX", "proxy_geo": "US-TX"},
{"name": "Phoenix, AZ", "proxy_geo": "US-AZ"},
# Add more cities as needed
]
async def national_yelp_scrape(category, proxy_pool):
"""Scrape a business category across multiple metro areas"""
all_leads = []
for metro in METRO_AREAS:
proxy = proxy_pool.get_proxy(geo=metro["proxy_geo"])
leads = await scrape_yelp_search(
category=category,
location=metro["name"],
proxy_config=proxy,
max_pages=10
)
# Add metro context
for lead in leads:
lead['metro_area'] = metro["name"]
all_leads.extend(leads)
print(f"{metro['name']}: {len(leads)} businesses found")
# Rest between cities
await asyncio.sleep(random.uniform(30, 60))
return all_leadsData Export and CRM Integration
Format your Yelp leads for import into CRM or outreach tools. For teams also pulling data from ecommerce platforms, maintaining consistent data formats across all sources simplifies pipeline management.
import csv
def export_qualified_leads(leads, output_file="yelp_leads.csv"):
"""Export qualified leads to CSV for CRM import"""
fieldnames = [
'name', 'phone', 'website', 'address', 'city', 'state',
'categories', 'rating', 'review_count', 'price_range',
'lead_score', 'qualification_reasons', 'yelp_url', 'metro_area'
]
with open(output_file, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction='ignore')
writer.writeheader()
for lead in sorted(leads, key=lambda x: x.get('lead_score', 0), reverse=True):
lead['categories'] = '; '.join(lead.get('categories', []))
lead['qualification_reasons'] = '; '.join(lead.get('qualification_reasons', []))
writer.writerow(lead)
print(f"Exported {len(leads)} leads to {output_file}")Performance and Expectations
| Metric | Expected Range |
|---|---|
| Search pages per hour | 30-60 |
| Detail pages per hour | 60-120 |
| Businesses per metro (10 pages) | 80-100 |
| Data completeness (name + phone) | 90%+ |
| CAPTCHA frequency | Every 100-200 pages |
| Proxy block rate (mobile) | Under 5% |
Conclusion
Yelp scraping with mobile proxies provides a reliable pipeline of local business leads that includes qualification signals (reviews, ratings, online presence) not available from other directories. The key to sustainable scraping is conservative pacing, robust selector strategies, and structured data extraction via JSON-LD when available. Start with a single category in your strongest market, validate lead quality through outreach conversion rates, and then expand systematically.
- How to Build an Automated Lead Scraping Pipeline with Proxies
- Building a B2B Contact Enrichment Pipeline with Mobile Proxies
- How to Scrape Job Listings at Scale with Rotating Proxies
- Proxies for HR Tech: Salary Benchmarking & Talent Intelligence
- How to Scrape AliExpress Product Data Without Getting Blocked
- Amazon Buy Box Monitoring: Proxy Setup for Continuous Tracking
- How to Build an Automated Lead Scraping Pipeline with Proxies
- Building a B2B Contact Enrichment Pipeline with Mobile Proxies
- How to Scrape Job Listings at Scale with Rotating Proxies
- Proxies for HR Tech: Salary Benchmarking & Talent Intelligence
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked
- How to Build an Automated Lead Scraping Pipeline with Proxies
- Building a B2B Contact Enrichment Pipeline with Mobile Proxies
- How to Scrape Job Listings at Scale with Rotating Proxies
- Proxies for HR Tech: Salary Benchmarking & Talent Intelligence
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked
Related Reading
- How to Build an Automated Lead Scraping Pipeline with Proxies
- Building a B2B Contact Enrichment Pipeline with Mobile Proxies
- How to Scrape Job Listings at Scale with Rotating Proxies
- Proxies for HR Tech: Salary Benchmarking & Talent Intelligence
- aiohttp + BeautifulSoup: Async Python Scraping
- How to Scrape AliExpress Product Data Without Getting Blocked