How to Scrape Yelp Business Data for Local Lead Generation

How to Scrape Yelp Business Data for Local Lead Generation

Yelp hosts detailed listings for millions of local businesses across the United States, Canada, and other markets. For B2B companies selling to local businesses — marketing agencies, SaaS tools, payment processors, POS systems — Yelp is a goldmine of qualified lead data. Each listing contains the business name, phone number, address, website, operating hours, review count, rating, and business category.

This guide walks through the complete process of scraping Yelp for local lead generation using mobile proxies and Python.

What Makes Yelp Valuable for B2B Leads

Yelp data is uniquely useful for lead qualification because reviews provide signals that other directories lack:

  • Review count indicates business maturity and customer volume
  • Rating trends reveal businesses that may need marketing help (declining ratings)
  • Response activity shows which business owners are actively engaged online
  • Category specificity enables precise targeting (e.g., “Italian restaurants” vs. just “restaurants”)
  • Photos and menu data indicate the business’s investment in online presence

A dental marketing agency, for example, can scrape all dental offices in a metro area, filter by those with fewer than 20 reviews (indicating weak online presence), and build a targeted outreach list.

Yelp’s Anti-Scraping Measures

Yelp actively combats scraping through multiple layers:

  1. IP-based rate limiting — Too many requests from one IP triggers blocks within minutes.
  2. JavaScript rendering — Critical page content loads via JavaScript, not static HTML.
  3. CAPTCHA challenges — Automated traffic triggers Yelp’s CAPTCHA system.
  4. Legal enforcement — Yelp has filed lawsuits against scraping operations (always review their Terms of Service).
  5. Honeypot traps — Hidden links and CSS-obfuscated elements designed to catch scrapers.

Mobile proxies address the IP-based detection, while browser automation handles JavaScript rendering. Always ensure your scraping activities comply with applicable laws and regulations.

Setting Up the Scraper

Dependencies

pip install playwright beautifulsoup4 lxml
playwright install chromium

Proxy Configuration

PROXY_CONFIG = {
    "server": "http://gateway.dataresearchtools.com:5000",
    "username": "your_user",
    "password": "your_pass",
}

Core Scraping Logic

from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import asyncio
import random
import json
import re

async def scrape_yelp_search(category, location, proxy_config, max_pages=10):
    """Scrape Yelp search results for a category and location"""
    businesses = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy=proxy_config,
        )
        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
            locale="en-US",
        )
        page = await context.new_page()

        for page_num in range(max_pages):
            start = page_num * 10
            url = f"https://www.yelp.com/search?find_desc={category}&find_loc={location}&start={start}"

            await page.goto(url, wait_until="networkidle")
            await page.wait_for_timeout(random.randint(3000, 7000))

            # Check for CAPTCHA
            if await page.query_selector('[class*="captcha"]'):
                print(f"CAPTCHA detected at page {page_num}")
                break

            # Parse results
            html = await page.content()
            page_businesses = parse_yelp_results(html)

            if not page_businesses:
                break

            businesses.extend(page_businesses)
            print(f"Page {page_num + 1}: Found {len(page_businesses)} businesses")

            # Human-like delay between pages
            await page.wait_for_timeout(random.randint(5000, 12000))

        await browser.close()

    return businesses


def parse_yelp_results(html):
    """Parse business listings from Yelp search results HTML"""
    soup = BeautifulSoup(html, 'lxml')
    businesses = []

    # Yelp uses dynamic class names, so we look for structural patterns
    results = soup.find_all('div', {'data-testid': re.compile(r'serp-ia-card')})

    if not results:
        # Fallback to alternative selectors
        results = soup.select('[class*="businessName"]')

    for result in results:
        business = {}

        # Business name
        name_link = result.find('a', href=re.compile(r'/biz/'))
        if name_link:
            business['name'] = name_link.get_text(strip=True)
            business['yelp_url'] = f"https://www.yelp.com{name_link['href']}"

        # Rating
        rating_el = result.find('div', {'aria-label': re.compile(r'star rating')})
        if rating_el:
            label = rating_el.get('aria-label', '')
            match = re.search(r'([\d.]+)', label)
            if match:
                business['rating'] = float(match.group(1))

        # Review count
        review_el = result.find('span', string=re.compile(r'\d+ review'))
        if review_el:
            match = re.search(r'(\d+)', review_el.get_text())
            if match:
                business['review_count'] = int(match.group(1))

        # Category
        category_els = result.find_all('a', href=re.compile(r'/search\?cflt='))
        business['categories'] = [c.get_text(strip=True) for c in category_els]

        # Address
        address_el = result.find('address')
        if address_el:
            business['address'] = address_el.get_text(strip=True)

        # Phone (sometimes visible in search results)
        phone_el = result.find('p', string=re.compile(r'\(\d{3}\)'))
        if phone_el:
            business['phone'] = phone_el.get_text(strip=True)

        if business.get('name'):
            businesses.append(business)

    return businesses

Extracting Detailed Business Pages

Search results provide basic information. For complete lead data, visit individual business pages:

async def scrape_business_detail(page, yelp_url):
    """Scrape detailed information from a Yelp business page"""
    await page.goto(yelp_url, wait_until="networkidle")
    await page.wait_for_timeout(random.randint(3000, 6000))

    html = await page.content()
    soup = BeautifulSoup(html, 'lxml')
    details = {}

    # Phone number (reliably found on detail pages)
    phone_link = soup.find('a', href=re.compile(r'tel:'))
    if phone_link:
        details['phone'] = phone_link.get_text(strip=True)

    # Website
    website_link = soup.find('a', href=re.compile(r'biz_redir'))
    if website_link:
        details['website'] = website_link.get('href', '')

    # Full address
    address_el = soup.find('address')
    if address_el:
        details['full_address'] = address_el.get_text(strip=True)

    # Hours of operation
    hours_table = soup.find('table', {'class': re.compile(r'hour')})
    if hours_table:
        hours = {}
        rows = hours_table.find_all('tr')
        for row in rows:
            cells = row.find_all(['th', 'td'])
            if len(cells) >= 2:
                day = cells[0].get_text(strip=True)
                time_range = cells[1].get_text(strip=True)
                hours[day] = time_range
        details['hours'] = hours

    # Price range
    price_el = soup.find('span', string=re.compile(r'^[\$]{1,4}$'))
    if price_el:
        details['price_range'] = price_el.get_text(strip=True)

    # Owner response rate (indicates engaged business owner)
    response_el = soup.find(string=re.compile(r'Business owner.*responded'))
    if response_el:
        details['owner_responsive'] = True

    return details

Lead Qualification Filters

Not every Yelp business is a good lead. Apply filters based on your target customer profile:

def qualify_yelp_lead(business, criteria):
    """Score and qualify a Yelp business as a lead"""
    score = 0
    reasons = []

    # Review count indicates business volume
    reviews = business.get('review_count', 0)
    if criteria.get('min_reviews') and reviews < criteria['min_reviews']:
        return None  # Too small
    if criteria.get('max_reviews') and reviews > criteria['max_reviews']:
        return None  # Too established (less likely to need services)

    if reviews < 20:
        score += 30  # Low online presence = opportunity
        reasons.append("Low review count - needs marketing help")
    elif reviews < 50:
        score += 20
        reasons.append("Moderate review count - growth potential")

    # Rating quality
    rating = business.get('rating', 0)
    if 3.0 <= rating <= 4.0:
        score += 20
        reasons.append("Mid-range rating - reputation management opportunity")

    # Website presence
    if not business.get('website'):
        score += 25
        reasons.append("No website listed - web design opportunity")
    elif business.get('website'):
        score += 10

    # Phone number (required for outreach)
    if business.get('phone'):
        score += 15

    # Category match
    if criteria.get('categories'):
        if any(cat in business.get('categories', []) for cat in criteria['categories']):
            score += 10

    business['lead_score'] = score
    business['qualification_reasons'] = reasons

    return business if score >= criteria.get('min_score', 40) else None

Handling Yelp’s Dynamic Content

Yelp frequently updates its frontend, which breaks selectors. Build resilience into your scraper. For foundational proxy concepts, review our proxy glossary.

Selector Fallback Strategy

class ResilientSelector:
    """Try multiple selectors in order of preference"""

    SELECTORS = {
        "business_name": [
            '[data-testid="serp-ia-card"] a[href*="/biz/"]',
            '.businessName a',
            'h3 a[href*="/biz/"]',
            'a[class*="business-name"]',
        ],
        "phone": [
            'a[href^="tel:"]',
            '[class*="phone"]',
            'p:has-text("(")',
        ],
        "rating": [
            '[aria-label*="star rating"]',
            '[class*="rating"]',
            'div[role="img"][aria-label*="star"]',
        ],
    }

    @staticmethod
    async def find(page, element_type):
        """Try selectors until one works"""
        for selector in ResilientSelector.SELECTORS.get(element_type, []):
            try:
                element = await page.query_selector(selector)
                if element:
                    return element
            except Exception:
                continue
        return None

JSON-LD Extraction

Many Yelp pages include structured data in JSON-LD format, which is more stable than HTML selectors:

def extract_jsonld(html):
    """Extract structured data from JSON-LD scripts"""
    soup = BeautifulSoup(html, 'lxml')
    scripts = soup.find_all('script', type='application/ld+json')

    for script in scripts:
        try:
            data = json.loads(script.string)
            if isinstance(data, dict) and data.get('@type') == 'LocalBusiness':
                return {
                    'name': data.get('name'),
                    'phone': data.get('telephone'),
                    'address': data.get('address', {}).get('streetAddress'),
                    'city': data.get('address', {}).get('addressLocality'),
                    'state': data.get('address', {}).get('addressRegion'),
                    'zip': data.get('address', {}).get('postalCode'),
                    'rating': data.get('aggregateRating', {}).get('ratingValue'),
                    'review_count': data.get('aggregateRating', {}).get('reviewCount'),
                    'price_range': data.get('priceRange'),
                }
        except json.JSONDecodeError:
            continue
    return None

Scaling Across Markets

For national campaigns, parallelize scraping across metropolitan areas:

METRO_AREAS = [
    {"name": "New York, NY", "proxy_geo": "US-NY"},
    {"name": "Los Angeles, CA", "proxy_geo": "US-CA"},
    {"name": "Chicago, IL", "proxy_geo": "US-IL"},
    {"name": "Houston, TX", "proxy_geo": "US-TX"},
    {"name": "Phoenix, AZ", "proxy_geo": "US-AZ"},
    # Add more cities as needed
]

async def national_yelp_scrape(category, proxy_pool):
    """Scrape a business category across multiple metro areas"""
    all_leads = []

    for metro in METRO_AREAS:
        proxy = proxy_pool.get_proxy(geo=metro["proxy_geo"])
        leads = await scrape_yelp_search(
            category=category,
            location=metro["name"],
            proxy_config=proxy,
            max_pages=10
        )

        # Add metro context
        for lead in leads:
            lead['metro_area'] = metro["name"]

        all_leads.extend(leads)
        print(f"{metro['name']}: {len(leads)} businesses found")

        # Rest between cities
        await asyncio.sleep(random.uniform(30, 60))

    return all_leads

Data Export and CRM Integration

Format your Yelp leads for import into CRM or outreach tools. For teams also pulling data from ecommerce platforms, maintaining consistent data formats across all sources simplifies pipeline management.

import csv

def export_qualified_leads(leads, output_file="yelp_leads.csv"):
    """Export qualified leads to CSV for CRM import"""
    fieldnames = [
        'name', 'phone', 'website', 'address', 'city', 'state',
        'categories', 'rating', 'review_count', 'price_range',
        'lead_score', 'qualification_reasons', 'yelp_url', 'metro_area'
    ]

    with open(output_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction='ignore')
        writer.writeheader()

        for lead in sorted(leads, key=lambda x: x.get('lead_score', 0), reverse=True):
            lead['categories'] = '; '.join(lead.get('categories', []))
            lead['qualification_reasons'] = '; '.join(lead.get('qualification_reasons', []))
            writer.writerow(lead)

    print(f"Exported {len(leads)} leads to {output_file}")

Performance and Expectations

MetricExpected Range
Search pages per hour30-60
Detail pages per hour60-120
Businesses per metro (10 pages)80-100
Data completeness (name + phone)90%+
CAPTCHA frequencyEvery 100-200 pages
Proxy block rate (mobile)Under 5%

Conclusion

Yelp scraping with mobile proxies provides a reliable pipeline of local business leads that includes qualification signals (reviews, ratings, online presence) not available from other directories. The key to sustainable scraping is conservative pacing, robust selector strategies, and structured data extraction via JSON-LD when available. Start with a single category in your strongest market, validate lead quality through outreach conversion rates, and then expand systematically.


Related Reading

Scroll to Top