How to Scrape Yelp Business Data in 2026
Yelp is the leading local business review platform in North America, with over 265 million cumulative reviews and 178 million monthly unique visitors. For local SEO agencies, market researchers, reputation management firms, and business intelligence teams, scraping Yelp provides essential insights into business performance, customer sentiment, and competitive positioning.
This guide covers how to extract Yelp data using both the official API and web scraping techniques with Python.
What Data Can You Extract from Yelp?
Yelp business pages contain rich data:
- Business listings (name, address, phone, hours, website)
- Reviews (text, rating, date, user info, useful/funny/cool votes)
- Aggregate ratings (overall, by category)
- Photos (user and business photos)
- Business categories and attributes
- Menu data (for restaurants)
- Price range indicators
- COVID-related updates and health measures
- Owner responses to reviews
Example JSON Output
{
"business_id": "abc123-def456",
"name": "Joe's Pizza",
"rating": 4.5,
"review_count": 1243,
"price": "$$",
"categories": ["Pizza", "Italian", "Delivery"],
"address": "123 Main St, New York, NY 10001",
"phone": "(212) 555-1234",
"hours": {
"Monday": "11:00 AM - 11:00 PM",
"Tuesday": "11:00 AM - 11:00 PM"
},
"attributes": {
"delivery": true,
"takeout": true,
"outdoor_seating": true
},
"recent_review": {
"rating": 5,
"text": "Best pizza in the city, hands down...",
"date": "2026-03-05",
"user": "FoodLover22"
}
}Prerequisites
pip install requests beautifulsoup4 lxml fake-useragent yelpapiMethod 1: Using the Yelp Fusion API (Recommended)
Yelp offers a comprehensive API with generous free tier limits.
import requests
import json
import time
class YelpAPIScraper:
def __init__(self, api_key, proxy_url=None):
self.api_key = api_key
self.base_url = "https://api.yelp.com/v3"
self.proxy_url = proxy_url
self.session = requests.Session()
def _get_headers(self):
return {
"Authorization": f"Bearer {self.api_key}",
"Accept": "application/json",
}
def _get_proxies(self):
if self.proxy_url:
return {"http": self.proxy_url, "https": self.proxy_url}
return None
def search_businesses(self, term, location, limit=50, offset=0, sort_by="best_match"):
"""Search for businesses."""
params = {
"term": term,
"location": location,
"limit": min(limit, 50),
"offset": offset,
"sort_by": sort_by,
}
response = self.session.get(
f"{self.base_url}/businesses/search",
headers=self._get_headers(),
params=params,
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
data = response.json()
businesses = []
for biz in data.get("businesses", []):
businesses.append({
"id": biz.get("id"),
"name": biz.get("name"),
"rating": biz.get("rating"),
"review_count": biz.get("review_count"),
"price": biz.get("price"),
"phone": biz.get("phone"),
"address": ", ".join(biz.get("location", {}).get("display_address", [])),
"categories": [c["title"] for c in biz.get("categories", [])],
"coordinates": biz.get("coordinates"),
"url": biz.get("url"),
"image_url": biz.get("image_url"),
"is_closed": biz.get("is_closed"),
"distance": biz.get("distance"),
})
return businesses, data.get("total", 0)
def search_all(self, term, location, max_results=200):
"""Paginate through all search results."""
all_businesses = []
offset = 0
while offset < min(max_results, 1000):
businesses, total = self.search_businesses(term, location, limit=50, offset=offset)
if not businesses:
break
all_businesses.extend(businesses)
offset += 50
print(f"Fetched {len(all_businesses)}/{min(total, max_results)}")
time.sleep(0.5)
return all_businesses
def get_business_details(self, business_id):
"""Get detailed business information."""
response = self.session.get(
f"{self.base_url}/businesses/{business_id}",
headers=self._get_headers(),
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
return response.json()
def get_reviews(self, business_id, sort_by="yelp_sort"):
"""Get reviews for a business (limited to 3 via API)."""
params = {"sort_by": sort_by}
response = self.session.get(
f"{self.base_url}/businesses/{business_id}/reviews",
headers=self._get_headers(),
params=params,
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
data = response.json()
return [{
"rating": r.get("rating"),
"text": r.get("text"),
"time_created": r.get("time_created"),
"user": r.get("user", {}).get("name"),
} for r in data.get("reviews", [])]
def autocomplete(self, text, latitude=None, longitude=None):
"""Get autocomplete suggestions."""
params = {"text": text}
if latitude and longitude:
params["latitude"] = latitude
params["longitude"] = longitude
response = self.session.get(
f"{self.base_url}/autocomplete",
headers=self._get_headers(),
params=params,
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
return response.json()
# Usage
scraper = YelpAPIScraper(
api_key="your_yelp_api_key",
proxy_url="http://user:pass@proxy:port"
)
# Search businesses
businesses, total = scraper.search_businesses("pizza", "New York, NY", limit=10)
print(f"Found {total} total results")
for biz in businesses[:3]:
print(f" {biz['name']} - {biz['rating']}* ({biz['review_count']} reviews)")
reviews = scraper.get_reviews(biz["id"])
print(f" Sample review: {reviews[0]['text'][:100]}..." if reviews else " No reviews")
time.sleep(0.5)Method 2: Web Scraping for Full Reviews
The Yelp API only returns 3 reviews per business. Web scraping provides access to all reviews.
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import json
import time
import random
class YelpWebScraper:
def __init__(self, proxy_url=None):
self.session = requests.Session()
self.ua = UserAgent()
self.proxy_url = proxy_url
def _get_headers(self):
return {
"User-Agent": self.ua.random,
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.yelp.com/",
}
def _get_proxies(self):
if self.proxy_url:
return {"http": self.proxy_url, "https": self.proxy_url}
return None
def scrape_reviews(self, business_url, max_pages=5):
"""Scrape all reviews from a Yelp business page."""
reviews = []
for page in range(max_pages):
start = page * 10
url = f"{business_url}?start={start}" if page > 0 else business_url
try:
response = self.session.get(
url,
headers=self._get_headers(),
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
# Extract JSON-LD review data
scripts = soup.find_all("script", type="application/ld+json")
for script in scripts:
try:
data = json.loads(script.string)
if isinstance(data, list):
for item in data:
if item.get("@type") == "LocalBusiness":
for review in item.get("review", []):
reviews.append({
"rating": review.get("reviewRating", {}).get("ratingValue"),
"text": review.get("description"),
"date": review.get("datePublished"),
"author": review.get("author"),
})
except json.JSONDecodeError:
continue
# Fallback: parse review elements
if not reviews or page > 0:
review_elems = soup.select("[class*='review__'], li[class*='margin']")
for elem in review_elems:
rating = elem.select_one("[class*='star-rating'], [aria-label*='star']")
text = elem.select_one("p[class*='comment'], span[class*='text']")
date = elem.select_one("span[class*='date']")
if text:
reviews.append({
"rating": rating.get("aria-label") if rating else None,
"text": text.get_text(strip=True),
"date": date.get_text(strip=True) if date else None,
})
print(f"Page {page + 1}: Total reviews so far: {len(reviews)}")
time.sleep(random.uniform(3, 6))
except requests.RequestException as e:
print(f"Error on page {page + 1}: {e}")
continue
return reviews
# Usage
scraper = YelpWebScraper(proxy_url="http://user:pass@proxy:port")
reviews = scraper.scrape_reviews("https://www.yelp.com/biz/joes-pizza-new-york", max_pages=5)
print(f"Collected {len(reviews)} reviews")Yelp API Rate Limits
| Tier | Rate Limit | Daily Limit |
|---|---|---|
| Free | 5,000 requests/day | 5,000 |
| Fusion API | 5 QPS | Varies |
| GraphQL | Varies | Varies |
Proxy Recommendations for Yelp
| Proxy Type | Success Rate | Best For |
|---|---|---|
| US Residential | 80-90% | Review scraping |
| ISP Proxies | 75-85% | API access + web |
| Mobile | 85-95% | Bypassing blocks |
| Datacenter | 40-50% | API only |
US residential proxies work best for Yelp web scraping. For API access, datacenter proxies are sufficient.
Legal Considerations
- API Terms: Yelp’s API has specific terms limiting commercial use and data storage.
- Terms of Service: Web scraping is prohibited in Yelp’s ToS.
- Legal History: Yelp has pursued legal action against scrapers in the past.
- Review Content: Reviews are copyrighted by their authors.
See our web scraping compliance guide for details.
Frequently Asked Questions
How do I get a Yelp API key?
Create a free account on the Yelp Developers site. Register a new app to receive an API key. The free tier provides 5,000 API calls per day.
Why does the Yelp API only return 3 reviews?
Yelp intentionally limits API review access to 3 per business to protect review content. For full review data, web scraping is necessary.
Can I scrape Yelp photos?
Yes. Business photos are accessible through both the API (limited) and web scraping. Photos are copyrighted, so do not republish without permission.
What’s the best way to monitor Yelp reviews?
Combine the Yelp API for new review detection with web scraping for full review text extraction. Schedule daily API checks and scrape full details for new reviews.
Advanced Techniques
Handling Pagination
Most websites paginate their results. Implement robust pagination handling:
def scrape_all_pages(scraper, base_url, max_pages=20):
all_data = []
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
results = scraper.search(url)
if not results:
break
all_data.extend(results)
print(f"Page {page}: {len(results)} items (total: {len(all_data)})")
time.sleep(random.uniform(2, 5))
return all_dataData Validation and Cleaning
Always validate scraped data before storage:
def validate_data(item):
required_fields = ["title", "url"]
for field in required_fields:
if not item.get(field):
return False
return True
def clean_text(text):
if not text:
return None
# Remove extra whitespace
import re
text = re.sub(r'\s+', ' ', text).strip()
# Remove HTML entities
import html
text = html.unescape(text)
return text
# Apply to results
cleaned = [item for item in results if validate_data(item)]
for item in cleaned:
item["title"] = clean_text(item.get("title"))Monitoring and Alerting
Build monitoring into your scraping pipeline:
import logging
from datetime import datetime
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ScrapingMonitor:
def __init__(self):
self.start_time = datetime.now()
self.requests = 0
self.errors = 0
self.items = 0
def log_request(self, success=True):
self.requests += 1
if not success:
self.errors += 1
if self.requests % 50 == 0:
elapsed = (datetime.now() - self.start_time).seconds
rate = self.requests / max(elapsed, 1) * 60
logger.info(f"Requests: {self.requests}, Errors: {self.errors}, "
f"Items: {self.items}, Rate: {rate:.1f}/min")
def log_item(self, count=1):
self.items += countError Handling and Retry Logic
Implement robust error handling:
import time
from requests.exceptions import RequestException
def retry_request(func, max_retries=3, base_delay=5):
for attempt in range(max_retries):
try:
return func()
except RequestException as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
time.sleep(delay)
return NoneData Storage Options
Choose the right storage for your scraping volume:
import json
import csv
import sqlite3
class DataStorage:
def __init__(self, db_path="scraped_data.db"):
self.conn = sqlite3.connect(db_path)
self.conn.execute('''CREATE TABLE IF NOT EXISTS items
(id TEXT PRIMARY KEY, title TEXT, url TEXT, data JSON, scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)''')
def save(self, item):
self.conn.execute(
"INSERT OR REPLACE INTO items (id, title, url, data) VALUES (?, ?, ?, ?)",
(item.get("id"), item.get("title"), item.get("url"), json.dumps(item))
)
self.conn.commit()
def export_json(self, output_path):
cursor = self.conn.execute("SELECT data FROM items")
items = [json.loads(row[0]) for row in cursor.fetchall()]
with open(output_path, "w") as f:
json.dump(items, f, indent=2)
def export_csv(self, output_path):
cursor = self.conn.execute("SELECT * FROM items")
rows = cursor.fetchall()
with open(output_path, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["id", "title", "url", "data", "scraped_at"])
writer.writerows(rows)Frequently Asked Questions
How often should I scrape data?
The optimal frequency depends on how often the source data changes. For real-time data (stock prices, news), scrape every few minutes. For product listings, daily or weekly is usually sufficient. For reviews, weekly scraping captures new feedback without excessive load.
What happens if my IP gets blocked?
If you receive 403 or 429 status codes, your IP is likely blocked. Switch to a different proxy, implement exponential backoff, and slow your request rate. Rotating residential proxies automatically switch IPs to prevent blocks.
Should I use headless browsers or HTTP requests?
Use HTTP requests (with BeautifulSoup or similar) whenever possible — they are faster and use less resources. Switch to headless browsers (Selenium, Playwright) only when JavaScript rendering is required for the data you need.
How do I handle CAPTCHAs?
CAPTCHAs indicate aggressive bot detection. To minimize them: use residential or mobile proxies, implement realistic delays, rotate user agents, and maintain consistent session behavior. For persistent CAPTCHAs, consider CAPTCHA-solving services as a last resort.
Can I scrape data commercially?
The legality of commercial scraping depends on the platform’s ToS, the type of data collected, and your jurisdiction. Public data is generally more permissible, but always consult legal counsel for commercial use cases. See our compliance guide.
Conclusion
Yelp offers one of the more accessible data ecosystems among review platforms, with a generous free API for business search and basic review data. For comprehensive review scraping, combine the API with web scraping using residential proxies.
For more review platform guides, check our social media proxy guide and proxy provider comparisons.
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
last updated: April 4, 2026