How to Scrape Crunchbase Company Data in 2026
Crunchbase is the premier business data platform for tracking startups, venture capital, and private company information. With data on over 4 million companies, including funding rounds, acquisitions, leadership, and key metrics, Crunchbase is essential for investors, sales teams, market researchers, and startup ecosystem analysts.
This guide covers how to extract Crunchbase data using their API and web scraping techniques with Python.
What Data Can You Extract?
Crunchbase data includes:
- Company profiles (description, founded date, headquarters, employee count)
- Funding rounds (amount, date, investors, series type)
- Investor profiles (portfolio, fund size, investment focus)
- Acquisitions (acquirer, price, date)
- Key people (founders, executives, board members)
- Company categories and industry tags
- Financial metrics (revenue ranges, valuation estimates)
- News and events
Example JSON Output
{
"company": {
"name": "Stripe",
"short_description": "Financial infrastructure for the internet",
"headquarters": "San Francisco, CA",
"founded": "2010",
"employee_count": "5001-10000",
"categories": ["Fintech", "Payments", "SaaS"],
"website": "https://stripe.com"
},
"funding": {
"total_raised": "$8.7B",
"last_round": {
"type": "Series I",
"amount": "$6.5B",
"date": "2023-03-15",
"valuation": "$50B"
}
}
}Prerequisites
pip install requests beautifulsoup4 lxml fake-useragentMethod 1: Using Crunchbase API
Crunchbase offers both free and paid API access:
import requests
import json
import time
class CrunchbaseScraper:
def __init__(self, api_key, proxy_url=None):
self.api_key = api_key
self.base_url = "https://api.crunchbase.com/api/v4"
self.proxy_url = proxy_url
self.session = requests.Session()
def _get_headers(self):
return {
"X-cb-user-key": self.api_key,
"Content-Type": "application/json",
}
def _get_proxies(self):
if self.proxy_url:
return {"http": self.proxy_url, "https": self.proxy_url}
return None
def search_organizations(self, query, limit=25):
"""Search for companies/organizations."""
url = f"{self.base_url}/autocompletes"
params = {
"query": query,
"collection_ids": "organizations",
"limit": limit,
}
response = self.session.get(
url, headers=self._get_headers(),
params=params, proxies=self._get_proxies(), timeout=30
)
response.raise_for_status()
return response.json().get("entities", [])
def get_organization(self, permalink):
"""Get detailed organization data."""
url = f"{self.base_url}/entities/organizations/{permalink}"
params = {
"field_ids": "short_description,founded_on,location_identifiers,num_employees_enum,categories,funding_total,last_funding_type,investor_count",
}
response = self.session.get(
url, headers=self._get_headers(),
params=params, proxies=self._get_proxies(), timeout=30
)
response.raise_for_status()
return response.json()
def get_funding_rounds(self, org_permalink, limit=10):
"""Get funding rounds for a company."""
url = f"{self.base_url}/entities/organizations/{org_permalink}/funding_rounds"
params = {"limit": limit}
response = self.session.get(
url, headers=self._get_headers(),
params=params, proxies=self._get_proxies(), timeout=30
)
response.raise_for_status()
return response.json()
def search_funding_rounds(self, min_amount=None, funded_after=None, limit=50):
"""Search for funding rounds with filters."""
url = f"{self.base_url}/searches/funding_rounds"
payload = {
"field_ids": ["identifier", "funded_organization_identifier", "money_raised", "announced_on", "investment_type"],
"limit": limit,
"order": [{"field_id": "announced_on", "sort": "desc"}],
}
query = []
if min_amount:
query.append({
"type": "predicate",
"field_id": "money_raised",
"operator_id": "gte",
"values": [{"value": min_amount, "currency": "usd"}]
})
if funded_after:
query.append({
"type": "predicate",
"field_id": "announced_on",
"operator_id": "gte",
"values": [funded_after]
})
if query:
payload["query"] = query
response = self.session.post(
url, headers=self._get_headers(),
json=payload, proxies=self._get_proxies(), timeout=30
)
response.raise_for_status()
return response.json()
# Usage
scraper = CrunchbaseScraper(api_key="your_api_key")
# Search companies
results = scraper.search_organizations("stripe")
print(json.dumps(results[:3], indent=2))
# Get company details
org = scraper.get_organization("stripe")
print(json.dumps(org, indent=2))Method 2: Web Scraping Crunchbase
For data not available through the API:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import json
class CrunchbaseWebScraper:
def __init__(self, proxy_url=None):
self.session = requests.Session()
self.ua = UserAgent()
self.proxy_url = proxy_url
def _get_headers(self):
return {
"User-Agent": self.ua.random,
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
}
def _get_proxies(self):
if self.proxy_url:
return {"http": self.proxy_url, "https": self.proxy_url}
return None
def scrape_company(self, company_slug):
"""Scrape company data from the Crunchbase website."""
url = f"https://www.crunchbase.com/organization/{company_slug}"
try:
response = self.session.get(
url, headers=self._get_headers(),
proxies=self._get_proxies(), timeout=30
)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
# Extract JSON-LD
scripts = soup.find_all("script", type="application/ld+json")
for script in scripts:
try:
data = json.loads(script.string)
if data.get("@type") == "Organization":
return data
except json.JSONDecodeError:
continue
return None
except requests.RequestException as e:
print(f"Error: {e}")
return NoneProxy Recommendations
| Proxy Type | Success Rate | Best For |
|---|---|---|
| Residential | 70-80% | Web scraping |
| Datacenter | 60-70% | API access |
| ISP | 65-75% | Consistent sessions |
For API access, any proxy type works. For web scraping, residential proxies are recommended.
Legal Considerations
- API Terms: Crunchbase API terms restrict data resale and competing product creation.
- Terms of Service: Web scraping is prohibited in Crunchbase’s ToS.
- Data Licensing: Crunchbase data is commercially licensed. Extensive use may require a paid data license.
- Commercial Use: Significant restrictions on commercial use of scraped data.
See our web scraping compliance guide for details.
Handling Crunchbase Anti-Bot Protections
1. JavaScript Rendering
Crunchbase is a React-based single-page application. Simple HTTP requests return minimal data. For web scraping (as opposed to API access), use Selenium or Playwright to render the full page:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
class CrunchbaseSeleniumScraper:
def __init__(self, proxy=None):
options = Options()
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")
if proxy:
options.add_argument(f"--proxy-server={proxy}")
self.driver = webdriver.Chrome(options=options)
def scrape_company_page(self, slug):
url = f"https://www.crunchbase.com/organization/{slug}"
self.driver.get(url)
time.sleep(5)
try:
WebDriverWait(self.driver, 15).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "h1, [class*='profile']"))
)
except Exception:
return None
data = self.driver.execute_script('''
const result = {};
const title = document.querySelector("h1");
result.name = title ? title.innerText.trim() : null;
const desc = document.querySelector("[class*='description']");
result.description = desc ? desc.innerText.trim() : null;
const fields = document.querySelectorAll("[class*='field-type']");
fields.forEach(field => {
const label = field.querySelector("[class*='label']");
const value = field.querySelector("[class*='value']");
if (label && value) {
result[label.innerText.trim().toLowerCase().replace(/\\s+/g, "_")] = value.innerText.trim();
}
});
return result;
''')
return data
def close(self):
self.driver.quit()2. Rate Limiting
The Crunchbase API enforces strict rate limits depending on your plan tier. Implement exponential backoff:
import time
def api_request_with_backoff(func, max_retries=3, base_delay=2):
for attempt in range(max_retries):
try:
return func()
except requests.exceptions.HTTPError as e:
if e.response.status_code == 429:
delay = base_delay * (2 ** attempt)
print(f"Rate limited. Waiting {delay}s...")
time.sleep(delay)
else:
raise
return None3. Session Management
Crunchbase tracks browsing sessions closely. Rotate user agents and clear cookies between scraping batches. For API access, use a single persistent session to avoid triggering anomaly detection.
Data Storage and Export
For investment research and sales intelligence pipelines, store Crunchbase data in a structured database:
import sqlite3
import json
class CrunchbaseDataStore:
def __init__(self, db_path="crunchbase_data.db"):
self.conn = sqlite3.connect(db_path)
self.conn.execute('''CREATE TABLE IF NOT EXISTS companies
(permalink TEXT PRIMARY KEY, name TEXT, description TEXT,
headquarters TEXT, founded TEXT, employee_count TEXT,
categories TEXT, total_funding TEXT, last_funding_type TEXT,
data JSON, scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)''')
self.conn.execute('''CREATE TABLE IF NOT EXISTS funding_rounds
(round_id TEXT PRIMARY KEY, company_permalink TEXT,
round_type TEXT, amount TEXT, announced_on TEXT,
investors TEXT, data JSON)''')
def store_company(self, company):
self.conn.execute(
"""INSERT OR REPLACE INTO companies
(permalink, name, description, headquarters, founded, employee_count,
categories, total_funding, last_funding_type, data)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(company.get("permalink"), company.get("name"), company.get("short_description"),
company.get("headquarters"), company.get("founded_on"),
company.get("num_employees_enum"), json.dumps(company.get("categories", [])),
company.get("funding_total"), company.get("last_funding_type"),
json.dumps(company))
)
self.conn.commit()
def export_csv(self, output_path):
import csv
cursor = self.conn.execute("SELECT * FROM companies ORDER BY name")
rows = cursor.fetchall()
columns = [desc[0] for desc in cursor.description]
with open(output_path, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(columns)
writer.writerows(rows)Frequently Asked Questions
Is Crunchbase API free?
Crunchbase offers a free tier with limited access (basic search, autocomplete). Full API access requires a paid plan starting at several hundred dollars per month. The free tier is sufficient for small research projects, but serious data collection requires the Pro or Enterprise plan.
Can I scrape Crunchbase without an API key?
Web scraping is possible but Crunchbase uses aggressive anti-bot protections including Cloudflare, CAPTCHAs, and their ToS prohibits it. The API is the recommended and more reliable data access method.
How often does Crunchbase data update?
Crunchbase data is updated continuously by their research team and community contributors. Funding round data typically appears within 1-2 days of announcement. Company profile updates depend on community contributions and Crunchbase’s editorial team.
What alternatives exist for company data?
PitchBook, CB Insights, and LinkedIn Sales Navigator are commercial alternatives. For free options, consider SEC EDGAR for public company data, AngelList for startup data, or OpenCorporates for basic company registration information.
How do I track funding trends in a specific sector?
Use the Crunchbase search API with category filters and date ranges. Store results in a database over time and analyze funding velocity, average round sizes, and investor patterns. The search_funding_rounds method with min_amount and funded_after filters is ideal for this use case.
Conclusion
Crunchbase is best accessed through their official API, which provides structured data on companies, funding, and investors. Web scraping with Selenium serves as a supplement for data not available through the API. Use residential proxies for web scraping, respect API rate limits for sustained access, and store results in a database for longitudinal analysis of startup ecosystems and venture capital trends.
For more business data guides, visit our web scraping proxy guide and proxy provider comparisons.
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix