How to Scrape Crunchbase Company Data in 2026

How to Scrape Crunchbase Company Data in 2026

Crunchbase is the premier business data platform for tracking startups, venture capital, and private company information. With data on over 4 million companies, including funding rounds, acquisitions, leadership, and key metrics, Crunchbase is essential for investors, sales teams, market researchers, and startup ecosystem analysts.

This guide covers how to extract Crunchbase data using their API and web scraping techniques with Python.

What Data Can You Extract?

Crunchbase data includes:

  • Company profiles (description, founded date, headquarters, employee count)
  • Funding rounds (amount, date, investors, series type)
  • Investor profiles (portfolio, fund size, investment focus)
  • Acquisitions (acquirer, price, date)
  • Key people (founders, executives, board members)
  • Company categories and industry tags
  • Financial metrics (revenue ranges, valuation estimates)
  • News and events

Example JSON Output

{
  "company": {
    "name": "Stripe",
    "short_description": "Financial infrastructure for the internet",
    "headquarters": "San Francisco, CA",
    "founded": "2010",
    "employee_count": "5001-10000",
    "categories": ["Fintech", "Payments", "SaaS"],
    "website": "https://stripe.com"
  },
  "funding": {
    "total_raised": "$8.7B",
    "last_round": {
      "type": "Series I",
      "amount": "$6.5B",
      "date": "2023-03-15",
      "valuation": "$50B"
    }
  }
}

Prerequisites

pip install requests beautifulsoup4 lxml fake-useragent

Method 1: Using Crunchbase API

Crunchbase offers both free and paid API access:

import requests
import json
import time

class CrunchbaseScraper:
    def __init__(self, api_key, proxy_url=None):
        self.api_key = api_key
        self.base_url = "https://api.crunchbase.com/api/v4"
        self.proxy_url = proxy_url
        self.session = requests.Session()

    def _get_headers(self):
        return {
            "X-cb-user-key": self.api_key,
            "Content-Type": "application/json",
        }

    def _get_proxies(self):
        if self.proxy_url:
            return {"http": self.proxy_url, "https": self.proxy_url}
        return None

    def search_organizations(self, query, limit=25):
        """Search for companies/organizations."""
        url = f"{self.base_url}/autocompletes"
        params = {
            "query": query,
            "collection_ids": "organizations",
            "limit": limit,
        }

        response = self.session.get(
            url, headers=self._get_headers(),
            params=params, proxies=self._get_proxies(), timeout=30
        )
        response.raise_for_status()
        return response.json().get("entities", [])

    def get_organization(self, permalink):
        """Get detailed organization data."""
        url = f"{self.base_url}/entities/organizations/{permalink}"
        params = {
            "field_ids": "short_description,founded_on,location_identifiers,num_employees_enum,categories,funding_total,last_funding_type,investor_count",
        }

        response = self.session.get(
            url, headers=self._get_headers(),
            params=params, proxies=self._get_proxies(), timeout=30
        )
        response.raise_for_status()
        return response.json()

    def get_funding_rounds(self, org_permalink, limit=10):
        """Get funding rounds for a company."""
        url = f"{self.base_url}/entities/organizations/{org_permalink}/funding_rounds"
        params = {"limit": limit}

        response = self.session.get(
            url, headers=self._get_headers(),
            params=params, proxies=self._get_proxies(), timeout=30
        )
        response.raise_for_status()
        return response.json()

    def search_funding_rounds(self, min_amount=None, funded_after=None, limit=50):
        """Search for funding rounds with filters."""
        url = f"{self.base_url}/searches/funding_rounds"
        payload = {
            "field_ids": ["identifier", "funded_organization_identifier", "money_raised", "announced_on", "investment_type"],
            "limit": limit,
            "order": [{"field_id": "announced_on", "sort": "desc"}],
        }

        query = []
        if min_amount:
            query.append({
                "type": "predicate",
                "field_id": "money_raised",
                "operator_id": "gte",
                "values": [{"value": min_amount, "currency": "usd"}]
            })
        if funded_after:
            query.append({
                "type": "predicate",
                "field_id": "announced_on",
                "operator_id": "gte",
                "values": [funded_after]
            })

        if query:
            payload["query"] = query

        response = self.session.post(
            url, headers=self._get_headers(),
            json=payload, proxies=self._get_proxies(), timeout=30
        )
        response.raise_for_status()
        return response.json()


# Usage
scraper = CrunchbaseScraper(api_key="your_api_key")

# Search companies
results = scraper.search_organizations("stripe")
print(json.dumps(results[:3], indent=2))

# Get company details
org = scraper.get_organization("stripe")
print(json.dumps(org, indent=2))

Method 2: Web Scraping Crunchbase

For data not available through the API:

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import json

class CrunchbaseWebScraper:
    def __init__(self, proxy_url=None):
        self.session = requests.Session()
        self.ua = UserAgent()
        self.proxy_url = proxy_url

    def _get_headers(self):
        return {
            "User-Agent": self.ua.random,
            "Accept": "text/html,application/xhtml+xml",
            "Accept-Language": "en-US,en;q=0.9",
        }

    def _get_proxies(self):
        if self.proxy_url:
            return {"http": self.proxy_url, "https": self.proxy_url}
        return None

    def scrape_company(self, company_slug):
        """Scrape company data from the Crunchbase website."""
        url = f"https://www.crunchbase.com/organization/{company_slug}"

        try:
            response = self.session.get(
                url, headers=self._get_headers(),
                proxies=self._get_proxies(), timeout=30
            )
            response.raise_for_status()
            soup = BeautifulSoup(response.text, "lxml")

            # Extract JSON-LD
            scripts = soup.find_all("script", type="application/ld+json")
            for script in scripts:
                try:
                    data = json.loads(script.string)
                    if data.get("@type") == "Organization":
                        return data
                except json.JSONDecodeError:
                    continue

            return None

        except requests.RequestException as e:
            print(f"Error: {e}")
            return None

Proxy Recommendations

Proxy TypeSuccess RateBest For
Residential70-80%Web scraping
Datacenter60-70%API access
ISP65-75%Consistent sessions

For API access, any proxy type works. For web scraping, residential proxies are recommended.

Legal Considerations

  1. API Terms: Crunchbase API terms restrict data resale and competing product creation.
  2. Terms of Service: Web scraping is prohibited in Crunchbase’s ToS.
  3. Data Licensing: Crunchbase data is commercially licensed. Extensive use may require a paid data license.
  4. Commercial Use: Significant restrictions on commercial use of scraped data.

See our web scraping compliance guide for details.

Handling Crunchbase Anti-Bot Protections

1. JavaScript Rendering

Crunchbase is a React-based single-page application. Simple HTTP requests return minimal data. For web scraping (as opposed to API access), use Selenium or Playwright to render the full page:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

class CrunchbaseSeleniumScraper:
    def __init__(self, proxy=None):
        options = Options()
        options.add_argument("--headless=new")
        options.add_argument("--no-sandbox")
        if proxy:
            options.add_argument(f"--proxy-server={proxy}")
        self.driver = webdriver.Chrome(options=options)

    def scrape_company_page(self, slug):
        url = f"https://www.crunchbase.com/organization/{slug}"
        self.driver.get(url)
        time.sleep(5)

        try:
            WebDriverWait(self.driver, 15).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "h1, [class*='profile']"))
            )
        except Exception:
            return None

        data = self.driver.execute_script('''
            const result = {};
            const title = document.querySelector("h1");
            result.name = title ? title.innerText.trim() : null;

            const desc = document.querySelector("[class*='description']");
            result.description = desc ? desc.innerText.trim() : null;

            const fields = document.querySelectorAll("[class*='field-type']");
            fields.forEach(field => {
                const label = field.querySelector("[class*='label']");
                const value = field.querySelector("[class*='value']");
                if (label && value) {
                    result[label.innerText.trim().toLowerCase().replace(/\\s+/g, "_")] = value.innerText.trim();
                }
            });

            return result;
        ''')

        return data

    def close(self):
        self.driver.quit()

2. Rate Limiting

The Crunchbase API enforces strict rate limits depending on your plan tier. Implement exponential backoff:

import time

def api_request_with_backoff(func, max_retries=3, base_delay=2):
    for attempt in range(max_retries):
        try:
            return func()
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 429:
                delay = base_delay * (2 ** attempt)
                print(f"Rate limited. Waiting {delay}s...")
                time.sleep(delay)
            else:
                raise
    return None

3. Session Management

Crunchbase tracks browsing sessions closely. Rotate user agents and clear cookies between scraping batches. For API access, use a single persistent session to avoid triggering anomaly detection.

Data Storage and Export

For investment research and sales intelligence pipelines, store Crunchbase data in a structured database:

import sqlite3
import json

class CrunchbaseDataStore:
    def __init__(self, db_path="crunchbase_data.db"):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute('''CREATE TABLE IF NOT EXISTS companies
            (permalink TEXT PRIMARY KEY, name TEXT, description TEXT,
             headquarters TEXT, founded TEXT, employee_count TEXT,
             categories TEXT, total_funding TEXT, last_funding_type TEXT,
             data JSON, scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)''')
        self.conn.execute('''CREATE TABLE IF NOT EXISTS funding_rounds
            (round_id TEXT PRIMARY KEY, company_permalink TEXT,
             round_type TEXT, amount TEXT, announced_on TEXT,
             investors TEXT, data JSON)''')

    def store_company(self, company):
        self.conn.execute(
            """INSERT OR REPLACE INTO companies
            (permalink, name, description, headquarters, founded, employee_count,
             categories, total_funding, last_funding_type, data)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
            (company.get("permalink"), company.get("name"), company.get("short_description"),
             company.get("headquarters"), company.get("founded_on"),
             company.get("num_employees_enum"), json.dumps(company.get("categories", [])),
             company.get("funding_total"), company.get("last_funding_type"),
             json.dumps(company))
        )
        self.conn.commit()

    def export_csv(self, output_path):
        import csv
        cursor = self.conn.execute("SELECT * FROM companies ORDER BY name")
        rows = cursor.fetchall()
        columns = [desc[0] for desc in cursor.description]
        with open(output_path, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(columns)
            writer.writerows(rows)

Frequently Asked Questions

Is Crunchbase API free?

Crunchbase offers a free tier with limited access (basic search, autocomplete). Full API access requires a paid plan starting at several hundred dollars per month. The free tier is sufficient for small research projects, but serious data collection requires the Pro or Enterprise plan.

Can I scrape Crunchbase without an API key?

Web scraping is possible but Crunchbase uses aggressive anti-bot protections including Cloudflare, CAPTCHAs, and their ToS prohibits it. The API is the recommended and more reliable data access method.

How often does Crunchbase data update?

Crunchbase data is updated continuously by their research team and community contributors. Funding round data typically appears within 1-2 days of announcement. Company profile updates depend on community contributions and Crunchbase’s editorial team.

What alternatives exist for company data?

PitchBook, CB Insights, and LinkedIn Sales Navigator are commercial alternatives. For free options, consider SEC EDGAR for public company data, AngelList for startup data, or OpenCorporates for basic company registration information.

How do I track funding trends in a specific sector?

Use the Crunchbase search API with category filters and date ranges. Store results in a database over time and analyze funding velocity, average round sizes, and investor patterns. The search_funding_rounds method with min_amount and funded_after filters is ideal for this use case.

Conclusion

Crunchbase is best accessed through their official API, which provides structured data on companies, funding, and investors. Web scraping with Selenium serves as a supplement for data not available through the API. Use residential proxies for web scraping, respect API rate limits for sustained access, and store results in a database for longitudinal analysis of startup ecosystems and venture capital trends.

For more business data guides, visit our web scraping proxy guide and proxy provider comparisons.


Related Reading

Scroll to Top