Scraping DAO governance and Snapshot data in 2026

Scrape DAO governance data jobs in 2026 are a cleaner problem than most crypto scraping work because the underlying systems were designed to be public from day one. Snapshot.org publishes a full GraphQL API. Tally exposes its own GraphQL endpoint. The on-chain Governor contracts emit standardized events that any node can index. Compared to scraping a cagey marketplace or a fingerprinting-heavy social platform, governance data is right there waiting to be collected. The challenge is reconciliation across off-chain (Snapshot), on-chain (Governor contracts), and the human layer (Discord, forum threads, Twitter discussions) where most of the actual decision-making happens.

This guide covers the practical mechanics of building a DAO governance pipeline in 2026: how to use Snapshot’s GraphQL API at scale, how to read on-chain proposals from OpenZeppelin Governor and Compound-style contracts, and the analytics patterns that turn raw vote data into voter behavior intelligence.

The two-track structure of DAO governance

Almost every active DAO in 2026 runs governance on one of two tracks. Off-chain via Snapshot, where votes are gasless signatures stored on IPFS, or on-chain via a Governor contract that records every vote on the blockchain. Many DAOs do both: a Snapshot signal vote first to gauge community sentiment, then an on-chain vote that actually executes the change.

This bifurcation matters for scrapers because the data lives in completely different places. Snapshot’s data is in Snapshot’s GraphQL database. On-chain proposal data is in Ethereum, Optimism, Arbitrum, Base, or wherever the Governor contract is deployed. A complete picture of a DAO’s governance requires both feeds.

Snapshot GraphQL: the easiest scraping target in crypto

Snapshot.org runs a public GraphQL endpoint at https://hub.snapshot.org/graphql with no authentication required for read operations. They rate-limit at roughly 60 requests per minute per IP, which is generous given the small payload size. The schema is documented and stable.

The two queries you actually use most:

import requests

SNAPSHOT_URL = "https://hub.snapshot.org/graphql"

def get_proposals(space: str, first: int = 100, skip: int = 0):
    query = """
    query GetProposals($space: String!, $first: Int!, $skip: Int!) {
      proposals(
        first: $first
        skip: $skip
        where: {space_in: [$space]}
        orderBy: "created"
        orderDirection: desc
      ) {
        id
        title
        body
        choices
        start
        end
        snapshot
        state
        author
        space { id name }
        scores
        scores_total
        votes
      }
    }
    """
    resp = requests.post(
        SNAPSHOT_URL,
        json={"query": query, "variables": {"space": space, "first": first, "skip": skip}},
        timeout=15,
    )
    return resp.json()["data"]["proposals"]


def get_votes(proposal_id: str, first: int = 1000, skip: int = 0):
    query = """
    query GetVotes($proposal: String!, $first: Int!, $skip: Int!) {
      votes(
        first: $first
        skip: $skip
        where: {proposal: $proposal}
        orderBy: "vp"
        orderDirection: desc
      ) {
        id
        voter
        vp
        choice
        created
        reason
      }
    }
    """
    resp = requests.post(
        SNAPSHOT_URL,
        json={"query": query, "variables": {"proposal": proposal_id, "first": first, "skip": skip}},
        timeout=15,
    )
    return resp.json()["data"]["votes"]

The vp field in the votes query is voting power, denominated in whatever the space’s strategy specifies (token balance, NFT ownership, delegated tokens, etc.). For most DAOs this maps directly to token holdings at the snapshot block.

A complete archive of a single DAO’s governance history requires paginating through proposals, then for each proposal paginating through votes. For a large DAO like Aave or Uniswap, this is 200-500 proposals and 50,000-200,000 individual votes. Pulling the full archive takes about an hour respecting rate limits.

Tally for on-chain governance

Tally is the dominant interface for on-chain Governor contracts. They aggregate proposal data from OpenZeppelin Governor, Compound Bravo, and several variants across multiple chains. Their GraphQL API at https://api.tally.xyz/query requires an API key (free tier available, paid for higher rate limits).

Tally is the right tool when you need on-chain DAO data without running your own indexer. They handle the contract decoding, proposal state machine, and cross-chain aggregation. For a research project on, say, Optimism’s governance evolution, Tally is faster than building from scratch.

def tally_proposals(governor_id: str, api_key: str, first: int = 100):
    query = """
    query Proposals($input: ProposalsInput!) {
      proposals(input: $input) {
        nodes {
          ... on Proposal {
            id
            metadata { title description }
            status
            createdAt
            voteStats { votesCount support percent }
          }
        }
      }
    }
    """
    resp = requests.post(
        "https://api.tally.xyz/query",
        json={
            "query": query,
            "variables": {"input": {"filters": {"governorId": governor_id}, "page": {"limit": first}}},
        },
        headers={"Api-Key": api_key},
        timeout=15,
    )
    return resp.json()

Reading Governor contracts directly

For the highest fidelity and complete decentralization, read the Governor contract directly via RPC. OpenZeppelin’s Governor is the most common implementation and emits these events:

ProposalCreated when a new proposal is submitted
VoteCast when a delegate casts a vote
ProposalCanceled, ProposalQueued, ProposalExecuted for lifecycle changes

from web3 import Web3

w3 = Web3(Web3.HTTPProvider("https://eth-mainnet.g.alchemy.com/v2/YOUR_KEY"))

GOVERNOR_ABI = [...]  # OpenZeppelin Governor ABI
governor = w3.eth.contract(address="0x408ED6354d4973f66138C91495F2f2FCbd8724C3", abi=GOVERNOR_ABI)

def index_proposals_from_block(start_block: int, end_block: int):
    event_filter = governor.events.ProposalCreated.create_filter(
        from_block=start_block,
        to_block=end_block,
    )
    for event in event_filter.get_all_entries():
        yield {
            "proposal_id": event["args"]["proposalId"],
            "proposer": event["args"]["proposer"],
            "description": event["args"]["description"],
            "block_number": event["blockNumber"],
            "tx_hash": event["transactionHash"].hex(),
        }

The catch with reading historical events at scale is RPC rate limits. Most providers cap eth_getLogs at 10,000 blocks per request. For multi-year history you batch in chunks. Free tier RPC providers will throttle hard if you try to backfill years of governance events; use a paid tier or run your own archive node.

Comparison of governance data sources

source	auth	coverage	data freshness	best for
Snapshot GraphQL	none	all Snapshot spaces (off-chain votes)	seconds	off-chain governance, signal votes
Tally GraphQL	API key (free + paid)	all OpenZeppelin/Compound governors	minutes	on-chain governance, multi-chain
Direct RPC + ABI	RPC key	any chain you have RPC for	block-by-block	high-fidelity, custom contracts
Etherscan API	API key (free)	all Ethereum contracts	minutes	quick contract introspection, tx decoding
The Graph subgraphs	varies	indexed contracts only	minutes	when a community subgraph exists
Boardroom	none for read	aggregated DAO data	minutes	quick dashboards, multi-DAO comparison

For most research, Snapshot + Tally covers 80% of the data you need. Direct RPC is for the remaining 20% where you need millisecond freshness or custom contract logic that aggregators do not understand.

Snapshot space discovery

Snapshot has 100,000+ registered spaces, of which maybe 3,000 are actively used. Filtering active from inactive is a useful preprocessing step that saves storage and rate-limit budget. The discovery query:

def list_active_spaces(min_proposals: int = 10):
    query = """
    query Spaces($first: Int!) {
      spaces(first: $first, orderBy: "proposalsCount", orderDirection: desc) {
        id
        name
        about
        network
        symbol
        followersCount
        proposalsCount
      }
    }
    """
    resp = requests.post(SNAPSHOT_URL, json={"query": query, "variables": {"first": 1000}}, timeout=15)
    spaces = resp.json()["data"]["spaces"]
    return [s for s in spaces if s["proposalsCount"] >= min_proposals]

The top 1,000 spaces by proposal count cover roughly 95% of governance activity. For a research project, scraping just those is usually enough. The long tail is interesting only for category-specific studies (e.g., NFT-collection DAOs, regional DAOs, gaming guilds).

Voter analytics: turning raw votes into intelligence

The most valuable thing you can do with governance data is voter behavior analysis. Every wallet’s voting history tells a story.

Common analytics derived from raw vote data:

Voter participation rate: percentage of proposals a wallet voted on relative to its eligibility window
Concentration: the share of total voting power held by the top 10, 100, 1000 voters
Whale alignment: how often a specific large wallet votes with the majority versus against
Coalition detection: clusters of wallets that consistently vote the same way (suggesting coordination, delegation chains, or shared sybil control)
Proposal heat: total voters and total VP that participated, normalized by space size

A simple coalition detector using Jaccard similarity:

from collections import defaultdict
from itertools import combinations

def detect_coalitions(votes_by_proposal: dict, min_overlap: float = 0.8):
    """votes_by_proposal: {proposal_id: {voter_address: choice_index}}"""
    voter_history = defaultdict(dict)
    for prop_id, voter_choices in votes_by_proposal.items():
        for voter, choice in voter_choices.items():
            voter_history[voter][prop_id] = choice

    coalitions = []
    voters = list(voter_history.keys())
    for a, b in combinations(voters, 2):
        common = set(voter_history[a].keys()) & set(voter_history[b].keys())
        if len(common) < 5:
            continue
        agreement = sum(1 for p in common if voter_history[a][p] == voter_history[b][p]) / len(common)
        if agreement >= min_overlap:
            coalitions.append((a, b, agreement, len(common)))
    return coalitions

This is naive (real coalition detection uses spectral clustering or community detection algorithms on a vote agreement graph) but it works well enough for first-pass exploration on small DAOs.

Sybil and delegation graph reconstruction

A useful extension to coalition detection is reconstructing the delegation graph for token-weighted DAOs. Most Governor implementations expose a delegates(address) view that returns the address each token-holder has delegated to, and a getVotes(address, blockNumber) view that returns effective voting power at a historical block. Walking these views for the top 10,000 token holders gives you a delegation-edge dataset that, combined with vote records, exposes:

Bridges: wallets that aggregate delegated voting power from many small holders, then vote as one bloc
Whale puppets: delegate addresses that always vote identically with one large delegator, suggesting controlled signing
Idle delegations: delegations to addresses that never cast votes, effectively removing those tokens from circulation
Sybil clusters: groups of small wallets delegated to the same entity that all received their initial token transfer from a common funding source

Build the delegation graph with a network library like networkx and run weakly-connected-component analysis to surface clusters. For DAOs with snapshot-based delegation (where delegation snapshot is per-proposal), reconstruct the graph at each proposal block to capture the actual configuration that voted, not the current one.

Forum and Discord context

Governance does not happen only in the votes. The actual decision happens in forum threads, Discord channels, and Twitter discussions weeks before a proposal hits Snapshot. A complete pipeline pulls Discourse forum data via the official Discourse API, Discord channel data via bot integrations (with server admin permission), and Twitter via paid Twitter API or scraping.

For Discourse forums (Aave, Uniswap, Compound, Optimism all use Discourse for governance discussion), the API is at https://forum.example.org/posts.json and is enabled by default for read access. Polling categories every 10 minutes is sufficient.

def get_discourse_topics(forum_url: str, category_id: int, page: int = 0):
    url = f"{forum_url}/c/{category_id}.json"
    resp = requests.get(url, params={"page": page}, timeout=15)
    return resp.json()

For Discord, scraping public messages without bot permissions violates Discord ToS. The compliant approach is asking the DAO’s admins for bot permissions in the relevant channels. Most governance-focused DAOs grant this for legitimate research projects.

Cross-DAO benchmarking

Once you have several DAOs in your pipeline, the most interesting analysis is cross-DAO comparison. Useful benchmarks include:

Median quorum hit rate: what fraction of proposals reach quorum across DAOs of similar size? A DAO with a 30% quorum rate while peers run at 70% has either dead delegations or unhealthy proposal throughput.
Proposal velocity: proposals per month relative to treasury size. A $500M treasury with one proposal per month is under-active; a $5M treasury with twenty proposals per month is firefighting.
Author concentration: what share of proposals come from the top 5 proposal authors? Above 60% suggests a small council effectively governs and the broader voter base is rubber-stamping.
Voter retention: of voters who participated 90 days ago, what fraction still vote today? A normal range is 25-55%; below 20% indicates community burnout.

Most of these metrics fit naturally into a single dashboard fed by the storage schema below. The benchmark numbers themselves come from your own corpus once you have indexed 30+ DAOs; there is no universally accepted source of truth for “healthy” DAO governance.

Storage schema

CREATE TABLE dao_spaces (
    space_id TEXT PRIMARY KEY,
    name TEXT,
    network TEXT,
    members_count INTEGER,
    proposals_count INTEGER,
    treasury_usd NUMERIC
);

CREATE TABLE proposals (
    proposal_id TEXT PRIMARY KEY,
    space_id TEXT REFERENCES dao_spaces(space_id),
    source TEXT NOT NULL,  -- 'snapshot' | 'governor' | 'tally'
    title TEXT,
    body TEXT,
    author TEXT,
    state TEXT,
    created_at TIMESTAMPTZ,
    start_at TIMESTAMPTZ,
    end_at TIMESTAMPTZ,
    snapshot_block BIGINT,
    scores_total NUMERIC,
    votes_count INTEGER
);

CREATE TABLE votes (
    proposal_id TEXT REFERENCES proposals(proposal_id),
    voter_address TEXT NOT NULL,
    choice INTEGER NOT NULL,
    voting_power NUMERIC NOT NULL,
    created_at TIMESTAMPTZ NOT NULL,
    reason TEXT,
    PRIMARY KEY (proposal_id, voter_address)
);

CREATE INDEX ON votes (voter_address);
CREATE INDEX ON proposals (space_id, created_at DESC);

This schema handles both Snapshot and on-chain votes uniformly via the source field. For DAOs that run both, you have parallel proposal records and can analyze the relationship between off-chain signals and on-chain execution.

Cost worked example

A practical research deployment covering ~150 active DAOs across Snapshot and 60 on-chain Governors costs:

Snapshot GraphQL ($0)
Tally API paid tier ($79/mo for 150 req/min)
Alchemy Growth tier ($49/mo) for on-chain events
1 small VPS for indexer ($25/mo)
Postgres on a hosted instance ($30/mo)
10 IPs of residential proxy for forum scraping ($20/mo)

Total: about $200/month for a complete cross-DAO governance dataset that competing services charge $1,500-3,000/month for. The break-even point is reached within the first week of operation.

Proxy considerations

Snapshot’s rate limit is permissive enough that proxies are usually unnecessary for read-only research. If you are pulling all 100,000+ Snapshot spaces continuously, distribute across 5-10 IPs with residential proxies.

For RPC calls, your bottleneck is your RPC provider, not IP-based rate limits. Use a paid tier or multiple free-tier keys. We cover provider selection in our best residential proxy providers 2026 review.

For Discord scraping (with bot permissions), the bot itself rate-limits per server, not per IP. Proxies do not help.

External authoritative reference: the Snapshot.js documentation covers the GraphQL schema, signing flow, and strategy types.

Common gotchas

Snapshot’s vp field is computed at proposal creation, not at vote time. A wallet that votes after selling its tokens still gets credit for its snapshot-block balance. Do not double-discount.
The state field on Snapshot proposals returns pending, active, closed. There is no passed or failed; that is your interpretation based on scores against quorum and threshold defined in the space settings.
Governor proposal IDs differ between Compound Bravo (uint256 sequential) and OpenZeppelin (keccak hash of proposal params). Code that assumes one breaks on the other.
Multi-chain DAOs (Optimism, Arbitrum) often reuse Snapshot space slugs but have different on-chain Governors per chain. Make sure you are matching the right pair.
Tally’s voteStats returns percentages calculated against quorum, not against votes cast. A proposal at 60% support may show 30% in percent if quorum is half the eligible supply.
Reading historical events with eth_getLogs on free-tier RPCs frequently times out on busy contracts. Chunk by 1,000-block windows for old contracts and 10,000 for newer ones, with retry-with-narrower-window on timeout.

Common analytical questions

Once you have the raw data, the questions worth answering include:

Vote concentration over time. Is a DAO’s governance becoming more or less concentrated? Plot top-10 vote share per proposal over time.

Delegate effectiveness. For DAOs with delegation (most modern Governor implementations), how often do delegates show up to vote on the tokens delegated to them? An 80% delegate participation rate is healthy; a 20% rate means delegation is dead weight.

Proposal pass rate. What fraction of proposals reach quorum and pass? A pass rate of 95% suggests rubber-stamping; a pass rate of 30% suggests genuine contention.

Treasury follow-through. For DAOs that vote on treasury allocations, do the funds actually move on-chain after the vote passes? Reconciliation against the treasury wallet’s transactions reveals whether governance is real or theater.

We dig into related on-chain forensics in our guide on scraping crypto exchange order books.

FAQ

Q: do I need a node to scrape on-chain governance?
No. Hosted RPC providers (Alchemy, Infura, QuickNode) cover the read use case perfectly. You only need your own node if you are indexing every block in real time at scale or need access to internal traces.

Q: can I scrape historical Snapshot proposals from defunct DAOs?
Yes. Snapshot persists historical data even for spaces that are inactive. You can pull the full archive of any space that has not been deleted by the admins.

Q: how do I correlate Snapshot space to on-chain Governor contract?
Most large DAOs publish the mapping in their docs. There is no universal index. Tally maintains a manually curated list. For obscure DAOs you may need to ask in the project’s Discord.

Q: what about Solana DAOs?
Solana DAOs use Realms (governance.so) instead of Snapshot/Governor. Realms exposes data via Solana RPC and the SPL Governance program. Different stack, same patterns.

Q: is voter address pseudonymity a concern for analysis?
For pure on-chain analysis, addresses are public and attribution is fair game. For combining vote data with off-chain identity (Twitter handles, Discord usernames), be careful about privacy expectations. See our GDPR compliance for web scraping guide for the data minimization patterns.

Q: how do I track DAO treasury movements alongside governance votes?
Index Transfer events from the treasury wallet on the same chain as the Governor. Cross-reference transactions occurring within 7 days after a ProposalExecuted event against the proposal’s calldata to confirm execution matched intent. Surprisingly often, a proposal authorizes a transfer that never happens because the multisig signers fail to coordinate.

Q: what is the right cadence for polling Snapshot?
For active spaces, every 5 minutes catches new proposals and vote updates. For an archive, a one-time backfill plus daily incremental sync is enough. Snapshot itself does not push updates, so polling is the only option without running a custom relayer.

Q: do I need to verify Snapshot signatures?
Snapshot’s hub already verifies signatures before storing votes, so the data you pull is pre-validated. Re-verifying is overkill for analysis but useful for audit-grade research where you want to prove independently that every vote is signed by the wallet it claims.

Closing

DAO governance scraping in 2026 is one of the cleanest data engineering problems in crypto. Snapshot and Tally do most of the heavy lifting; on-chain data fills the gaps; forum and Discord context completes the picture. The hard part is not collection but interpretation: turning vote records into intelligence about who actually controls a DAO, how decisions get made, and whether governance is functioning or theater. For the broader crypto data infrastructure picture see our crypto-defi category hub.