How to Scrape Bluesky AT Protocol Posts in 2026 (Official + Workaround)

Bluesky’s AT Protocol is one of the few social platforms in 2026 that actively wants you to scrape it — the public firehose is open, the API is documented, and most endpoints don’t require authentication for read access. That said, “open” doesn’t mean “easy.” The firehose runs at several thousand events per second, the data model is unfamiliar if you’re coming from REST-style APIs, and the workarounds for bulk historical collection have their own sharp edges. Here’s a direct path through both the official route and the fallback options.

Understanding the AT Protocol Data Model

Before writing a single line of code, spend 20 minutes on the data model — it’ll save hours of confusion later.

AT Protocol uses three core primitives:

  • DID (Decentralized Identifier): a persistent identity handle like did:plc:abc123xyz that survives username changes
  • NSID (Namespaced Schema ID): type identifiers like app.bsky.feed.post that describe record schemas
  • CID (Content Identifier): a hash-based pointer to a specific version of a record

Every Bluesky post is a record under the app.bsky.feed.post NSID, stored in a user’s Personal Data Server (PDS). The PDS for most users is bsky.social, but federated users can self-host. If you’re building scrapers for decentralized social data, the federation model is similar to what you’ll encounter with Mastodon’s ActivityPub architecture — multiple data sources, no single authoritative endpoint.

The Official Route: AppView API and the Firehose

Bluesky exposes two official paths for data collection.

AppView REST API

The AppView API at public.api.bsky.app is the friendliest entry point. Most read endpoints are unauthenticated and return clean JSON. The rate limits are generous — around 3,000 requests per 5 minutes per IP for unauthenticated calls — and the response schemas are stable.

import httpx

BASE = "https://public.api.bsky.app/xrpc"

def get_author_feed(handle: str, limit: int = 50) -> list[dict]:
    r = httpx.get(
        f"{BASE}/app.bsky.feed.getAuthorFeed",
        params={"actor": handle, "limit": limit},
        timeout=10,
    )
    r.raise_for_status()
    return r.json().get("feed", [])

Pagination uses a cursor field returned in each response. Pass it back as ?cursor= to walk backwards through a user’s post history. The API caps single-request limits at 100 records for most endpoints.

The Relay Firehose

For real-time collection or large-scale crawls, the firehose at wss://bsky.network/xrpc/com.atproto.sync.subscribeRepos is the right tool. It streams every repo operation across the network as a CAR (Content Addressable aRchive) encoded websocket message. You decode it with the dag-cbor format, filter for app.bsky.feed.post creates, and you have a near-complete view of public posts.

The practical catch: at peak hours the firehose pushes 3,000 to 5,000 events per second. A naive Python consumer falls behind within minutes. Use atproto SDK’s built-in firehose client with a multi-process consumer pool, or route the stream through a Redis queue and process asynchronously.

Workarounds for Historical and Bulk Collection

The firehose is real-time only — it has no replay window beyond a few hours. For historical data, you have three options.

MethodCoverageAuth RequiredRate LimitBest For
getAuthorFeed paginationPer-user postsNo3k req/5minProfile-level research
searchPosts (AppView)Full-text indexedNo300 req/5minKeyword monitoring
PDS listRecordsAll records by DIDNoVaries by PDSFull user archive
Relay getBlocks (CAR sync)Full repo snapshotsNoLow, use sparinglyHistorical audit
Third-party index (Smoke Signal, Skyfeed)Cross-account searchAPI keyVariesVolume keyword pulls

For keyword-based collection at scale, app.bsky.feed.searchPosts is rate-limited tighter than getAuthorFeed. If you need volume, the Smoke Signal and Skyfeed indexers offer their own search APIs with higher throughput — check their current terms before hitting them in bulk.

This tradeoff between official limits and third-party indexers mirrors what you hit scraping other platforms. The approach for Threads public post collection follows the same pattern: official API first, unofficial indexer as overflow.

Handling DIDs, PDS Routing, and Federation

Federated users don’t store their data on bsky.social. To correctly resolve any DID to its PDS, call the DID resolution endpoint:

GET https://plc.directory/<did>

This returns a DID document containing the #atproto_pds service endpoint. Your scraper needs to route com.atproto.repo.* calls to that endpoint, not to bsky.social. A naive scraper that hardcodes the host will silently miss federated accounts — an important detail if your research covers non-Bluesky AT Protocol deployments.

  1. Resolve the handle to a DID via com.atproto.identity.resolveHandle
  2. Fetch the DID document from plc.directory or the identity’s own DID doc
  3. Extract the PDS service endpoint
  4. Call com.atproto.repo.listRecords on that PDS with the resolved DID

This four-step chain is the correct way to scrape any AT Protocol account regardless of which PDS hosts it. Skip step 2-3 only if you’re 100% certain you’re targeting bsky.social-hosted accounts.

Proxy and Infrastructure Considerations

Bluesky’s rate limits are IP-based for unauthenticated calls. If you’re running parallel crawlers across thousands of DIDs, you will hit the ceiling on a single residential or datacenter IP. The pillar guide on Bluesky proxy infrastructure covers the specific proxy configurations that work reliably against public.api.bsky.app — residential rotating proxies outperform datacenter ones here because the AppView API does apply light fingerprinting on top of IP rate limits.

A few operational notes that matter at scale:

  • Bluesky does not currently block Tor exit nodes, but response latency is high and not worth the tradeoff for bulk collection
  • 429 responses include a Retry-After header — respect it, backoff exponentially, and do not retry immediately
  • If you’re also collecting from other decentralized platforms, the federation routing logic for Discord public server scraping and Bluesky share a common pattern: you’re querying distributed infrastructure with inconsistent rate enforcement per node

Authenticated API access (using an app password, not your account password) raises most rate limits by 3-5x and unlocks a few additional endpoints. For any production pipeline touching >10,000 accounts per day, create a dedicated bot account and authenticate all requests.

Bottom Line

Bluesky is the easiest major social platform to scrape legally in 2026: the firehose is public, the API is well-documented, and federation means the data is explicitly designed to be portable. Start with the AppView REST API for targeted collection, add the firehose for real-time monitoring, and use PDS routing when you need full account archives across federated hosts. DRT will keep tracking AT Protocol API changes as the network scales toward mainstream adoption.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)