Bluesky’s AT Protocol is one of the few social platforms in 2026 that actively wants you to scrape it — the public firehose is open, the API is documented, and most endpoints don’t require authentication for read access. That said, “open” doesn’t mean “easy.” The firehose runs at several thousand events per second, the data model is unfamiliar if you’re coming from REST-style APIs, and the workarounds for bulk historical collection have their own sharp edges. Here’s a direct path through both the official route and the fallback options.
Understanding the AT Protocol Data Model
Before writing a single line of code, spend 20 minutes on the data model — it’ll save hours of confusion later.
AT Protocol uses three core primitives:
- DID (Decentralized Identifier): a persistent identity handle like
did:plc:abc123xyzthat survives username changes - NSID (Namespaced Schema ID): type identifiers like
app.bsky.feed.postthat describe record schemas - CID (Content Identifier): a hash-based pointer to a specific version of a record
Every Bluesky post is a record under the app.bsky.feed.post NSID, stored in a user’s Personal Data Server (PDS). The PDS for most users is bsky.social, but federated users can self-host. If you’re building scrapers for decentralized social data, the federation model is similar to what you’ll encounter with Mastodon’s ActivityPub architecture — multiple data sources, no single authoritative endpoint.
The Official Route: AppView API and the Firehose
Bluesky exposes two official paths for data collection.
AppView REST API
The AppView API at public.api.bsky.app is the friendliest entry point. Most read endpoints are unauthenticated and return clean JSON. The rate limits are generous — around 3,000 requests per 5 minutes per IP for unauthenticated calls — and the response schemas are stable.
import httpx
BASE = "https://public.api.bsky.app/xrpc"
def get_author_feed(handle: str, limit: int = 50) -> list[dict]:
r = httpx.get(
f"{BASE}/app.bsky.feed.getAuthorFeed",
params={"actor": handle, "limit": limit},
timeout=10,
)
r.raise_for_status()
return r.json().get("feed", [])Pagination uses a cursor field returned in each response. Pass it back as ?cursor= to walk backwards through a user’s post history. The API caps single-request limits at 100 records for most endpoints.
The Relay Firehose
For real-time collection or large-scale crawls, the firehose at wss://bsky.network/xrpc/com.atproto.sync.subscribeRepos is the right tool. It streams every repo operation across the network as a CAR (Content Addressable aRchive) encoded websocket message. You decode it with the dag-cbor format, filter for app.bsky.feed.post creates, and you have a near-complete view of public posts.
The practical catch: at peak hours the firehose pushes 3,000 to 5,000 events per second. A naive Python consumer falls behind within minutes. Use atproto SDK’s built-in firehose client with a multi-process consumer pool, or route the stream through a Redis queue and process asynchronously.
Workarounds for Historical and Bulk Collection
The firehose is real-time only — it has no replay window beyond a few hours. For historical data, you have three options.
| Method | Coverage | Auth Required | Rate Limit | Best For |
|---|---|---|---|---|
getAuthorFeed pagination | Per-user posts | No | 3k req/5min | Profile-level research |
searchPosts (AppView) | Full-text indexed | No | 300 req/5min | Keyword monitoring |
PDS listRecords | All records by DID | No | Varies by PDS | Full user archive |
Relay getBlocks (CAR sync) | Full repo snapshots | No | Low, use sparingly | Historical audit |
| Third-party index (Smoke Signal, Skyfeed) | Cross-account search | API key | Varies | Volume keyword pulls |
For keyword-based collection at scale, app.bsky.feed.searchPosts is rate-limited tighter than getAuthorFeed. If you need volume, the Smoke Signal and Skyfeed indexers offer their own search APIs with higher throughput — check their current terms before hitting them in bulk.
This tradeoff between official limits and third-party indexers mirrors what you hit scraping other platforms. The approach for Threads public post collection follows the same pattern: official API first, unofficial indexer as overflow.
Handling DIDs, PDS Routing, and Federation
Federated users don’t store their data on bsky.social. To correctly resolve any DID to its PDS, call the DID resolution endpoint:
GET https://plc.directory/<did>This returns a DID document containing the #atproto_pds service endpoint. Your scraper needs to route com.atproto.repo.* calls to that endpoint, not to bsky.social. A naive scraper that hardcodes the host will silently miss federated accounts — an important detail if your research covers non-Bluesky AT Protocol deployments.
- Resolve the handle to a DID via
com.atproto.identity.resolveHandle - Fetch the DID document from
plc.directoryor the identity’s own DID doc - Extract the PDS service endpoint
- Call
com.atproto.repo.listRecordson that PDS with the resolved DID
This four-step chain is the correct way to scrape any AT Protocol account regardless of which PDS hosts it. Skip step 2-3 only if you’re 100% certain you’re targeting bsky.social-hosted accounts.
Proxy and Infrastructure Considerations
Bluesky’s rate limits are IP-based for unauthenticated calls. If you’re running parallel crawlers across thousands of DIDs, you will hit the ceiling on a single residential or datacenter IP. The pillar guide on Bluesky proxy infrastructure covers the specific proxy configurations that work reliably against public.api.bsky.app — residential rotating proxies outperform datacenter ones here because the AppView API does apply light fingerprinting on top of IP rate limits.
A few operational notes that matter at scale:
- Bluesky does not currently block Tor exit nodes, but response latency is high and not worth the tradeoff for bulk collection
429responses include aRetry-Afterheader — respect it, backoff exponentially, and do not retry immediately- If you’re also collecting from other decentralized platforms, the federation routing logic for Discord public server scraping and Bluesky share a common pattern: you’re querying distributed infrastructure with inconsistent rate enforcement per node
Authenticated API access (using an app password, not your account password) raises most rate limits by 3-5x and unlocks a few additional endpoints. For any production pipeline touching >10,000 accounts per day, create a dedicated bot account and authenticate all requests.
Bottom Line
Bluesky is the easiest major social platform to scrape legally in 2026: the firehose is public, the API is well-documented, and federation means the data is explicitly designed to be portable. Start with the AppView REST API for targeted collection, add the firehose for real-time monitoring, and use PDS routing when you need full account archives across federated hosts. DRT will keep tracking AT Protocol API changes as the network scales toward mainstream adoption.