How to Scrape Bluesky Data in 2026
Bluesky is a decentralized social media platform built on the AT Protocol (Authenticated Transfer Protocol), created by Twitter co-founder Jack Dorsey. With its growing user base and open, decentralized architecture, Bluesky has become an increasingly important platform for social listening, trend analysis, and academic research.
Unlike most social platforms, Bluesky’s decentralized design makes it one of the most scraper-friendly platforms available. The AT Protocol provides open API access to public data, making data collection straightforward and well-supported.
What Data Can You Extract from Bluesky?
Bluesky’s open protocol provides access to:
- Posts (skeets) with text, images, links, and embeds
- User profiles (handle, display name, bio, follower/following counts)
- Feeds and timelines (algorithmic and chronological)
- Reply threads and conversations
- Likes, reposts, and quote posts
- Custom feeds (community-curated algorithms)
- Moderation lists and labels
- Full network graph data
Example JSON Output
{
"uri": "at://did:plc:abc123/app.bsky.feed.post/xyz789",
"author": {
"handle": "techwriter.bsky.social",
"displayName": "Tech Writer",
"followers": 15000,
"following": 450
},
"text": "Just published a deep dive on how the AT Protocol compares to ActivityPub for decentralized social media.",
"created_at": "2026-03-08T14:30:00.000Z",
"likes": 342,
"reposts": 89,
"replies": 45,
"embed": {
"type": "external",
"title": "AT Protocol vs ActivityPub",
"url": "https://example.com/article"
}
}Prerequisites
pip install atproto requestsBluesky’s AT Protocol API is open and does not require proxies for most use cases. However, for high-volume scraping, residential proxies can help distribute requests.
Method 1: Using the AT Protocol SDK (Recommended)
The atproto Python SDK provides the cleanest interface to Bluesky data.
from atproto import Client
import json
from datetime import datetime
class BlueskyScraper:
def __init__(self, handle=None, password=None):
self.client = Client()
if handle and password:
self.client.login(handle, password)
def get_profile(self, handle):
"""Get user profile information."""
try:
profile = self.client.get_profile(handle)
return {
"did": profile.did,
"handle": profile.handle,
"display_name": profile.display_name,
"description": profile.description,
"followers_count": profile.followers_count,
"follows_count": profile.follows_count,
"posts_count": profile.posts_count,
"avatar": profile.avatar,
"created_at": profile.created_at,
}
except Exception as e:
print(f"Error getting profile: {e}")
return None
def get_author_feed(self, handle, limit=50):
"""Get posts from a specific user."""
posts = []
cursor = None
while len(posts) < limit:
try:
fetch_limit = min(limit - len(posts), 100)
response = self.client.get_author_feed(
handle,
limit=fetch_limit,
cursor=cursor
)
for item in response.feed:
post = item.post
post_data = {
"uri": post.uri,
"cid": post.cid,
"author_handle": post.author.handle,
"author_name": post.author.display_name,
"text": post.record.text if hasattr(post.record, 'text') else None,
"created_at": post.record.created_at if hasattr(post.record, 'created_at') else None,
"likes": post.like_count,
"reposts": post.repost_count,
"replies": post.reply_count,
"is_repost": item.reason is not None,
}
posts.append(post_data)
cursor = response.cursor
if not cursor:
break
except Exception as e:
print(f"Error fetching feed: {e}")
break
return posts
def search_posts(self, query, limit=100):
"""Search for posts containing a keyword."""
posts = []
cursor = None
while len(posts) < limit:
try:
fetch_limit = min(limit - len(posts), 100)
response = self.client.app.bsky.feed.search_posts(
params={"q": query, "limit": fetch_limit, "cursor": cursor}
)
for post in response.posts:
posts.append({
"uri": post.uri,
"author": post.author.handle,
"text": post.record.text if hasattr(post.record, 'text') else None,
"created_at": post.record.created_at if hasattr(post.record, 'created_at') else None,
"likes": post.like_count,
"reposts": post.repost_count,
"replies": post.reply_count,
})
cursor = response.cursor
if not cursor:
break
except Exception as e:
print(f"Error searching: {e}")
break
return posts
def get_post_thread(self, uri):
"""Get a full post thread with replies."""
try:
response = self.client.get_post_thread(uri)
thread = response.thread
result = {
"post": {
"uri": thread.post.uri,
"text": thread.post.record.text if hasattr(thread.post.record, 'text') else None,
"author": thread.post.author.handle,
"likes": thread.post.like_count,
},
"replies": []
}
if hasattr(thread, 'replies') and thread.replies:
for reply in thread.replies:
if hasattr(reply, 'post'):
result["replies"].append({
"uri": reply.post.uri,
"text": reply.post.record.text if hasattr(reply.post.record, 'text') else None,
"author": reply.post.author.handle,
"likes": reply.post.like_count,
})
return result
except Exception as e:
print(f"Error getting thread: {e}")
return None
def get_followers(self, handle, limit=100):
"""Get a user's followers."""
followers = []
cursor = None
while len(followers) < limit:
try:
response = self.client.get_followers(
handle,
limit=min(limit - len(followers), 100),
cursor=cursor
)
for follower in response.followers:
followers.append({
"handle": follower.handle,
"display_name": follower.display_name,
"description": follower.description,
})
cursor = response.cursor
if not cursor:
break
except Exception as e:
break
return followers
# Usage
scraper = BlueskyScraper(handle="your.handle.bsky.social", password="your_app_password")
# Get profile
profile = scraper.get_profile("jay.bsky.team")
print(json.dumps(profile, indent=2, default=str))
# Get recent posts
posts = scraper.get_author_feed("jay.bsky.team", limit=20)
print(f"Fetched {len(posts)} posts")
# Search
results = scraper.search_posts("web scraping", limit=50)
print(f"Found {len(results)} posts about web scraping")Method 2: Direct HTTP API Access
For more control, access the AT Protocol API directly:
import requests
import json
class BlueskyHTTPScraper:
def __init__(self, proxy_url=None):
self.session = requests.Session()
self.base_url = "https://public.api.bsky.app"
self.proxy_url = proxy_url
def _get_proxies(self):
if self.proxy_url:
return {"http": self.proxy_url, "https": self.proxy_url}
return None
def get_profile(self, handle):
"""Get profile via public API (no auth required)."""
response = self.session.get(
f"{self.base_url}/xrpc/app.bsky.actor.getProfile",
params={"actor": handle},
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
return response.json()
def search_posts(self, query, limit=25):
"""Search posts via public API."""
response = self.session.get(
f"{self.base_url}/xrpc/app.bsky.feed.searchPosts",
params={"q": query, "limit": limit},
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
return response.json()
def get_feed(self, handle, limit=50, cursor=None):
"""Get author feed via public API."""
params = {"actor": handle, "limit": limit}
if cursor:
params["cursor"] = cursor
response = self.session.get(
f"{self.base_url}/xrpc/app.bsky.feed.getAuthorFeed",
params=params,
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
return response.json()
# Usage (no authentication needed for public data)
scraper = BlueskyHTTPScraper(proxy_url="http://user:pass@proxy:port")
profile = scraper.get_profile("jay.bsky.team")
print(json.dumps(profile, indent=2))Handling Rate Limits
Bluesky’s public API has generous rate limits but you should still be respectful:
| Endpoint | Rate Limit |
|---|---|
| Public API | ~3000 requests/5 minutes |
| Authenticated API | ~5000 requests/5 minutes |
| Search | ~100 requests/minute |
Proxy Recommendations
| Proxy Type | Necessity | Best For |
|---|---|---|
| Datacenter | Low priority | High-volume API calls |
| Residential | Optional | Distributed requests |
| None needed | Most cases | Standard data collection |
Bluesky’s open architecture means proxies are only needed for very high-volume operations. For most use cases, direct API access without proxies works well.
Legal Considerations
- Open Protocol: The AT Protocol is designed for open data access, making Bluesky one of the most legally accessible platforms to scrape.
- Terms of Service: While data is open, respect Bluesky’s ToS regarding commercial use and user privacy.
- Privacy: Despite open access, respect user privacy and comply with GDPR/CCPA.
- Rate Limits: Excessive requests may result in temporary IP blocks.
See our web scraping compliance guide for details.
Frequently Asked Questions
Is Bluesky data publicly accessible without authentication?
Yes. The public API (public.api.bsky.app) provides access to profiles, posts, and feeds without authentication. Authentication provides higher rate limits and access to additional features.
How does Bluesky’s openness compare to Twitter/X?
Bluesky’s AT Protocol is fundamentally open, providing free API access to public data. This is in stark contrast to Twitter/X, which charges thousands of dollars per month for API access. Bluesky is one of the most accessible social platforms for data collection.
Can I access the full Bluesky firehose?
Yes. The AT Protocol provides a firehose endpoint that streams all public events in real-time. This is available at the relay level and provides complete access to all public posts, likes, follows, and other actions.
Do I need proxies for Bluesky scraping?
For most use cases, no. Bluesky’s generous rate limits and open architecture make direct API access sufficient. Proxies are only needed for extremely high-volume operations or when running multiple concurrent scrapers.
Advanced Techniques
Handling Pagination
Most websites paginate their results. Implement robust pagination handling:
def scrape_all_pages(scraper, base_url, max_pages=20):
all_data = []
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
results = scraper.search(url)
if not results:
break
all_data.extend(results)
print(f"Page {page}: {len(results)} items (total: {len(all_data)})")
time.sleep(random.uniform(2, 5))
return all_dataData Validation and Cleaning
Always validate scraped data before storage:
def validate_data(item):
required_fields = ["title", "url"]
for field in required_fields:
if not item.get(field):
return False
return True
def clean_text(text):
if not text:
return None
# Remove extra whitespace
import re
text = re.sub(r'\s+', ' ', text).strip()
# Remove HTML entities
import html
text = html.unescape(text)
return text
# Apply to results
cleaned = [item for item in results if validate_data(item)]
for item in cleaned:
item["title"] = clean_text(item.get("title"))Monitoring and Alerting
Build monitoring into your scraping pipeline:
import logging
from datetime import datetime
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ScrapingMonitor:
def __init__(self):
self.start_time = datetime.now()
self.requests = 0
self.errors = 0
self.items = 0
def log_request(self, success=True):
self.requests += 1
if not success:
self.errors += 1
if self.requests % 50 == 0:
elapsed = (datetime.now() - self.start_time).seconds
rate = self.requests / max(elapsed, 1) * 60
logger.info(f"Requests: {self.requests}, Errors: {self.errors}, "
f"Items: {self.items}, Rate: {rate:.1f}/min")
def log_item(self, count=1):
self.items += countError Handling and Retry Logic
Implement robust error handling:
import time
from requests.exceptions import RequestException
def retry_request(func, max_retries=3, base_delay=5):
for attempt in range(max_retries):
try:
return func()
except RequestException as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
time.sleep(delay)
return NoneData Storage Options
Choose the right storage for your scraping volume:
import json
import csv
import sqlite3
class DataStorage:
def __init__(self, db_path="scraped_data.db"):
self.conn = sqlite3.connect(db_path)
self.conn.execute('''CREATE TABLE IF NOT EXISTS items
(id TEXT PRIMARY KEY, title TEXT, url TEXT, data JSON, scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)''')
def save(self, item):
self.conn.execute(
"INSERT OR REPLACE INTO items (id, title, url, data) VALUES (?, ?, ?, ?)",
(item.get("id"), item.get("title"), item.get("url"), json.dumps(item))
)
self.conn.commit()
def export_json(self, output_path):
cursor = self.conn.execute("SELECT data FROM items")
items = [json.loads(row[0]) for row in cursor.fetchall()]
with open(output_path, "w") as f:
json.dump(items, f, indent=2)
def export_csv(self, output_path):
cursor = self.conn.execute("SELECT * FROM items")
rows = cursor.fetchall()
with open(output_path, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["id", "title", "url", "data", "scraped_at"])
writer.writerows(rows)Frequently Asked Questions
How often should I scrape data?
The optimal frequency depends on how often the source data changes. For real-time data (stock prices, news), scrape every few minutes. For product listings, daily or weekly is usually sufficient. For reviews, weekly scraping captures new feedback without excessive load.
What happens if my IP gets blocked?
If you receive 403 or 429 status codes, your IP is likely blocked. Switch to a different proxy, implement exponential backoff, and slow your request rate. Rotating residential proxies automatically switch IPs to prevent blocks.
Should I use headless browsers or HTTP requests?
Use HTTP requests (with BeautifulSoup or similar) whenever possible — they are faster and use less resources. Switch to headless browsers (Selenium, Playwright) only when JavaScript rendering is required for the data you need.
How do I handle CAPTCHAs?
CAPTCHAs indicate aggressive bot detection. To minimize them: use residential or mobile proxies, implement realistic delays, rotate user agents, and maintain consistent session behavior. For persistent CAPTCHAs, consider CAPTCHA-solving services as a last resort.
Can I scrape data commercially?
The legality of commercial scraping depends on the platform’s ToS, the type of data collected, and your jurisdiction. Public data is generally more permissible, but always consult legal counsel for commercial use cases. See our compliance guide.
Conclusion
Bluesky stands out as one of the most accessible social platforms for data collection, thanks to the AT Protocol’s open design. The Python SDK provides clean, structured access to profiles, posts, search, and social graphs. For most research and analysis use cases, direct API access without proxies is sufficient.
For more social media scraping strategies, explore our social media proxy guide and proxy provider comparisons.
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix