How to Scrape Mastodon Data in 2026
Mastodon is the leading decentralized social network built on the ActivityPub protocol, with millions of users distributed across thousands of independent servers (instances). For social media researchers, brand monitors, and data scientists, Mastodon provides a unique dataset of federated social interactions that can be collected through its well-documented, open API.
Unlike centralized platforms that restrict data access, Mastodon’s open-source design and standardized API make it one of the most accessible social platforms for data collection.
What Data Can You Extract from Mastodon?
Mastodon’s API provides access to:
- Posts (toots) with text, media, polls, and content warnings
- User profiles (username, bio, follower/following counts, joined date)
- Public timelines (local, federated, hashtag-based)
- Boost (reblog) and favorite counts
- Hashtag trends and usage statistics
- Instance information (rules, stats, features)
- Conversation threads
- Custom emoji and media attachments
Example JSON Output
{
"id": "109876543210987654",
"account": {
"username": "techwriter",
"display_name": "Tech Writer",
"instance": "mastodon.social",
"followers_count": 8500,
"following_count": 320,
"statuses_count": 12400
},
"content": "<p>Just published my analysis of ActivityPub adoption rates across the Fediverse.</p>",
"plain_text": "Just published my analysis of ActivityPub adoption rates across the Fediverse.",
"created_at": "2026-03-08T14:30:00.000Z",
"favourites_count": 89,
"reblogs_count": 42,
"replies_count": 15,
"visibility": "public",
"tags": ["fediverse", "activitypub"],
"url": "https://mastodon.social/@techwriter/109876543210987654"
}Prerequisites
pip install Mastodon.py requests beautifulsoup4Method 1: Using Mastodon.py (Recommended)
The Mastodon.py library provides a complete Python interface to the Mastodon API.
from mastodon import Mastodon
import json
from datetime import datetime, timedelta
class MastodonScraper:
def __init__(self, instance_url="https://mastodon.social", access_token=None):
if access_token:
self.api = Mastodon(
access_token=access_token,
api_base_url=instance_url
)
else:
# Public API access (limited but no auth needed)
self.api = Mastodon(
api_base_url=instance_url,
client_id=None,
client_secret=None,
access_token=None
)
self.instance_url = instance_url
def get_public_timeline(self, limit=40, local=False):
"""Get posts from the public timeline."""
try:
if local:
toots = self.api.timeline_local(limit=limit)
else:
toots = self.api.timeline_public(limit=limit)
return [self._parse_toot(t) for t in toots]
except Exception as e:
print(f"Error: {e}")
return []
def get_hashtag_timeline(self, hashtag, limit=40):
"""Get posts for a specific hashtag."""
try:
toots = self.api.timeline_hashtag(hashtag, limit=limit)
return [self._parse_toot(t) for t in toots]
except Exception as e:
print(f"Error: {e}")
return []
def get_account_statuses(self, account_id, limit=100):
"""Get posts from a specific account."""
try:
toots = self.api.account_statuses(account_id, limit=limit)
return [self._parse_toot(t) for t in toots]
except Exception as e:
print(f"Error: {e}")
return []
def search_accounts(self, query, limit=10):
"""Search for user accounts."""
try:
results = self.api.account_search(query, limit=limit)
return [{
"id": a.id,
"username": a.username,
"display_name": a.display_name,
"url": a.url,
"followers": a.followers_count,
"following": a.following_count,
"posts": a.statuses_count,
"note": a.note,
"created_at": str(a.created_at),
} for a in results]
except Exception as e:
print(f"Error: {e}")
return []
def get_trending_tags(self):
"""Get currently trending hashtags."""
try:
trends = self.api.trending_tags()
return [{
"name": t.name,
"url": t.url,
"uses_today": t.history[0]["uses"] if t.history else 0,
"accounts_today": t.history[0]["accounts"] if t.history else 0,
} for t in trends]
except Exception as e:
print(f"Error: {e}")
return []
def get_instance_info(self):
"""Get information about the Mastodon instance."""
try:
info = self.api.instance()
return {
"uri": info.uri,
"title": info.title,
"description": info.description,
"version": info.version,
"stats": {
"user_count": info.stats["user_count"],
"status_count": info.stats["status_count"],
"domain_count": info.stats["domain_count"],
},
"languages": info.languages,
"registrations": info.registrations,
}
except Exception as e:
print(f"Error: {e}")
return None
def _parse_toot(self, toot):
"""Parse a toot into a clean dictionary."""
from bs4 import BeautifulSoup
# Strip HTML from content
plain_text = BeautifulSoup(toot.content, "lxml").get_text()
return {
"id": toot.id,
"url": toot.url,
"created_at": str(toot.created_at),
"content_html": toot.content,
"plain_text": plain_text,
"author": {
"id": toot.account.id,
"username": toot.account.username,
"display_name": toot.account.display_name,
"url": toot.account.url,
},
"favourites": toot.favourites_count,
"reblogs": toot.reblogs_count,
"replies": toot.replies_count,
"visibility": toot.visibility,
"tags": [t.name for t in toot.tags],
"media": [m.url for m in toot.media_attachments],
"is_reblog": toot.reblog is not None,
"language": toot.language,
"sensitive": toot.sensitive,
"spoiler_text": toot.spoiler_text,
}
def paginate_timeline(self, hashtag=None, max_toots=500):
"""Paginate through a timeline to collect many posts."""
all_toots = []
max_id = None
while len(all_toots) < max_toots:
try:
if hashtag:
toots = self.api.timeline_hashtag(hashtag, limit=40, max_id=max_id)
else:
toots = self.api.timeline_public(limit=40, max_id=max_id)
if not toots:
break
parsed = [self._parse_toot(t) for t in toots]
all_toots.extend(parsed)
max_id = toots[-1].id
print(f"Collected {len(all_toots)} toots")
except Exception as e:
print(f"Error during pagination: {e}")
break
return all_toots[:max_toots]
# Usage
scraper = MastodonScraper(instance_url="https://mastodon.social")
# Get public timeline
timeline = scraper.get_public_timeline(limit=20)
print(f"Got {len(timeline)} posts from public timeline")
# Get trending tags
trends = scraper.get_trending_tags()
print(json.dumps(trends, indent=2))
# Search for accounts
accounts = scraper.search_accounts("data science", limit=5)
print(json.dumps(accounts, indent=2, default=str))
# Get hashtag posts
ai_posts = scraper.get_hashtag_timeline("artificialintelligence", limit=20)
print(f"Got {len(ai_posts)} AI-related posts")Method 2: Direct API Access with Requests
For cross-instance scraping without the Mastodon.py dependency:
import requests
import json
import time
class MastodonDirectScraper:
def __init__(self, proxy_url=None):
self.session = requests.Session()
self.proxy_url = proxy_url
def _get_proxies(self):
if self.proxy_url:
return {"http": self.proxy_url, "https": self.proxy_url}
return None
def get_public_timeline(self, instance, limit=40):
"""Get public timeline from any instance."""
response = self.session.get(
f"https://{instance}/api/v1/timelines/public",
params={"limit": limit},
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
return response.json()
def search_instance(self, instance, query, result_type="statuses"):
"""Search across an instance."""
response = self.session.get(
f"https://{instance}/api/v2/search",
params={"q": query, "type": result_type, "limit": 40},
proxies=self._get_proxies(),
timeout=30
)
response.raise_for_status()
return response.json()
def scrape_multiple_instances(self, instances, hashtag, limit_per=20):
"""Scrape the same hashtag across multiple instances."""
all_toots = []
for instance in instances:
try:
response = self.session.get(
f"https://{instance}/api/v1/timelines/tag/{hashtag}",
params={"limit": limit_per},
proxies=self._get_proxies(),
timeout=30
)
if response.ok:
toots = response.json()
for t in toots:
t["_instance"] = instance
all_toots.extend(toots)
print(f"{instance}: {len(toots)} posts")
except Exception as e:
print(f"{instance}: Error - {e}")
time.sleep(1)
return all_toots
# Usage
scraper = MastodonDirectScraper(proxy_url="http://user:pass@proxy:port")
# Scrape across multiple instances
instances = ["mastodon.social", "fosstodon.org", "hachyderm.io", "infosec.exchange"]
posts = scraper.scrape_multiple_instances(instances, "webscraping", limit_per=10)
print(f"Total posts across instances: {len(posts)}")Mastodon API Rate Limits
| Endpoint | Rate Limit |
|---|---|
| Public timelines | 300 requests/5 minutes |
| Account endpoints | 300 requests/5 minutes |
| Search | 30 requests/5 minutes |
| Authenticated | Higher limits per instance |
Rate limits vary by instance. Large instances like mastodon.social may have stricter limits.
Proxy Recommendations
| Proxy Type | Necessity | Best For |
|---|---|---|
| None | Most use cases | Single instance, low volume |
| Datacenter | Optional | Multi-instance scraping |
| Residential | Optional | High-volume operations |
Mastodon’s open API makes proxies optional for most use cases. For multi-instance scraping at scale, datacenter proxies help distribute requests.
Legal Considerations
- Open Source: Mastodon is open-source software (AGPL-3.0), and its API is designed for programmatic access.
- Instance Rules: Each instance has its own rules and ToS. Respect per-instance policies.
- Privacy: Respect content warnings and post visibility settings. Only collect public posts.
- GDPR: European instances are subject to GDPR. Comply with data deletion requests.
See our web scraping compliance guide for details.
Frequently Asked Questions
Is it legal to scrape Mastodon?
Mastodon’s open-source nature and public API make it one of the most legally accessible social platforms. However, respect each instance’s Terms of Service, only collect public data, and comply with privacy regulations.
Do I need authentication to access Mastodon data?
No. Public timelines, hashtag timelines, and basic account info are accessible without authentication. Authentication provides higher rate limits and access to additional endpoints.
Can I scrape across multiple Mastodon instances?
Yes. Each Mastodon instance runs its own API. You can query multiple instances to get a broader view of Fediverse activity. Use the direct API approach to scrape across instances.
How does Mastodon data compare to Twitter/X data?
Mastodon data is more accessible (free API, no paywall) but more fragmented (distributed across instances). The content tends to be more niche and technically oriented compared to mainstream social platforms.
Advanced Techniques
Handling Pagination
Most websites paginate their results. Implement robust pagination handling:
def scrape_all_pages(scraper, base_url, max_pages=20):
all_data = []
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
results = scraper.search(url)
if not results:
break
all_data.extend(results)
print(f"Page {page}: {len(results)} items (total: {len(all_data)})")
time.sleep(random.uniform(2, 5))
return all_dataData Validation and Cleaning
Always validate scraped data before storage:
def validate_data(item):
required_fields = ["title", "url"]
for field in required_fields:
if not item.get(field):
return False
return True
def clean_text(text):
if not text:
return None
# Remove extra whitespace
import re
text = re.sub(r'\s+', ' ', text).strip()
# Remove HTML entities
import html
text = html.unescape(text)
return text
# Apply to results
cleaned = [item for item in results if validate_data(item)]
for item in cleaned:
item["title"] = clean_text(item.get("title"))Monitoring and Alerting
Build monitoring into your scraping pipeline:
import logging
from datetime import datetime
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class ScrapingMonitor:
def __init__(self):
self.start_time = datetime.now()
self.requests = 0
self.errors = 0
self.items = 0
def log_request(self, success=True):
self.requests += 1
if not success:
self.errors += 1
if self.requests % 50 == 0:
elapsed = (datetime.now() - self.start_time).seconds
rate = self.requests / max(elapsed, 1) * 60
logger.info(f"Requests: {self.requests}, Errors: {self.errors}, "
f"Items: {self.items}, Rate: {rate:.1f}/min")
def log_item(self, count=1):
self.items += countError Handling and Retry Logic
Implement robust error handling:
import time
from requests.exceptions import RequestException
def retry_request(func, max_retries=3, base_delay=5):
for attempt in range(max_retries):
try:
return func()
except RequestException as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
time.sleep(delay)
return NoneData Storage Options
Choose the right storage for your scraping volume:
import json
import csv
import sqlite3
class DataStorage:
def __init__(self, db_path="scraped_data.db"):
self.conn = sqlite3.connect(db_path)
self.conn.execute('''CREATE TABLE IF NOT EXISTS items
(id TEXT PRIMARY KEY, title TEXT, url TEXT, data JSON, scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)''')
def save(self, item):
self.conn.execute(
"INSERT OR REPLACE INTO items (id, title, url, data) VALUES (?, ?, ?, ?)",
(item.get("id"), item.get("title"), item.get("url"), json.dumps(item))
)
self.conn.commit()
def export_json(self, output_path):
cursor = self.conn.execute("SELECT data FROM items")
items = [json.loads(row[0]) for row in cursor.fetchall()]
with open(output_path, "w") as f:
json.dump(items, f, indent=2)
def export_csv(self, output_path):
cursor = self.conn.execute("SELECT * FROM items")
rows = cursor.fetchall()
with open(output_path, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["id", "title", "url", "data", "scraped_at"])
writer.writerows(rows)Frequently Asked Questions
How often should I scrape data?
The optimal frequency depends on how often the source data changes. For real-time data (stock prices, news), scrape every few minutes. For product listings, daily or weekly is usually sufficient. For reviews, weekly scraping captures new feedback without excessive load.
What happens if my IP gets blocked?
If you receive 403 or 429 status codes, your IP is likely blocked. Switch to a different proxy, implement exponential backoff, and slow your request rate. Rotating residential proxies automatically switch IPs to prevent blocks.
Should I use headless browsers or HTTP requests?
Use HTTP requests (with BeautifulSoup or similar) whenever possible — they are faster and use less resources. Switch to headless browsers (Selenium, Playwright) only when JavaScript rendering is required for the data you need.
How do I handle CAPTCHAs?
CAPTCHAs indicate aggressive bot detection. To minimize them: use residential or mobile proxies, implement realistic delays, rotate user agents, and maintain consistent session behavior. For persistent CAPTCHAs, consider CAPTCHA-solving services as a last resort.
Can I scrape data commercially?
The legality of commercial scraping depends on the platform’s ToS, the type of data collected, and your jurisdiction. Public data is generally more permissible, but always consult legal counsel for commercial use cases. See our compliance guide.
Conclusion
Mastodon is one of the most accessible social platforms for data collection, thanks to its open-source design and well-documented API. The Mastodon.py library provides a clean Python interface, while direct API access enables cross-instance research. Proxies are typically unnecessary for standard use cases.
For more social media scraping strategies, explore our social media proxy guide and proxy provider comparisons.
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix
Related Reading
- How to Scrape AliExpress Product Data
- How to Scrape Amazon Product Reviews in 2026
- aiohttp + BeautifulSoup: Async Python Scraping
- How Anti-Bot Systems Detect Scrapers (Cloudflare, Akamai, PerimeterX)
- API vs Web Scraping: When You Need Proxies (and When You Don’t)
- ASEAN Data Protection Laws: A Web Scraping Compliance Matrix