Threads crossed 300 million monthly active users in early 2026, and if you’re building social listening tools, competitive intelligence pipelines, or brand monitoring systems, you need to scrape it. Meta has made this harder than it should be — no public API with meaningful rate limits, aggressive bot detection, and a GraphQL layer that shifts regularly. Here’s what actually works in 2026.
What Meta Exposes (and What It Doesn’t)
Threads launched a limited API in late 2023 under the Instagram Graph API umbrella. By 2026, the official API covers:
- Your own account’s posts and replies (requires user auth)
- Basic profile metadata for public accounts
- Post insights (impressions, likes, replies) for your own content
What it does not cover: search by keyword, hashtag timelines, follower graphs, or bulk profile enumeration. If your use case goes beyond reading your own content back, you’re working outside the official surface.
The unofficial path uses Threads’ internal GraphQL API, the same endpoints the mobile app hits. The base is https://www.threads.net/api/graphql with a fixed x-ig-app-id header (238260118697367 as of mid-2026). These endpoints are unauthenticated for public content, but Meta rate-limits by IP aggressively — more on mitigation below.
Fetching Public Profiles and Posts
For a single public profile, the simplest approach is a direct GraphQL query against the threads_timeline_list_feed_query operation. You need three headers minimum:
import httpx
HEADERS = {
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 17_4 like Mac OS X) AppleWebKit/605.1.15",
"x-ig-app-id": "238260118697367",
"Accept-Language": "en-US,en;q=0.9",
"Content-Type": "application/x-www-form-urlencoded",
}
def get_user_id(username: str) -> str:
url = f"https://www.threads.net/@{username}"
r = httpx.get(url, headers=HEADERS, follow_redirects=True)
# parse __ar_v from inline JSON in HTML
import re
match = re.search(r'"user_id":"(\d+)"', r.text)
return match.group(1) if match else None
def fetch_threads(user_id: str, cursor: str = None):
payload = {
"lsd": "AVqbxe3J_LA", # rotate this from homepage fetch
"variables": f'{{"userID":"{user_id}","after":"{cursor or ""}"}}',
"doc_id": "7357086314335024", # timeline query doc ID, verify periodically
}
r = httpx.post("https://www.threads.net/api/graphql", data=payload, headers=HEADERS)
return r.json()The lsd token and doc_id are the two values that break scrapers when Meta rotates them. Pull lsd fresh from the homepage HTML on each session start. doc_id changes every few weeks — pin a version, monitor for 400s, and update.
Pagination works through a page_info.end_cursor field in the response. Loop until has_next_page is false or you hit your target row count.
Handling Rate Limits and Detection
Threads’ bot mitigation in 2026 is considerably tighter than what the platform launched with. You’ll hit 429s within 50-100 requests per IP per hour on the GraphQL endpoint without mitigation. The detection signals Meta uses:
| Signal | What triggers it | Mitigation |
|---|---|---|
| Request cadence | Uniform intervals (e.g. exactly 2s) | Jitter: random.uniform(1.8, 4.5) |
| IP reputation | Datacenter ASNs | Residential or mobile proxies |
| TLS fingerprint | Non-browser ClientHello | Use httpx with HTTP/2 or curl-impersonate |
| Cookie absence | No csrftoken / ig_did | Bootstrap cookies from homepage |
| User-Agent mismatch | Desktop UA + mobile endpoint | Consistent mobile UA stack |
For proxy selection, residential IPs from US or EU pools work reliably. Mobile IPs (carrier-grade NAT ranges) are the most durable because they share address space with genuine app traffic. Avoid datacenter ranges — Meta has extensive ASN blocklists. If you’re building serious infrastructure around Instagram-adjacent properties, the approach in How to Scrape Instagram Profiles and Posts Without Getting Blocked covers the full detection surface in more depth, including cookie rotation patterns that apply equally to Threads.
Parsing the Response
The GraphQL response is nested and inconsistent — fields appear at different depths depending on whether you’re hitting the timeline, a single post, or a reply thread. A stable parsing pattern:
- Navigate to
data.mediaData.threads(for timeline) ordata.data.containing_thread.thread_items(for single post) - Each item has a
postobject withpk(unique post ID),user.username,caption.text,like_count,taken_at(Unix timestamp) - Reply counts live under
text_post_app_info.direct_reply_count - Quoted posts are nested under
text_post_app_info.share_info.quoted_post
Write a defensive parser that checks for key existence before accessing nested fields. The schema shifts without notice, and silent KeyError crashes will corrupt your pipeline mid-run.
For storing output, write to newline-delimited JSON (.ndjson) so partial runs are recoverable. If you’re running a multi-account or keyword-sweep job, a simple SQLite table with (post_id TEXT PRIMARY KEY, fetched_at INTEGER, raw_json TEXT) is enough to deduplicate without a full database stack.
Threads vs Other Decentralized and Semi-Open Platforms
Threads is ActivityPub-compatible (it joined the fediverse in late 2024), which means public posts are theoretically accessible via ActivityPub federation endpoints. In practice, Meta’s federation implementation is partial and rate-limited at the protocol level too. Compare this to genuinely open alternatives:
| Platform | Official API | ActivityPub / Open | Scraping difficulty |
|---|---|---|---|
| Threads | Limited (own content only) | Partial | High |
| Mastodon | Full REST API | Yes (full) | Low |
| Bluesky | Full AT Protocol API | AT Protocol | Low-Medium |
| Discord | Bot API (no public search) | No | Medium |
If your research covers multiple social platforms, you can often get cleaner data from Mastodon’s ActivityPub layer, as covered in How to Scrape Mastodon Federation Data 2026: ActivityPub Patterns. For Bluesky specifically, the AT Protocol gives you structured firehose access that Threads doesn’t come close to matching — see How to Scrape Bluesky AT Protocol Posts in 2026 (Official + Workaround). Discord sits in a different category entirely since it has no public post concept, but How to Scrape Discord Public Server Data Ethically in 2026 walks through what’s accessible without violating ToS.
Threads is objectively the hardest of these four to extract data from at scale, and the only one where you’re working against active countermeasures rather than just working around missing APIs.
Staying Inside Legal and Ethical Boundaries
Threads’ Terms of Service prohibit automated data collection. The legal picture in 2026 is still shaped by hiQ v. LinkedIn (Ninth Circuit): scraping public data is generally protected, but ToS violations can still generate cease-and-desist letters and account bans. Practical risk management:
- Never scrape private accounts or gated content
- Respect
robots.txt—threads.net/robots.txtdisallows most API paths for crawlers - Don’t store personally identifiable information beyond what your analysis requires
- Rate-limit yourself below what would constitute a DoS burden on the platform
- If you’re building a commercial product on this data, get legal review
The ethical line is less ambiguous than the legal one: scraping public posts to analyze public discourse is defensible. Bulk-harvesting user profiles to build contact databases is not.
Bottom Line
For small-scale research (under 10,000 posts/day), the unofficial GraphQL approach with residential proxies and proper jitter is viable today. For production pipelines, budget for proxy costs, build in doc_id monitoring, and expect to patch your scraper every 4-6 weeks when Meta rotates endpoints. DRT will keep this guide updated as the Threads API surface and detection stack evolve — check back before any major pipeline build.