How to Scrape Threads (Meta) Public Posts and Profiles (2026)

Threads crossed 300 million monthly active users in early 2026, and if you’re building social listening tools, competitive intelligence pipelines, or brand monitoring systems, you need to scrape it. Meta has made this harder than it should be — no public API with meaningful rate limits, aggressive bot detection, and a GraphQL layer that shifts regularly. Here’s what actually works in 2026.

What Meta Exposes (and What It Doesn’t)

Threads launched a limited API in late 2023 under the Instagram Graph API umbrella. By 2026, the official API covers:

  • Your own account’s posts and replies (requires user auth)
  • Basic profile metadata for public accounts
  • Post insights (impressions, likes, replies) for your own content

What it does not cover: search by keyword, hashtag timelines, follower graphs, or bulk profile enumeration. If your use case goes beyond reading your own content back, you’re working outside the official surface.

The unofficial path uses Threads’ internal GraphQL API, the same endpoints the mobile app hits. The base is https://www.threads.net/api/graphql with a fixed x-ig-app-id header (238260118697367 as of mid-2026). These endpoints are unauthenticated for public content, but Meta rate-limits by IP aggressively — more on mitigation below.

Fetching Public Profiles and Posts

For a single public profile, the simplest approach is a direct GraphQL query against the threads_timeline_list_feed_query operation. You need three headers minimum:

import httpx

HEADERS = {
    "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 17_4 like Mac OS X) AppleWebKit/605.1.15",
    "x-ig-app-id": "238260118697367",
    "Accept-Language": "en-US,en;q=0.9",
    "Content-Type": "application/x-www-form-urlencoded",
}

def get_user_id(username: str) -> str:
    url = f"https://www.threads.net/@{username}"
    r = httpx.get(url, headers=HEADERS, follow_redirects=True)
    # parse __ar_v from inline JSON in HTML
    import re
    match = re.search(r'"user_id":"(\d+)"', r.text)
    return match.group(1) if match else None

def fetch_threads(user_id: str, cursor: str = None):
    payload = {
        "lsd": "AVqbxe3J_LA",  # rotate this from homepage fetch
        "variables": f'{{"userID":"{user_id}","after":"{cursor or ""}"}}',
        "doc_id": "7357086314335024",  # timeline query doc ID, verify periodically
    }
    r = httpx.post("https://www.threads.net/api/graphql", data=payload, headers=HEADERS)
    return r.json()

The lsd token and doc_id are the two values that break scrapers when Meta rotates them. Pull lsd fresh from the homepage HTML on each session start. doc_id changes every few weeks — pin a version, monitor for 400s, and update.

Pagination works through a page_info.end_cursor field in the response. Loop until has_next_page is false or you hit your target row count.

Handling Rate Limits and Detection

Threads’ bot mitigation in 2026 is considerably tighter than what the platform launched with. You’ll hit 429s within 50-100 requests per IP per hour on the GraphQL endpoint without mitigation. The detection signals Meta uses:

SignalWhat triggers itMitigation
Request cadenceUniform intervals (e.g. exactly 2s)Jitter: random.uniform(1.8, 4.5)
IP reputationDatacenter ASNsResidential or mobile proxies
TLS fingerprintNon-browser ClientHelloUse httpx with HTTP/2 or curl-impersonate
Cookie absenceNo csrftoken / ig_didBootstrap cookies from homepage
User-Agent mismatchDesktop UA + mobile endpointConsistent mobile UA stack

For proxy selection, residential IPs from US or EU pools work reliably. Mobile IPs (carrier-grade NAT ranges) are the most durable because they share address space with genuine app traffic. Avoid datacenter ranges — Meta has extensive ASN blocklists. If you’re building serious infrastructure around Instagram-adjacent properties, the approach in How to Scrape Instagram Profiles and Posts Without Getting Blocked covers the full detection surface in more depth, including cookie rotation patterns that apply equally to Threads.

Parsing the Response

The GraphQL response is nested and inconsistent — fields appear at different depths depending on whether you’re hitting the timeline, a single post, or a reply thread. A stable parsing pattern:

  1. Navigate to data.mediaData.threads (for timeline) or data.data.containing_thread.thread_items (for single post)
  2. Each item has a post object with pk (unique post ID), user.username, caption.text, like_count, taken_at (Unix timestamp)
  3. Reply counts live under text_post_app_info.direct_reply_count
  4. Quoted posts are nested under text_post_app_info.share_info.quoted_post

Write a defensive parser that checks for key existence before accessing nested fields. The schema shifts without notice, and silent KeyError crashes will corrupt your pipeline mid-run.

For storing output, write to newline-delimited JSON (.ndjson) so partial runs are recoverable. If you’re running a multi-account or keyword-sweep job, a simple SQLite table with (post_id TEXT PRIMARY KEY, fetched_at INTEGER, raw_json TEXT) is enough to deduplicate without a full database stack.

Threads vs Other Decentralized and Semi-Open Platforms

Threads is ActivityPub-compatible (it joined the fediverse in late 2024), which means public posts are theoretically accessible via ActivityPub federation endpoints. In practice, Meta’s federation implementation is partial and rate-limited at the protocol level too. Compare this to genuinely open alternatives:

PlatformOfficial APIActivityPub / OpenScraping difficulty
ThreadsLimited (own content only)PartialHigh
MastodonFull REST APIYes (full)Low
BlueskyFull AT Protocol APIAT ProtocolLow-Medium
DiscordBot API (no public search)NoMedium

If your research covers multiple social platforms, you can often get cleaner data from Mastodon’s ActivityPub layer, as covered in How to Scrape Mastodon Federation Data 2026: ActivityPub Patterns. For Bluesky specifically, the AT Protocol gives you structured firehose access that Threads doesn’t come close to matching — see How to Scrape Bluesky AT Protocol Posts in 2026 (Official + Workaround). Discord sits in a different category entirely since it has no public post concept, but How to Scrape Discord Public Server Data Ethically in 2026 walks through what’s accessible without violating ToS.

Threads is objectively the hardest of these four to extract data from at scale, and the only one where you’re working against active countermeasures rather than just working around missing APIs.

Staying Inside Legal and Ethical Boundaries

Threads’ Terms of Service prohibit automated data collection. The legal picture in 2026 is still shaped by hiQ v. LinkedIn (Ninth Circuit): scraping public data is generally protected, but ToS violations can still generate cease-and-desist letters and account bans. Practical risk management:

  • Never scrape private accounts or gated content
  • Respect robots.txtthreads.net/robots.txt disallows most API paths for crawlers
  • Don’t store personally identifiable information beyond what your analysis requires
  • Rate-limit yourself below what would constitute a DoS burden on the platform
  • If you’re building a commercial product on this data, get legal review

The ethical line is less ambiguous than the legal one: scraping public posts to analyze public discourse is defensible. Bulk-harvesting user profiles to build contact databases is not.

Bottom Line

For small-scale research (under 10,000 posts/day), the unofficial GraphQL approach with residential proxies and proper jitter is viable today. For production pipelines, budget for proxy costs, build in doc_id monitoring, and expect to patch your scraper every 4-6 weeks when Meta rotates endpoints. DRT will keep this guide updated as the Threads API surface and detection stack evolve — check back before any major pipeline build.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)