How to Scrape Medium Articles and Author Stats (2026)

Medium sits behind a partial JavaScript render and a soft login wall, making it one of the trickier publishing platforms to scrape at scale. the good news: Medium exposes a JSON API under the hood that most scrapers miss entirely, and with the right approach you can pull article metadata, clap counts, and author stats without touching a headless browser.

What Medium Actually Exposes (and What It Doesn’t)

Medium has no official public API, but its frontend fetches structured data from internal endpoints that return JSON when you append ?format=json to almost any URL. this works for publication pages, tag feeds, and individual post URLs.

example: https://medium.com/tag/python?format=json returns a fat JSON blob with story metadata, author handles, and clap counts. strip the leading )]}while(1); security prefix before parsing.

import requests, json

def fetch_medium_json(url):
    resp = requests.get(
        url + "?format=json",
        headers={"Accept": "application/json"},
        cookies={"uid": "", "sid": ""}  # empty auth bypasses some soft walls
    )
    raw = resp.text
    # strip XSSI prefix
    clean = raw[len(")]}while(1);</x>"):]
    return json.loads(clean)

data = fetch_medium_json("https://medium.com/tag/machine-learning")
stories = data["payload"]["references"]["Post"]

what you can get without login:

  • post titles, subtitles, published timestamps
  • clap counts (rounded to nearest 100 above 1000)
  • author user ID and name
  • reading time estimate
  • tags and publication name

what requires a logged-in session cookie:

  • exact clap counts below 1000
  • follower counts per author
  • response/comment counts on private stories
  • paywall content

Author Stats: The GraphQL Route

Medium’s app also uses a GraphQL endpoint at https://medium.com/_/graphql. it’s not officially documented but it’s stable enough that tools have relied on it since 2024. you can query author profile data including follower count, story count, and bio.

import requests

GRAPHQL_URL = "https://medium.com/_/graphql"
QUERY = """
query GetUserProfile($username: String!) {
  userResult(username: $username) {
    ... on User {
      name
      username
      followerCount
      postCount
      bio
      twitterScreenName
    }
  }
}
"""

resp = requests.post(
    GRAPHQL_URL,
    json={"query": QUERY, "variables": {"username": "target_author"}},
    headers={"Content-Type": "application/json", "Referer": "https://medium.com"}
)
print(resp.json())

rate limits kick in around 60 requests per minute per IP. for bulk author lookups, rotate IPs or add 1-2 second delays between requests. if you’re pulling author stats alongside article content at scale, the same proxy rotation setup you’d use for How to Scrape Pinterest Pin and Board Data at Scale (2026) applies cleanly here.

Handling the Soft Login Wall and Paywalled Content

Medium shows a modal after 3-5 article views per session for non-logged-in users. it’s purely client-side, so simple HTTP requests to the JSON endpoints skip it entirely. the harder wall is the “Member-only story” paywall, which truncates content at around 500 words in the JSON payload.

options for paywall content:

  1. use a Medium member session cookie (your own account, terms permitting)
  2. target non-paywalled stories only (filter payload.value.virtuals.isPaywalled == false)
  3. use the RSS feed for publications that expose full text: https://medium.com/feed/publication-name

the RSS approach is underused. many Medium publications expose full article HTML in their RSS field, no login needed. parse it with feedparser + BeautifulSoup and you sidestep the JSON API complexity entirely for content extraction.

for JavaScript-heavy author pages where the JSON trick doesn’t work, Playwright with a persistent browser profile is the fallback. similar patterns apply when you’re pulling from developer-focused platforms like How to Scrape Dev.to Public Articles at Scale (2026) or How to Scrape Hashnode Tech Blog Posts (2026), where the rendering stack differs but the proxy and session management challenges are the same.

Proxy and Rate-Limit Strategy

Medium’s backend detects datacenter IPs aggressively. requests from AWS/GCP/Azure ranges get 429s or redirects to the login page within 10-20 requests. residential or mobile proxies are required for any volume above toy-scale.

Proxy TypeSuccess Rate (Medium)Cost/GBBest For
Datacenter~20%$0.50-1Small tests only
Residential rotating~85%$3-8Tag feeds, author lookup
Mobile rotating~95%$8-15Paywall bypass, high-volume
Static residential~75%$2-5Session-based scraping

key settings to reduce blocks:

  • set Accept-Language: en-US,en;q=0.9 and a real browser User-Agent
  • include Referer: https://medium.com on all requests
  • rotate session cookies alongside IPs, not just IPs alone
  • avoid scraping the same author page more than once per 10 minutes

Structuring the Output for Analysis

raw Medium JSON is deeply nested. flatten it before storing. useful fields to extract per article:

  • id (stable post ID, use as deduplication key)
  • title, subtitle
  • createdAt, updatedAt (Unix ms, divide by 1000)
  • virtuals.totalClapCount
  • virtuals.responsesCreatedCount
  • virtuals.readingTime
  • creator.userId
  • primaryTag.name

for trend analysis across a tag or publication, store snapshots over time rather than a single pull. clap counts on Medium articles continue growing for weeks after publish. a weekly snapshot gives you velocity data that a one-shot pull misses entirely.

if you’re aggregating across multiple publishing platforms for content intelligence, the approach DRT covers for How to Scrape Quora Questions and Answers Programmatically (2026) pairs well here: Quora gives you question demand signals, Medium gives you long-form supply and author authority data. and if you’re building a news aggregation layer on top, see How to Scrape Google News Articles in 2026 for pulling syndicated coverage that often surfaces Medium posts.

recommended storage schema (PostgreSQL):

CREATE TABLE medium_articles (
    post_id TEXT PRIMARY KEY,
    title TEXT,
    author_id TEXT,
    publication TEXT,
    primary_tag TEXT,
    clap_count INTEGER,
    response_count INTEGER,
    reading_time_mins FLOAT,
    is_paywalled BOOLEAN,
    published_at TIMESTAMPTZ,
    scraped_at TIMESTAMPTZ DEFAULT now()
);

Bottom Line

the fastest path to Medium data in 2026 is the ?format=json suffix trick plus the GraphQL author endpoint, with residential rotating proxies and a 1-second delay between requests. skip headless browsers unless you specifically need paywalled content. dataresearchtools.com covers the full stack for publisher and social platform scraping — if Medium is one node in a larger content intelligence pipeline, the proxy and normalization patterns here transfer directly to every other platform in the stack.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
message me on telegram

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)