How to Scrape SoundCloud Artist + Track Data (2026)

SoundCloud exposes more public data than most developers realize, and scraping SoundCloud artist and track metadata is one of the more straightforward audio platform tasks if you approach it correctly. The catch: SoundCloud’s unofficial API still works in 2026, but rate limits are aggressive and client IDs rotate without warning.

What Data Is Actually Available

SoundCloud’s public pages expose a clean set of structured data without authentication:

  • Artist profiles: follower count, track count, description, avatar URL, verified status
  • Track metadata: play count, like count, comment count, duration, genre, tags, waveform data
  • Playlists and sets: track lists, curator info, like counts
  • Comments: text, timestamps, user handles
  • Related tracks and recommendations

Private tracks return 401s. Monetized tracks behind a paywall may load metadata but block stream URLs. Everything else is fair game via the web app or the unofficial API endpoint.

The Unofficial API Approach

SoundCloud doesn’t publish a public API in 2026 (their developer program is effectively closed), but the web app talks to api-v2.soundcloud.com using a rotating client_id param. You can extract this from any web session.

import httpx
import re

def get_client_id(session: httpx.Client) -> str:
    r = session.get("https://soundcloud.com")
    # Find the JS bundle URL that contains the client_id
    bundle = re.search(r'https://a-v2\.sndcdn\.com/assets/[^"]+\.js', r.text)
    if not bundle:
        raise ValueError("bundle not found")
    js = session.get(bundle.group()).text
    match = re.search(r'client_id:"([a-zA-Z0-9]+)"', js)
    return match.group(1)

def get_track(client_id: str, url: str, session: httpx.Client) -> dict:
    endpoint = "https://api-v2.soundcloud.com/resolve"
    r = session.get(endpoint, params={"url": url, "client_id": client_id})
    r.raise_for_status()
    return r.json()

The client_id extraction works reliably but the value changes every few days. Cache it with a TTL of 48 hours and refresh on 401. For artist pages, swap resolve for /users/{user_id}/tracks with limit and linked_partitioning=1 for cursor-based pagination.

If you’re comparing approaches across audio platforms, the SoundCloud unofficial API is more stable than scraping Apple Music’s JS-rendered pages — see How to Scrape Apple Music Charts and Playlists (2026) for context on how brittle those can get.

Pagination and Rate Limits

SoundCloud’s API uses cursor pagination via next_href in responses. A full artist discography can run into thousands of tracks for major labels.

  1. Fetch /users/{id}/tracks?limit=200&linked_partitioning=1
  2. Pull next_href from the response body
  3. Append &client_id=... to it and repeat
  4. Stop when next_href is null

Rate limits are session-based, not IP-based in most configurations, but heavy scraping triggers a temporary 429 at around 200-300 requests per minute. A realistic safe rate is 60-80 req/min per client ID with jitter. If you’re doing large-scale collection, rotate client IDs alongside proxies. Residential proxies from Singapore or EU will avoid the same geo-based throttle triggers you’ll see with datacenter IPs.

The same cursor pattern applies on Spotify’s public endpoints — How to Scrape Spotify Public Data (2026): Playlists, Artists, Charts covers the next cursor in detail if you’re building a cross-platform pipeline.

Tools and Library Comparison

ToolBest forRenders JSHandles paginationMaintained
httpx + manualCustom pipelines, speedNoManualYes
soundcloud-v2 (PyPI)Quick extractionNoYesPartial (2024)
PlaywrightJS-heavy fallbackYesManualYes
Apify SoundCloud ActorNo-code / managedYesYesYes
Scrapy + middlewareLarge-scale batchNoCustomYes

The soundcloud-v2 library on PyPI handles client ID rotation and pagination but its last meaningful commit was mid-2024. It works for small jobs but you’ll want to fork or replace it for production use. For anything over 50k tracks, Scrapy with a rotating proxy middleware and a Redis-backed scheduler is the right architecture.

If you’re also collecting listener behavior data, How to Scrape Last.fm Listening Data and Artist Metadata (2026) is worth reading alongside this — Last.fm has a real public API that gives scrobble counts SoundCloud doesn’t expose.

Anti-Bot Countermeasures and Workarounds

SoundCloud’s bot detection is lighter than Spotify’s or Apple’s, but it’s not absent:

  • TLS fingerprinting: Use httpx with HTTP/2 or curl_cffi with a Chrome impersonation profile. Standard requests with a User-Agent string is often flagged.
  • Cookie requirements: The API accepts requests without cookies in most cases, but some endpoints 403 without a valid sc_anonymous_id cookie. Set it to any UUID-format string on session init.
  • Waveform and stream URLs: These are signed CDN URLs that expire in 30 minutes. Don’t store them long-term — store the track ID and re-resolve.

For independent label and artist research use cases, pairing SoundCloud data with Bandcamp sales signals gives a more complete picture of an artist’s commercial traction. How to Scrape Bandcamp Artist Pages and Sales Data (2026) covers that side of the stack.

One broader pattern worth noting: the same client ID extraction and rate-limit handling logic used here applies to dozens of platforms that expose data through internal XHR endpoints rather than documented APIs. The techniques in How to Scrape ZoomInfo Without Account: Public Data Strategies (2026) are a good reference for the general approach to reverse-engineering internal API calls.

Storing and Structuring the Output

A minimal track record worth persisting:

{
  "id": 1234567890,
  "permalink_url": "https://soundcloud.com/artist/track-slug",
  "title": "Track Title",
  "user": {"id": 111, "username": "artist-handle"},
  "playback_count": 480200,
  "likes_count": 12300,
  "comment_count": 340,
  "duration": 213000,
  "genre": "Electronic",
  "tag_list": "techno berlin underground",
  "created_at": "2025-11-14T18:22:00Z",
  "reposts_count": 890
}

Store raw JSON first, normalize later. SoundCloud’s schema is relatively stable but playback_count goes null on some older tracks — handle it with a default of 0, not a missing key error. For time-series analysis, scrape the same artist every 7 days and diff the play counts rather than trying to pull historical data (the API doesn’t expose it).

Bottom line

SoundCloud is one of the more accessible audio platforms to scrape in 2026 — the unofficial API returns clean JSON, pagination is well-structured, and bot detection is manageable with curl_cffi and modest rate limiting. Prioritize client ID refresh logic from day one; everything else is standard data engineering. DRT covers the full audio and media data stack if you’re building cross-platform pipelines that need Spotify, Apple Music, Bandcamp, and Last.fm alongside SoundCloud.

Related guides on dataresearchtools.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Resources

Proxy Signals Podcast
Operator-level insights on mobile proxies and access infrastructure.

Multi-Account Proxies: Setup, Types, Tools & Mistakes (2026)