Scala sits in an odd spot for web scraping: powerful enough for production data pipelines, yet rarely the first tool engineers reach for when they need to pull structured data from the web. That gap is worth closing. With sttp 4.x and Jsoup 1.17+, you get a type-safe HTTP client, a battle-tested HTML parser, and the full JVM ecosystem under one roof — a legitimate choice for teams already running Scala microservices or Spark-based data infrastructure in 2026.
Why Scala for Scraping in 2026
The honest case for Scala scraping is not that it beats Python on ergonomics — it doesn’t. the case is integration: if your pipeline ends in Spark, Flink, or a Scala-based data lake, keeping the scraper in the same language eliminates a serialization boundary and a runtime dependency. you also get strong typing on your extracted models, which catches schema drift at compile time rather than at 2am when a site redesign breaks your production job.
Compared to async Rust scrapers (covered in depth in Web Scraping with Reqwest + Tokio in Rust: Async Patterns (2026)), Scala trades raw throughput for a richer ecosystem and easier team onboarding. compared to Go’s Colly (Go Web Scraping with Colly v2: Production Patterns for 2026), it’s more verbose but gives you cats-effect or ZIO for principled concurrency.
Setting Up sttp + Jsoup
Add these to your build.sbt:
libraryDependencies ++= Seq(
"com.softwaremill.sttp.client4" %% "core" % "4.0.0-RC1",
"com.softwaremill.sttp.client4" %% "okhttp-backend" % "4.0.0-RC1",
"org.jsoup" % "jsoup" % "1.17.2"
)sttp 4 ships a synchronous backend (OkHttp, Apache HttpClient 5) and async backends for cats-effect, ZIO, and Pekko. for a simple scraper, OkHttp synchronous is the least ceremony. for a crawler that hits hundreds of URLs concurrently, wire in the cats-effect backend with a semaphore-controlled rate limiter.
A minimal fetch-and-parse loop looks like this:
import sttp.client4.*
import sttp.client4.okhttp.OkHttpSyncBackend
import org.jsoup.Jsoup
val backend = OkHttpSyncBackend()
val response = basicRequest
.get(uri"https://example.com/products")
.header("User-Agent", "Mozilla/5.0 (compatible; DataBot/1.0)")
.send(backend)
response.body match {
case Right(html) =>
val doc = Jsoup.parse(html)
val items = doc.select(".product-title").eachText()
items.forEach(println)
case Left(err) =>
System.err.println(s"request failed: $err")
}Jsoup’s CSS selector API (select, attr, text, eachText) covers 95% of extraction tasks. for sites that serialize data into blocks, use doc.select("script[type=application/json]").first().data() and hand the string to circe or uPickle.
Handling Anti-Bot Measures
Modern sites fingerprint JVM HTTP clients aggressively. the default OkHttp TLS fingerprint is detectable -- rotate it by injecting a custom SSLSocketFactory that matches a real browser's cipher suite order, or route through a residential proxy with TLS termination on the proxy side.
Key headers to set on every request:
User-Agent: a current Chrome or Firefox string, rotated per sessionAccept-Language: match the target site's localeSec-Fetch-*headers:Sec-Fetch-Dest: document,Sec-Fetch-Mode: navigate,Sec-Fetch-Site: noneReferer: set for paginated requests to simulate navigation
For JavaScript-heavy sites, sttp + Jsoup alone won't cut it -- you'll need a headless browser. before adding Playwright to your JVM project (the Playwright Java SDK is stable in 2026), check whether the site serializes its data into an API call you can hit directly. browser automation costs 10-30x more per page than a plain HTTP request, and the Cypress vs Playwright comparison for scraping covers when that overhead is actually justified.
Concurrency Patterns
Synchronous with a thread pool
For up to ~50 concurrent requests, a FixedThreadPool + parallel collections is simple and predictable:
import scala.collection.parallel.CollectionConverters.*
val urls: List[String] = loadUrls()
val results = urls.par.map { url =>
basicRequest.get(uri"$url").send(backend).body
}
Async with cats-effect
For higher fan-out, switch to the cats-effect backend and cap parallelism with Semaphore:
// pseudocode -- wire your full IO stack around this
Semaphore[IO](20).flatMap { sem =>
urls.parTraverseN(20) { url =>
sem.permit.use(_ => fetchAndParse(url))
}
}
This approach composes cleanly with retry logic via cats-retry and timeout handling via IO.timeout. it also integrates naturally with fs2 streams if you're feeding a Kafka topic or writing Parquet partitions downstream.
Scala vs Other JVM and Non-JVM Options
| Language | HTTP client | Parser | Async model | Best for |
|---|---|---|---|---|
| Scala | sttp 4 | Jsoup | cats-effect / ZIO | Spark pipelines, typed models |
| Kotlin | Ktor | Jsoup | Coroutines | Android-adjacent stacks |
| Java | HttpClient 11 | Jsoup | CompletableFuture | Legacy enterprise codebases |
| Python | httpx | BeautifulSoup | asyncio | Rapid prototyping |
| Node/Bun | fetch | cheerio | Event loop | Frontend-adjacent scrapers (see Bun vs Deno vs Node.js benchmarks) |
Jsoup is the clear winner on the JVM regardless of language -- its CSS selector API matches Python's BeautifulSoup feature-for-feature, and it sanitizes malformed HTML more reliably than most alternatives.
Legal and Operational Guardrails
Before running any scraper at scale, review the target site's robots.txt and terms of service. the Web Scraping Legal Guide 2026 covers CFAA exposure, GDPR obligations when scraping personal data, and the current post-hiQ case law landscape -- worth reading before you send your first production crawl. on the operational side: respect Crawl-delay, implement exponential backoff on 429s, and store raw HTML before parsing so you can re-extract without re-crawling.
A production-ready numbered checklist before go-live:
- confirm
robots.txtallows your target paths - set a meaningful
User-Agentwith a contact email - add jitter to request intervals (150-400ms baseline)
- log all 4xx/5xx responses with URL and timestamp
- store raw responses in object storage (S3, GCS) before parsing
- test your CSS selectors against at least three archived versions of the target page
Bottom line
Scala + sttp + Jsoup is a credible production stack for teams already in the JVM ecosystem, not a general-purpose recommendation. if your pipeline is Spark-first and you want type-safe extraction with no cross-language serialization, it earns its place. for everything else, Python or Go will ship faster. DRT covers the full range of scraping stacks -- JVM, Rust, Go, and JavaScript runtimes -- so you can benchmark the tradeoffs before committing to a language choice.
Related guides on dataresearchtools.com
- Cypress vs Playwright for Web Scraping: When to Pick Each (2026)
- Web Scraping with Reqwest + Tokio in Rust: Async Patterns (2026)
- Bun vs Deno vs Node.js for Web Scraping in 2026: Speed Benchmarks
- Go Web Scraping with Colly v2: Production Patterns for 2026
- Pillar: Web Scraping Legal Guide 2026: GDPR, CFAA, hiQ vs LinkedIn, and More