Scrapling Web Scraping Python Tutorial: Adaptive Scraping
Scrapling is a Python web scraping library designed to solve one of the biggest pain points in scraping: broken selectors. when a website changes its HTML structure, traditional scrapers break. Scrapling uses intelligent, adaptive selectors that can still find the right elements even after a website redesign.
this tutorial covers everything you need to start scraping with Scrapling, from basic usage to advanced techniques with proxy integration.
What is Scrapling
Scrapling is an open-source Python library that takes a different approach to element selection. instead of relying solely on CSS selectors or XPath that break when HTML changes, Scrapling builds a fingerprint of each element based on multiple attributes. when the page structure changes, it uses fuzzy matching to locate the same element in the new layout.
key features include:
- adaptive selectors: elements are identified by multiple properties, not just their CSS path
- auto-match: if an element moves in the DOM, Scrapling can still find it
- built-in stealth: browser fingerprint randomization and anti-detection features
- playwright integration: full JavaScript rendering support
- clean API: simple, intuitive interface inspired by BeautifulSoup
Installation
install Scrapling with pip:
pip install scrapling
python -m scrapling install # installs browser dependencies
this installs the core library plus Playwright browsers for JavaScript rendering.
for a minimal installation without browser support:
pip install scrapling[core]
Basic Usage
Scrapling provides three main fetcher classes depending on your needs.
StaticFetcher: Simple HTTP Requests
use this for pages that do not require JavaScript rendering:
from scrapling import StaticFetcher
fetcher = StaticFetcher()
# fetch a page
page = fetcher.get("https://quotes.toscrape.com/")
# find elements using CSS selectors
quotes = page.find_all("div.quote")
for quote in quotes:
text = quote.find("span.text").text
author = quote.find("small.author").text
print(f"{author}: {text}")
StealthFetcher: Anti-Detection Browsing
use this when websites block standard requests:
from scrapling import StealthFetcher
fetcher = StealthFetcher()
# this uses a real browser with anti-detection measures
page = fetcher.get("https://example.com/protected-page")
# Scrapling handles fingerprint randomization automatically
products = page.find_all("div.product-card")
for product in products:
name = product.find("h2").text
price = product.find(".price").text
print(f"{name}: {price}")
PlayWrightFetcher: Full Browser Control
use this when you need full control over the browser:
from scrapling import PlayWrightFetcher
fetcher = PlayWrightFetcher()
page = fetcher.get(
"https://example.com/dynamic-page",
wait_selector="div.results", # wait for this element
timeout=15000
)
results = page.find_all("div.result-item")
for item in results:
print(item.text)
Adaptive Selectors: the Core Feature
the most powerful feature of Scrapling is adaptive selection. here is how it works.
Traditional Approach (Fragile)
# this breaks when the website changes its HTML structure
price = page.find("div.product-info > span.price-current > strong")
Scrapling Adaptive Approach (Resilient)
from scrapling import Adaptor
# first run: Scrapling learns the element
html = fetcher.get("https://example.com/product")
# use auto_match to find elements adaptively
price_element = html.find("span.price", auto_match=True)
# Scrapling creates a fingerprint based on:
# - text content patterns
# - surrounding elements
# - attribute values
# - position in the document
# - visual similarity
# on subsequent runs, even if the class name changes from
# "price" to "product-price" or the element moves,
# Scrapling can still locate it using the fingerprint
How Auto-Match Works
Scrapling’s auto_match builds an element profile using multiple signals:
from scrapling import Adaptor
# parse HTML content
page = Adaptor(html_content)
# find with auto_match enabled
# Scrapling uses these signals to identify elements:
# 1. text content and patterns (e.g., "$XX.XX" looks like a price)
# 2. element tag and attributes
# 3. sibling and parent context
# 4. position relative to other matched elements
product_name = page.find("h1.product-title", auto_match=True)
product_price = page.find("span.price", auto_match=True)
# save the learned selectors for future use
# this stores the element fingerprints
page.save("product_page_profile")
# on the next run, load the profile
page2 = Adaptor(new_html_content)
page2.load("product_page_profile")
# even if classes changed, auto_match finds the right elements
name = page2.find("h1.product-title", auto_match=True)
price = page2.find("span.price", auto_match=True)
Integrating Proxies with Scrapling
proxy support is essential for serious scraping. here is how to set it up with each fetcher.
Proxies with StaticFetcher
from scrapling import StaticFetcher
# single proxy
fetcher = StaticFetcher()
page = fetcher.get(
"https://example.com",
proxy="http://user:pass@proxy.example.com:8080"
)
# rotating proxies
import random
PROXY_POOL = [
"http://user:pass@us1.proxy.com:8080",
"http://user:pass@us2.proxy.com:8080",
"http://user:pass@eu1.proxy.com:8080",
]
fetcher = StaticFetcher()
page = fetcher.get(
"https://example.com",
proxy=random.choice(PROXY_POOL)
)
Proxies with StealthFetcher
from scrapling import StealthFetcher
fetcher = StealthFetcher()
# residential proxy for maximum success rate
page = fetcher.get(
"https://protected-site.com/data",
proxy="http://user:pass@residential.proxy.com:8080"
)
Proxies with PlayWrightFetcher
from scrapling import PlayWrightFetcher
fetcher = PlayWrightFetcher()
page = fetcher.get(
"https://example.com/products",
proxy={
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
}
)
Building a Complete Scraping Pipeline
let us build a practical scraper that extracts product data with proxy rotation and error handling.
import json
import time
import random
from scrapling import StealthFetcher, Adaptor
class ScraplingPipeline:
def __init__(self, proxies=None):
self.fetcher = StealthFetcher()
self.proxies = proxies or []
self.results = []
def _get_proxy(self):
if not self.proxies:
return None
return random.choice(self.proxies)
def scrape_product(self, url):
"""scrape a single product page."""
proxy = self._get_proxy()
page = self.fetcher.get(url, proxy=proxy)
# use adaptive selectors
name = page.find("h1", auto_match=True)
price = page.find("[class*='price']", auto_match=True)
description = page.find("[class*='description']", auto_match=True)
rating = page.find("[class*='rating']", auto_match=True)
return {
"url": url,
"name": name.text if name else None,
"price": price.text if price else None,
"description": description.text if description else None,
"rating": rating.text if rating else None,
}
def scrape_listing(self, url, product_selector="a[href*='product']"):
"""scrape a listing page and follow product links."""
proxy = self._get_proxy()
page = self.fetcher.get(url, proxy=proxy)
# find product links
links = page.find_all(product_selector)
product_urls = []
for link in links:
href = link.get("href")
if href:
if href.startswith("/"):
# convert relative to absolute
from urllib.parse import urljoin
href = urljoin(url, href)
product_urls.append(href)
return product_urls
def run(self, listing_url, max_products=50, delay=2.0):
"""full pipeline: listing -> product pages -> structured data."""
print(f"scraping listing: {listing_url}")
product_urls = self.scrape_listing(listing_url)
product_urls = product_urls[:max_products]
print(f"found {len(product_urls)} products")
for i, url in enumerate(product_urls):
print(f" [{i+1}/{len(product_urls)}] {url}")
try:
data = self.scrape_product(url)
self.results.append(data)
except Exception as e:
print(f" error: {e}")
self.results.append({"url": url, "error": str(e)})
time.sleep(delay + random.uniform(0, 1))
return self.results
def save(self, filename):
"""save results to JSON."""
with open(filename, "w") as f:
json.dump(self.results, f, indent=2)
print(f"saved {len(self.results)} results to {filename}")
# usage
pipeline = ScraplingPipeline(proxies=[
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
])
results = pipeline.run("https://example.com/products", max_products=20)
pipeline.save("products.json")
Handling Pagination
scraping multiple pages of results:
from scrapling import StealthFetcher
def scrape_paginated(base_url, max_pages=10, proxies=None):
"""scrape through paginated results."""
fetcher = StealthFetcher()
all_items = []
for page_num in range(1, max_pages + 1):
url = f"{base_url}?page={page_num}"
proxy = random.choice(proxies) if proxies else None
page = fetcher.get(url, proxy=proxy)
items = page.find_all("div.item", auto_match=True)
if not items:
print(f"no items on page {page_num}, stopping")
break
for item in items:
all_items.append({
"title": item.find("h2").text if item.find("h2") else None,
"price": item.find(".price").text if item.find(".price") else None,
})
print(f"page {page_num}: {len(items)} items (total: {len(all_items)})")
time.sleep(2)
return all_items
Data Extraction Patterns
Extracting Tables
from scrapling import Adaptor
def extract_table(page, table_selector="table"):
"""extract table data into a list of dictionaries."""
table = page.find(table_selector)
if not table:
return []
# get headers
headers = []
header_row = table.find("thead tr") or table.find("tr")
if header_row:
headers = [th.text.strip() for th in header_row.find_all(["th", "td"])]
# get rows
rows = []
body = table.find("tbody") or table
for tr in body.find_all("tr")[1:] if not table.find("thead") else body.find_all("tr"):
cells = [td.text.strip() for td in tr.find_all("td")]
if cells and len(cells) == len(headers):
rows.append(dict(zip(headers, cells)))
return rows
Extracting Structured Data (JSON-LD)
import json
def extract_jsonld(page):
"""extract JSON-LD structured data from a page."""
scripts = page.find_all("script[type='application/ld+json']")
structured_data = []
for script in scripts:
try:
data = json.loads(script.text)
structured_data.append(data)
except json.JSONDecodeError:
continue
return structured_data
Scrapling vs Other Libraries
| Feature | Scrapling | BeautifulSoup | Scrapy | Playwright |
|---|---|---|---|---|
| adaptive selectors | yes | no | no | no |
| anti-detection | built-in | no | no | partial |
| JS rendering | yes | no | with plugin | yes |
| async support | yes | no | yes | yes |
| learning curve | low | low | medium | medium |
| speed | moderate | fast | fast | slow |
| proxy support | yes | manual | yes | yes |
Troubleshooting Common Issues
auto_match returns the wrong element:
this usually happens when multiple elements look similar. provide more context in your selector or use a combination of tag name and partial class matching.
# too vague
price = page.find("span", auto_match=True)
# better: more context
price = page.find("span[class*='price']", auto_match=True)
# best: combine with parent context
product_div = page.find("div.product-main")
price = product_div.find("span[class*='price']", auto_match=True)
StealthFetcher is slow:
the stealth fetcher launches a full browser. for pages that do not need JS rendering, switch to StaticFetcher which is much faster.
proxy connection errors:
verify your proxy credentials and test connectivity separately before integrating with Scrapling.
import httpx
proxy = "http://user:pass@proxy.example.com:8080"
try:
resp = httpx.get("https://httpbin.org/ip", proxy=proxy, timeout=10)
print(f"proxy IP: {resp.json()['origin']}")
except Exception as e:
print(f"proxy error: {e}")
Performance Tips
- use StaticFetcher when possible: it is 5 to 10 times faster than browser-based fetchers
- cache responses: save HTML locally during development to avoid re-fetching
- batch processing: collect URLs first, then process them with controlled concurrency
- respect rate limits: add delays between requests to avoid getting blocked
- rotate proxies: distribute requests across multiple IPs for better success rates
- profile your selectors: auto_match is powerful but slower than direct CSS selectors. use it only where you need adaptability
Conclusion
Scrapling fills a real gap in the Python scraping ecosystem. its adaptive selectors mean you spend less time maintaining broken scrapers and more time working with the data you extract. combined with built-in stealth features and proxy support, it is a solid choice for projects where website stability is a concern.
start with the StealthFetcher for protected sites, use StaticFetcher for speed on simple pages, and enable auto_match on elements that tend to change. this combination handles most real-world scraping scenarios without the constant maintenance that traditional approaches demand.