Scrapling Web Scraping Python Tutorial: Adaptive Scraping

Scrapling Web Scraping Python Tutorial: Adaptive Scraping

Scrapling is a Python web scraping library designed to solve one of the biggest pain points in scraping: broken selectors. when a website changes its HTML structure, traditional scrapers break. Scrapling uses intelligent, adaptive selectors that can still find the right elements even after a website redesign.

this tutorial covers everything you need to start scraping with Scrapling, from basic usage to advanced techniques with proxy integration.

What is Scrapling

Scrapling is an open-source Python library that takes a different approach to element selection. instead of relying solely on CSS selectors or XPath that break when HTML changes, Scrapling builds a fingerprint of each element based on multiple attributes. when the page structure changes, it uses fuzzy matching to locate the same element in the new layout.

key features include:

  • adaptive selectors: elements are identified by multiple properties, not just their CSS path
  • auto-match: if an element moves in the DOM, Scrapling can still find it
  • built-in stealth: browser fingerprint randomization and anti-detection features
  • playwright integration: full JavaScript rendering support
  • clean API: simple, intuitive interface inspired by BeautifulSoup

Installation

install Scrapling with pip:

pip install scrapling
python -m scrapling install  # installs browser dependencies

this installs the core library plus Playwright browsers for JavaScript rendering.

for a minimal installation without browser support:

pip install scrapling[core]

Basic Usage

Scrapling provides three main fetcher classes depending on your needs.

StaticFetcher: Simple HTTP Requests

use this for pages that do not require JavaScript rendering:

from scrapling import StaticFetcher

fetcher = StaticFetcher()

# fetch a page
page = fetcher.get("https://quotes.toscrape.com/")

# find elements using CSS selectors
quotes = page.find_all("div.quote")

for quote in quotes:
    text = quote.find("span.text").text
    author = quote.find("small.author").text
    print(f"{author}: {text}")

StealthFetcher: Anti-Detection Browsing

use this when websites block standard requests:

from scrapling import StealthFetcher

fetcher = StealthFetcher()

# this uses a real browser with anti-detection measures
page = fetcher.get("https://example.com/protected-page")

# Scrapling handles fingerprint randomization automatically
products = page.find_all("div.product-card")

for product in products:
    name = product.find("h2").text
    price = product.find(".price").text
    print(f"{name}: {price}")

PlayWrightFetcher: Full Browser Control

use this when you need full control over the browser:

from scrapling import PlayWrightFetcher

fetcher = PlayWrightFetcher()

page = fetcher.get(
    "https://example.com/dynamic-page",
    wait_selector="div.results",  # wait for this element
    timeout=15000
)

results = page.find_all("div.result-item")
for item in results:
    print(item.text)

Adaptive Selectors: the Core Feature

the most powerful feature of Scrapling is adaptive selection. here is how it works.

Traditional Approach (Fragile)

# this breaks when the website changes its HTML structure
price = page.find("div.product-info > span.price-current > strong")

Scrapling Adaptive Approach (Resilient)

from scrapling import Adaptor

# first run: Scrapling learns the element
html = fetcher.get("https://example.com/product")

# use auto_match to find elements adaptively
price_element = html.find("span.price", auto_match=True)

# Scrapling creates a fingerprint based on:
# - text content patterns
# - surrounding elements
# - attribute values
# - position in the document
# - visual similarity

# on subsequent runs, even if the class name changes from
# "price" to "product-price" or the element moves,
# Scrapling can still locate it using the fingerprint

How Auto-Match Works

Scrapling’s auto_match builds an element profile using multiple signals:

from scrapling import Adaptor

# parse HTML content
page = Adaptor(html_content)

# find with auto_match enabled
# Scrapling uses these signals to identify elements:
# 1. text content and patterns (e.g., "$XX.XX" looks like a price)
# 2. element tag and attributes
# 3. sibling and parent context
# 4. position relative to other matched elements

product_name = page.find("h1.product-title", auto_match=True)
product_price = page.find("span.price", auto_match=True)

# save the learned selectors for future use
# this stores the element fingerprints
page.save("product_page_profile")

# on the next run, load the profile
page2 = Adaptor(new_html_content)
page2.load("product_page_profile")

# even if classes changed, auto_match finds the right elements
name = page2.find("h1.product-title", auto_match=True)
price = page2.find("span.price", auto_match=True)

Integrating Proxies with Scrapling

proxy support is essential for serious scraping. here is how to set it up with each fetcher.

Proxies with StaticFetcher

from scrapling import StaticFetcher

# single proxy
fetcher = StaticFetcher()
page = fetcher.get(
    "https://example.com",
    proxy="http://user:pass@proxy.example.com:8080"
)

# rotating proxies
import random

PROXY_POOL = [
    "http://user:pass@us1.proxy.com:8080",
    "http://user:pass@us2.proxy.com:8080",
    "http://user:pass@eu1.proxy.com:8080",
]

fetcher = StaticFetcher()
page = fetcher.get(
    "https://example.com",
    proxy=random.choice(PROXY_POOL)
)

Proxies with StealthFetcher

from scrapling import StealthFetcher

fetcher = StealthFetcher()

# residential proxy for maximum success rate
page = fetcher.get(
    "https://protected-site.com/data",
    proxy="http://user:pass@residential.proxy.com:8080"
)

Proxies with PlayWrightFetcher

from scrapling import PlayWrightFetcher

fetcher = PlayWrightFetcher()

page = fetcher.get(
    "https://example.com/products",
    proxy={
        "server": "http://proxy.example.com:8080",
        "username": "user",
        "password": "pass"
    }
)

Building a Complete Scraping Pipeline

let us build a practical scraper that extracts product data with proxy rotation and error handling.

import json
import time
import random
from scrapling import StealthFetcher, Adaptor

class ScraplingPipeline:
    def __init__(self, proxies=None):
        self.fetcher = StealthFetcher()
        self.proxies = proxies or []
        self.results = []

    def _get_proxy(self):
        if not self.proxies:
            return None
        return random.choice(self.proxies)

    def scrape_product(self, url):
        """scrape a single product page."""
        proxy = self._get_proxy()

        page = self.fetcher.get(url, proxy=proxy)

        # use adaptive selectors
        name = page.find("h1", auto_match=True)
        price = page.find("[class*='price']", auto_match=True)
        description = page.find("[class*='description']", auto_match=True)
        rating = page.find("[class*='rating']", auto_match=True)

        return {
            "url": url,
            "name": name.text if name else None,
            "price": price.text if price else None,
            "description": description.text if description else None,
            "rating": rating.text if rating else None,
        }

    def scrape_listing(self, url, product_selector="a[href*='product']"):
        """scrape a listing page and follow product links."""
        proxy = self._get_proxy()
        page = self.fetcher.get(url, proxy=proxy)

        # find product links
        links = page.find_all(product_selector)
        product_urls = []

        for link in links:
            href = link.get("href")
            if href:
                if href.startswith("/"):
                    # convert relative to absolute
                    from urllib.parse import urljoin
                    href = urljoin(url, href)
                product_urls.append(href)

        return product_urls

    def run(self, listing_url, max_products=50, delay=2.0):
        """full pipeline: listing -> product pages -> structured data."""
        print(f"scraping listing: {listing_url}")
        product_urls = self.scrape_listing(listing_url)
        product_urls = product_urls[:max_products]

        print(f"found {len(product_urls)} products")

        for i, url in enumerate(product_urls):
            print(f"  [{i+1}/{len(product_urls)}] {url}")
            try:
                data = self.scrape_product(url)
                self.results.append(data)
            except Exception as e:
                print(f"    error: {e}")
                self.results.append({"url": url, "error": str(e)})

            time.sleep(delay + random.uniform(0, 1))

        return self.results

    def save(self, filename):
        """save results to JSON."""
        with open(filename, "w") as f:
            json.dump(self.results, f, indent=2)
        print(f"saved {len(self.results)} results to {filename}")

# usage
pipeline = ScraplingPipeline(proxies=[
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
])

results = pipeline.run("https://example.com/products", max_products=20)
pipeline.save("products.json")

Handling Pagination

scraping multiple pages of results:

from scrapling import StealthFetcher

def scrape_paginated(base_url, max_pages=10, proxies=None):
    """scrape through paginated results."""
    fetcher = StealthFetcher()
    all_items = []

    for page_num in range(1, max_pages + 1):
        url = f"{base_url}?page={page_num}"
        proxy = random.choice(proxies) if proxies else None

        page = fetcher.get(url, proxy=proxy)
        items = page.find_all("div.item", auto_match=True)

        if not items:
            print(f"no items on page {page_num}, stopping")
            break

        for item in items:
            all_items.append({
                "title": item.find("h2").text if item.find("h2") else None,
                "price": item.find(".price").text if item.find(".price") else None,
            })

        print(f"page {page_num}: {len(items)} items (total: {len(all_items)})")
        time.sleep(2)

    return all_items

Data Extraction Patterns

Extracting Tables

from scrapling import Adaptor

def extract_table(page, table_selector="table"):
    """extract table data into a list of dictionaries."""
    table = page.find(table_selector)
    if not table:
        return []

    # get headers
    headers = []
    header_row = table.find("thead tr") or table.find("tr")
    if header_row:
        headers = [th.text.strip() for th in header_row.find_all(["th", "td"])]

    # get rows
    rows = []
    body = table.find("tbody") or table
    for tr in body.find_all("tr")[1:] if not table.find("thead") else body.find_all("tr"):
        cells = [td.text.strip() for td in tr.find_all("td")]
        if cells and len(cells) == len(headers):
            rows.append(dict(zip(headers, cells)))

    return rows

Extracting Structured Data (JSON-LD)

import json

def extract_jsonld(page):
    """extract JSON-LD structured data from a page."""
    scripts = page.find_all("script[type='application/ld+json']")

    structured_data = []
    for script in scripts:
        try:
            data = json.loads(script.text)
            structured_data.append(data)
        except json.JSONDecodeError:
            continue

    return structured_data

Scrapling vs Other Libraries

FeatureScraplingBeautifulSoupScrapyPlaywright
adaptive selectorsyesnonono
anti-detectionbuilt-innonopartial
JS renderingyesnowith pluginyes
async supportyesnoyesyes
learning curvelowlowmediummedium
speedmoderatefastfastslow
proxy supportyesmanualyesyes

Troubleshooting Common Issues

auto_match returns the wrong element:
this usually happens when multiple elements look similar. provide more context in your selector or use a combination of tag name and partial class matching.

# too vague
price = page.find("span", auto_match=True)

# better: more context
price = page.find("span[class*='price']", auto_match=True)

# best: combine with parent context
product_div = page.find("div.product-main")
price = product_div.find("span[class*='price']", auto_match=True)

StealthFetcher is slow:
the stealth fetcher launches a full browser. for pages that do not need JS rendering, switch to StaticFetcher which is much faster.

proxy connection errors:
verify your proxy credentials and test connectivity separately before integrating with Scrapling.

import httpx

proxy = "http://user:pass@proxy.example.com:8080"
try:
    resp = httpx.get("https://httpbin.org/ip", proxy=proxy, timeout=10)
    print(f"proxy IP: {resp.json()['origin']}")
except Exception as e:
    print(f"proxy error: {e}")

Performance Tips

  1. use StaticFetcher when possible: it is 5 to 10 times faster than browser-based fetchers
  2. cache responses: save HTML locally during development to avoid re-fetching
  3. batch processing: collect URLs first, then process them with controlled concurrency
  4. respect rate limits: add delays between requests to avoid getting blocked
  5. rotate proxies: distribute requests across multiple IPs for better success rates
  6. profile your selectors: auto_match is powerful but slower than direct CSS selectors. use it only where you need adaptability

Conclusion

Scrapling fills a real gap in the Python scraping ecosystem. its adaptive selectors mean you spend less time maintaining broken scrapers and more time working with the data you extract. combined with built-in stealth features and proxy support, it is a solid choice for projects where website stability is a concern.

start with the StealthFetcher for protected sites, use StaticFetcher for speed on simple pages, and enable auto_match on elements that tend to change. this combination handles most real-world scraping scenarios without the constant maintenance that traditional approaches demand.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top