MLflow + Web Scraping: Track Experiments with Scraped Datasets

TL;DR
MLflow logs datasets, parameters, and metrics for each scraping experiment so you can compare runs, reproduce results, and track data drift. add it to any Python scraping pipeline with five lines of code.

ML engineers who build models on scraped data face a reproducibility problem: the dataset from last Tuesday is different from today’s crawl. prices changed, listings were removed, new content appeared. without experiment tracking, you cannot tell which model version trained on which data snapshot.

MLflow solves this. it is an open-source platform that logs runs, parameters, metrics, and artifacts — including datasets. this guide shows how to wire it into a web scraping pipeline.

why track scraping experiments

scraped datasets are not static files — they are snapshots of a living web. tracking experiments lets you:

  • reproduce any historical dataset by rerunning the exact crawler config
  • compare model performance across dataset versions
  • detect data drift when source sites change structure
  • audit which training data a model used

see our overview of what is web scraping for background on scraping pipeline architecture.

setup

pip install mlflow requests beautifulsoup4

# start the mlflow ui (runs at http://localhost:5000)
mlflow ui

logging a scraping run

import mlflow
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time
from datetime import datetime

mlflow.set_experiment("autoscout24-price-scraper")

with mlflow.start_run(run_name=f"run-{datetime.now().strftime('%Y%m%d-%H%M')}"):
    # log parameters
    mlflow.log_param("target_url", "https://books.toscrape.com")
    mlflow.log_param("max_pages", 10)
    mlflow.log_param("proxy_type", "residential")
    mlflow.log_param("delay_seconds", 1.5)

    # scrape
    records = []
    for page in range(1, 11):
        url = f"https://books.toscrape.com/catalogue/page-{page}.html"
        r = requests.get(url, timeout=10)
        soup = BeautifulSoup(r.text, "html.parser")
        for article in soup.select("article.product_pod"):
            title = article.select_one("h3 a")["title"]
            price = article.select_one(".price_color").text.strip()
            records.append({"title": title, "price": price})
        time.sleep(1.5)

    df = pd.DataFrame(records)

    # log metrics
    mlflow.log_metric("records_scraped", len(df))
    mlflow.log_metric("pages_crawled", 10)
    mlflow.log_metric("avg_price", df["price"].str.replace("[£$,]", "", regex=True).astype(float).mean())

    # log the dataset as an artifact
    df.to_csv("scraped_data.csv", index=False)
    mlflow.log_artifact("scraped_data.csv", artifact_path="datasets")

    print(f"logged {len(df)} records")

tagging proxy configuration

when rotating proxies, log which proxy pool was used. this matters when debugging success rate differences between runs:

mlflow.set_tag("proxy_provider", "brightdata")
mlflow.set_tag("proxy_country", "US")
mlflow.set_tag("proxy_type", "residential")
mlflow.log_metric("request_success_rate", successful / total)
mlflow.log_metric("blocked_requests", blocked_count)

see our comparison of SOCKS5 vs HTTP proxy for the tradeoffs between proxy types in scraping pipelines.

dataset versioning with mlflow datasets

import mlflow.data
from mlflow.data.pandas_dataset import PandasDataset

dataset: PandasDataset = mlflow.data.from_pandas(
    df,
    source="https://books.toscrape.com",
    name="books-dataset",
    targets="price"
)
mlflow.log_input(dataset, context="training")

comparing runs in the ui

open http://localhost:5000 after a few runs. select multiple runs and click “compare” to see a side-by-side view of parameters and metrics. this immediately shows if a new proxy configuration improved success rates or if a target site changed structure (records_scraped drops noticeably).

detecting data drift

import mlflow

client = mlflow.tracking.MlflowClient()
runs = client.search_runs(
    experiment_ids=["1"],
    order_by=["start_time DESC"],
    max_results=10
)

for run in runs:
    print(
        run.info.run_name,
        run.data.metrics.get("records_scraped"),
        run.data.metrics.get("avg_price")
    )

if records_scraped drops from 200 to 40 between runs, the target site changed structure. you now have an automatic audit trail to investigate why.

connecting to a remote mlflow server

import os
os.environ["MLFLOW_TRACKING_URI"] = "http://your-server:5000"

# or for cloud storage
os.environ["MLFLOW_TRACKING_URI"] = "s3://your-bucket/mlflow"

sources and further reading

related guides

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top