MLflow logs datasets, parameters, and metrics for each scraping experiment so you can compare runs, reproduce results, and track data drift. add it to any Python scraping pipeline with five lines of code.
ML engineers who build models on scraped data face a reproducibility problem: the dataset from last Tuesday is different from today’s crawl. prices changed, listings were removed, new content appeared. without experiment tracking, you cannot tell which model version trained on which data snapshot.
MLflow solves this. it is an open-source platform that logs runs, parameters, metrics, and artifacts — including datasets. this guide shows how to wire it into a web scraping pipeline.
why track scraping experiments
scraped datasets are not static files — they are snapshots of a living web. tracking experiments lets you:
- reproduce any historical dataset by rerunning the exact crawler config
- compare model performance across dataset versions
- detect data drift when source sites change structure
- audit which training data a model used
see our overview of what is web scraping for background on scraping pipeline architecture.
setup
pip install mlflow requests beautifulsoup4
# start the mlflow ui (runs at http://localhost:5000)
mlflow ui
logging a scraping run
import mlflow
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time
from datetime import datetime
mlflow.set_experiment("autoscout24-price-scraper")
with mlflow.start_run(run_name=f"run-{datetime.now().strftime('%Y%m%d-%H%M')}"):
# log parameters
mlflow.log_param("target_url", "https://books.toscrape.com")
mlflow.log_param("max_pages", 10)
mlflow.log_param("proxy_type", "residential")
mlflow.log_param("delay_seconds", 1.5)
# scrape
records = []
for page in range(1, 11):
url = f"https://books.toscrape.com/catalogue/page-{page}.html"
r = requests.get(url, timeout=10)
soup = BeautifulSoup(r.text, "html.parser")
for article in soup.select("article.product_pod"):
title = article.select_one("h3 a")["title"]
price = article.select_one(".price_color").text.strip()
records.append({"title": title, "price": price})
time.sleep(1.5)
df = pd.DataFrame(records)
# log metrics
mlflow.log_metric("records_scraped", len(df))
mlflow.log_metric("pages_crawled", 10)
mlflow.log_metric("avg_price", df["price"].str.replace("[£$,]", "", regex=True).astype(float).mean())
# log the dataset as an artifact
df.to_csv("scraped_data.csv", index=False)
mlflow.log_artifact("scraped_data.csv", artifact_path="datasets")
print(f"logged {len(df)} records")
tagging proxy configuration
when rotating proxies, log which proxy pool was used. this matters when debugging success rate differences between runs:
mlflow.set_tag("proxy_provider", "brightdata")
mlflow.set_tag("proxy_country", "US")
mlflow.set_tag("proxy_type", "residential")
mlflow.log_metric("request_success_rate", successful / total)
mlflow.log_metric("blocked_requests", blocked_count)
see our comparison of SOCKS5 vs HTTP proxy for the tradeoffs between proxy types in scraping pipelines.
dataset versioning with mlflow datasets
import mlflow.data
from mlflow.data.pandas_dataset import PandasDataset
dataset: PandasDataset = mlflow.data.from_pandas(
df,
source="https://books.toscrape.com",
name="books-dataset",
targets="price"
)
mlflow.log_input(dataset, context="training")
comparing runs in the ui
open http://localhost:5000 after a few runs. select multiple runs and click “compare” to see a side-by-side view of parameters and metrics. this immediately shows if a new proxy configuration improved success rates or if a target site changed structure (records_scraped drops noticeably).
detecting data drift
import mlflow
client = mlflow.tracking.MlflowClient()
runs = client.search_runs(
experiment_ids=["1"],
order_by=["start_time DESC"],
max_results=10
)
for run in runs:
print(
run.info.run_name,
run.data.metrics.get("records_scraped"),
run.data.metrics.get("avg_price")
)
if records_scraped drops from 200 to 40 between runs, the target site changed structure. you now have an automatic audit trail to investigate why.
connecting to a remote mlflow server
import os
os.environ["MLFLOW_TRACKING_URI"] = "http://your-server:5000"
# or for cloud storage
os.environ["MLFLOW_TRACKING_URI"] = "s3://your-bucket/mlflow"