Press ESC to exit fullscreen
📖 Lesson ⏱️ 120 minutes

Experiment Tracking with MLflow

Log parameters, metrics, and artifacts — compare runs and reproduce results

The Notebook Graveyard

Open your Downloads folder. If you’ve done any ML work, you probably have files like this:

model_final.ipynb
model_final_v2.ipynb
model_final_REAL.ipynb
model_v3_USE_THIS_ONE.ipynb
model_v3_better.ipynb
best_model_DO_NOT_DELETE.ipynb

You ran experiments. Some worked. Some didn’t. But now, three months later, you cannot answer these basic questions:

  • Which notebook produced the 94% accuracy result?
  • What learning rate was used in that run?
  • Was the data preprocessed the same way in all runs?
  • What were the exact feature columns?

This is the experiment tracking problem, and it kills ML projects. You re-run experiments you already ran. You lose good results. You deploy a model without knowing what hyperparameters produced it.

MLflow solves this. Every training run logs its parameters, metrics, and artifacts to a central server. You get a UI to compare runs side-by-side, filter by metric, and reproduce any historical experiment.


MLflow Core Concepts

Before the code, understand the data model:

Experiment: A named group of related runs. churn-prediction might be one experiment; churn-prediction-xgboost another.

Run: A single execution of your training code. Each run has a unique ID and logs parameters, metrics, tags, and artifacts.

Parameters: Inputs to your training run. learning_rate=0.01, max_depth=6. Logged once at the start of the run.

Metrics: Measured outputs. accuracy=0.91, f1=0.88. Can be logged multiple times (e.g., once per epoch) to produce time series charts.

Artifacts: Files produced by the run. The trained model, plots, confusion matrices, feature importance charts.

Model Registry: A catalog of models with stages (Staging, Production, Archived). Covered in a later lesson.


Installing and Starting MLflow

pip install mlflow scikit-learn pandas

Start the MLflow tracking server (locally for now):

mlflow ui

Open http://localhost:5000. You’ll see an empty UI. It fills up as you run experiments.

For production, you’d point MLflow at a database and shared blob storage:

mlflow server \
  --backend-store-uri postgresql://user:pass@host/mlflow \
  --default-artifact-root s3://my-bucket/mlflow-artifacts \
  --host 0.0.0.0 \
  --port 5000

Your First Tracked Experiment

Let’s build a complete example: hyperparameter search over learning_rate in [0.001, 0.01, 0.1] for a churn prediction model. We’ll log every run to MLflow and compare them in the UI.

# experiments/search_learning_rate.py
import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.preprocessing import LabelEncoder

# Point to your MLflow server
mlflow.set_tracking_uri("http://localhost:5000")

# Name the experiment (created if it doesn't exist)
mlflow.set_experiment("churn-prediction")

# Generate synthetic churn data for this example
np.random.seed(42)
n_samples = 5000
data = pd.DataFrame({
    "tenure_months": np.random.randint(1, 72, n_samples),
    "monthly_charges": np.random.uniform(20, 100, n_samples),
    "total_charges": np.random.uniform(100, 8000, n_samples),
    "num_products": np.random.randint(1, 5, n_samples),
    "has_support_calls": np.random.randint(0, 2, n_samples),
    "churn": np.random.randint(0, 2, n_samples),
})

X = data.drop("churn", axis=1)
y = data["churn"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Hyperparameter grid
learning_rates = [0.001, 0.01, 0.1]
n_estimators_options = [50, 100, 200]

for lr in learning_rates:
    for n_est in n_estimators_options:
        # Start a new MLflow run
        with mlflow.start_run(run_name=f"lr={lr}_n_est={n_est}"):
            # Log parameters
            mlflow.log_param("learning_rate", lr)
            mlflow.log_param("n_estimators", n_est)
            mlflow.log_param("max_depth", 4)
            mlflow.log_param("train_size", len(X_train))

            # Train
            model = GradientBoostingClassifier(
                learning_rate=lr,
                n_estimators=n_est,
                max_depth=4,
                random_state=42,
            )
            model.fit(X_train, y_train)

            # Evaluate
            y_pred = model.predict(X_test)
            y_prob = model.predict_proba(X_test)[:, 1]

            acc = accuracy_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)
            auc = roc_auc_score(y_test, y_prob)

            # Log metrics
            mlflow.log_metric("accuracy", acc)
            mlflow.log_metric("f1_score", f1)
            mlflow.log_metric("roc_auc", auc)

            # Log the model artifact
            mlflow.sklearn.log_model(
                model,
                artifact_path="model",
                registered_model_name="churn-predictor",  # sends to Model Registry
            )

            # Log a feature importance plot
            import matplotlib.pyplot as plt
            fig, ax = plt.subplots(figsize=(8, 5))
            importances = pd.Series(
                model.feature_importances_, index=X.columns
            ).sort_values(ascending=True)
            importances.plot(kind="barh", ax=ax)
            ax.set_title(f"Feature Importance (lr={lr}, n_est={n_est})")
            plt.tight_layout()
            mlflow.log_figure(fig, "feature_importance.png")
            plt.close()

            print(f"lr={lr}, n_est={n_est}: acc={acc:.4f}, auc={auc:.4f}")

Run this script, then open the MLflow UI at http://localhost:5000. You’ll see 9 runs in the churn-prediction experiment.


In the UI, you can:

Sort and filter: Click the roc_auc column header to sort all runs by AUC. Instantly find the best run.

Compare runs: Select multiple runs and click “Compare.” MLflow renders a side-by-side table of all parameters and metrics, plus parallel coordinates plots that reveal which hyperparameter combinations worked best.

View artifacts: Click into any run to see its logged artifacts — the feature importance PNG, the saved model, any other files.

Reproduce a run: Every run records the Git commit hash, the source file, and all parameters. To reproduce it exactly, check out that commit and use those parameters.


Logging During Training (Per-Epoch Metrics)

For deep learning or iterative algorithms, you can log metrics at each step to see learning curves:

with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("epochs", 50)

    for epoch in range(50):
        # ... train one epoch ...
        train_loss = train_one_epoch(model, train_loader)
        val_loss = validate(model, val_loader)

        # Log with step number
        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_loss", val_loss, step=epoch)

    mlflow.pytorch.log_model(model, "model")

MLflow renders these as a time series chart. You can see exactly when your model started overfitting.


Auto-Logging: Zero-Code Tracking

MLflow supports auto-logging for major frameworks. A single line before your training code logs everything automatically:

import mlflow

# Enable auto-logging for sklearn
mlflow.sklearn.autolog()

# Now just train normally — MLflow logs everything
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, max_depth=6)
model.fit(X_train, y_train)
# Parameters, metrics, and model artifact are all logged automatically

Auto-logging works with sklearn, XGBoost, LightGBM, PyTorch, TensorFlow, Keras, and Spark MLlib. It’s not always granular enough for production use (you’ll usually want to log custom metrics), but it’s a great starting point.


Querying Runs Programmatically

You don’t have to use the UI. You can query MLflow runs in code — useful for CI pipelines that need to find the best model:

import mlflow

client = mlflow.tracking.MlflowClient()

# Get all runs in the experiment, sorted by AUC descending
experiment = client.get_experiment_by_name("churn-prediction")
runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["metrics.roc_auc DESC"],
    max_results=1,
)

best_run = runs[0]
print(f"Best run ID: {best_run.info.run_id}")
print(f"Best AUC: {best_run.data.metrics['roc_auc']:.4f}")
print(f"Parameters: {best_run.data.params}")

# Load the model from that run
model = mlflow.sklearn.load_model(f"runs:/{best_run.info.run_id}/model")

This is how your CI/CD pipeline will find and promote the best model — more on this in the CI/CD lesson.


Organizing Experiments

As your project grows, you’ll want more structure than a single experiment. Here are patterns that work well:

By model type: churn-gbm, churn-rf, churn-nn — makes it easy to compare architectures.

By feature set: churn-v1-features, churn-v2-features — tracks the impact of feature engineering work.

By dataset version: Tag each run with data_version=abc123 (the DVC commit hash) so you can filter runs that used the same training data.

with mlflow.start_run():
    # Tag with data version (DVC commit hash)
    mlflow.set_tag("data_version", "a1b2c3d4")
    mlflow.set_tag("feature_set", "v2")
    mlflow.set_tag("model_type", "gradient_boosting")
    # ... rest of training ...

The Parent-Child Run Pattern

When doing hyperparameter search, use nested runs to group all search runs under one parent:

with mlflow.start_run(run_name="hyperparameter_search") as parent_run:
    for lr in [0.001, 0.01, 0.1]:
        with mlflow.start_run(run_name=f"lr={lr}", nested=True):
            mlflow.log_param("learning_rate", lr)
            # ... train and log metrics ...

In the UI, you’ll see a single parent run that you can expand to see all child runs. Much cleaner than 30 flat runs.


Summary

MLflow solves the notebook graveyard problem by giving every training run a permanent, searchable record. After this lesson you can:

  • Start an MLflow server and create experiments
  • Log parameters, metrics, and artifacts in any training script
  • Use the UI to compare runs and find the best hyperparameters
  • Query runs programmatically to find and load the best model
  • Use auto-logging for zero-configuration tracking

The next lesson takes this further: once you have a tracked experiment, how do you automate the entire training process so it runs on every code change? That’s CI/CD for ML.