Experiment Tracking with MLflow

The Notebook Graveyard

Open your Downloads folder. If you’ve done any ML work, you probably have files like this:

model_final.ipynb
model_final_v2.ipynb
model_final_REAL.ipynb
model_v3_USE_THIS_ONE.ipynb
model_v3_better.ipynb
best_model_DO_NOT_DELETE.ipynb

You ran experiments. Some worked. Some didn’t. But now, three months later, you cannot answer these basic questions:

Which notebook produced the 94% accuracy result?
What learning rate was used in that run?
Was the data preprocessed the same way in all runs?
What were the exact feature columns?

This is the experiment tracking problem, and it kills ML projects. You re-run experiments you already ran. You lose good results. You deploy a model without knowing what hyperparameters produced it.

MLflow solves this. Every training run logs its parameters, metrics, and artifacts to a central server. You get a UI to compare runs side-by-side, filter by metric, and reproduce any historical experiment.

MLflow Core Concepts

Before the code, understand the data model:

Experiment: A named group of related runs. churn-prediction might be one experiment; churn-prediction-xgboost another.

Run: A single execution of your training code. Each run has a unique ID and logs parameters, metrics, tags, and artifacts.

Parameters: Inputs to your training run. learning_rate=0.01, max_depth=6. Logged once at the start of the run.

Metrics: Measured outputs. accuracy=0.91, f1=0.88. Can be logged multiple times (e.g., once per epoch) to produce time series charts.

Artifacts: Files produced by the run. The trained model, plots, confusion matrices, feature importance charts.

Model Registry: A catalog of models with stages (Staging, Production, Archived). Covered in a later lesson.

Installing and Starting MLflow

pip install mlflow scikit-learn pandas

Start the MLflow tracking server (locally for now):

mlflow ui

Open http://localhost:5000. You’ll see an empty UI. It fills up as you run experiments.

For production, you’d point MLflow at a database and shared blob storage:

mlflow server \
  --backend-store-uri postgresql://user:pass@host/mlflow \
  --default-artifact-root s3://my-bucket/mlflow-artifacts \
  --host 0.0.0.0 \
  --port 5000

Your First Tracked Experiment

Let’s build a complete example: hyperparameter search over learning_rate in [0.001, 0.01, 0.1] for a churn prediction model. We’ll log every run to MLflow and compare them in the UI.

# experiments/search_learning_rate.py
import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.preprocessing import LabelEncoder

# Point to your MLflow server
mlflow.set_tracking_uri("http://localhost:5000")

# Name the experiment (created if it doesn't exist)
mlflow.set_experiment("churn-prediction")

# Generate synthetic churn data for this example
np.random.seed(42)
n_samples = 5000
data = pd.DataFrame({
    "tenure_months": np.random.randint(1, 72, n_samples),
    "monthly_charges": np.random.uniform(20, 100, n_samples),
    "total_charges": np.random.uniform(100, 8000, n_samples),
    "num_products": np.random.randint(1, 5, n_samples),
    "has_support_calls": np.random.randint(0, 2, n_samples),
    "churn": np.random.randint(0, 2, n_samples),
})

X = data.drop("churn", axis=1)
y = data["churn"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Hyperparameter grid
learning_rates = [0.001, 0.01, 0.1]
n_estimators_options = [50, 100, 200]

for lr in learning_rates:
    for n_est in n_estimators_options:
        # Start a new MLflow run
        with mlflow.start_run(run_name=f"lr={lr}_n_est={n_est}"):
            # Log parameters
            mlflow.log_param("learning_rate", lr)
            mlflow.log_param("n_estimators", n_est)
            mlflow.log_param("max_depth", 4)
            mlflow.log_param("train_size", len(X_train))

            # Train
            model = GradientBoostingClassifier(
                learning_rate=lr,
                n_estimators=n_est,
                max_depth=4,
                random_state=42,
            )
            model.fit(X_train, y_train)

            # Evaluate
            y_pred = model.predict(X_test)
            y_prob = model.predict_proba(X_test)[:, 1]

            acc = accuracy_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)
            auc = roc_auc_score(y_test, y_prob)

            # Log metrics
            mlflow.log_metric("accuracy", acc)
            mlflow.log_metric("f1_score", f1)
            mlflow.log_metric("roc_auc", auc)

            # Log the model artifact
            mlflow.sklearn.log_model(
                model,
                artifact_path="model",
                registered_model_name="churn-predictor",  # sends to Model Registry
            )

            # Log a feature importance plot
            import matplotlib.pyplot as plt
            fig, ax = plt.subplots(figsize=(8, 5))
            importances = pd.Series(
                model.feature_importances_, index=X.columns
            ).sort_values(ascending=True)
            importances.plot(kind="barh", ax=ax)
            ax.set_title(f"Feature Importance (lr={lr}, n_est={n_est})")
            plt.tight_layout()
            mlflow.log_figure(fig, "feature_importance.png")
            plt.close()

            print(f"lr={lr}, n_est={n_est}: acc={acc:.4f}, auc={auc:.4f}")

Run this script, then open the MLflow UI at http://localhost:5000. You’ll see 9 runs in the churn-prediction experiment.

Navigating the MLflow UI

In the UI, you can:

Sort and filter: Click the roc_auc column header to sort all runs by AUC. Instantly find the best run.

Compare runs: Select multiple runs and click “Compare.” MLflow renders a side-by-side table of all parameters and metrics, plus parallel coordinates plots that reveal which hyperparameter combinations worked best.

View artifacts: Click into any run to see its logged artifacts — the feature importance PNG, the saved model, any other files.

Reproduce a run: Every run records the Git commit hash, the source file, and all parameters. To reproduce it exactly, check out that commit and use those parameters.

Logging During Training (Per-Epoch Metrics)

For deep learning or iterative algorithms, you can log metrics at each step to see learning curves:

with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("epochs", 50)

    for epoch in range(50):
        # ... train one epoch ...
        train_loss = train_one_epoch(model, train_loader)
        val_loss = validate(model, val_loader)

        # Log with step number
        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_loss", val_loss, step=epoch)

    mlflow.pytorch.log_model(model, "model")

MLflow renders these as a time series chart. You can see exactly when your model started overfitting.

Auto-Logging: Zero-Code Tracking

MLflow supports auto-logging for major frameworks. A single line before your training code logs everything automatically:

import mlflow

# Enable auto-logging for sklearn
mlflow.sklearn.autolog()

# Now just train normally — MLflow logs everything
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, max_depth=6)
model.fit(X_train, y_train)
# Parameters, metrics, and model artifact are all logged automatically

Auto-logging works with sklearn, XGBoost, LightGBM, PyTorch, TensorFlow, Keras, and Spark MLlib. It’s not always granular enough for production use (you’ll usually want to log custom metrics), but it’s a great starting point.

Querying Runs Programmatically

You don’t have to use the UI. You can query MLflow runs in code — useful for CI pipelines that need to find the best model:

import mlflow

client = mlflow.tracking.MlflowClient()

# Get all runs in the experiment, sorted by AUC descending
experiment = client.get_experiment_by_name("churn-prediction")
runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["metrics.roc_auc DESC"],
    max_results=1,
)

best_run = runs[0]
print(f"Best run ID: {best_run.info.run_id}")
print(f"Best AUC: {best_run.data.metrics['roc_auc']:.4f}")
print(f"Parameters: {best_run.data.params}")

# Load the model from that run
model = mlflow.sklearn.load_model(f"runs:/{best_run.info.run_id}/model")

This is how your CI/CD pipeline will find and promote the best model — more on this in the CI/CD lesson.

Organizing Experiments

As your project grows, you’ll want more structure than a single experiment. Here are patterns that work well:

By model type: churn-gbm, churn-rf, churn-nn — makes it easy to compare architectures.

By feature set: churn-v1-features, churn-v2-features — tracks the impact of feature engineering work.

By dataset version: Tag each run with data_version=abc123 (the DVC commit hash) so you can filter runs that used the same training data.

with mlflow.start_run():
    # Tag with data version (DVC commit hash)
    mlflow.set_tag("data_version", "a1b2c3d4")
    mlflow.set_tag("feature_set", "v2")
    mlflow.set_tag("model_type", "gradient_boosting")
    # ... rest of training ...

The Parent-Child Run Pattern

When doing hyperparameter search, use nested runs to group all search runs under one parent:

with mlflow.start_run(run_name="hyperparameter_search") as parent_run:
    for lr in [0.001, 0.01, 0.1]:
        with mlflow.start_run(run_name=f"lr={lr}", nested=True):
            mlflow.log_param("learning_rate", lr)
            # ... train and log metrics ...

In the UI, you’ll see a single parent run that you can expand to see all child runs. Much cleaner than 30 flat runs.

Summary

MLflow solves the notebook graveyard problem by giving every training run a permanent, searchable record. After this lesson you can:

Start an MLflow server and create experiments
Log parameters, metrics, and artifacts in any training script
Use the UI to compare runs and find the best hyperparameters
Query runs programmatically to find and load the best model
Use auto-logging for zero-configuration tracking

The next lesson takes this further: once you have a tracked experiment, how do you automate the entire training process so it runs on every code change? That’s CI/CD for ML.

Course Content

The Notebook Graveyard

MLflow Core Concepts

Installing and Starting MLflow

Your First Tracked Experiment

Navigating the MLflow UI

Logging During Training (Per-Epoch Metrics)

Auto-Logging: Zero-Code Tracking

Querying Runs Programmatically

Organizing Experiments

The Parent-Child Run Pattern

Summary

🍪 Cookie Notice

Cookie Preferences

Essential Cookies

Analytics Cookies

Marketing Cookies

Functionality Cookies