Course Content
Experiment Tracking with MLflow
Log parameters, metrics, and artifacts — compare runs and reproduce results
The Notebook Graveyard
Open your Downloads folder. If you’ve done any ML work, you probably have files like this:
model_final.ipynb
model_final_v2.ipynb
model_final_REAL.ipynb
model_v3_USE_THIS_ONE.ipynb
model_v3_better.ipynb
best_model_DO_NOT_DELETE.ipynbYou ran experiments. Some worked. Some didn’t. But now, three months later, you cannot answer these basic questions:
- Which notebook produced the 94% accuracy result?
- What learning rate was used in that run?
- Was the data preprocessed the same way in all runs?
- What were the exact feature columns?
This is the experiment tracking problem, and it kills ML projects. You re-run experiments you already ran. You lose good results. You deploy a model without knowing what hyperparameters produced it.
MLflow solves this. Every training run logs its parameters, metrics, and artifacts to a central server. You get a UI to compare runs side-by-side, filter by metric, and reproduce any historical experiment.
MLflow Core Concepts
Before the code, understand the data model:
Experiment: A named group of related runs. churn-prediction might be one experiment; churn-prediction-xgboost another.
Run: A single execution of your training code. Each run has a unique ID and logs parameters, metrics, tags, and artifacts.
Parameters: Inputs to your training run. learning_rate=0.01, max_depth=6. Logged once at the start of the run.
Metrics: Measured outputs. accuracy=0.91, f1=0.88. Can be logged multiple times (e.g., once per epoch) to produce time series charts.
Artifacts: Files produced by the run. The trained model, plots, confusion matrices, feature importance charts.
Model Registry: A catalog of models with stages (Staging, Production, Archived). Covered in a later lesson.
Installing and Starting MLflow
pip install mlflow scikit-learn pandasStart the MLflow tracking server (locally for now):
mlflow uiOpen http://localhost:5000. You’ll see an empty UI. It fills up as you run experiments.
For production, you’d point MLflow at a database and shared blob storage:
mlflow server \
--backend-store-uri postgresql://user:pass@host/mlflow \
--default-artifact-root s3://my-bucket/mlflow-artifacts \
--host 0.0.0.0 \
--port 5000Your First Tracked Experiment
Let’s build a complete example: hyperparameter search over learning_rate in [0.001, 0.01, 0.1] for a churn prediction model. We’ll log every run to MLflow and compare them in the UI.
# experiments/search_learning_rate.py
import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.preprocessing import LabelEncoder
# Point to your MLflow server
mlflow.set_tracking_uri("http://localhost:5000")
# Name the experiment (created if it doesn't exist)
mlflow.set_experiment("churn-prediction")
# Generate synthetic churn data for this example
np.random.seed(42)
n_samples = 5000
data = pd.DataFrame({
"tenure_months": np.random.randint(1, 72, n_samples),
"monthly_charges": np.random.uniform(20, 100, n_samples),
"total_charges": np.random.uniform(100, 8000, n_samples),
"num_products": np.random.randint(1, 5, n_samples),
"has_support_calls": np.random.randint(0, 2, n_samples),
"churn": np.random.randint(0, 2, n_samples),
})
X = data.drop("churn", axis=1)
y = data["churn"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Hyperparameter grid
learning_rates = [0.001, 0.01, 0.1]
n_estimators_options = [50, 100, 200]
for lr in learning_rates:
for n_est in n_estimators_options:
# Start a new MLflow run
with mlflow.start_run(run_name=f"lr={lr}_n_est={n_est}"):
# Log parameters
mlflow.log_param("learning_rate", lr)
mlflow.log_param("n_estimators", n_est)
mlflow.log_param("max_depth", 4)
mlflow.log_param("train_size", len(X_train))
# Train
model = GradientBoostingClassifier(
learning_rate=lr,
n_estimators=n_est,
max_depth=4,
random_state=42,
)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_prob)
# Log metrics
mlflow.log_metric("accuracy", acc)
mlflow.log_metric("f1_score", f1)
mlflow.log_metric("roc_auc", auc)
# Log the model artifact
mlflow.sklearn.log_model(
model,
artifact_path="model",
registered_model_name="churn-predictor", # sends to Model Registry
)
# Log a feature importance plot
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 5))
importances = pd.Series(
model.feature_importances_, index=X.columns
).sort_values(ascending=True)
importances.plot(kind="barh", ax=ax)
ax.set_title(f"Feature Importance (lr={lr}, n_est={n_est})")
plt.tight_layout()
mlflow.log_figure(fig, "feature_importance.png")
plt.close()
print(f"lr={lr}, n_est={n_est}: acc={acc:.4f}, auc={auc:.4f}")Run this script, then open the MLflow UI at http://localhost:5000. You’ll see 9 runs in the churn-prediction experiment.
Navigating the MLflow UI
In the UI, you can:
Sort and filter: Click the roc_auc column header to sort all runs by AUC. Instantly find the best run.
Compare runs: Select multiple runs and click “Compare.” MLflow renders a side-by-side table of all parameters and metrics, plus parallel coordinates plots that reveal which hyperparameter combinations worked best.
View artifacts: Click into any run to see its logged artifacts — the feature importance PNG, the saved model, any other files.
Reproduce a run: Every run records the Git commit hash, the source file, and all parameters. To reproduce it exactly, check out that commit and use those parameters.
Logging During Training (Per-Epoch Metrics)
For deep learning or iterative algorithms, you can log metrics at each step to see learning curves:
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("epochs", 50)
for epoch in range(50):
# ... train one epoch ...
train_loss = train_one_epoch(model, train_loader)
val_loss = validate(model, val_loader)
# Log with step number
mlflow.log_metric("train_loss", train_loss, step=epoch)
mlflow.log_metric("val_loss", val_loss, step=epoch)
mlflow.pytorch.log_model(model, "model")MLflow renders these as a time series chart. You can see exactly when your model started overfitting.
Auto-Logging: Zero-Code Tracking
MLflow supports auto-logging for major frameworks. A single line before your training code logs everything automatically:
import mlflow
# Enable auto-logging for sklearn
mlflow.sklearn.autolog()
# Now just train normally — MLflow logs everything
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, max_depth=6)
model.fit(X_train, y_train)
# Parameters, metrics, and model artifact are all logged automaticallyAuto-logging works with sklearn, XGBoost, LightGBM, PyTorch, TensorFlow, Keras, and Spark MLlib. It’s not always granular enough for production use (you’ll usually want to log custom metrics), but it’s a great starting point.
Querying Runs Programmatically
You don’t have to use the UI. You can query MLflow runs in code — useful for CI pipelines that need to find the best model:
import mlflow
client = mlflow.tracking.MlflowClient()
# Get all runs in the experiment, sorted by AUC descending
experiment = client.get_experiment_by_name("churn-prediction")
runs = client.search_runs(
experiment_ids=[experiment.experiment_id],
order_by=["metrics.roc_auc DESC"],
max_results=1,
)
best_run = runs[0]
print(f"Best run ID: {best_run.info.run_id}")
print(f"Best AUC: {best_run.data.metrics['roc_auc']:.4f}")
print(f"Parameters: {best_run.data.params}")
# Load the model from that run
model = mlflow.sklearn.load_model(f"runs:/{best_run.info.run_id}/model")This is how your CI/CD pipeline will find and promote the best model — more on this in the CI/CD lesson.
Organizing Experiments
As your project grows, you’ll want more structure than a single experiment. Here are patterns that work well:
By model type: churn-gbm, churn-rf, churn-nn — makes it easy to compare architectures.
By feature set: churn-v1-features, churn-v2-features — tracks the impact of feature engineering work.
By dataset version: Tag each run with data_version=abc123 (the DVC commit hash) so you can filter runs that used the same training data.
with mlflow.start_run():
# Tag with data version (DVC commit hash)
mlflow.set_tag("data_version", "a1b2c3d4")
mlflow.set_tag("feature_set", "v2")
mlflow.set_tag("model_type", "gradient_boosting")
# ... rest of training ...The Parent-Child Run Pattern
When doing hyperparameter search, use nested runs to group all search runs under one parent:
with mlflow.start_run(run_name="hyperparameter_search") as parent_run:
for lr in [0.001, 0.01, 0.1]:
with mlflow.start_run(run_name=f"lr={lr}", nested=True):
mlflow.log_param("learning_rate", lr)
# ... train and log metrics ...In the UI, you’ll see a single parent run that you can expand to see all child runs. Much cleaner than 30 flat runs.
Summary
MLflow solves the notebook graveyard problem by giving every training run a permanent, searchable record. After this lesson you can:
- Start an MLflow server and create experiments
- Log parameters, metrics, and artifacts in any training script
- Use the UI to compare runs and find the best hyperparameters
- Query runs programmatically to find and load the best model
- Use auto-logging for zero-configuration tracking
The next lesson takes this further: once you have a tracked experiment, how do you automate the entire training process so it runs on every code change? That’s CI/CD for ML.
