Course Content
Model Registry and Versioning
Manage model versions, promotion workflows, and rollbacks
The Versioning Crisis
It’s Monday morning. Your model API started returning strange predictions at 6 AM. You check the server. Three pickle files exist in /models/:
churn_model.pkl (modified: Saturday 11pm)
churn_model_backup.pkl (modified: Friday 3pm)
churn_model_old.pkl (modified: last Tuesday)Which one is in production? You check app.py: MODEL_PATH = "/models/churn_model.pkl". That’s the Saturday version. Who deployed it? What changed? Is the Friday backup better? You have no idea.
You swap to churn_model_backup.pkl and redeploy. Predictions normalize. You’ve rolled back. But you have no idea what you deployed, why it was wrong, or whether Friday’s model is actually better than the original.
This is a model versioning failure, and it happens constantly in organizations without a model registry.
What is a Model Registry?
A model registry is a centralized catalog of trained models. It tracks:
- Every model version (with its artifacts, metrics, and provenance)
- Which stage each version is in:
Staging,Production,Archived - Who promoted or demoted each version and when
- Which training run produced each version (links back to MLflow experiments)
The model registry is the bridge between experimentation and deployment. You don’t deploy a pickle file. You promote a registered model version to Production, and your serving code loads whatever version has that stage.
MLflow Model Registry: Core Concepts
Registered Model: A named model (churn-predictor). Has multiple versions.
Model Version: A specific version of a registered model (Version 1, Version 2, etc.). Immutable once created. Has a stage.
Stage: Where the version lives in the deployment lifecycle:
None: Just registered, not yet evaluated for deploymentStaging: In testing / QA. Deployed to staging environment.Production: Live. Your serving code loads this version.Archived: Replaced or retired. Kept for audit trail.
Alias: A mutable pointer to a version. More flexible than stages — you can create aliases like @champion and @challenger and update them without changing the version’s stage.
Registering a Model After Training
At the end of your training script, register the model:
# src/models/train.py (excerpt)
import mlflow
import mlflow.sklearn
with mlflow.start_run() as run:
# ... train model ...
mlflow.log_param("learning_rate", 0.01)
mlflow.log_metric("roc_auc", 0.921)
# Register the model to the registry
model_uri = f"runs:/{run.info.run_id}/model"
result = mlflow.register_model(
model_uri=model_uri,
name="churn-predictor",
tags={
"training_dataset": "v2023-Q4",
"feature_set": "v2",
"git_commit": os.getenv("GIT_SHA", "unknown"),
},
)
print(f"Registered model version: {result.version}")
print(f"Stage: {result.current_stage}") # "None" initiallyThe first time you register, it creates the model churn-predictor with Version 1. Subsequent registrations create Version 2, 3, and so on.
Transitioning Stages
Once a model is registered, use the MlflowClient to promote it through stages:
# ci/register_model.py — run in CI after training passes quality gates
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
MODEL_NAME = "churn-predictor"
def get_latest_version(name: str, stage: str = None) -> str:
"""Get the version number of the latest model in a given stage."""
versions = client.get_latest_versions(name, stages=[stage] if stage else None)
if not versions:
return None
return versions[0].version
def promote_to_staging(run_id: str) -> str:
"""Register and promote the model from a run to Staging."""
# Register
model_uri = f"runs:/{run_id}/model"
result = mlflow.register_model(model_uri, MODEL_NAME)
version = result.version
# Transition to Staging
client.transition_model_version_stage(
name=MODEL_NAME,
version=version,
stage="Staging",
archive_existing_versions=False, # Don't archive previous staging versions yet
)
print(f"Version {version} promoted to Staging")
return version
def compare_with_production(staging_version: str) -> bool:
"""Return True if staging version beats production."""
# Get production version metrics
prod_version = get_latest_version(MODEL_NAME, stage="Production")
if prod_version is None:
print("No production model exists yet. Staging will become production.")
return True
staging_run = client.get_model_version(MODEL_NAME, staging_version)
prod_run = client.get_model_version(MODEL_NAME, prod_version)
staging_metrics = client.get_run(staging_run.run_id).data.metrics
prod_metrics = client.get_run(prod_run.run_id).data.metrics
staging_auc = staging_metrics.get("roc_auc", 0)
prod_auc = prod_metrics.get("roc_auc", 0)
print(f"Staging AUC: {staging_auc:.4f} | Production AUC: {prod_auc:.4f}")
return staging_auc > prod_auc
def promote_to_production(version: str):
"""Promote a version to Production and archive the previous production model."""
client.transition_model_version_stage(
name=MODEL_NAME,
version=version,
stage="Production",
archive_existing_versions=True, # Archive old production model
)
print(f"Version {version} is now Production")
# Also set an alias for programmatic access
client.set_registered_model_alias(MODEL_NAME, "champion", version)
# Main CI flow
if __name__ == "__main__":
import sys
run_id = os.getenv("MLFLOW_RUN_ID")
if not run_id:
print("No MLFLOW_RUN_ID found. Set it from your training run.")
sys.exit(1)
# Step 1: Promote to staging
staging_version = promote_to_staging(run_id)
# Step 2: Compare with production
is_better = compare_with_production(staging_version)
if is_better:
# Step 3: Promote to production if better
promote_to_production(staging_version)
print("Deployment complete: new model is now in production")
else:
print("New model did not beat production. Keeping current production model.")
# Archive the staging version that didn't make it
client.transition_model_version_stage(
name=MODEL_NAME,
version=staging_version,
stage="Archived",
)
sys.exit(1) # Fail the CI stepLoading the Production Model in Your Serving API
Instead of loading a pickle file by path, load the production model by its registry stage:
# src/api/app.py (model loading section)
import mlflow
import mlflow.sklearn
MLFLOW_TRACKING_URI = os.getenv("MLFLOW_TRACKING_URI", "http://localhost:5000")
MODEL_NAME = "churn-predictor"
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
@asynccontextmanager
async def lifespan(app: FastAPI):
# Load the current production model from the registry
# This always loads whatever version is currently staged as Production
model_uri = f"models:/{MODEL_NAME}/Production"
# Or load by alias (more flexible):
# model_uri = f"models:/{MODEL_NAME}@champion"
logger.info(f"Loading model from registry: {model_uri}")
state.model = mlflow.sklearn.load_model(model_uri)
# Capture the actual version that was loaded
client = mlflow.tracking.MlflowClient()
prod_versions = client.get_latest_versions(MODEL_NAME, stages=["Production"])
state.model_version = prod_versions[0].version if prod_versions else "unknown"
logger.info(f"Loaded model version {state.model_version}")
yield
state.model = NoneNow when you promote a new version to Production in the registry, the next deployment of your API automatically picks it up. No file path changes, no reconfigurations.
The Rollback Procedure
This is the scenario: you deployed Version 5 on Monday morning. By 10 AM, monitoring shows that prediction distributions have shifted. You need to roll back to Version 4.
Step 1: Check what’s in production
client = MlflowClient()
prod_versions = client.get_latest_versions("churn-predictor", stages=["Production"])
print(f"Current production: Version {prod_versions[0].version}")
archived_versions = client.get_latest_versions("churn-predictor", stages=["Archived"])
for v in archived_versions:
run = client.get_run(v.run_id)
print(f"Archived Version {v.version}: AUC={run.data.metrics.get('roc_auc', 'N/A'):.4f}")Step 2: Promote the previous version back to production
def rollback_to_version(version: str):
"""Roll back production to a specific version."""
# Archive the current production version
current_prod = client.get_latest_versions("churn-predictor", stages=["Production"])
for v in current_prod:
client.transition_model_version_stage(
name="churn-predictor",
version=v.version,
stage="Archived",
)
print(f"Archived Version {v.version}")
# Promote the specified version back to production
client.transition_model_version_stage(
name="churn-predictor",
version=version,
stage="Production",
archive_existing_versions=False,
)
# Update the champion alias
client.set_registered_model_alias("churn-predictor", "champion", version)
print(f"Rolled back to Version {version}")
# Roll back from Version 5 to Version 4
rollback_to_version("4")Step 3: Redeploy the serving API (or restart it so it reloads from the registry)
# Kubernetes rolling restart
kubectl rollout restart deployment/churn-predictor-api
# Or for Docker Compose
docker-compose up -d --force-recreate apiWithin minutes, the serving API is loading Version 4 from the registry. The entire rollback takes under 5 minutes and leaves a complete audit trail in MLflow.
Model Lineage: The Audit Trail
Every version in the registry links to its source run, which contains:
- The exact training code (Git commit hash)
- The exact training data (DVC hash)
- All hyperparameters
- All metrics from training and evaluation
- The environment (Python version, package versions)
This lineage is essential for:
Debugging: “Why did Version 5 fail?” → Open the registry → Find the training run → See it was trained on a dataset that had a preprocessing bug.
Compliance: In regulated industries, you must be able to explain every prediction your model made. The registry gives you the model version used at any point in time, and the training run gives you the data and code.
Reproducibility: Need to reproduce Version 3? Check out the Git commit, pull the DVC data version, and run training with the logged hyperparameters.
Summary
The model registry brings the same discipline to model versions that Git brings to code versions. Key takeaways:
- Never deploy a model by copying a pickle file — register it and load from the registry
- Use stages to manage the promotion workflow:
None → Staging → Production - Use aliases (
@champion,@challenger) for flexible A/B testing and gradual rollouts - Always compare new models against the current production baseline before promoting
- Rollbacks take minutes when you use the registry — seconds to change a stage, minutes to redeploy
In the next lesson, you’ll learn how to deploy this whole stack on Kubernetes so it scales automatically under load.
