Press ESC to exit fullscreen
📖 Lesson ⏱️ 90 minutes

Model Registry and Versioning

Manage model versions, promotion workflows, and rollbacks

The Versioning Crisis

It’s Monday morning. Your model API started returning strange predictions at 6 AM. You check the server. Three pickle files exist in /models/:

churn_model.pkl          (modified: Saturday 11pm)
churn_model_backup.pkl   (modified: Friday 3pm)
churn_model_old.pkl      (modified: last Tuesday)

Which one is in production? You check app.py: MODEL_PATH = "/models/churn_model.pkl". That’s the Saturday version. Who deployed it? What changed? Is the Friday backup better? You have no idea.

You swap to churn_model_backup.pkl and redeploy. Predictions normalize. You’ve rolled back. But you have no idea what you deployed, why it was wrong, or whether Friday’s model is actually better than the original.

This is a model versioning failure, and it happens constantly in organizations without a model registry.


What is a Model Registry?

A model registry is a centralized catalog of trained models. It tracks:

  • Every model version (with its artifacts, metrics, and provenance)
  • Which stage each version is in: Staging, Production, Archived
  • Who promoted or demoted each version and when
  • Which training run produced each version (links back to MLflow experiments)

The model registry is the bridge between experimentation and deployment. You don’t deploy a pickle file. You promote a registered model version to Production, and your serving code loads whatever version has that stage.


MLflow Model Registry: Core Concepts

Registered Model: A named model (churn-predictor). Has multiple versions.

Model Version: A specific version of a registered model (Version 1, Version 2, etc.). Immutable once created. Has a stage.

Stage: Where the version lives in the deployment lifecycle:

  • None: Just registered, not yet evaluated for deployment
  • Staging: In testing / QA. Deployed to staging environment.
  • Production: Live. Your serving code loads this version.
  • Archived: Replaced or retired. Kept for audit trail.

Alias: A mutable pointer to a version. More flexible than stages — you can create aliases like @champion and @challenger and update them without changing the version’s stage.


Registering a Model After Training

At the end of your training script, register the model:

# src/models/train.py (excerpt)
import mlflow
import mlflow.sklearn

with mlflow.start_run() as run:
    # ... train model ...
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_metric("roc_auc", 0.921)

    # Register the model to the registry
    model_uri = f"runs:/{run.info.run_id}/model"
    result = mlflow.register_model(
        model_uri=model_uri,
        name="churn-predictor",
        tags={
            "training_dataset": "v2023-Q4",
            "feature_set": "v2",
            "git_commit": os.getenv("GIT_SHA", "unknown"),
        },
    )

    print(f"Registered model version: {result.version}")
    print(f"Stage: {result.current_stage}")  # "None" initially

The first time you register, it creates the model churn-predictor with Version 1. Subsequent registrations create Version 2, 3, and so on.


Transitioning Stages

Once a model is registered, use the MlflowClient to promote it through stages:

# ci/register_model.py — run in CI after training passes quality gates
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()
MODEL_NAME = "churn-predictor"

def get_latest_version(name: str, stage: str = None) -> str:
    """Get the version number of the latest model in a given stage."""
    versions = client.get_latest_versions(name, stages=[stage] if stage else None)
    if not versions:
        return None
    return versions[0].version


def promote_to_staging(run_id: str) -> str:
    """Register and promote the model from a run to Staging."""
    # Register
    model_uri = f"runs:/{run_id}/model"
    result = mlflow.register_model(model_uri, MODEL_NAME)
    version = result.version

    # Transition to Staging
    client.transition_model_version_stage(
        name=MODEL_NAME,
        version=version,
        stage="Staging",
        archive_existing_versions=False,  # Don't archive previous staging versions yet
    )
    print(f"Version {version} promoted to Staging")
    return version


def compare_with_production(staging_version: str) -> bool:
    """Return True if staging version beats production."""
    # Get production version metrics
    prod_version = get_latest_version(MODEL_NAME, stage="Production")
    if prod_version is None:
        print("No production model exists yet. Staging will become production.")
        return True

    staging_run = client.get_model_version(MODEL_NAME, staging_version)
    prod_run = client.get_model_version(MODEL_NAME, prod_version)

    staging_metrics = client.get_run(staging_run.run_id).data.metrics
    prod_metrics = client.get_run(prod_run.run_id).data.metrics

    staging_auc = staging_metrics.get("roc_auc", 0)
    prod_auc = prod_metrics.get("roc_auc", 0)

    print(f"Staging AUC: {staging_auc:.4f}  |  Production AUC: {prod_auc:.4f}")
    return staging_auc > prod_auc


def promote_to_production(version: str):
    """Promote a version to Production and archive the previous production model."""
    client.transition_model_version_stage(
        name=MODEL_NAME,
        version=version,
        stage="Production",
        archive_existing_versions=True,  # Archive old production model
    )
    print(f"Version {version} is now Production")
    # Also set an alias for programmatic access
    client.set_registered_model_alias(MODEL_NAME, "champion", version)


# Main CI flow
if __name__ == "__main__":
    import sys

    run_id = os.getenv("MLFLOW_RUN_ID")
    if not run_id:
        print("No MLFLOW_RUN_ID found. Set it from your training run.")
        sys.exit(1)

    # Step 1: Promote to staging
    staging_version = promote_to_staging(run_id)

    # Step 2: Compare with production
    is_better = compare_with_production(staging_version)

    if is_better:
        # Step 3: Promote to production if better
        promote_to_production(staging_version)
        print("Deployment complete: new model is now in production")
    else:
        print("New model did not beat production. Keeping current production model.")
        # Archive the staging version that didn't make it
        client.transition_model_version_stage(
            name=MODEL_NAME,
            version=staging_version,
            stage="Archived",
        )
        sys.exit(1)  # Fail the CI step

Loading the Production Model in Your Serving API

Instead of loading a pickle file by path, load the production model by its registry stage:

# src/api/app.py (model loading section)
import mlflow
import mlflow.sklearn

MLFLOW_TRACKING_URI = os.getenv("MLFLOW_TRACKING_URI", "http://localhost:5000")
MODEL_NAME = "churn-predictor"

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Load the current production model from the registry
    # This always loads whatever version is currently staged as Production
    model_uri = f"models:/{MODEL_NAME}/Production"

    # Or load by alias (more flexible):
    # model_uri = f"models:/{MODEL_NAME}@champion"

    logger.info(f"Loading model from registry: {model_uri}")
    state.model = mlflow.sklearn.load_model(model_uri)

    # Capture the actual version that was loaded
    client = mlflow.tracking.MlflowClient()
    prod_versions = client.get_latest_versions(MODEL_NAME, stages=["Production"])
    state.model_version = prod_versions[0].version if prod_versions else "unknown"

    logger.info(f"Loaded model version {state.model_version}")
    yield
    state.model = None

Now when you promote a new version to Production in the registry, the next deployment of your API automatically picks it up. No file path changes, no reconfigurations.


The Rollback Procedure

This is the scenario: you deployed Version 5 on Monday morning. By 10 AM, monitoring shows that prediction distributions have shifted. You need to roll back to Version 4.

Step 1: Check what’s in production

client = MlflowClient()
prod_versions = client.get_latest_versions("churn-predictor", stages=["Production"])
print(f"Current production: Version {prod_versions[0].version}")

archived_versions = client.get_latest_versions("churn-predictor", stages=["Archived"])
for v in archived_versions:
    run = client.get_run(v.run_id)
    print(f"Archived Version {v.version}: AUC={run.data.metrics.get('roc_auc', 'N/A'):.4f}")

Step 2: Promote the previous version back to production

def rollback_to_version(version: str):
    """Roll back production to a specific version."""
    # Archive the current production version
    current_prod = client.get_latest_versions("churn-predictor", stages=["Production"])
    for v in current_prod:
        client.transition_model_version_stage(
            name="churn-predictor",
            version=v.version,
            stage="Archived",
        )
        print(f"Archived Version {v.version}")

    # Promote the specified version back to production
    client.transition_model_version_stage(
        name="churn-predictor",
        version=version,
        stage="Production",
        archive_existing_versions=False,
    )

    # Update the champion alias
    client.set_registered_model_alias("churn-predictor", "champion", version)
    print(f"Rolled back to Version {version}")

# Roll back from Version 5 to Version 4
rollback_to_version("4")

Step 3: Redeploy the serving API (or restart it so it reloads from the registry)

# Kubernetes rolling restart
kubectl rollout restart deployment/churn-predictor-api

# Or for Docker Compose
docker-compose up -d --force-recreate api

Within minutes, the serving API is loading Version 4 from the registry. The entire rollback takes under 5 minutes and leaves a complete audit trail in MLflow.


Model Lineage: The Audit Trail

Every version in the registry links to its source run, which contains:

  • The exact training code (Git commit hash)
  • The exact training data (DVC hash)
  • All hyperparameters
  • All metrics from training and evaluation
  • The environment (Python version, package versions)

This lineage is essential for:

Debugging: “Why did Version 5 fail?” → Open the registry → Find the training run → See it was trained on a dataset that had a preprocessing bug.

Compliance: In regulated industries, you must be able to explain every prediction your model made. The registry gives you the model version used at any point in time, and the training run gives you the data and code.

Reproducibility: Need to reproduce Version 3? Check out the Git commit, pull the DVC data version, and run training with the logged hyperparameters.


Summary

The model registry brings the same discipline to model versions that Git brings to code versions. Key takeaways:

  • Never deploy a model by copying a pickle file — register it and load from the registry
  • Use stages to manage the promotion workflow: None → Staging → Production
  • Use aliases (@champion, @challenger) for flexible A/B testing and gradual rollouts
  • Always compare new models against the current production baseline before promoting
  • Rollbacks take minutes when you use the registry — seconds to change a stage, minutes to redeploy

In the next lesson, you’ll learn how to deploy this whole stack on Kubernetes so it scales automatically under load.