Press ESC to exit fullscreen
📖 Lesson ⏱️ 45 minutes

What is MLOps and Why It Matters

From notebook to production — the MLOps lifecycle and why most models never ship

The Problem: A Tale of Two Notebooks

Picture this. You spend three weeks building a customer churn prediction model. The accuracy is 91%. Your manager is impressed. You zip the notebook and email it to your colleague in the data engineering team. They spend two days trying to run it. It fails. Turns out you were using pandas 1.3.5, they have 2.0. Your notebook relies on a CSV in a folder called ~/Downloads/final_data_use_this_v3.csv. The model weights are stored locally and never committed to Git.

Six months later, the model is somehow in production — someone manually uploaded it to a server. Nobody knows how. Predictions start drifting. Customers are complaining. You open your laptop, look at the original notebook, and realize you can’t even remember which preprocessing steps you ran, in which order, or on what version of the data.

This is the story of ML in most organizations. And it’s why 85% of ML projects never reach production (Gartner, 2022).

MLOps exists to solve exactly this.


What is MLOps?

MLOps (Machine Learning Operations) is the set of practices, tools, and culture that takes a machine learning model from an experiment on your laptop to a reliable, monitored, continuously-improving system in production.

It borrows heavily from DevOps (the software engineering discipline for reliable software delivery) and applies those principles to the unique challenges of ML: data dependencies, model decay, non-deterministic behavior, and the separation between the people who build models and the people who operate systems.

Think of MLOps as the difference between:

  • Notebook ML: A data scientist trains a model, saves a pickle file, emails it around, and hopes for the best.
  • Production ML: A versioned dataset feeds an automated training pipeline. Every run is tracked. The model passes automated quality gates before deployment. Once live, it is continuously monitored and retrained when performance degrades.

Notebook ML vs Production ML

DimensionNotebook MLProduction ML
DataLocal CSV fileVersioned, tracked datasets
Experiments40 notebooks named final_v2_REALTracked runs with logged params/metrics
DeploymentManual uploadCI/CD pipeline
Environment”Works on my machine”Containerized, reproducible
MonitoringNone (you find out from users)Automated drift and performance alerts
RetrainingAd hoc, when someone remembersTriggered by data drift or schedule

The core insight is that software engineering has solved these problems already — version control, automated testing, continuous deployment, observability. MLOps applies those solutions to the ML workflow.


The MLOps Lifecycle

Here is the full lifecycle of a production ML system. Every arrow is where things break without MLOps.

Raw Data
    |
    v
[Data Versioning]  <-- DVC, Delta Lake, LakeFS
    |
    v
[Feature Engineering]  <-- Feature Store (Feast, Tecton)
    |
    v
[Experiment Tracking]  <-- MLflow, Weights & Biases
    |
    v
[Model Training]
    |
    v
[Automated Testing / CI]  <-- GitHub Actions, Jenkins
    |
    v
[Model Registry]  <-- MLflow Registry, SageMaker Model Registry
    |
    v
[Model Serving]  <-- FastAPI, Triton, BentoML, SageMaker
    |
    v
[Monitoring & Drift Detection]  <-- Evidently, Grafana, Prometheus
    |
    v
[Retraining Trigger]  <-- back to top

Let’s walk through each stage.

1. Data Versioning

Your model is only as good as the data it was trained on. If you cannot reproduce the exact dataset used for training, you cannot debug issues, compare experiments, or audit model behavior.

Tools like DVC (Data Version Control) solve this by treating datasets like code — you commit a pointer to the data in Git, and the actual data lives in blob storage (S3, GCS). Anyone with access can check out the exact dataset used in any historical run.

2. Experiment Tracking

Training a model involves dozens of decisions: learning rate, regularization, feature selection, architecture. Without tracking, you will lose these experiments. You’ll end up re-running experiments you already ran, or deploying a model without knowing what hyperparameters produced it.

MLflow tracks every run automatically: parameters, metrics, artifacts, and the code version that produced them. You get a UI to compare runs side-by-side.

3. CI/CD for ML

In software engineering, every code change triggers an automated pipeline: build, test, deploy. ML needs the same thing. When a data scientist pushes new training code, the CI pipeline should:

  1. Validate the input data schema
  2. Run unit tests on preprocessing logic
  3. Train the model
  4. Evaluate it against a quality threshold
  5. Register it if it passes

If the model doesn’t hit 85% accuracy, the pipeline fails and nobody merges the PR. This catches regressions before they reach users.

4. Model Serving

Once a model passes CI, it needs to be exposed as an API. The serving layer handles incoming requests, preprocesses them the same way the model was trained, runs inference, and returns predictions.

FastAPI is a popular choice for simple cases. For high-throughput production use, teams use NVIDIA Triton, TorchServe, or managed services like AWS SageMaker.

5. Monitoring and Drift Detection

This is where most teams cut corners — and where silent failures happen.

Data drift: The distribution of inputs changes over time. Your fraud model was trained when most transactions were under $500. Now the average is $1,200. The model has never seen these values.

Concept drift: The relationship between inputs and the target changes. A recommendation model trained pre-COVID recommends travel content. Post-COVID, users want home-improvement content. The input features look the same; the correct outputs are completely different.

You need automated checks that compare the distribution of live traffic against your training data, and alert you when divergence exceeds a threshold.

6. Automated Retraining

The final piece: when monitoring detects drift or performance degradation, the pipeline automatically kicks off a new training run on fresh data, evaluates the new model, and (if it passes quality gates) promotes it to production. The cycle restarts.


Why This Matters: The Real Cost of Ignoring MLOps

Consider what happens when you skip MLOps:

  • Reproducibility failures: “We can’t replicate the model we deployed six months ago.”
  • Shadow models: Nobody knows which model is in production. There are three different pickle files on the server.
  • Silent degradation: Your model’s accuracy has been dropping for two months. You find out when a major client churns.
  • Deployment bottlenecks: Every model deployment requires a two-week manual process involving three teams.
  • Regulatory risk: In finance and healthcare, you must be able to explain and reproduce any prediction your model made. Without data and experiment versioning, this is impossible.

The 85% statistic isn’t just about technical difficulty. It’s about the operational infrastructure that most teams never build. MLOps is that infrastructure.


Who Does What in an MLOps Team?

RoleResponsibilities
Data ScientistExperiments, feature engineering, model architecture
ML EngineerTraining pipelines, feature stores, experiment tracking
Platform EngineerKubernetes, CI/CD, infrastructure, monitoring
Data EngineerData pipelines, data quality, storage

In smaller teams, one or two people wear all these hats. That’s fine — but you still need the practices, even if you’re the only person on the team.


Your MLOps Maturity Level

Teams typically move through these levels:

Level 0 — Manual: Notebooks, manual deployment, no tracking. Most data science projects start here.

Level 1 — ML Pipeline Automation: Training is automated and triggered. Experiments are tracked. Data is versioned.

Level 2 — CI/CD Pipeline Automation: Every code change triggers a full pipeline. Models are automatically evaluated and deployed if they pass gates. Monitoring is in place.

Getting from Level 0 to Level 1 is a weekend of work. Getting from Level 1 to Level 2 takes a few months. Both are worthwhile.


What You Will Build in This Course

Over the next ten lessons, you will build a complete MLOps pipeline from scratch. By the end, you will have:

  • A structured, reproducible ML project with DVC
  • Experiment tracking with MLflow
  • A GitHub Actions CI/CD pipeline that trains and evaluates automatically
  • A Dockerized model serving API with FastAPI
  • Drift monitoring with Evidently
  • A model registry with staging, production, and rollback workflows

Each lesson builds on the previous one. By the capstone, these pieces connect into a single end-to-end system.

The next lesson covers the foundation everything else depends on: how to structure your ML project and version your data.