Building a CI/CD Pipeline for ML

The Bug That Reached Production

Your colleague Marcus was in a hurry. He added a new feature column to the preprocessing script — days_since_last_purchase — but forgot to handle the case where a customer has never purchased. The column fills with NaN. The training script drops NaN rows silently. The test set still has NaN values. The model trains fine. Accuracy on the biased test set looks good.

Three weeks later, the model is in production. Predictions for new customers — who have never purchased — are all defaulting to “no churn.” You’re missing 40% of your at-risk customers.

The code change was reviewed. The tests passed. But nobody ran the full training pipeline to check that the model’s performance hadn’t degraded. There was no gate.

This is what CI/CD for ML prevents.

What CI/CD Means for ML

In software engineering, CI/CD (Continuous Integration / Continuous Deployment) means:

Continuous Integration: Every code push triggers automated tests. If tests fail, the code doesn’t merge.
Continuous Deployment: Passing code is automatically deployed to production.

ML adds a third requirement: every model change must pass model quality gates — minimum accuracy, F1 score, or AUC thresholds — before the model reaches production.

A complete ML CI/CD pipeline does this on every push to main:

Install dependencies
Pull the versioned dataset with DVC
Run data validation (schema checks, null value checks)
Run unit tests on preprocessing code
Train the model
Evaluate against minimum quality thresholds
Register the model if it passes
Alert or block the merge if it fails

If Marcus had this pipeline, step 3 would have caught the NaN issue immediately.

When to Trigger Retraining

Before building the pipeline, understand the three triggers for ML retraining:

1. Code change: A data scientist pushes new training code or features. This should always trigger a full pipeline run to verify the change doesn’t degrade the model.

2. Data drift: The distribution of incoming data has shifted enough that the current model’s performance has degraded. The drift monitoring system (lesson 7) triggers a retraining job.

3. Scheduled retraining: Some models need fresh data regularly regardless of drift — a daily news recommender, a fraud model, a demand forecaster. A scheduled pipeline (e.g., every Sunday at 2am) retrains on the latest data.

This lesson focuses on trigger 1 — code-change-triggered CI. Drift-triggered retraining follows the same pattern but is invoked by your monitoring system.

GitHub Actions: The Basics

GitHub Actions is a CI/CD platform built into GitHub. You define workflows as YAML files in .github/workflows/. GitHub runs them on cloud machines (called runners) when specified events occur.

A workflow has:

Triggers (on:): push, pull_request, schedule, or manual dispatch
Jobs: groups of steps that run on a runner
Steps: individual commands or pre-built actions

Here is a minimal example:

# .github/workflows/hello.yml
name: Hello World

on:
  push:
    branches: [main]

jobs:
  say-hello:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: echo "Hello from CI"

Every push to main runs this. Not useful yet — but the structure is the template for everything that follows.

The Full ML Training Pipeline

Here is a complete train.yml workflow. Read through it once, then we’ll walk through each section.

# .github/workflows/train.yml
name: ML Training Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  workflow_dispatch:  # Allows manual trigger from GitHub UI

env:
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  MINIMUM_ACCURACY: "0.85"
  MINIMUM_AUC: "0.88"

jobs:
  train-and-evaluate:
    runs-on: ubuntu-latest
    timeout-minutes: 60

    steps:
      # Step 1: Check out the code
      - name: Checkout repository
        uses: actions/checkout@v4

      # Step 2: Set up Python
      - name: Set up Python 3.11
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: "pip"

      # Step 3: Install dependencies
      - name: Install dependencies
        run: |
          pip install --upgrade pip
          pip install -r requirements.txt

      # Step 4: Pull versioned data with DVC
      - name: Pull data with DVC
        run: |
          dvc remote modify myremote access_key_id $AWS_ACCESS_KEY_ID
          dvc remote modify myremote secret_access_key $AWS_SECRET_ACCESS_KEY
          dvc pull

      # Step 5: Validate data schema and quality
      - name: Run data validation
        run: python src/data/validate.py

      # Step 6: Run unit tests on preprocessing and model code
      - name: Run unit tests
        run: pytest tests/ -v --tb=short

      # Step 7: Train the model
      - name: Train model
        run: python src/models/train.py

      # Step 8: Evaluate and check quality gates
      - name: Evaluate model and check quality gates
        run: |
          python src/models/evaluate.py
          python ci/check_quality_gates.py \
            --metrics-file metrics/eval_metrics.json \
            --min-accuracy $MINIMUM_ACCURACY \
            --min-auc $MINIMUM_AUC

      # Step 9: Upload model artifact (only if on main branch)
      - name: Upload model artifact
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        uses: actions/upload-artifact@v4
        with:
          name: trained-model
          path: models/
          retention-days: 30

      # Step 10: Register model in MLflow (only on main, only if gates passed)
      - name: Register model in MLflow
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        run: python ci/register_model.py

      # Step 11: Notify on failure
      - name: Notify on failure
        if: failure()
        uses: slackapi/slack-github-action@v1.26.0
        with:
          payload: |
            {"text": "ML pipeline failed on ${{ github.ref }}. Check: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"}
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

Walking Through Each Step

Steps 1-3: Setup. Standard boilerplate. The cache: "pip" directive caches your pip packages between runs — this saves 2-3 minutes on every run.

Step 4: DVC pull. Fetches the exact version of the training data that the current code expects. If someone pushed a data update and a code update together, this ensures they’re in sync.

Step 5: Data validation. This is the step that would have caught Marcus’s bug. More on this below.

Step 6: Unit tests. Fast tests on individual functions. If preprocess.py returns the wrong number of columns, this catches it instantly — before spending five minutes training.

Step 7: Train. The actual training run. This logs to MLflow and saves artifacts locally.

Step 8: Quality gates. The evaluator writes metrics to metrics/eval_metrics.json. The gate checker reads them and exits with code 1 if they don’t meet thresholds. A non-zero exit code fails the step, fails the job, and blocks the merge.

Steps 9-10: Only on main. On pull requests, we train and evaluate but don’t register. This gives reviewers the evaluation results without modifying production state.

The Data Validation Script

This is your first line of defense. It runs in seconds and catches the most common data issues.

# src/data/validate.py
import sys
import pandas as pd
import numpy as np

def validate_dataset(path: str) -> list[str]:
    """Return a list of validation errors. Empty list = pass."""
    errors = []
    df = pd.read_csv(path)

    # Schema checks
    required_columns = [
        "customer_id", "tenure_months", "monthly_charges",
        "total_charges", "num_products", "has_support_calls", "churn"
    ]
    missing_cols = set(required_columns) - set(df.columns)
    if missing_cols:
        errors.append(f"Missing required columns: {missing_cols}")

    # Null checks on critical columns
    critical_cols = ["tenure_months", "monthly_charges", "churn"]
    for col in critical_cols:
        if col in df.columns:
            null_count = df[col].isnull().sum()
            if null_count > 0:
                errors.append(f"Column '{col}' has {null_count} null values")

    # Range checks
    if "tenure_months" in df.columns:
        if (df["tenure_months"] < 0).any():
            errors.append("tenure_months has negative values")
        if (df["tenure_months"] > 120).any():
            errors.append("tenure_months has values > 120 (suspicious)")

    if "churn" in df.columns:
        invalid_churn = ~df["churn"].isin([0, 1])
        if invalid_churn.any():
            errors.append(f"churn column has invalid values (not 0 or 1)")

    # Size check
    if len(df) < 1000:
        errors.append(f"Dataset too small: {len(df)} rows (minimum: 1000)")

    # Class balance check
    if "churn" in df.columns:
        churn_rate = df["churn"].mean()
        if churn_rate < 0.05 or churn_rate > 0.60:
            errors.append(
                f"Unusual class balance: {churn_rate:.1%} churn rate "
                f"(expected 5-60%)"
            )

    return errors

if __name__ == "__main__":
    errors = validate_dataset("data/processed/train_features.csv")
    if errors:
        print("DATA VALIDATION FAILED:")
        for error in errors:
            print(f"  - {error}")
        sys.exit(1)
    else:
        print(f"Data validation passed.")

The Quality Gate Script

# ci/check_quality_gates.py
import argparse
import json
import sys

def check_gates(metrics_file: str, min_accuracy: float, min_auc: float):
    with open(metrics_file) as f:
        metrics = json.load(f)

    failures = []

    accuracy = metrics.get("accuracy", 0)
    if accuracy < min_accuracy:
        failures.append(
            f"Accuracy {accuracy:.4f} below threshold {min_accuracy:.4f}"
        )

    auc = metrics.get("roc_auc", 0)
    if auc < min_auc:
        failures.append(
            f"ROC AUC {auc:.4f} below threshold {min_auc:.4f}"
        )

    if failures:
        print("QUALITY GATE FAILED:")
        for f in failures:
            print(f"  - {f}")
        sys.exit(1)
    else:
        print(f"Quality gates passed: accuracy={accuracy:.4f}, auc={auc:.4f}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--metrics-file", required=True)
    parser.add_argument("--min-accuracy", type=float, required=True)
    parser.add_argument("--min-auc", type=float, required=True)
    args = parser.parse_args()
    check_gates(args.metrics_file, args.min_accuracy, args.min_auc)

If the model doesn’t hit your thresholds, the pipeline fails with a clear error message and the PR cannot be merged.

Scheduled Retraining

Add a second workflow for scheduled retraining:

# .github/workflows/scheduled-retrain.yml
name: Scheduled Model Retraining

on:
  schedule:
    # Every Sunday at 2:00 AM UTC
    - cron: '0 2 * * 0'
  workflow_dispatch:

jobs:
  retrain:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install -r requirements.txt

      - name: Fetch latest data
        run: python src/data/fetch_latest.py  # Downloads fresh data from your warehouse

      - name: Run pipeline
        run: dvc repro

      - name: Check quality gates
        run: |
          python ci/check_quality_gates.py \
            --metrics-file metrics/eval_metrics.json \
            --min-accuracy 0.85 \
            --min-auc 0.88

      - name: Register new model version if passing
        run: python ci/register_model.py

Pull Request Checks

For pull requests, you want the pipeline to run but not register anything. Add the evaluation results as a PR comment so reviewers can see model performance:

      - name: Comment PR with metrics
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const metrics = JSON.parse(
              fs.readFileSync('metrics/eval_metrics.json', 'utf8')
            );
            const body = `## Model Evaluation Results
            | Metric | Value |
            |--------|-------|
            | Accuracy | ${metrics.accuracy.toFixed(4)} |
            | F1 Score | ${metrics.f1_score.toFixed(4)} |
            | ROC AUC | ${metrics.roc_auc.toFixed(4)} |
            `;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

Now every PR shows its model metrics in a comment. Reviewers see at a glance whether the change improves or degrades performance.

Summary

A CI/CD pipeline for ML is what separates a professional ML system from a notebook that happens to work today. The key ideas:

Every code push trains the model — you always know if a change broke something
Data validation runs before training — catches data issues before wasting compute
Quality gates enforce minimum standards — bad models cannot reach production
Model registration is automated — no more manual uploads and guessing which version is live

The pipeline you built in this lesson is the control plane for everything else in the course. The next lesson adds the packaging layer: Dockerizing your model so it runs identically everywhere.

Course Content

The Bug That Reached Production

What CI/CD Means for ML

When to Trigger Retraining

GitHub Actions: The Basics

The Full ML Training Pipeline

Walking Through Each Step

The Data Validation Script

The Quality Gate Script

Scheduled Retraining

Pull Request Checks

Summary

🍪 Cookie Notice

Cookie Preferences

Essential Cookies

Analytics Cookies

Marketing Cookies

Functionality Cookies