Press ESC to exit fullscreen
📖 Lesson ⏱️ 120 minutes

Kubernetes for ML: Basics

Deploy and scale ML services on Kubernetes — just the essentials

Why Kubernetes for ML?

Your FastAPI model server is running in a Docker container on a single machine. It handles 500 requests per minute comfortably. Then Monday morning arrives, 9 AM, all your sales reps open their CRM simultaneously. Traffic spikes to 5,000 requests per minute. Your single container is overwhelmed. Latency climbs from 50ms to 8 seconds. Customers see errors.

By the time you manually spin up more servers, the spike is over. Two hours later, you’re paying for ten idle servers.

This is the scaling problem, and Kubernetes (K8s) solves it automatically.

Kubernetes is a container orchestration platform. You describe what you want (“run 3 replicas of my model server, and if traffic exceeds X, add more”), and Kubernetes makes it happen — scaling up when load increases, scaling down when it drops, restarting crashed containers, and routing traffic across healthy instances.

You don’t need to be a Kubernetes expert to deploy ML models on it. You need to understand five concepts: Pod, Deployment, Service, ConfigMap, and HorizontalPodAutoscaler. That’s this lesson.


The Five Concepts You Need

1. Pod

A Pod is the smallest deployable unit in Kubernetes. It wraps one (or sometimes more) containers and provides them with a shared network namespace and storage.

Think of a Pod as “one running instance of your model server.” It has an IP address, runs your Docker container, and has a defined resource allocation (CPU and memory).

You almost never create Pods directly. You create Deployments, which manage Pods for you.

2. Deployment

A Deployment says: “I want N copies of this Pod running at all times.” If a Pod crashes, the Deployment controller notices and creates a new one. If you update the image, the Deployment does a rolling update — replacing old Pods with new ones gradually, without downtime.

3. Service

Pods are ephemeral and get new IP addresses when they restart. A Service provides a stable IP and DNS name that always routes to the current healthy Pods, even as they come and go.

Types:

  • ClusterIP (default): Internal to the cluster — other services can call it, external traffic cannot
  • LoadBalancer: Exposes a cloud load balancer. External traffic enters here.
  • NodePort: Exposes on a port on each node. Good for local development.

4. ConfigMap and Secret

ConfigMaps store non-sensitive configuration (environment variables, config files). Secrets store sensitive data (API keys, database passwords). Both are injected into Pods as environment variables or mounted as files.

5. HorizontalPodAutoscaler (HPA)

An HPA watches a metric (CPU utilization, request rate, custom metrics) and automatically adjusts the number of replicas. “If average CPU exceeds 70%, add more Pods. If it drops below 30%, remove some.”


A Complete Deployment for Your ML API

Here is a full Kubernetes manifest for the churn prediction API. All five resources in one file.

# kubernetes/churn-predictor.yaml

# ─── Namespace ────────────────────────────────────────────────────────────────
apiVersion: v1
kind: Namespace
metadata:
  name: ml-serving

---
# ─── ConfigMap: non-sensitive configuration ───────────────────────────────────
apiVersion: v1
kind: ConfigMap
metadata:
  name: churn-predictor-config
  namespace: ml-serving
data:
  MODEL_PATH: "/app/models/churn_model.pkl"
  LOG_LEVEL: "INFO"
  MLFLOW_TRACKING_URI: "http://mlflow-server.ml-serving.svc.cluster.local:5000"

---
# ─── Secret: sensitive configuration (base64 encoded) ────────────────────────
# In production, use an external secrets manager (AWS Secrets Manager, Vault)
# kubectl create secret generic churn-predictor-secrets \
#   --from-literal=AWS_ACCESS_KEY_ID=xxx \
#   --from-literal=AWS_SECRET_ACCESS_KEY=yyy
apiVersion: v1
kind: Secret
metadata:
  name: churn-predictor-secrets
  namespace: ml-serving
type: Opaque
# Values are base64 encoded: echo -n "your-key" | base64
stringData:
  AWS_ACCESS_KEY_ID: "CHANGEME"
  AWS_SECRET_ACCESS_KEY: "CHANGEME"

---
# ─── Deployment: manages the Pods ─────────────────────────────────────────────
apiVersion: apps/v1
kind: Deployment
metadata:
  name: churn-predictor
  namespace: ml-serving
  labels:
    app: churn-predictor
    version: "1.0.0"
spec:
  replicas: 3                    # Start with 3 replicas
  selector:
    matchLabels:
      app: churn-predictor
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1               # During updates, allow 1 extra Pod
      maxUnavailable: 0         # Never take a Pod down before a new one is ready
  template:
    metadata:
      labels:
        app: churn-predictor
    spec:
      containers:
        - name: api
          image: ghcr.io/your-org/churn-predictor:1.0.0
          ports:
            - containerPort: 8000
          
          # Resource limits: guarantee this much CPU/memory, cap at these limits
          resources:
            requests:
              cpu: "500m"        # 0.5 CPU cores
              memory: "512Mi"    # 512 MB
            limits:
              cpu: "2000m"       # 2 CPU cores max
              memory: "2Gi"      # 2 GB max
          
          # Inject non-sensitive config
          envFrom:
            - configMapRef:
                name: churn-predictor-config
          
          # Inject secrets
          env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: churn-predictor-secrets
                  key: AWS_ACCESS_KEY_ID
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: churn-predictor-secrets
                  key: AWS_SECRET_ACCESS_KEY
          
          # Liveness probe: restart the container if this fails
          # (detects hung processes that aren't serving requests)
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30     # Wait 30s before first check (model loading)
            periodSeconds: 15
            failureThreshold: 3
          
          # Readiness probe: only send traffic when this passes
          # (prevents routing to a Pod that hasn't loaded the model yet)
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 20
            periodSeconds: 10
            failureThreshold: 3
      
      # Graceful shutdown: finish in-flight requests before stopping
      terminationGracePeriodSeconds: 30

---
# ─── Service: stable endpoint for the Deployment ─────────────────────────────
apiVersion: v1
kind: Service
metadata:
  name: churn-predictor-svc
  namespace: ml-serving
spec:
  selector:
    app: churn-predictor          # Routes to all Pods with this label
  ports:
    - protocol: TCP
      port: 80                    # Service port (external-facing)
      targetPort: 8000            # Container port
  type: LoadBalancer              # Provision a cloud load balancer

---
# ─── HorizontalPodAutoscaler: auto-scaling ────────────────────────────────────
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: churn-predictor-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: churn-predictor
  minReplicas: 2                  # Never go below 2 (for availability)
  maxReplicas: 20                 # Never go above 20 (cost control)
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70  # Scale up when average CPU > 70%

Deploying and Operating

# Deploy everything
kubectl apply -f kubernetes/churn-predictor.yaml

# Check that Pods are running
kubectl get pods -n ml-serving
# NAME                               READY   STATUS    RESTARTS   AGE
# churn-predictor-7d9f8b-abc12       1/1     Running   0          2m
# churn-predictor-7d9f8b-def34       1/1     Running   0          2m
# churn-predictor-7d9f8b-ghi56       1/1     Running   0          2m

# Check the service (find the external IP)
kubectl get service churn-predictor-svc -n ml-serving
# NAME                   TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)
# churn-predictor-svc    LoadBalancer   10.96.0.100     34.102.140.5    80:32000/TCP

# Test the deployed API
curl http://34.102.140.5/health

# Watch the HPA in action
kubectl get hpa churn-predictor-hpa -n ml-serving -w
# NAME                    REFERENCE                TARGETS   MINPODS   MAXPODS   REPLICAS
# churn-predictor-hpa     Deployment/churn-pred    45%/70%   2         20        3

# Check logs from a specific Pod
kubectl logs -n ml-serving churn-predictor-7d9f8b-abc12

# Stream logs from all Pods with the label
kubectl logs -n ml-serving -l app=churn-predictor -f

# Manually scale (useful for testing)
kubectl scale deployment churn-predictor -n ml-serving --replicas=5

The 9 AM Traffic Spike Scenario

Your model processes 500 RPM (requests per minute) normally. At 9 AM Monday, traffic jumps to 5,000 RPM as teams start their workday.

Here’s what happens automatically:

  1. 9:00 AM: Traffic spike begins. Average CPU across 3 Pods rises from 30% to 85%.
  2. 9:01 AM: HPA detects CPU > 70% threshold. Schedules scale-up event.
  3. 9:02 AM: New Pods are scheduled. Kubernetes pulls the image (cached on nodes — instant), starts containers. readinessProbe runs.
  4. 9:03 AM: New Pods pass readiness check. Service begins routing traffic to them. Now 7 Pods are serving.
  5. 9:04 AM: CPU drops to 55% across 7 Pods. Traffic is handled without degradation.
  6. 11:30 AM: Traffic normalizes to 500 RPM. CPU drops to 15%. HPA scales back down to 2 Pods.

The entire scale-up took about 3 minutes. No human intervention. No alerts. No outage.


Rolling Updates: Zero-Downtime Deployments

When you train a new model and push a new Docker image:

# Update the image in the Deployment
kubectl set image deployment/churn-predictor \
  api=ghcr.io/your-org/churn-predictor:1.1.0 \
  -n ml-serving

# Watch the rollout
kubectl rollout status deployment/churn-predictor -n ml-serving
# Waiting for deployment "churn-predictor" rollout to finish: 1 out of 3 new replicas have been updated...
# Waiting for deployment "churn-predictor" rollout to finish: 2 out of 3 new replicas have been updated...
# deployment "churn-predictor" successfully rolled out

# Roll back if something goes wrong
kubectl rollout undo deployment/churn-predictor -n ml-serving

The RollingUpdate strategy (configured in the Deployment) ensures that:

  • Old Pods keep serving traffic while new ones start
  • New Pods must pass readiness checks before old ones are removed
  • At most 1 extra Pod exists at any time (controlled by maxSurge)
  • At least 3 Pods are always available (controlled by maxUnavailable: 0)

Users never see downtime.


Resource Requests and Limits: Why They Matter

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "2000m"
    memory: "2Gi"

Requests tell Kubernetes how much resource to reserve for this Pod. Kubernetes uses requests for scheduling — it will only place a Pod on a node that has enough available resources.

Limits are the cap. If a Pod tries to use more than 2 CPU cores, Kubernetes throttles it. If a Pod tries to use more than 2Gi memory, Kubernetes kills it (OOM kill) and restarts it.

For ML models:

  • Set memory requests based on your model’s size + overhead (a 500MB model might need 1.5GB with Python runtime overhead)
  • Set CPU requests based on steady-state usage; limits higher for burst
  • Test under load to find real numbers — guessing leads to OOM crashes in production

Summary

The Kubernetes concepts you need for ML serving:

ConceptWhat it does
PodOne running container instance
DeploymentManages multiple Pod replicas, handles rolling updates
ServiceStable DNS and load balancing across Pods
HPAAutomatically scales replica count based on metrics
ConfigMap/SecretInjects config and credentials into Pods

The liveness and readiness probes on your /health endpoint are the connective tissue — they tell Kubernetes when a Pod is ready for traffic and when it needs to be restarted. The FastAPI health endpoint you built in lesson 6 serves this purpose exactly.

The next and final lesson ties everything together: a complete end-to-end MLOps pipeline for a churn prediction model, using every component from this course.