Course Content
Kubernetes for ML: Basics
Deploy and scale ML services on Kubernetes — just the essentials
Why Kubernetes for ML?
Your FastAPI model server is running in a Docker container on a single machine. It handles 500 requests per minute comfortably. Then Monday morning arrives, 9 AM, all your sales reps open their CRM simultaneously. Traffic spikes to 5,000 requests per minute. Your single container is overwhelmed. Latency climbs from 50ms to 8 seconds. Customers see errors.
By the time you manually spin up more servers, the spike is over. Two hours later, you’re paying for ten idle servers.
This is the scaling problem, and Kubernetes (K8s) solves it automatically.
Kubernetes is a container orchestration platform. You describe what you want (“run 3 replicas of my model server, and if traffic exceeds X, add more”), and Kubernetes makes it happen — scaling up when load increases, scaling down when it drops, restarting crashed containers, and routing traffic across healthy instances.
You don’t need to be a Kubernetes expert to deploy ML models on it. You need to understand five concepts: Pod, Deployment, Service, ConfigMap, and HorizontalPodAutoscaler. That’s this lesson.
The Five Concepts You Need
1. Pod
A Pod is the smallest deployable unit in Kubernetes. It wraps one (or sometimes more) containers and provides them with a shared network namespace and storage.
Think of a Pod as “one running instance of your model server.” It has an IP address, runs your Docker container, and has a defined resource allocation (CPU and memory).
You almost never create Pods directly. You create Deployments, which manage Pods for you.
2. Deployment
A Deployment says: “I want N copies of this Pod running at all times.” If a Pod crashes, the Deployment controller notices and creates a new one. If you update the image, the Deployment does a rolling update — replacing old Pods with new ones gradually, without downtime.
3. Service
Pods are ephemeral and get new IP addresses when they restart. A Service provides a stable IP and DNS name that always routes to the current healthy Pods, even as they come and go.
Types:
ClusterIP(default): Internal to the cluster — other services can call it, external traffic cannotLoadBalancer: Exposes a cloud load balancer. External traffic enters here.NodePort: Exposes on a port on each node. Good for local development.
4. ConfigMap and Secret
ConfigMaps store non-sensitive configuration (environment variables, config files). Secrets store sensitive data (API keys, database passwords). Both are injected into Pods as environment variables or mounted as files.
5. HorizontalPodAutoscaler (HPA)
An HPA watches a metric (CPU utilization, request rate, custom metrics) and automatically adjusts the number of replicas. “If average CPU exceeds 70%, add more Pods. If it drops below 30%, remove some.”
A Complete Deployment for Your ML API
Here is a full Kubernetes manifest for the churn prediction API. All five resources in one file.
# kubernetes/churn-predictor.yaml
# ─── Namespace ────────────────────────────────────────────────────────────────
apiVersion: v1
kind: Namespace
metadata:
name: ml-serving
---
# ─── ConfigMap: non-sensitive configuration ───────────────────────────────────
apiVersion: v1
kind: ConfigMap
metadata:
name: churn-predictor-config
namespace: ml-serving
data:
MODEL_PATH: "/app/models/churn_model.pkl"
LOG_LEVEL: "INFO"
MLFLOW_TRACKING_URI: "http://mlflow-server.ml-serving.svc.cluster.local:5000"
---
# ─── Secret: sensitive configuration (base64 encoded) ────────────────────────
# In production, use an external secrets manager (AWS Secrets Manager, Vault)
# kubectl create secret generic churn-predictor-secrets \
# --from-literal=AWS_ACCESS_KEY_ID=xxx \
# --from-literal=AWS_SECRET_ACCESS_KEY=yyy
apiVersion: v1
kind: Secret
metadata:
name: churn-predictor-secrets
namespace: ml-serving
type: Opaque
# Values are base64 encoded: echo -n "your-key" | base64
stringData:
AWS_ACCESS_KEY_ID: "CHANGEME"
AWS_SECRET_ACCESS_KEY: "CHANGEME"
---
# ─── Deployment: manages the Pods ─────────────────────────────────────────────
apiVersion: apps/v1
kind: Deployment
metadata:
name: churn-predictor
namespace: ml-serving
labels:
app: churn-predictor
version: "1.0.0"
spec:
replicas: 3 # Start with 3 replicas
selector:
matchLabels:
app: churn-predictor
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # During updates, allow 1 extra Pod
maxUnavailable: 0 # Never take a Pod down before a new one is ready
template:
metadata:
labels:
app: churn-predictor
spec:
containers:
- name: api
image: ghcr.io/your-org/churn-predictor:1.0.0
ports:
- containerPort: 8000
# Resource limits: guarantee this much CPU/memory, cap at these limits
resources:
requests:
cpu: "500m" # 0.5 CPU cores
memory: "512Mi" # 512 MB
limits:
cpu: "2000m" # 2 CPU cores max
memory: "2Gi" # 2 GB max
# Inject non-sensitive config
envFrom:
- configMapRef:
name: churn-predictor-config
# Inject secrets
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: churn-predictor-secrets
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: churn-predictor-secrets
key: AWS_SECRET_ACCESS_KEY
# Liveness probe: restart the container if this fails
# (detects hung processes that aren't serving requests)
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30 # Wait 30s before first check (model loading)
periodSeconds: 15
failureThreshold: 3
# Readiness probe: only send traffic when this passes
# (prevents routing to a Pod that hasn't loaded the model yet)
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 20
periodSeconds: 10
failureThreshold: 3
# Graceful shutdown: finish in-flight requests before stopping
terminationGracePeriodSeconds: 30
---
# ─── Service: stable endpoint for the Deployment ─────────────────────────────
apiVersion: v1
kind: Service
metadata:
name: churn-predictor-svc
namespace: ml-serving
spec:
selector:
app: churn-predictor # Routes to all Pods with this label
ports:
- protocol: TCP
port: 80 # Service port (external-facing)
targetPort: 8000 # Container port
type: LoadBalancer # Provision a cloud load balancer
---
# ─── HorizontalPodAutoscaler: auto-scaling ────────────────────────────────────
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: churn-predictor-hpa
namespace: ml-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: churn-predictor
minReplicas: 2 # Never go below 2 (for availability)
maxReplicas: 20 # Never go above 20 (cost control)
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale up when average CPU > 70%Deploying and Operating
# Deploy everything
kubectl apply -f kubernetes/churn-predictor.yaml
# Check that Pods are running
kubectl get pods -n ml-serving
# NAME READY STATUS RESTARTS AGE
# churn-predictor-7d9f8b-abc12 1/1 Running 0 2m
# churn-predictor-7d9f8b-def34 1/1 Running 0 2m
# churn-predictor-7d9f8b-ghi56 1/1 Running 0 2m
# Check the service (find the external IP)
kubectl get service churn-predictor-svc -n ml-serving
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
# churn-predictor-svc LoadBalancer 10.96.0.100 34.102.140.5 80:32000/TCP
# Test the deployed API
curl http://34.102.140.5/health
# Watch the HPA in action
kubectl get hpa churn-predictor-hpa -n ml-serving -w
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# churn-predictor-hpa Deployment/churn-pred 45%/70% 2 20 3
# Check logs from a specific Pod
kubectl logs -n ml-serving churn-predictor-7d9f8b-abc12
# Stream logs from all Pods with the label
kubectl logs -n ml-serving -l app=churn-predictor -f
# Manually scale (useful for testing)
kubectl scale deployment churn-predictor -n ml-serving --replicas=5The 9 AM Traffic Spike Scenario
Your model processes 500 RPM (requests per minute) normally. At 9 AM Monday, traffic jumps to 5,000 RPM as teams start their workday.
Here’s what happens automatically:
- 9:00 AM: Traffic spike begins. Average CPU across 3 Pods rises from 30% to 85%.
- 9:01 AM: HPA detects CPU > 70% threshold. Schedules scale-up event.
- 9:02 AM: New Pods are scheduled. Kubernetes pulls the image (cached on nodes — instant), starts containers. readinessProbe runs.
- 9:03 AM: New Pods pass readiness check. Service begins routing traffic to them. Now 7 Pods are serving.
- 9:04 AM: CPU drops to 55% across 7 Pods. Traffic is handled without degradation.
- 11:30 AM: Traffic normalizes to 500 RPM. CPU drops to 15%. HPA scales back down to 2 Pods.
The entire scale-up took about 3 minutes. No human intervention. No alerts. No outage.
Rolling Updates: Zero-Downtime Deployments
When you train a new model and push a new Docker image:
# Update the image in the Deployment
kubectl set image deployment/churn-predictor \
api=ghcr.io/your-org/churn-predictor:1.1.0 \
-n ml-serving
# Watch the rollout
kubectl rollout status deployment/churn-predictor -n ml-serving
# Waiting for deployment "churn-predictor" rollout to finish: 1 out of 3 new replicas have been updated...
# Waiting for deployment "churn-predictor" rollout to finish: 2 out of 3 new replicas have been updated...
# deployment "churn-predictor" successfully rolled out
# Roll back if something goes wrong
kubectl rollout undo deployment/churn-predictor -n ml-servingThe RollingUpdate strategy (configured in the Deployment) ensures that:
- Old Pods keep serving traffic while new ones start
- New Pods must pass readiness checks before old ones are removed
- At most 1 extra Pod exists at any time (controlled by
maxSurge) - At least 3 Pods are always available (controlled by
maxUnavailable: 0)
Users never see downtime.
Resource Requests and Limits: Why They Matter
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"Requests tell Kubernetes how much resource to reserve for this Pod. Kubernetes uses requests for scheduling — it will only place a Pod on a node that has enough available resources.
Limits are the cap. If a Pod tries to use more than 2 CPU cores, Kubernetes throttles it. If a Pod tries to use more than 2Gi memory, Kubernetes kills it (OOM kill) and restarts it.
For ML models:
- Set memory requests based on your model’s size + overhead (a 500MB model might need 1.5GB with Python runtime overhead)
- Set CPU requests based on steady-state usage; limits higher for burst
- Test under load to find real numbers — guessing leads to OOM crashes in production
Summary
The Kubernetes concepts you need for ML serving:
| Concept | What it does |
|---|---|
| Pod | One running container instance |
| Deployment | Manages multiple Pod replicas, handles rolling updates |
| Service | Stable DNS and load balancing across Pods |
| HPA | Automatically scales replica count based on metrics |
| ConfigMap/Secret | Injects config and credentials into Pods |
The liveness and readiness probes on your /health endpoint are the connective tissue — they tell Kubernetes when a Pod is ready for traffic and when it needs to be restarted. The FastAPI health endpoint you built in lesson 6 serves this purpose exactly.
The next and final lesson ties everything together: a complete end-to-end MLOps pipeline for a churn prediction model, using every component from this course.
