Course Content
ML Project Structure and Git Workflows
Structuring ML repos for reproducibility, DVC for data versioning
The Reproducibility Problem
Imagine you onboard a new data scientist, Priya, onto your team. She needs to reproduce the churn model your team trained three months ago so she can extend it with new features. You point her to the Git repo. She clones it. Then begins the scavenger hunt.
The training script imports from ../utils/helpers.py — a file that doesn’t exist in the repo. The data/ folder is empty; it was in .gitignore. The config says model_path = /Users/alice/projects/churn/models/final.pkl. The README says “run train.py” but doesn’t mention which Python version. After a full day of archaeology, Priya gives up and starts from scratch.
This is avoidable. With a good project structure and DVC, Priya should be able to reproduce your exact training run in under five minutes.
The Standard ML Project Structure
Here is a battle-tested directory layout for ML projects. It was popularized by Cookiecutter Data Science and has since become the de facto standard.
my-ml-project/
├── .dvc/ # DVC internal files (commit this)
├── .github/
│ └── workflows/ # CI/CD pipelines
├── configs/
│ ├── model_config.yaml # Hyperparameters
│ └── data_config.yaml # Paths, schema definitions
├── data/
│ ├── raw/ # Original, immutable data
│ ├── processed/ # Cleaned, feature-engineered data
│ └── external/ # Third-party data sources
├── models/ # Saved model artifacts
├── notebooks/ # Exploration only (not for production code)
├── src/
│ ├── __init__.py
│ ├── data/
│ │ ├── make_dataset.py # Download or generate data
│ │ └── preprocess.py # Cleaning and feature engineering
│ ├── models/
│ │ ├── train.py # Training logic
│ │ └── evaluate.py # Evaluation logic
│ └── utils/
│ └── helpers.py # Shared utilities
├── tests/
│ ├── test_preprocess.py
│ └── test_model.py
├── .dvcignore # Like .gitignore, but for DVC
├── .gitignore # Ignore data/, models/, __pycache__/
├── dvc.yaml # DVC pipeline stages
├── params.yaml # Experiment parameters (tracked by DVC)
├── requirements.txt
└── README.mdWhat Goes Where, and Why
data/raw/ — The original data, never modified. If you transform it in-place, you can never go back. Treat this directory as read-only. DVC tracks it.
data/processed/ — The output of your preprocessing pipeline. Also tracked by DVC. Reproducible from data/raw/ given the same code.
notebooks/ — For exploration and visualization only. Production-quality code should live in src/. Notebooks are hard to test, diff, and import. The rule: if code runs in production, it lives in src/.
src/ — Your actual Python package. Importable. Testable. Every preprocessing step, every feature transformation, lives here as a function.
configs/ — Hyperparameters and paths live in YAML files, not hardcoded in scripts. This makes experiments reproducible: change a value in params.yaml, re-run the pipeline, compare results.
tests/ — Unit tests for your data transformations and model logic. These run in CI before every merge.
dvc.yaml — Defines your ML pipeline as a series of stages (preprocess → train → evaluate). DVC can reproduce the entire pipeline from scratch.
Why Git Alone Fails for ML
Git is designed for text files. It diffs and stores changes to code beautifully. But ML projects have two things Git was never designed for:
1. Large binary files. A training dataset might be 2GB. A trained PyTorch model might be 500MB. Git stores every version of every file in its history. After a few iterations, your repo is gigabytes and git clone takes twenty minutes.
2. Data versioning semantics. Git tracks code changes with commit hashes. You want to track data changes with similar precision — “this model was trained on dataset at commit abc123” — but Git can’t efficiently store the data itself.
Running git add data/train.csv on a large file produces this:
warning: LFS support missing for 'data/train.csv'
Counting objects: 3, done.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 524.29 MiB | 1.24 MiB/s, done.Your repo is now half a gigabyte heavier. Every collaborator must download 500MB just to clone it.
DVC: Git for Data
DVC (Data Version Control) solves this by storing large files in remote storage (S3, GCS, Azure Blob, or even a local drive) and committing only a tiny pointer file to Git.
When you run dvc add data/train.csv, DVC:
- Copies
data/train.csvto its cache (.dvc/cache/) - Creates a
data/train.csv.dvcpointer file (a few lines of YAML with a hash) - Adds
data/train.csvto.gitignore
You commit the .dvc file to Git. The actual data stays in the cache and is pushed to remote storage separately. When a teammate runs dvc pull, they get the exact file. When you update the data and run dvc add again, the pointer file changes — and that change is a normal Git diff.
Setting Up DVC
# Install DVC (with S3 support)
pip install "dvc[s3]"
# Initialize in your project (creates .dvc/ directory)
dvc init
# Add a remote storage location
dvc remote add -d myremote s3://my-bucket/dvc-cache
# Or for a local remote (great for testing)
dvc remote add -d localremote /tmp/dvc-cache
# Commit the DVC config
git add .dvc/config
git commit -m "Configure DVC remote storage"Tracking Data Files
# Track your training data
dvc add data/raw/train.csv
# DVC creates data/raw/train.csv.dvc and updates .gitignore
# Commit the pointer file, not the data
git add data/raw/train.csv.dvc data/.gitignore
git commit -m "Add training dataset v1"
# Push data to remote storage
dvc pushThe data/raw/train.csv.dvc file looks like this:
outs:
- md5: a1b2c3d4e5f6...
size: 2147483648
path: train.csvThis tiny file in Git tells DVC exactly which version of the data to download.
Reproducing an Experiment
When Priya joins your team:
# Clone the code
git clone https://github.com/your-org/churn-model.git
cd churn-model
# Install Python dependencies
pip install -r requirements.txt
# Pull the exact data version the current branch uses
dvc pull
# Reproduce the full pipeline (preprocess -> train -> evaluate)
dvc reprodvc repro runs every stage in dvc.yaml in order, skipping any stage whose inputs haven’t changed. If Priya already has the processed data cached, DVC skips the preprocessing step. In five minutes, she has the exact same trained model you had.
Defining a Pipeline with dvc.yaml
Instead of a bash script that runs python preprocess.py && python train.py && python evaluate.py, define your pipeline in dvc.yaml. DVC tracks inputs and outputs of each stage and only re-runs stages that need it.
# dvc.yaml
stages:
preprocess:
cmd: python src/data/preprocess.py
deps:
- src/data/preprocess.py
- data/raw/train.csv
params:
- configs/data_config.yaml:
- test_size
- random_seed
outs:
- data/processed/train_features.csv
- data/processed/test_features.csv
train:
cmd: python src/models/train.py
deps:
- src/models/train.py
- data/processed/train_features.csv
params:
- params.yaml:
- learning_rate
- n_estimators
- max_depth
outs:
- models/churn_model.pkl
metrics:
- metrics/train_metrics.json:
cache: false
evaluate:
cmd: python src/models/evaluate.py
deps:
- src/models/evaluate.py
- models/churn_model.pkl
- data/processed/test_features.csv
metrics:
- metrics/eval_metrics.json:
cache: falseAnd params.yaml holds the hyperparameters:
# params.yaml
learning_rate: 0.01
n_estimators: 100
max_depth: 6Now try this experiment: change n_estimators to 200 in params.yaml, then run dvc repro. DVC detects that only the train and evaluate stages depend on that parameter — it skips preprocessing entirely and re-runs only those two stages. You can compare the results with dvc metrics diff.
# Run pipeline
dvc repro
# Compare metrics between current run and last commit
dvc metrics diff HEADOutput:
Path Metric HEAD workspace Change
metrics/eval_metrics.json accuracy 0.891 0.903 0.012
metrics/eval_metrics.json f1_score 0.876 0.889 0.013The Training Script
Here is a minimal src/models/train.py that works with the DVC pipeline:
# src/models/train.py
import json
import pickle
from pathlib import Path
import pandas as pd
import yaml
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Load parameters from params.yaml
with open("params.yaml") as f:
params = yaml.safe_load(f)
# Load processed training data
train_df = pd.read_csv("data/processed/train_features.csv")
X_train = train_df.drop("churn", axis=1)
y_train = train_df["churn"]
# Train model
model = GradientBoostingClassifier(
learning_rate=params["learning_rate"],
n_estimators=params["n_estimators"],
max_depth=params["max_depth"],
random_state=42,
)
model.fit(X_train, y_train)
# Save model artifact
Path("models").mkdir(exist_ok=True)
with open("models/churn_model.pkl", "wb") as f:
pickle.dump(model, f)
# Save training metrics
train_preds = model.predict(X_train)
metrics = {"train_accuracy": accuracy_score(y_train, train_preds)}
Path("metrics").mkdir(exist_ok=True)
with open("metrics/train_metrics.json", "w") as f:
json.dump(metrics, f)
print(f"Training accuracy: {metrics['train_accuracy']:.4f}")Notice: no hardcoded paths, no hardcoded hyperparameters. Everything comes from params.yaml or the DVC stage definition. This makes the entire pipeline reproducible.
Git Workflow for ML Teams
The branching strategy that works best for ML teams:
main <-- stable, production-ready experiments
|
|-- feature/add-age-feature <-- data scientist branches
|-- experiment/try-xgboost <-- experimental runs
|-- fix/preprocessing-bugThe rule: main should always be reproducible. Any experiment that goes to main must have its DVC artifacts pushed to the remote. Before merging, CI runs dvc repro and checks that metrics meet the minimum threshold.
Commit messages that work well for ML:
feat: add customer age as input feature
- Added age_bucket feature in preprocess.py
- Updated data_config.yaml schema
- Accuracy improved from 0.891 to 0.903 on held-out test set
- DVC artifacts pushed to s3://my-bucket/dvc-cacheSummary
A well-structured ML project:
- Separates raw data from processed data
- Keeps production code in
src/, exploration innotebooks/ - Stores hyperparameters in
params.yaml, not hardcoded in scripts - Uses DVC to track data files and model artifacts without bloating Git
- Defines the pipeline in
dvc.yamlso anyone can reproduce it withdvc repro
When Priya joins your team next month, she runs three commands — git clone, dvc pull, dvc repro — and has the exact same results you had. That’s the goal.
The next lesson adds experiment tracking to this pipeline: how to use MLflow to log every training run so you can compare them, pick the winner, and never lose a good result again.
