Press ESC to exit fullscreen
📖 Lesson ⏱️ 90 minutes

ML Project Structure and Git Workflows

Structuring ML repos for reproducibility, DVC for data versioning

The Reproducibility Problem

Imagine you onboard a new data scientist, Priya, onto your team. She needs to reproduce the churn model your team trained three months ago so she can extend it with new features. You point her to the Git repo. She clones it. Then begins the scavenger hunt.

The training script imports from ../utils/helpers.py — a file that doesn’t exist in the repo. The data/ folder is empty; it was in .gitignore. The config says model_path = /Users/alice/projects/churn/models/final.pkl. The README says “run train.py” but doesn’t mention which Python version. After a full day of archaeology, Priya gives up and starts from scratch.

This is avoidable. With a good project structure and DVC, Priya should be able to reproduce your exact training run in under five minutes.


The Standard ML Project Structure

Here is a battle-tested directory layout for ML projects. It was popularized by Cookiecutter Data Science and has since become the de facto standard.

my-ml-project/
├── .dvc/                    # DVC internal files (commit this)
├── .github/
│   └── workflows/           # CI/CD pipelines
├── configs/
│   ├── model_config.yaml    # Hyperparameters
│   └── data_config.yaml     # Paths, schema definitions
├── data/
│   ├── raw/                 # Original, immutable data
│   ├── processed/           # Cleaned, feature-engineered data
│   └── external/            # Third-party data sources
├── models/                  # Saved model artifacts
├── notebooks/               # Exploration only (not for production code)
├── src/
│   ├── __init__.py
│   ├── data/
│   │   ├── make_dataset.py  # Download or generate data
│   │   └── preprocess.py    # Cleaning and feature engineering
│   ├── models/
│   │   ├── train.py         # Training logic
│   │   └── evaluate.py      # Evaluation logic
│   └── utils/
│       └── helpers.py       # Shared utilities
├── tests/
│   ├── test_preprocess.py
│   └── test_model.py
├── .dvcignore               # Like .gitignore, but for DVC
├── .gitignore               # Ignore data/, models/, __pycache__/
├── dvc.yaml                 # DVC pipeline stages
├── params.yaml              # Experiment parameters (tracked by DVC)
├── requirements.txt
└── README.md

What Goes Where, and Why

data/raw/ — The original data, never modified. If you transform it in-place, you can never go back. Treat this directory as read-only. DVC tracks it.

data/processed/ — The output of your preprocessing pipeline. Also tracked by DVC. Reproducible from data/raw/ given the same code.

notebooks/ — For exploration and visualization only. Production-quality code should live in src/. Notebooks are hard to test, diff, and import. The rule: if code runs in production, it lives in src/.

src/ — Your actual Python package. Importable. Testable. Every preprocessing step, every feature transformation, lives here as a function.

configs/ — Hyperparameters and paths live in YAML files, not hardcoded in scripts. This makes experiments reproducible: change a value in params.yaml, re-run the pipeline, compare results.

tests/ — Unit tests for your data transformations and model logic. These run in CI before every merge.

dvc.yaml — Defines your ML pipeline as a series of stages (preprocess → train → evaluate). DVC can reproduce the entire pipeline from scratch.


Why Git Alone Fails for ML

Git is designed for text files. It diffs and stores changes to code beautifully. But ML projects have two things Git was never designed for:

1. Large binary files. A training dataset might be 2GB. A trained PyTorch model might be 500MB. Git stores every version of every file in its history. After a few iterations, your repo is gigabytes and git clone takes twenty minutes.

2. Data versioning semantics. Git tracks code changes with commit hashes. You want to track data changes with similar precision — “this model was trained on dataset at commit abc123” — but Git can’t efficiently store the data itself.

Running git add data/train.csv on a large file produces this:

warning: LFS support missing for 'data/train.csv'
Counting objects: 3, done.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 524.29 MiB | 1.24 MiB/s, done.

Your repo is now half a gigabyte heavier. Every collaborator must download 500MB just to clone it.


DVC: Git for Data

DVC (Data Version Control) solves this by storing large files in remote storage (S3, GCS, Azure Blob, or even a local drive) and committing only a tiny pointer file to Git.

When you run dvc add data/train.csv, DVC:

  1. Copies data/train.csv to its cache (.dvc/cache/)
  2. Creates a data/train.csv.dvc pointer file (a few lines of YAML with a hash)
  3. Adds data/train.csv to .gitignore

You commit the .dvc file to Git. The actual data stays in the cache and is pushed to remote storage separately. When a teammate runs dvc pull, they get the exact file. When you update the data and run dvc add again, the pointer file changes — and that change is a normal Git diff.

Setting Up DVC

# Install DVC (with S3 support)
pip install "dvc[s3]"

# Initialize in your project (creates .dvc/ directory)
dvc init

# Add a remote storage location
dvc remote add -d myremote s3://my-bucket/dvc-cache
# Or for a local remote (great for testing)
dvc remote add -d localremote /tmp/dvc-cache

# Commit the DVC config
git add .dvc/config
git commit -m "Configure DVC remote storage"

Tracking Data Files

# Track your training data
dvc add data/raw/train.csv

# DVC creates data/raw/train.csv.dvc and updates .gitignore
# Commit the pointer file, not the data
git add data/raw/train.csv.dvc data/.gitignore
git commit -m "Add training dataset v1"

# Push data to remote storage
dvc push

The data/raw/train.csv.dvc file looks like this:

outs:
- md5: a1b2c3d4e5f6...
  size: 2147483648
  path: train.csv

This tiny file in Git tells DVC exactly which version of the data to download.

Reproducing an Experiment

When Priya joins your team:

# Clone the code
git clone https://github.com/your-org/churn-model.git
cd churn-model

# Install Python dependencies
pip install -r requirements.txt

# Pull the exact data version the current branch uses
dvc pull

# Reproduce the full pipeline (preprocess -> train -> evaluate)
dvc repro

dvc repro runs every stage in dvc.yaml in order, skipping any stage whose inputs haven’t changed. If Priya already has the processed data cached, DVC skips the preprocessing step. In five minutes, she has the exact same trained model you had.


Defining a Pipeline with dvc.yaml

Instead of a bash script that runs python preprocess.py && python train.py && python evaluate.py, define your pipeline in dvc.yaml. DVC tracks inputs and outputs of each stage and only re-runs stages that need it.

# dvc.yaml
stages:
  preprocess:
    cmd: python src/data/preprocess.py
    deps:
      - src/data/preprocess.py
      - data/raw/train.csv
    params:
      - configs/data_config.yaml:
          - test_size
          - random_seed
    outs:
      - data/processed/train_features.csv
      - data/processed/test_features.csv

  train:
    cmd: python src/models/train.py
    deps:
      - src/models/train.py
      - data/processed/train_features.csv
    params:
      - params.yaml:
          - learning_rate
          - n_estimators
          - max_depth
    outs:
      - models/churn_model.pkl
    metrics:
      - metrics/train_metrics.json:
          cache: false

  evaluate:
    cmd: python src/models/evaluate.py
    deps:
      - src/models/evaluate.py
      - models/churn_model.pkl
      - data/processed/test_features.csv
    metrics:
      - metrics/eval_metrics.json:
          cache: false

And params.yaml holds the hyperparameters:

# params.yaml
learning_rate: 0.01
n_estimators: 100
max_depth: 6

Now try this experiment: change n_estimators to 200 in params.yaml, then run dvc repro. DVC detects that only the train and evaluate stages depend on that parameter — it skips preprocessing entirely and re-runs only those two stages. You can compare the results with dvc metrics diff.

# Run pipeline
dvc repro

# Compare metrics between current run and last commit
dvc metrics diff HEAD

Output:

Path                      Metric     HEAD    workspace    Change
metrics/eval_metrics.json accuracy   0.891   0.903        0.012
metrics/eval_metrics.json f1_score   0.876   0.889        0.013

The Training Script

Here is a minimal src/models/train.py that works with the DVC pipeline:

# src/models/train.py
import json
import pickle
from pathlib import Path

import pandas as pd
import yaml
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Load parameters from params.yaml
with open("params.yaml") as f:
    params = yaml.safe_load(f)

# Load processed training data
train_df = pd.read_csv("data/processed/train_features.csv")
X_train = train_df.drop("churn", axis=1)
y_train = train_df["churn"]

# Train model
model = GradientBoostingClassifier(
    learning_rate=params["learning_rate"],
    n_estimators=params["n_estimators"],
    max_depth=params["max_depth"],
    random_state=42,
)
model.fit(X_train, y_train)

# Save model artifact
Path("models").mkdir(exist_ok=True)
with open("models/churn_model.pkl", "wb") as f:
    pickle.dump(model, f)

# Save training metrics
train_preds = model.predict(X_train)
metrics = {"train_accuracy": accuracy_score(y_train, train_preds)}
Path("metrics").mkdir(exist_ok=True)
with open("metrics/train_metrics.json", "w") as f:
    json.dump(metrics, f)

print(f"Training accuracy: {metrics['train_accuracy']:.4f}")

Notice: no hardcoded paths, no hardcoded hyperparameters. Everything comes from params.yaml or the DVC stage definition. This makes the entire pipeline reproducible.


Git Workflow for ML Teams

The branching strategy that works best for ML teams:

main          <-- stable, production-ready experiments
    |
    |-- feature/add-age-feature    <-- data scientist branches
    |-- experiment/try-xgboost     <-- experimental runs
    |-- fix/preprocessing-bug

The rule: main should always be reproducible. Any experiment that goes to main must have its DVC artifacts pushed to the remote. Before merging, CI runs dvc repro and checks that metrics meet the minimum threshold.

Commit messages that work well for ML:

feat: add customer age as input feature

- Added age_bucket feature in preprocess.py
- Updated data_config.yaml schema
- Accuracy improved from 0.891 to 0.903 on held-out test set
- DVC artifacts pushed to s3://my-bucket/dvc-cache

Summary

A well-structured ML project:

  • Separates raw data from processed data
  • Keeps production code in src/, exploration in notebooks/
  • Stores hyperparameters in params.yaml, not hardcoded in scripts
  • Uses DVC to track data files and model artifacts without bloating Git
  • Defines the pipeline in dvc.yaml so anyone can reproduce it with dvc repro

When Priya joins your team next month, she runs three commands — git clone, dvc pull, dvc repro — and has the exact same results you had. That’s the goal.

The next lesson adds experiment tracking to this pipeline: how to use MLflow to log every training run so you can compare them, pick the winner, and never lose a good result again.