Press ESC to exit fullscreen
📖 Lesson ⏱️ 90 minutes

The Hugging Face Ecosystem

Models, datasets, transformers, and PEFT — your fine-tuning toolkit

Your Fine-Tuning Toolkit

The Hugging Face ecosystem is the de-facto standard infrastructure for LLM fine-tuning. It is not a single library but a constellation of tools that work together. Understanding what each piece does — and which ones you actually touch during a fine-tuning run — will save you hours of confusion reading error messages that reference libraries you did not know were involved.

There are five libraries you need to know:

  • transformers: loads pretrained models and tokenizers
  • datasets: loads, processes, and streams training data
  • peft: applies parameter-efficient fine-tuning methods (LoRA, etc.)
  • accelerate: handles multi-GPU, mixed-precision, and distributed training
  • evaluate: computes metrics like BLEU, ROUGE, and accuracy

Let us walk through each one with code.

transformers: Models and Tokenizers

The transformers library is where you interact with the model itself. Every major open-source LLM — Mistral, LLaMA, Falcon, Phi, Gemma — has a model class in this library. You almost never write model architecture code directly. Instead, you call from_pretrained() with a model name from the HuggingFace Hub, and the library downloads weights and configuration automatically.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistralai/Mistral-7B-v0.1"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Mistral has no pad token by default

# Load model in float16 to reduce memory usage
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",  # automatically distribute across available GPUs
)

print(f"Model loaded: {sum(p.numel() for p in model.parameters()) / 1e9:.1f}B parameters")
# Output: Model loaded: 7.2B parameters

What is a Tokenizer?

A tokenizer converts human-readable text into the integer sequences the model actually processes. This is more subtle than it looks. Modern LLMs use subword tokenization — words are split into pieces, and those pieces map to integers called token IDs.

text = "Fine-tuning a language model"
tokens = tokenizer(text)
print(tokens)
# {'input_ids': [28765, 28733, 25136, 264, 3842, 2229], 
#  'attention_mask': [1, 1, 1, 1, 1, 1]}

# Decode back to text
print(tokenizer.decode(tokens['input_ids']))
# 'Fine-tuning a language model'

# See the actual token strings
print(tokenizer.convert_ids_to_tokens(tokens['input_ids']))
# ['▁Fine', '-', 'tuning', '▁a', '▁language', '▁model']

The character indicates a space before the token. Notice “Fine-tuning” was split into three tokens. This is why LLMs sometimes make surprising errors with character counting or unusual spellings — they never see individual characters, only subword pieces.

Causal LM vs Seq2Seq

This distinction matters for which model class you load and how you format your training data.

Causal Language Models (CLM) are trained to predict the next token given all previous tokens. GPT, LLaMA, Mistral, and Falcon are all causal LMs. They generate text left-to-right and are used for completion and chat tasks. Load with AutoModelForCausalLM.

Sequence-to-Sequence Models (Seq2Seq) have a separate encoder and decoder. T5, BART, and mT5 fall into this category. They process an input sequence (encoder) and generate an output sequence (decoder). Better suited for translation, summarization, and tasks with a clear input→output structure. Load with AutoModelForSeq2SeqLM.

For fine-tuning modern chat models (Mistral, LLaMA 3, Phi-3), you will always use causal LMs.

Running Inference Before Training

Always test your model pipeline before you start training. There is no point debugging a training loop if the model is not loading correctly in the first place.

from transformers import pipeline

# Quick inference test
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
)

response = generator("Explain what a transformer model is in simple terms:")
print(response[0]['generated_text'])

datasets: Loading and Processing Training Data

The datasets library handles data loading, streaming, preprocessing, and splitting. It wraps data in an Arrow-backed format that is memory-efficient even for very large datasets.

from datasets import load_dataset

# Load a public dataset from the Hub
dataset = load_dataset("tatsu-lab/alpaca")
print(dataset)
# DatasetDict({
#     train: Dataset({features: ['instruction', 'input', 'output', 'text'], num_rows: 52002})
# })

# Inspect an example
print(dataset['train'][0])
# {'instruction': 'Give three tips for staying healthy.',
#  'input': '',
#  'output': '1. Eat a balanced diet...',
#  'text': 'Below is an instruction...'}

# Apply a transformation to all examples
def format_example(example):
    prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    return {"text": prompt}

formatted = dataset.map(format_example, remove_columns=dataset['train'].column_names)

For custom datasets, you can load from local files:

# Load from JSONL file (one JSON object per line)
dataset = load_dataset("json", data_files={"train": "train.jsonl", "test": "test.jsonl"})

# Load from CSV
dataset = load_dataset("csv", data_files="data.csv")

peft: Parameter-Efficient Fine-Tuning

The peft library implements LoRA, QLoRA, prefix tuning, and other efficient fine-tuning methods. You will use it in every lesson from here on. The key workflow is: take a loaded model, apply a PEFT config, and get back a modified model with most weights frozen.

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8,                          # rank of the low-rank matrices
    lora_alpha=16,                # scaling factor
    target_modules=["q_proj", "v_proj"],  # which layers to apply LoRA to
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Wrap the model
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 7,245,574,144 || trainable%: 0.0579

Only 0.06% of parameters are trainable. That is the power of PEFT — covered in depth in the LoRA lesson.

accelerate: Scaling Training

The accelerate library handles the complexity of running training across multiple GPUs or with mixed precision (float16/bfloat16). You rarely interact with it directly during standard fine-tuning — the Trainer class from transformers uses it under the hood. But you need it installed.

pip install accelerate
accelerate config  # interactive setup for your hardware

For multi-GPU training, accelerate handles data parallelism, gradient synchronization, and device placement. A single configuration file (~/.cache/huggingface/accelerate/default_config.yaml) tells your training scripts how to distribute work.

evaluate: Metrics

import evaluate

# Load a metric
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")

predictions = ["The cat sat on the mat"]
references = [["The cat sat on the mat", "There is a cat on the mat"]]

bleu_score = bleu.compute(predictions=predictions, references=references)
print(bleu_score)
# {'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], ...}

The HuggingFace Hub

The Hub (huggingface.co) is a model registry, dataset registry, and social platform combined. Every model you load with from_pretrained() is downloaded from the Hub. When you finish fine-tuning, you push your model back to the Hub to share it or access it from other machines.

# Push model and tokenizer to the Hub
model.push_to_hub("your-username/mistral-7b-customer-support")
tokenizer.push_to_hub("your-username/mistral-7b-customer-support")

Model cards on the Hub document what a model does, what data it was trained on, its limitations, and evaluation results. When you push a fine-tuned model, write a model card. Future-you will thank present-you.

Putting It Together: Complete Setup for Mistral-7B Fine-Tuning

Here is the complete boilerplate you will use in every subsequent lesson:

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType
import torch

# 1. Configuration
model_name = "mistralai/Mistral-7B-v0.1"
dataset_name = "tatsu-lab/alpaca"

# 2. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # required for causal LM training

# 3. Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)
model.config.use_cache = False  # disable KV-cache during training

# 4. Apply LoRA
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# 5. Load dataset
dataset = load_dataset(dataset_name, split="train[:5%]")  # 5% for quick testing

print("Setup complete. Ready to train.")

Install everything you need with:

pip install transformers datasets peft accelerate evaluate bitsandbytes

The bitsandbytes package is needed for quantization, which we cover in the QLoRA lesson. Install it now so it is ready.

What to Know Before the Next Lesson

The patterns above — from_pretrained, load_dataset, get_peft_model — will appear in every fine-tuning script you write. The names of hyperparameters will vary (rank, alpha, learning rate, batch size), but the structure stays the same. Load model, load data, configure adapter, train, evaluate, push.

One concept to keep in mind: the device_map="auto" argument lets the library decide how to distribute the model across available devices. On a machine with one GPU, everything goes on that GPU (or overflows to CPU if needed). On a multi-GPU machine, layers are split across devices. This is handled automatically — you do not need to call .to(device) on every tensor yourself.

The next lesson covers dataset preparation — arguably the most important and most underrated skill in the entire fine-tuning workflow.