Course Content
The Hugging Face Ecosystem
Models, datasets, transformers, and PEFT — your fine-tuning toolkit
Your Fine-Tuning Toolkit
The Hugging Face ecosystem is the de-facto standard infrastructure for LLM fine-tuning. It is not a single library but a constellation of tools that work together. Understanding what each piece does — and which ones you actually touch during a fine-tuning run — will save you hours of confusion reading error messages that reference libraries you did not know were involved.
There are five libraries you need to know:
- transformers: loads pretrained models and tokenizers
- datasets: loads, processes, and streams training data
- peft: applies parameter-efficient fine-tuning methods (LoRA, etc.)
- accelerate: handles multi-GPU, mixed-precision, and distributed training
- evaluate: computes metrics like BLEU, ROUGE, and accuracy
Let us walk through each one with code.
transformers: Models and Tokenizers
The transformers library is where you interact with the model itself. Every major open-source LLM — Mistral, LLaMA, Falcon, Phi, Gemma — has a model class in this library. You almost never write model architecture code directly. Instead, you call from_pretrained() with a model name from the HuggingFace Hub, and the library downloads weights and configuration automatically.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "mistralai/Mistral-7B-v0.1"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # Mistral has no pad token by default
# Load model in float16 to reduce memory usage
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto", # automatically distribute across available GPUs
)
print(f"Model loaded: {sum(p.numel() for p in model.parameters()) / 1e9:.1f}B parameters")
# Output: Model loaded: 7.2B parametersWhat is a Tokenizer?
A tokenizer converts human-readable text into the integer sequences the model actually processes. This is more subtle than it looks. Modern LLMs use subword tokenization — words are split into pieces, and those pieces map to integers called token IDs.
text = "Fine-tuning a language model"
tokens = tokenizer(text)
print(tokens)
# {'input_ids': [28765, 28733, 25136, 264, 3842, 2229],
# 'attention_mask': [1, 1, 1, 1, 1, 1]}
# Decode back to text
print(tokenizer.decode(tokens['input_ids']))
# 'Fine-tuning a language model'
# See the actual token strings
print(tokenizer.convert_ids_to_tokens(tokens['input_ids']))
# ['▁Fine', '-', 'tuning', '▁a', '▁language', '▁model']The ▁ character indicates a space before the token. Notice “Fine-tuning” was split into three tokens. This is why LLMs sometimes make surprising errors with character counting or unusual spellings — they never see individual characters, only subword pieces.
Causal LM vs Seq2Seq
This distinction matters for which model class you load and how you format your training data.
Causal Language Models (CLM) are trained to predict the next token given all previous tokens. GPT, LLaMA, Mistral, and Falcon are all causal LMs. They generate text left-to-right and are used for completion and chat tasks. Load with AutoModelForCausalLM.
Sequence-to-Sequence Models (Seq2Seq) have a separate encoder and decoder. T5, BART, and mT5 fall into this category. They process an input sequence (encoder) and generate an output sequence (decoder). Better suited for translation, summarization, and tasks with a clear input→output structure. Load with AutoModelForSeq2SeqLM.
For fine-tuning modern chat models (Mistral, LLaMA 3, Phi-3), you will always use causal LMs.
Running Inference Before Training
Always test your model pipeline before you start training. There is no point debugging a training loop if the model is not loading correctly in the first place.
from transformers import pipeline
# Quick inference test
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
)
response = generator("Explain what a transformer model is in simple terms:")
print(response[0]['generated_text'])datasets: Loading and Processing Training Data
The datasets library handles data loading, streaming, preprocessing, and splitting. It wraps data in an Arrow-backed format that is memory-efficient even for very large datasets.
from datasets import load_dataset
# Load a public dataset from the Hub
dataset = load_dataset("tatsu-lab/alpaca")
print(dataset)
# DatasetDict({
# train: Dataset({features: ['instruction', 'input', 'output', 'text'], num_rows: 52002})
# })
# Inspect an example
print(dataset['train'][0])
# {'instruction': 'Give three tips for staying healthy.',
# 'input': '',
# 'output': '1. Eat a balanced diet...',
# 'text': 'Below is an instruction...'}
# Apply a transformation to all examples
def format_example(example):
prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
return {"text": prompt}
formatted = dataset.map(format_example, remove_columns=dataset['train'].column_names)For custom datasets, you can load from local files:
# Load from JSONL file (one JSON object per line)
dataset = load_dataset("json", data_files={"train": "train.jsonl", "test": "test.jsonl"})
# Load from CSV
dataset = load_dataset("csv", data_files="data.csv")peft: Parameter-Efficient Fine-Tuning
The peft library implements LoRA, QLoRA, prefix tuning, and other efficient fine-tuning methods. You will use it in every lesson from here on. The key workflow is: take a loaded model, apply a PEFT config, and get back a modified model with most weights frozen.
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
r=8, # rank of the low-rank matrices
lora_alpha=16, # scaling factor
target_modules=["q_proj", "v_proj"], # which layers to apply LoRA to
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
# Wrap the model
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 7,245,574,144 || trainable%: 0.0579Only 0.06% of parameters are trainable. That is the power of PEFT — covered in depth in the LoRA lesson.
accelerate: Scaling Training
The accelerate library handles the complexity of running training across multiple GPUs or with mixed precision (float16/bfloat16). You rarely interact with it directly during standard fine-tuning — the Trainer class from transformers uses it under the hood. But you need it installed.
pip install accelerate
accelerate config # interactive setup for your hardwareFor multi-GPU training, accelerate handles data parallelism, gradient synchronization, and device placement. A single configuration file (~/.cache/huggingface/accelerate/default_config.yaml) tells your training scripts how to distribute work.
evaluate: Metrics
import evaluate
# Load a metric
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
predictions = ["The cat sat on the mat"]
references = [["The cat sat on the mat", "There is a cat on the mat"]]
bleu_score = bleu.compute(predictions=predictions, references=references)
print(bleu_score)
# {'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], ...}The HuggingFace Hub
The Hub (huggingface.co) is a model registry, dataset registry, and social platform combined. Every model you load with from_pretrained() is downloaded from the Hub. When you finish fine-tuning, you push your model back to the Hub to share it or access it from other machines.
# Push model and tokenizer to the Hub
model.push_to_hub("your-username/mistral-7b-customer-support")
tokenizer.push_to_hub("your-username/mistral-7b-customer-support")Model cards on the Hub document what a model does, what data it was trained on, its limitations, and evaluation results. When you push a fine-tuned model, write a model card. Future-you will thank present-you.
Putting It Together: Complete Setup for Mistral-7B Fine-Tuning
Here is the complete boilerplate you will use in every subsequent lesson:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType
import torch
# 1. Configuration
model_name = "mistralai/Mistral-7B-v0.1"
dataset_name = "tatsu-lab/alpaca"
# 2. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # required for causal LM training
# 3. Load model
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
model.config.use_cache = False # disable KV-cache during training
# 4. Apply LoRA
peft_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# 5. Load dataset
dataset = load_dataset(dataset_name, split="train[:5%]") # 5% for quick testing
print("Setup complete. Ready to train.")Install everything you need with:
pip install transformers datasets peft accelerate evaluate bitsandbytesThe bitsandbytes package is needed for quantization, which we cover in the QLoRA lesson. Install it now so it is ready.
What to Know Before the Next Lesson
The patterns above — from_pretrained, load_dataset, get_peft_model — will appear in every fine-tuning script you write. The names of hyperparameters will vary (rank, alpha, learning rate, batch size), but the structure stays the same. Load model, load data, configure adapter, train, evaluate, push.
One concept to keep in mind: the device_map="auto" argument lets the library decide how to distribute the model across available devices. On a machine with one GPU, everything goes on that GPU (or overflows to CPU if needed). On a multi-GPU machine, layers are split across devices. This is handled automatically — you do not need to call .to(device) on every tensor yourself.
The next lesson covers dataset preparation — arguably the most important and most underrated skill in the entire fine-tuning workflow.
