Dataset Preparation for Fine-Tuning

Data is the Lever

If you polled experienced ML engineers who have run dozens of fine-tuning experiments and asked them where most projects succeed or fail, the overwhelming answer would be: data. Not model architecture. Not learning rate. Not batch size. Data.

This lesson is about dataset preparation for fine-tuning instruction-following language models. It is the most underrated skill in the fine-tuning workflow, because it is unsexy — nobody writes blog posts about spending three days cleaning CSV files. But it is the work that separates fine-tuned models that actually work from fine-tuned models that are expensive failures.

The Real-World Scenario

You work at a B2B SaaS company. You have 500 resolved customer support tickets, each with the customer’s message and a human agent’s response. You want to fine-tune a model that responds to new tickets the way your best agents would — professional, concise, and specific to your product’s terminology.

This is an excellent fine-tuning candidate. The responses require a specific style, knowledge of your product’s features, and consistency across many interactions. Prompt engineering alone cannot reliably reproduce this voice at scale. Let us build the dataset.

Data Format: Three Common Structures

1. The Alpaca Format

The Alpaca format was introduced by Stanford with their Alpaca dataset and is widely used for simple instruction-following tasks. It has three fields: instruction, input, and output.

example = {
    "instruction": "Summarize the following customer complaint in one sentence.",
    "input": "I ordered the premium plan three weeks ago and my account still shows free tier...",
    "output": "Customer's account was not upgraded to premium tier after payment three weeks ago."
}

Use the Alpaca format when your task has a clear instruction that stays constant across examples, with variable input content.

2. The Chat Template Format

Modern models are typically instruction-tuned using a multi-turn chat format. Different models use different chat templates, but the pattern is consistent: alternating user and assistant turns, sometimes with a system message at the start.

# Mistral chat format
example = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful customer support agent for Acme SaaS. Be professional, concise, and empathetic."
        },
        {
            "role": "user", 
            "content": "My export button stopped working after your last update. I need this for a client presentation tomorrow."
        },
        {
            "role": "assistant",
            "content": "I'm sorry to hear the export feature is causing issues right before your presentation. We did push an update yesterday that affected CSV exports for accounts with custom field configurations. To restore functionality immediately: go to Settings > Data > Export Format and toggle 'Legacy Mode' to On. This will have you exporting within 60 seconds. Our team is working on a permanent fix, scheduled for tonight's deployment. Let me know if that resolves it."
        }
    ]
}

The tokenizer’s apply_chat_template() method converts this structured format into the exact token sequence the model expects during training:

formatted = tokenizer.apply_chat_template(
    example["messages"],
    tokenize=False,
    add_generation_prompt=False
)
print(formatted)
# <s>[INST] <<SYS>>\nYou are a helpful customer support agent...\n<</SYS>>\n\nMy export button... [/INST] I'm sorry to hear... </s>

3. The Raw Completion Format

For tasks that do not have a natural instruction-response structure — creative writing, code completion, domain-specific text generation — you can train on raw text completions. The model learns to continue text in the style of the training corpus.

example = {
    "text": "Patient presents with acute onset chest pain radiating to the left arm, diaphoresis noted. EKG shows ST elevation in leads II, III, aVF consistent with inferior STEMI. Initiated heparin protocol, cath lab activated."
}

Building the Customer Support Dataset

Let us walk through the complete pipeline for the 500-ticket scenario.

Step 1: Load and Inspect Raw Data

import pandas as pd
from datasets import Dataset

# Load raw support tickets (exported from your ticketing system)
df = pd.read_csv("support_tickets.csv")
print(df.shape)       # (500, 4)
print(df.columns.tolist())  
# ['ticket_id', 'customer_message', 'agent_response', 'resolved_at']

print(df.head(2))

Step 2: Quality Filtering

This is the most important step. Low-quality examples do not just fail to teach — they actively hurt your model. Be aggressive about filtering.

import re

def is_quality_example(row):
    """Return True if this row should be included in training."""
    
    # Filter 1: Remove very short responses (copy-paste noise, templated one-liners)
    if len(row['agent_response'].split()) < 20:
        return False
    
    # Filter 2: Remove very long responses (often copy-pasted documentation, not actual support)
    if len(row['agent_response'].split()) > 400:
        return False
    
    # Filter 3: Remove responses that are just links
    if row['agent_response'].strip().startswith("http"):
        return False
    
    # Filter 4: Remove responses with PII patterns (emails, phone numbers, names)
    pii_patterns = [
        r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',  # email
        r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',  # phone
        r'\bSSN\b|\bsocial security\b',     # SSN references
    ]
    for pattern in pii_patterns:
        if re.search(pattern, row['agent_response'], re.IGNORECASE):
            return False
    
    # Filter 5: Remove duplicate responses (boilerplate)
    # (handled separately below)
    
    return True

# Apply filters
df_clean = df[df.apply(is_quality_example, axis=1)].copy()
print(f"Kept {len(df_clean)} / {len(df)} examples after quality filtering")
# Kept 387 / 500 examples after quality filtering

Step 3: Remove Near-Duplicates

Duplicate or near-duplicate responses create an imbalanced dataset where the model over-learns templated phrases. Use simple hashing to catch exact duplicates, and a similarity threshold for near-duplicates.

from collections import Counter

# Remove exact duplicates
df_clean = df_clean.drop_duplicates(subset=['agent_response'])

# Detect near-duplicate responses (same structure, slightly different wording)
def normalize_response(text):
    """Strip names, ticket numbers, and whitespace for comparison."""
    text = re.sub(r'Ticket #\d+', 'TICKET', text)
    text = re.sub(r'\b[A-Z][a-z]+\b', 'NAME', text)  # proper nouns
    return ' '.join(text.lower().split())

response_counts = Counter(df_clean['agent_response'].apply(normalize_response))
duplicates = {k for k, v in response_counts.items() if v > 3}

df_clean = df_clean[~df_clean['agent_response'].apply(normalize_response).isin(duplicates)]
print(f"Final dataset size: {len(df_clean)} examples")
# Final dataset size: 341 examples

Step 4: Format into Chat Template

SYSTEM_PROMPT = """You are a customer support specialist for Acme SaaS. 
Your responses are professional, empathetic, and specific. 
You know our product deeply and provide actionable solutions.
When you don't know the answer, you escalate clearly."""

def format_as_chat(row):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": row['customer_message']},
        {"role": "assistant", "content": row['agent_response']}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False
    )
    return {"text": text, "messages": messages}

# Apply formatting
df_clean['formatted'] = df_clean.apply(format_as_chat, axis=1)

Step 5: Train/Validation/Test Split

The standard split for fine-tuning datasets is 80% training, 10% validation (used during training to monitor overfitting), and 10% test (held out completely for final evaluation).

from sklearn.model_selection import train_test_split

# First split: 80% train, 20% remainder
train_df, temp_df = train_test_split(df_clean, test_size=0.2, random_state=42)

# Second split: 50/50 from remainder = 10% val, 10% test
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

print(f"Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}")
# Train: 272, Val: 35, Test: 34

Step 6: Convert to HuggingFace Dataset

from datasets import Dataset, DatasetDict

def rows_to_dataset(df, format_fn):
    formatted = df.apply(format_fn, axis=1).tolist()
    return Dataset.from_list(formatted)

dataset = DatasetDict({
    "train": Dataset.from_pandas(train_df[['text', 'messages']]),
    "validation": Dataset.from_pandas(val_df[['text', 'messages']]),
    "test": Dataset.from_pandas(test_df[['text', 'messages']])
})

# Save locally
dataset.save_to_disk("./support_dataset")

# Or push to HuggingFace Hub (private by default)
dataset.push_to_hub("your-username/acme-support-dataset", private=True)

print(dataset)
# DatasetDict({
#     train: Dataset({features: ['text', 'messages'], num_rows: 272})
#     validation: Dataset({features: ['text', 'messages'], num_rows: 35})
#     test: Dataset({features: ['text', 'messages'], num_rows: 34})
# })

Step 7: Tokenize for Training

The final step converts text into token IDs and creates the attention masks. During training, you want the model to only predict the assistant tokens — not the system prompt or user message tokens. This is done by masking (setting to -100) all tokens that are not part of the assistant response.

def tokenize_and_mask(example, tokenizer, max_length=1024):
    """Tokenize example and mask non-assistant tokens."""
    
    # Tokenize full conversation
    tokenized = tokenizer(
        example["text"],
        truncation=True,
        max_length=max_length,
        padding="max_length",
        return_tensors="pt"
    )
    
    input_ids = tokenized["input_ids"][0]
    labels = input_ids.clone()
    
    # Find where the assistant response starts
    # [INST] ... [/INST] marks the end of the user turn
    # We mask everything up to and including [/INST]
    inst_end_token = tokenizer.convert_tokens_to_ids("[/INST]")
    
    # Find the last occurrence of [/INST] and mask everything before it
    inst_positions = (input_ids == inst_end_token).nonzero(as_tuple=True)[0]
    if len(inst_positions) > 0:
        last_inst_pos = inst_positions[-1].item()
        labels[:last_inst_pos + 1] = -100  # -100 means "ignore in loss computation"
    
    return {
        "input_ids": input_ids,
        "attention_mask": tokenized["attention_mask"][0],
        "labels": labels
    }

# Apply tokenization
tokenized_dataset = dataset.map(
    lambda x: tokenize_and_mask(x, tokenizer),
    remove_columns=dataset["train"].column_names
)

How Much Data Do You Actually Need?

The rule of thumb that has proven out in practice: 100+ high-quality examples per distinct behavior you want to change.

If you want the model to:

Use your product’s terminology (one behavior): 100–200 examples
Adopt a specific tone (one behavior): 100–300 examples
Handle 5 different issue categories differently: 100–200 per category = 500–1000 total
Learn a new task format it has never seen: 500–1000 minimum

More data always helps — up to a point. The quality ceiling matters more than the quantity ceiling. A dataset of 200 perfectly formatted, carefully verified examples consistently outperforms a dataset of 2,000 noisy, inconsistent examples. This is the opposite of the intuition from classical supervised learning on tabular data.

Common Data Mistakes to Avoid

Inconsistent formatting: If some examples use ### Instruction: and others use [INST], the model learns to respond to a mixture of signals. Pick one template and apply it everywhere.

Including reasoning in the user turn: Do not add model-side reasoning or chain-of-thought to the user’s message. Only format what the user would actually say.

Labeling errors on edge cases: For support tickets, have your best agent review a random sample of 50 examples and flag any responses that are incorrect or off-brand. Errors in training data become errors in model behavior.

Ignoring label distribution: If 80% of your examples are “billing issues” and only 5% are “technical bugs”, the model will be great at billing and mediocre at bugs. Check your distribution and oversample underrepresented categories.

Truncating without padding strategy: If your max sequence length cuts off the end of the assistant response (the part you’re actually training on), those examples contribute nothing useful to training. Log truncation rates and increase max_length or filter out examples that get truncated past the response start.

Data preparation is not a one-time step. Expect to iterate: train a model, see what it gets wrong, trace those failures back to data gaps or inconsistencies, fix the data, and retrain. The data pipeline is a living artifact of your fine-tuning project.

Course Content