Why Fine-Tune? When and When Not To

The Cardiologist Analogy

Imagine you walk into a hospital with chest pain. The hospital has two kinds of doctors available: a general practitioner who has read every medical textbook ever written, and a cardiologist who spent three extra years doing nothing but studying the heart.

The general practitioner is remarkable — they know about infectious disease, orthopedics, pediatrics, and neurology. But on a difficult ECG interpretation with an unusual ST-segment pattern, you want the cardiologist. Not because the cardiologist is smarter overall. They almost certainly know less about pediatric dosing than the GP. The cardiologist is better at this specific task because they have been trained exclusively on cardiac cases. Their mental models, their instincts, their vocabulary — all shaped by thousands of hours of focused exposure to exactly this domain.

Base LLMs are the general practitioners. Fine-tuned models are the cardiologists.

A fine-tuned model is not smarter in any general sense. It has traded breadth for depth. It speaks the language of your domain naturally. It formats its outputs the way you need them. It knows the difference between “STEMI” and “NSTEMI” not because you explained it in the prompt, but because it has internalized the distinction through training. This distinction — between knowledge that lives in the prompt versus knowledge that lives in the weights — is the fundamental intuition you need to navigate the choice between prompt engineering, RAG, and fine-tuning.

Three Tools, Three Jobs

Before picking a technique, you need to understand what each one actually does to the model’s behavior.

Prompt engineering changes nothing about the model. You are steering a ship that is already moving — the prompt is the wheel. You give the model instructions, examples, context, and constraints at inference time, and the model uses that input to produce better output. It is the fastest, cheapest, most flexible approach. The cost is zero (beyond the token count). The iteration cycle is seconds. You should always try this first.

Retrieval-Augmented Generation (RAG) also changes nothing about the model’s weights. Instead, it augments the model’s context with retrieved documents. When a user asks “What is our refund policy?”, a RAG system retrieves the relevant policy document and stuffs it into the prompt before calling the model. The model’s job is to read and synthesize, not to know from memory. RAG is the right tool when your bottleneck is knowledge — specifically, knowledge that changes over time, or knowledge that is too voluminous to fit in a prompt.

Fine-tuning changes the model’s weights. You show the model thousands of input-output examples, run gradient descent, and the model’s parameters shift to reflect the patterns in your data. Behavior, style, tone, and domain jargon become internalized — they do not need to be specified in every prompt. The cost is higher (compute + time), but the payoff is a model that reliably behaves the way you need, without extensive prompting.

When Prompt Engineering Fails

Prompt engineering fails in three common situations.

First, when the task requires knowledge the model simply does not have. If you are building an assistant for a proprietary codebase, no amount of prompting will make the model know your internal APIs. You need either RAG (retrieve the relevant code) or fine-tuning (teach the patterns).

Second, when the output format or style is highly specific and hard to describe. “Write in the voice of our brand” is easy to say and hard to specify in a system prompt. You can include examples, but if you need every response to feel right without extensive in-context examples, fine-tuning is the answer. The model learns the voice directly from labeled examples.

Third, when you need to make thousands of calls per day and every call includes a 2,000-token few-shot prompt. At scale, fat prompts cost real money. Fine-tuning can let you strip the prompt down to just the user’s query, because the behaviors are baked in.

When RAG Falls Short

RAG is excellent for factual grounding, but it has real weaknesses.

RAG requires good retrieval. If the retrieval step fails to find the right document, the model hallucinates. For specialized tasks with very precise vocabulary, retrieval quality can be inconsistent.

More importantly, RAG cannot change behavior. If you want the model to always respond in a structured JSON format, to use medical terminology correctly, or to adopt a specific persona — RAG cannot help you. Those are behavioral properties. You need fine-tuning.

RAG also struggles when the task requires synthesizing patterns across many documents rather than finding specific facts. “Summarize what our customers are most frustrated about” requires reading and reasoning across hundreds of tickets — fine-tuning a model on labeled summaries often works better.

The Decision Matrix

Use this table as a starting heuristic. Real decisions often involve multiple techniques in combination — RAG plus fine-tuning is common for production systems that need both knowledge grounding and behavioral consistency.

Use Case	Prompt Engineering	RAG	Fine-Tuning
Medical Q&A on public knowledge	Works well	Works well	Rarely needed
Medical Q&A on proprietary clinical notes	Prompt + few-shot	Best choice	Consider if style matters
Brand voice copywriting	Marginal	Not relevant	Best choice
SQL generation for standard tables	Works with few-shot	Not relevant	Best choice for complex schemas
Customer support (FAQ-style)	Works for simple	Best choice	Overkill unless tone matters
Customer support (brand voice + knowledge)	Insufficient alone	Needed for knowledge	Needed for voice
Code completion for internal library	Cannot know the API	Good option	Best for completion style
Document classification	Works for simple	Not relevant	Best for consistent performance
Named entity recognition	Brittle	Not relevant	Best choice
Summarization (style-specific)	Works with examples	Not relevant	Best for consistent style

Notice the pattern: fine-tuning wins when the task is about how the model behaves, while RAG wins when the task is about what the model knows. Prompt engineering handles everything else — and it handles it cheaply.

The Real Cost of Fine-Tuning

Fine-tuning is not free. Before you commit, be honest about what it requires:

Data: You need labeled examples — typically 100 to 10,000 pairs of (input, desired output). Collecting and cleaning this data is often the most expensive part of the entire project. Bad data produces bad models, and there are no shortcuts here.

Compute: Training a 7B parameter model for a few hours on an A100 costs roughly $10–50 on cloud providers. That is cheap enough for experimentation, but the cost of iteration (trying different datasets, hyperparameters, model sizes) adds up quickly.

Maintenance: A fine-tuned model is a snapshot of your data at a point in time. When your requirements change, you need to retrain. This is fundamentally different from RAG, where you can just update the document store.

Evaluation: You cannot tell if a fine-tuned model is better than the base model without a rigorous evaluation protocol. This takes time to set up and run.

Starting Your Decision

A practical process for any new LLM task:

Start with prompt engineering. Spend a few hours trying different prompts and few-shot examples. If you get 80%+ of the behavior you need, you are done.
If quality is the bottleneck, ask why. Is the model missing factual knowledge? Try RAG. Is the model producing the wrong style, format, or tone? Consider fine-tuning. Is the model simply bad at the task even with good context? Fine-tuning may help, but first check if a better base model (e.g., GPT-4 vs GPT-3.5) solves it more cheaply.
If you decide to fine-tune, invest in data quality first. 200 high-quality, carefully curated examples will outperform 2,000 noisy ones every time.
Measure before and after. Define your success metric before you start training. You need to know whether the fine-tuned model actually beats the baseline on real inputs from your use case.

The rest of this course covers the technical machinery of fine-tuning. But the decision about whether to fine-tune is always a business and data decision first. The best engineers in this space are the ones who reach for fine-tuning only when it is genuinely the right tool — not because it is impressive, but because it solves the problem cheaper and more reliably than the alternatives.

Course Content

The Cardiologist Analogy

Three Tools, Three Jobs

When Prompt Engineering Fails

When RAG Falls Short

The Decision Matrix

The Real Cost of Fine-Tuning

Starting Your Decision

🍪 Cookie Notice

Cookie Preferences

Essential Cookies

Analytics Cookies

Marketing Cookies

Functionality Cookies