Course Content
Why RAG? The Problem with Fine-Tuning for Knowledge
When RAG beats fine-tuning, and when it doesn't — making the right choice
The Core Analogy
Imagine you hire someone to answer legal questions for your firm. You have two options:
Option A: Send them to law school for three years. After graduation, they know a lot — but only what was taught before graduation. Anything that happened after they left campus? They have no idea. They also can’t cite which textbook they learned from; the knowledge is just in their head.
Option B: Give them internet access (and a great search engine) before they answer any question. They can look things up in real time, cite exactly what they found, and always have the latest information.
RAG is Option B. Fine-tuning is Option A.
Neither is universally better. But for most enterprise knowledge tasks — company wikis, product documentation, legal case files, support tickets — Option B wins by a wide margin. This lesson explains exactly why, when to switch, and when to use both together.
The Three Failure Modes That RAG Solves
1. Knowledge Cutoff
Every LLM has a training cutoff date. GPT-4 knows about the world up to some point in 2024; Claude’s knowledge ends similarly. If your users ask about:
- A court ruling from last month
- Your product’s pricing update from Q1 2026
- A new regulation that passed six weeks ago
…the base model will either say “I don’t know” or, worse, confidently make something up. This is not a bug in the model. It is a fundamental property of how these models work: they are statistical snapshots of a corpus that ended at a specific date.
How RAG fixes it: Your RAG pipeline ingests documents continuously. When a new regulation is published, you load it into your vector store. The model retrieves and reads it at query time. The knowledge cutoff problem disappears — not because the model was retrained, but because fresh information is injected at inference time.
2. Hallucination
LLMs generate text by predicting what comes next, token by token, based on patterns learned during training. When asked a specific factual question — “What was the damages award in Smith v. Jones, 2025?” — the model has no mechanism to distinguish between:
- A fact it actually learned from training data
- A fact it is constructing from plausible-sounding patterns
The result is hallucination: confident, fluent, completely wrong answers. In a customer support or legal context, this is catastrophic.
How RAG fixes it: The system retrieves the actual document (the case file, the contract, the policy) and passes it to the model as context. The prompt says, effectively: “Here is the relevant document. Answer only from this.” You are no longer asking the model to recall facts from its parametric memory. You are asking it to read and summarize — a task models are genuinely excellent at.
You can verify answers by pointing users to the source. If the model misreads the document, that’s an extractive error (much more detectable). If there’s no relevant document, the model says “I don’t have information about that” rather than inventing something.
3. Enterprise Knowledge That Never Existed in Training Data
GPT-4 has never read your company’s Confluence pages, Slack history, internal pricing spreadsheets, or customer contracts. No model has, because that data isn’t public. No amount of fine-tuning on a public base model gives it access to proprietary, continuously-changing organizational knowledge.
How RAG fixes it: Your internal documents become the retrieval corpus. The model answers questions grounded in your knowledge, not general internet knowledge. A new employee can ask “What is our policy on expense reimbursement for international travel?” and get an answer from the actual HR policy document, not a generic guess.
So When Do You Fine-Tune?
Fine-tuning changes the model’s weights — the billions of numerical parameters that encode its behavior. This is expensive (thousands of dollars for a proper run), slow (hours to days), and produces a model that is static until you retrain.
Fine-tuning is the right choice when your problem is about how the model behaves, not what it knows:
| Problem | Solution |
|---|---|
| Model needs to respond in a specific JSON format every time | Fine-tuning |
| Model needs to match your brand’s tone exactly | Fine-tuning |
| Model should refuse certain topics even when pressured | Fine-tuning (RLHF/DPO) |
| Model needs to classify text into your custom taxonomy | Fine-tuning |
| Model needs to know about your Q4 2025 earnings release | RAG |
| Model needs to answer questions from your 10,000-page manual | RAG |
| Model needs to cite company policy documents | RAG |
The practical heuristic: if the knowledge lives in documents you can search, use RAG; if you need to modify the model’s behavior, use fine-tuning.
The Law Firm Scenario: Why RAG Wins
Consider a 200-attorney law firm that wants to build an internal research assistant. They have:
- 15 years of case files (PDFs, Word docs)
- Internal memos and legal research notes
- Client contracts and deal histories
- A constantly-updated database of court rulings
Why not fine-tune?
- Cost and staleness: Fine-tuning on 15 years of case files costs tens of thousands of dollars. Three months later, new cases have accumulated and the model is out of date. You can’t fine-tune continuously.
- No citation: A fine-tuned model that “learned” about a case has no way to say “I found this in file X, page 12.” It just knows it. For legal work, citations are not optional.
- Data size: Fine-tuning works best with thousands of high-quality examples, not millions of raw documents. Raw legal documents are not fine-tuning examples.
- Privacy risk: Sending all your client documents to an LLM API for fine-tuning raises serious confidentiality issues. A local RAG pipeline with a self-hosted model avoids this entirely.
Why RAG wins:
- New case files are ingested daily into the vector store. The system stays current.
- Every answer includes a citation: “Based on Smith v. Jones (2025), page 4…”
- Attorneys can verify every answer by reading the source document
- When the firm’s internal policy changes, update one document in the store. Done.
The Spectrum: Prompt Engineering → RAG → Fine-Tuning → Both
It helps to think of these as a progression of investment and complexity:
1. Prompt Engineering (zero cost, try first) Just put documents in the context window. Works if your total corpus fits in 128K tokens. Fast to prototype, but doesn’t scale. Costs scale linearly with every API call because you’re re-sending all context.
2. RAG (moderate engineering, high value) Index your documents once, retrieve only what’s relevant per query. Scales to millions of documents. Most enterprise use cases live here.
3. Fine-Tuning (high cost, specific use cases) Retrain the model on your data. Use when you need consistent output format, tone, or classification behavior that can’t be achieved with prompting.
4. RAG + Fine-Tuning (maximum quality) Use a fine-tuned model as the generator inside a RAG pipeline. The fine-tuning gives consistent behavior; RAG gives current, citable knowledge. Bloomberg’s BloombergGPT and similar domain-specific models work this way.
Decision Framework
Use this checklist when evaluating a new use case:
Does the task require recent information (post-training cutoff)?
Yes → RAG required
Does the task require citing specific source documents?
Yes → RAG required
Is the knowledge proprietary / not on the public internet?
Yes → RAG required
Do you need consistent output format or style across all responses?
Yes → Consider fine-tuning (can combine with RAG)
Do you have a labeled dataset of good input/output examples?
No → Don't fine-tune yet. Collect data first.
Is the task purely behavioral (classification, tone, formatting)?
Yes → Fine-tuning may be sufficient without RAGIf you checked any of the first three boxes, start with RAG. Add fine-tuning only if RAG alone isn’t producing the quality you need.
What Comes Next
The rest of this course builds a complete RAG system from scratch. You’ll learn:
- How the full pipeline is architected (ingestion, indexing, retrieval, generation)
- How to parse messy real-world documents — PDFs with tables, scanned pages, HTML
- Chunking strategies that dramatically affect retrieval quality
- Which embedding model to pick for your use case
- How to choose between ChromaDB, Pinecone, and pgvector
- Dense vs. sparse vs. hybrid retrieval
- Re-ranking to improve precision
- Evaluating your system with the RAGAS framework
- Advanced techniques: HyDE, query expansion, Self-RAG
By the end, you will have built a Q&A system over a real document corpus with a quantitative evaluation pipeline — the same architecture used in production at companies like Notion, Stripe, and Intercom.
Let’s start by mapping out the architecture.
