Press ESC to exit fullscreen
📖 Lesson ⏱️ 60 minutes

RAG Architecture Overview

The full pipeline: ingest → chunk → embed → index → retrieve → augment → generate

The Scenario We Will Build

You work at a software company with 500 pages of product documentation. Users constantly ask support questions: “How do I configure SSO?” “What’s the rate limit on the API?” “Where do I find audit logs?” Your support team is overwhelmed. You want to build a chatbot that answers these questions automatically, accurately, and with citations.

By the end of this lesson, you will understand every component of the system needed to do this — and what breaks if you skip any of them.


The Two Phases of RAG

RAG systems have two distinct phases that run at different times:

Phase 1: Ingestion (offline, runs once — or periodically) Load documents → clean and split them → convert to embeddings → store in a vector database.

Phase 2: Retrieval and Generation (online, runs per query) User asks a question → embed the question → find similar chunks in the database → feed chunks + question to the LLM → return the answer.

Think of ingestion like building a library and creating a card catalog. You do that work once. Then when someone comes in with a question, you consult the catalog, pull the right books off the shelf, and read the relevant pages to answer them. You don’t rebuild the library for every question.


The Full Pipeline

Here is the complete architecture as a data flow:

INGESTION PHASE (offline)
─────────────────────────────────────────────────────────────
Raw Documents (PDF, HTML, DOCX, TXT)


[1] Document Loader
    Reads files, handles different formats


[2] Text Splitter / Chunker
    Breaks documents into chunks (e.g., 512 tokens each)


[3] Embedding Model
    Converts each chunk to a dense vector (e.g., 1536 floats)


[4] Vector Store
    Indexes vectors for fast similarity search

   (stored, waiting)


RETRIEVAL + GENERATION PHASE (online, per query)
─────────────────────────────────────────────────────────────
User Question: "How do I configure SSO?"


[5] Embedding Model (same model as ingestion)
    Converts question to a query vector


[6] Retriever
    Queries vector store, returns top-k similar chunks


[7] Prompt Template
    Assembles: system prompt + retrieved chunks + user question


[8] LLM
    Reads the assembled prompt, generates an answer


Answer + Source Citations

Seven components. Let’s examine each one.


Component 1: Document Loader

What it does: Reads raw files and converts them to a uniform text + metadata format. A good loader extracts not just text, but structured metadata: the filename, page number, section heading, and creation date. You will need this metadata later to cite sources.

What goes wrong if you skip it: You can’t skip it — you need to read files somehow. The failure mode here is bad loading: using a simple loader that strips all structure from a PDF, so tables come out as a jumbled string of numbers, and multi-column layouts get the columns mixed together.

The failure looks like: A chunk that says "Tier 1 100 200 Tier 2 500 1000 Tier 3 unlimited unlimited" when the original was a clean pricing table. No embedding model can make sense of that.

from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("product_docs.pdf")
documents = loader.load()

# Each document has .page_content and .metadata
print(documents[0].metadata)
# {'source': 'product_docs.pdf', 'page': 0, 'author': '...'}

Component 2: Text Splitter

What it does: LLMs have context limits. A 500-page PDF is ~250,000 tokens — you cannot fit that in a single prompt. The text splitter breaks documents into smaller, overlapping chunks that fit within the model’s context window while preserving as much coherent meaning as possible.

Why overlap matters: If you split with zero overlap, a concept that spans two chunks gets cut in half. Retrieval finds the first half, misses the second. Overlap (typically 10-20% of chunk size) ensures that boundary content is represented in at least one complete chunk.

What goes wrong if you skip it: You can’t skip it — you must split long documents. The failure mode is bad splitting: too-large chunks (LLM context fills up, can only retrieve 1-2 chunks) or too-small chunks (no context for the retrieved snippet to make sense).

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks")
# Split 42 documents into 847 chunks

Component 3: Embedding Model

What it does: Converts text to a vector of floating-point numbers (an “embedding”). The embedding captures semantic meaning: text that means similar things gets vectors that are close together in vector space. This is what makes semantic search possible.

Why the same model must be used for both ingestion and retrieval: Embeddings are relative to the model that produced them. If you embed your documents with OpenAI’s model and embed your query with a different model, the vectors live in different spaces and similarity search produces garbage.

What goes wrong if you skip it: You can’t skip it — embeddings are the foundation of the whole system. The failure mode is model mismatch (different models for indexing and querying) or model degradation (you switch models midway through; old embeddings are incompatible with new query embeddings).

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Embed a single text
vector = embeddings.embed_query("How do I configure SSO?")
print(f"Vector dimension: {len(vector)}")  # 1536

Component 4: Vector Store

What it does: Stores the chunk text plus its embedding vector, and provides a fast “find me the k vectors most similar to this query vector” operation. This is the core data structure that makes retrieval fast even with millions of chunks.

Under the hood: Most vector stores use an algorithm called HNSW (Hierarchical Navigable Small World) to organize vectors into a graph structure. This allows approximate nearest-neighbor search in milliseconds, even across millions of vectors — without scanning every vector one by one.

What goes wrong if you skip it: Some people try to store embeddings in a regular database and compute cosine similarity in application code. This works at small scale (under ~10,000 vectors) but becomes catastrophically slow at production scale. A proper vector store handles 10M vectors in 20ms; a naive loop takes minutes.

from langchain_community.vectorstores import Chroma

# Create the vector store and index all chunks
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Later: load existing index
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

Component 5 & 6: Retriever

What it does: Takes the user’s question, embeds it with the same model used during ingestion, and queries the vector store for the top-k most similar chunks. The retriever bridges the two phases — it produces the context that gets injected into the prompt.

How many chunks to retrieve (k)? The typical starting point is k=4 to k=6. Too few: you might miss a relevant chunk that’s slightly further away in embedding space. Too many: the LLM’s context fills up with less relevant text, and the model’s attention gets diluted — called the “lost in the middle” problem.

What goes wrong if you skip it: This component is the heart of RAG. If retrieval is poor, no amount of LLM quality saves you. The garbage-in-garbage-out principle applies fully here.

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# Retrieve relevant chunks for a query
results = retriever.invoke("How do I configure SSO?")

for doc in results:
    print(f"[{doc.metadata['source']}, p.{doc.metadata['page']}]")
    print(doc.page_content[:200])
    print("---")

Component 7: Prompt Template

What it does: Assembles the final prompt that gets sent to the LLM. A well-designed RAG prompt has three parts:

  1. System instructions: “You are a helpful assistant. Answer ONLY from the provided context. If the answer is not in the context, say ‘I don’t have information about that.’”
  2. Retrieved context: The actual chunks from your documents
  3. User question: What the user asked

Why the “answer only from context” instruction matters: Without this constraint, the LLM will combine retrieved context with its parametric memory. This produces answers that sound right but may mix retrieved facts with hallucinated ones — and you can’t tell which is which.

from langchain_core.prompts import ChatPromptTemplate

template = """You are a helpful assistant for product documentation.
Answer the question based ONLY on the following context.
If the answer is not in the context, say "I don't have information about that in our documentation."

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

Component 8: LLM

What it does: Reads the assembled prompt (system instructions + retrieved context + question) and generates a natural-language answer. In a RAG system, the LLM’s job is primarily reading comprehension and synthesis, not knowledge recall. The knowledge comes from the retrieved documents; the model’s job is to articulate it clearly.

Model choice matters less here than in other LLM tasks: Because the LLM is reading from context rather than recalling from memory, a smaller, cheaper model (GPT-4o-mini, Claude Haiku) often performs nearly as well as a larger one. The retrieval quality is the dominant factor in answer quality.

from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(
        f"[Source: {d.metadata.get('source', 'unknown')}, "
        f"Page {d.metadata.get('page', '?')}]\n{d.page_content}"
        for d in docs
    )

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

answer = chain.invoke("How do I configure SSO?")
print(answer)

Putting It All Together: A Minimal Working RAG System

Here is the complete minimal implementation — ingestion + retrieval + generation in ~50 lines:

import os
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# ── INGESTION PHASE ──────────────────────────────────────────
loader = PyMuPDFLoader("product_docs.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(documents)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")

# ── RETRIEVAL + GENERATION PHASE ─────────────────────────────
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

prompt = ChatPromptTemplate.from_template("""
Answer based ONLY on the context below. 
If the answer isn't in the context, say so.

Context:
{context}

Question: {question}
""")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(
        f"[{d.metadata.get('source')}, p.{d.metadata.get('page')}]\n{d.page_content}"
        for d in docs
    )

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Ask a question
print(chain.invoke("How do I configure SSO?"))

This is a working RAG system. Everything you learn in the remaining lessons is about making each component in this pipeline better — better parsing, better chunking, better retrieval, better evaluation.


Common Failure Modes and How to Diagnose Them

SymptomLikely CauseFix
Answer is factually wrong despite relevant documents existingRetrieval failure — wrong chunks retrievedCheck what retriever.invoke(question) actually returns
Answer says “I don’t have information about that” for a known questionRetrieval failure — no similar chunks foundCheck chunk quality; may need to adjust chunk size or embedding model
Answer contains facts not in the retrieved chunksLLM is hallucinating / mixing memoryStrengthen system prompt, lower temperature
Answer is too generic / vagueChunks are too large, relevant sentence is dilutedReduce chunk size
Answer is cut off / incompleteRetrieved chunks cut across sentence boundariesAdd chunk overlap, use semantic chunking

What You’ve Learned

You now understand the complete RAG architecture. The remaining lessons zoom into each component and teach you how to optimize it. Next up: the component most developers underestimate — document ingestion and parsing.