RAG Architecture Overview

The Scenario We Will Build

You work at a software company with 500 pages of product documentation. Users constantly ask support questions: “How do I configure SSO?” “What’s the rate limit on the API?” “Where do I find audit logs?” Your support team is overwhelmed. You want to build a chatbot that answers these questions automatically, accurately, and with citations.

By the end of this lesson, you will understand every component of the system needed to do this — and what breaks if you skip any of them.

The Two Phases of RAG

RAG systems have two distinct phases that run at different times:

Phase 1: Ingestion (offline, runs once — or periodically) Load documents → clean and split them → convert to embeddings → store in a vector database.

Phase 2: Retrieval and Generation (online, runs per query) User asks a question → embed the question → find similar chunks in the database → feed chunks + question to the LLM → return the answer.

Think of ingestion like building a library and creating a card catalog. You do that work once. Then when someone comes in with a question, you consult the catalog, pull the right books off the shelf, and read the relevant pages to answer them. You don’t rebuild the library for every question.

The Full Pipeline

Here is the complete architecture as a data flow:

INGESTION PHASE (offline)
─────────────────────────────────────────────────────────────
Raw Documents (PDF, HTML, DOCX, TXT)
        │
        ▼
[1] Document Loader
    Reads files, handles different formats
        │
        ▼
[2] Text Splitter / Chunker
    Breaks documents into chunks (e.g., 512 tokens each)
        │
        ▼
[3] Embedding Model
    Converts each chunk to a dense vector (e.g., 1536 floats)
        │
        ▼
[4] Vector Store
    Indexes vectors for fast similarity search
        │
   (stored, waiting)


RETRIEVAL + GENERATION PHASE (online, per query)
─────────────────────────────────────────────────────────────
User Question: "How do I configure SSO?"
        │
        ▼
[5] Embedding Model (same model as ingestion)
    Converts question to a query vector
        │
        ▼
[6] Retriever
    Queries vector store, returns top-k similar chunks
        │
        ▼
[7] Prompt Template
    Assembles: system prompt + retrieved chunks + user question
        │
        ▼
[8] LLM
    Reads the assembled prompt, generates an answer
        │
        ▼
Answer + Source Citations

Seven components. Let’s examine each one.

Component 1: Document Loader

What it does: Reads raw files and converts them to a uniform text + metadata format. A good loader extracts not just text, but structured metadata: the filename, page number, section heading, and creation date. You will need this metadata later to cite sources.

What goes wrong if you skip it: You can’t skip it — you need to read files somehow. The failure mode here is bad loading: using a simple loader that strips all structure from a PDF, so tables come out as a jumbled string of numbers, and multi-column layouts get the columns mixed together.

The failure looks like: A chunk that says "Tier 1 100 200 Tier 2 500 1000 Tier 3 unlimited unlimited" when the original was a clean pricing table. No embedding model can make sense of that.

from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("product_docs.pdf")
documents = loader.load()

# Each document has .page_content and .metadata
print(documents[0].metadata)
# {'source': 'product_docs.pdf', 'page': 0, 'author': '...'}

Component 2: Text Splitter

What it does: LLMs have context limits. A 500-page PDF is ~250,000 tokens — you cannot fit that in a single prompt. The text splitter breaks documents into smaller, overlapping chunks that fit within the model’s context window while preserving as much coherent meaning as possible.

Why overlap matters: If you split with zero overlap, a concept that spans two chunks gets cut in half. Retrieval finds the first half, misses the second. Overlap (typically 10-20% of chunk size) ensures that boundary content is represented in at least one complete chunk.

What goes wrong if you skip it: You can’t skip it — you must split long documents. The failure mode is bad splitting: too-large chunks (LLM context fills up, can only retrieve 1-2 chunks) or too-small chunks (no context for the retrieved snippet to make sense).

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks")
# Split 42 documents into 847 chunks

Component 3: Embedding Model

What it does: Converts text to a vector of floating-point numbers (an “embedding”). The embedding captures semantic meaning: text that means similar things gets vectors that are close together in vector space. This is what makes semantic search possible.

Why the same model must be used for both ingestion and retrieval: Embeddings are relative to the model that produced them. If you embed your documents with OpenAI’s model and embed your query with a different model, the vectors live in different spaces and similarity search produces garbage.

What goes wrong if you skip it: You can’t skip it — embeddings are the foundation of the whole system. The failure mode is model mismatch (different models for indexing and querying) or model degradation (you switch models midway through; old embeddings are incompatible with new query embeddings).

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Embed a single text
vector = embeddings.embed_query("How do I configure SSO?")
print(f"Vector dimension: {len(vector)}")  # 1536

Component 4: Vector Store

What it does: Stores the chunk text plus its embedding vector, and provides a fast “find me the k vectors most similar to this query vector” operation. This is the core data structure that makes retrieval fast even with millions of chunks.

Under the hood: Most vector stores use an algorithm called HNSW (Hierarchical Navigable Small World) to organize vectors into a graph structure. This allows approximate nearest-neighbor search in milliseconds, even across millions of vectors — without scanning every vector one by one.

What goes wrong if you skip it: Some people try to store embeddings in a regular database and compute cosine similarity in application code. This works at small scale (under ~10,000 vectors) but becomes catastrophically slow at production scale. A proper vector store handles 10M vectors in 20ms; a naive loop takes minutes.

from langchain_community.vectorstores import Chroma

# Create the vector store and index all chunks
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Later: load existing index
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

Component 5 & 6: Retriever

What it does: Takes the user’s question, embeds it with the same model used during ingestion, and queries the vector store for the top-k most similar chunks. The retriever bridges the two phases — it produces the context that gets injected into the prompt.

How many chunks to retrieve (k)? The typical starting point is k=4 to k=6. Too few: you might miss a relevant chunk that’s slightly further away in embedding space. Too many: the LLM’s context fills up with less relevant text, and the model’s attention gets diluted — called the “lost in the middle” problem.

What goes wrong if you skip it: This component is the heart of RAG. If retrieval is poor, no amount of LLM quality saves you. The garbage-in-garbage-out principle applies fully here.

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# Retrieve relevant chunks for a query
results = retriever.invoke("How do I configure SSO?")

for doc in results:
    print(f"[{doc.metadata['source']}, p.{doc.metadata['page']}]")
    print(doc.page_content[:200])
    print("---")

Component 7: Prompt Template

What it does: Assembles the final prompt that gets sent to the LLM. A well-designed RAG prompt has three parts:

System instructions: “You are a helpful assistant. Answer ONLY from the provided context. If the answer is not in the context, say ‘I don’t have information about that.’”
Retrieved context: The actual chunks from your documents
User question: What the user asked

Why the “answer only from context” instruction matters: Without this constraint, the LLM will combine retrieved context with its parametric memory. This produces answers that sound right but may mix retrieved facts with hallucinated ones — and you can’t tell which is which.

from langchain_core.prompts import ChatPromptTemplate

template = """You are a helpful assistant for product documentation.
Answer the question based ONLY on the following context.
If the answer is not in the context, say "I don't have information about that in our documentation."

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

Component 8: LLM

What it does: Reads the assembled prompt (system instructions + retrieved context + question) and generates a natural-language answer. In a RAG system, the LLM’s job is primarily reading comprehension and synthesis, not knowledge recall. The knowledge comes from the retrieved documents; the model’s job is to articulate it clearly.

Model choice matters less here than in other LLM tasks: Because the LLM is reading from context rather than recalling from memory, a smaller, cheaper model (GPT-4o-mini, Claude Haiku) often performs nearly as well as a larger one. The retrieval quality is the dominant factor in answer quality.

from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(
        f"[Source: {d.metadata.get('source', 'unknown')}, "
        f"Page {d.metadata.get('page', '?')}]\n{d.page_content}"
        for d in docs
    )

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

answer = chain.invoke("How do I configure SSO?")
print(answer)

Putting It All Together: A Minimal Working RAG System

Here is the complete minimal implementation — ingestion + retrieval + generation in ~50 lines:

import os
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# ── INGESTION PHASE ──────────────────────────────────────────
loader = PyMuPDFLoader("product_docs.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(documents)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")

# ── RETRIEVAL + GENERATION PHASE ─────────────────────────────
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

prompt = ChatPromptTemplate.from_template("""
Answer based ONLY on the context below. 
If the answer isn't in the context, say so.

Context:
{context}

Question: {question}
""")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(
        f"[{d.metadata.get('source')}, p.{d.metadata.get('page')}]\n{d.page_content}"
        for d in docs
    )

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Ask a question
print(chain.invoke("How do I configure SSO?"))

This is a working RAG system. Everything you learn in the remaining lessons is about making each component in this pipeline better — better parsing, better chunking, better retrieval, better evaluation.

Common Failure Modes and How to Diagnose Them

Symptom	Likely Cause	Fix
Answer is factually wrong despite relevant documents existing	Retrieval failure — wrong chunks retrieved	Check what `retriever.invoke(question)` actually returns
Answer says “I don’t have information about that” for a known question	Retrieval failure — no similar chunks found	Check chunk quality; may need to adjust chunk size or embedding model
Answer contains facts not in the retrieved chunks	LLM is hallucinating / mixing memory	Strengthen system prompt, lower temperature
Answer is too generic / vague	Chunks are too large, relevant sentence is diluted	Reduce chunk size
Answer is cut off / incomplete	Retrieved chunks cut across sentence boundaries	Add chunk overlap, use semantic chunking

What You’ve Learned

You now understand the complete RAG architecture. The remaining lessons zoom into each component and teach you how to optimize it. Next up: the component most developers underestimate — document ingestion and parsing.

Course Content

The Scenario We Will Build

The Two Phases of RAG

The Full Pipeline

Component 1: Document Loader

Component 2: Text Splitter

Component 3: Embedding Model

Component 4: Vector Store

Component 5 & 6: Retriever

Component 7: Prompt Template

Component 8: LLM

Putting It All Together: A Minimal Working RAG System

Common Failure Modes and How to Diagnose Them

What You’ve Learned

🍪 Cookie Notice

Cookie Preferences

Essential Cookies

Analytics Cookies

Marketing Cookies

Functionality Cookies