RAG Tutorial: Step-by-Step Guide to Retrieval-Augmented Generation (2026)

What is RAG and Why Does it Matter?

Large language models are impressive — but they have a critical limitation: they only know what they were trained on. Ask a model about your internal documentation, last quarter's sales data, or a research paper published last week, and it will either hallucinate an answer or admit it doesn't know.

Retrieval-Augmented Generation (RAG) solves this. Instead of relying on the model's memory, RAG retrieves relevant documents from your own knowledge base at query time and passes them directly into the context window. The model generates an answer grounded in your data — not in what it was trained on years ago.

This is why RAG is the dominant architecture for enterprise AI in 2026:

Accuracy — answers cite real documents, not model hallucinations
Updatability — add new documents without retraining or fine-tuning
Auditability — trace every answer to its source
Cost — far cheaper than fine-tuning for knowledge tasks

The RAG Pipeline — All Five Steps

A complete RAG system has five stages:

Load — ingest documents (PDFs, web pages, databases)
Chunk — split documents into retrieval-sized pieces
Embed — convert chunks to vector embeddings and store in a vector database
Retrieve — at query time, embed the question and find the most similar chunks
Generate — pass the retrieved chunks to an LLM as context, get a grounded answer

Let's build each step. We'll use langchain, chromadb, and the OpenAI API — all swappable for alternatives.

Step 1: Load and Parse Documents

Install dependencies:

pip install langchain langchain-community langchain-openai chromadb pypdf

Load a PDF document:

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("your-document.pdf")
documents = loader.load()

print(f"Loaded {'{'}len(documents){'}'} pages")
# → Loaded 47 pages

LangChain supports dozens of loaders: WebBaseLoader for web pages, TextLoader for plain text, CSVLoader for spreadsheets, and more. The interface is the same — .load() returns a list of Document objects.

Step 2: Chunk Your Documents

Chunking is the most underrated decision in RAG. Too large and your chunks contain irrelevant text that confuses retrieval. Too small and individual chunks lack context to answer questions.

Start with recursive character splitting at 512 tokens with 50-token overlap:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = splitter.split_documents(documents)
print(f"Created {'{'}len(chunks){'}'} chunks")
# → Created 312 chunks

The recursive splitter tries to split on paragraph breaks first, then newlines, then sentences — preserving semantic boundaries as much as possible. The 50-token overlap ensures context isn't lost at chunk boundaries.

Step 3: Embed and Index

Embedding converts text into a vector — a list of numbers that captures semantic meaning. Similar text produces similar vectors, which is what makes semantic search possible.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
)

print("Indexed and persisted to disk")

This creates a local ChromaDB vector database. For production, swap in Pinecone or pgvector. The embedding model and vector database are completely interchangeable — this is one of RAG's strengths.

Step 4: Retrieve Relevant Chunks

At query time, embed the user's question and find the most semantically similar chunks:

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4},  # return top 4 chunks
)

query = "What are the main findings of the study?"
relevant_chunks = retriever.invoke(query)

for chunk in relevant_chunks:
    print(f"Score: {'{'}chunk.metadata.get('score', 'N/A'){'}'}")
    print(chunk.page_content[:200])
    print("---")

The retriever computes cosine similarity between the query embedding and every chunk embedding, returning the closest matches. With k=4, you're passing the 4 most relevant chunks to the model — enough context without overwhelming the prompt.

Step 5: Generate a Grounded Answer

Pass the retrieved chunks as context to an LLM to generate an answer:

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt_template = ChatPromptTemplate.from_template("""
You are a helpful assistant. Answer the question based only on the provided context.
If the context doesn't contain the answer, say "I don't have enough information to answer that."

Context:
{'{'}context{'}'}

Question: {'{'}question{'}'}

Answer (cite the relevant section if possible):
""")

context_text = "\n\n---\n\n".join([c.page_content for c in relevant_chunks])

chain = prompt_template | llm

response = chain.invoke({'{'}
    "context": context_text,
    "question": query,
{'}'})

print(response.content)

Full Working Example

Here's the complete pipeline end-to-end, clean and runnable:

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate

# 1. Load
loader = PyPDFLoader("your-document.pdf")
documents = loader.load()

# 2. Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# 3. Embed & Index
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# 4. Retrieve
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

def rag_query(question: str) -> str:
    chunks = retriever.invoke(question)
    context = "\n\n---\n\n".join([c.page_content for c in chunks])

    prompt = ChatPromptTemplate.from_template("""
Answer the question using only the provided context. Cite the relevant section.
If the context does not answer the question, say "I don't have enough information."

Context: {'{'}context{'}'}
Question: {'{'}question{'}'}
Answer:""")

    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    chain = prompt | llm
    return chain.invoke({'{'}
        "context": context,
        "question": question,
    {'}'}).content

# Try it
print(rag_query("What are the main recommendations?"))

What's Next — Re-ranking and Evaluation

This tutorial covers the baseline RAG pipeline. To get it production-ready, you'll need two more components:

Re-ranking: The initial retrieval uses embedding similarity, which is fast but imprecise. A cross-encoder re-ranker reads each chunk in full context with the query and produces a much more accurate relevance score. Add it like this:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

reranker = CrossEncoderReranker(
    model=HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"),
    top_n=2,
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=retriever,
)

Evaluation: Use RAGAS to measure faithfulness (does the answer contradict the context?), answer relevancy (does the answer address the question?), and context precision (are the retrieved chunks actually relevant?).

RAG Tutorial: A Step-by-Step Guide to Retrieval-Augmented Generation

What is RAG and Why Does it Matter?

The RAG Pipeline — All Five Steps

Step 1: Load and Parse Documents

Step 2: Chunk Your Documents

Step 3: Embed and Index

Step 4: Retrieve Relevant Chunks

Step 5: Generate a Grounded Answer

Full Working Example

What's Next — Re-ranking and Evaluation

RAG From Scratch — Full Course

What is RAG and Why Does it Matter?

The RAG Pipeline — All Five Steps

Step 1: Load and Parse Documents

Step 2: Chunk Your Documents

Step 3: Embed and Index

Step 4: Retrieve Relevant Chunks

Step 5: Generate a Grounded Answer

Full Working Example

What's Next — Re-ranking and Evaluation

RAG From Scratch — Full Course

🍪 Cookie Notice

Cookie Preferences

Essential Cookies

Analytics Cookies

Marketing Cookies

Functionality Cookies