RAG Tutorial: A Step-by-Step Guide to Retrieval-Augmented Generation
Build a complete RAG system from scratch — from loading your first document to returning cited answers grounded in your own data. Every code block runs as-is.
What is RAG and Why Does it Matter?
Large language models are impressive — but they have a critical limitation: they only know what they were trained on. Ask a model about your internal documentation, last quarter's sales data, or a research paper published last week, and it will either hallucinate an answer or admit it doesn't know.
Retrieval-Augmented Generation (RAG) solves this. Instead of relying on the model's memory, RAG retrieves relevant documents from your own knowledge base at query time and passes them directly into the context window. The model generates an answer grounded in your data — not in what it was trained on years ago.
This is why RAG is the dominant architecture for enterprise AI in 2026:
- Accuracy — answers cite real documents, not model hallucinations
- Updatability — add new documents without retraining or fine-tuning
- Auditability — trace every answer to its source
- Cost — far cheaper than fine-tuning for knowledge tasks
The RAG Pipeline — All Five Steps
A complete RAG system has five stages:
- Load — ingest documents (PDFs, web pages, databases)
- Chunk — split documents into retrieval-sized pieces
- Embed — convert chunks to vector embeddings and store in a vector database
- Retrieve — at query time, embed the question and find the most similar chunks
- Generate — pass the retrieved chunks to an LLM as context, get a grounded answer
Let's build each step. We'll use langchain, chromadb, and the OpenAI API — all swappable for alternatives.
Step 1: Load and Parse Documents
Install dependencies:
pip install langchain langchain-community langchain-openai chromadb pypdfLoad a PDF document:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("your-document.pdf")
documents = loader.load()
print(f"Loaded {'{'}len(documents){'}'} pages")
# → Loaded 47 pagesLangChain supports dozens of loaders: WebBaseLoader for web pages, TextLoader for plain text, CSVLoader for spreadsheets, and more. The interface is the same — .load() returns a list of Document objects.
Step 2: Chunk Your Documents
Chunking is the most underrated decision in RAG. Too large and your chunks contain irrelevant text that confuses retrieval. Too small and individual chunks lack context to answer questions.
Start with recursive character splitting at 512 tokens with 50-token overlap:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
print(f"Created {'{'}len(chunks){'}'} chunks")
# → Created 312 chunksThe recursive splitter tries to split on paragraph breaks first, then newlines, then sentences — preserving semantic boundaries as much as possible. The 50-token overlap ensures context isn't lost at chunk boundaries.
Step 3: Embed and Index
Embedding converts text into a vector — a list of numbers that captures semantic meaning. Similar text produces similar vectors, which is what makes semantic search possible.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
)
print("Indexed and persisted to disk")This creates a local ChromaDB vector database. For production, swap in Pinecone or pgvector. The embedding model and vector database are completely interchangeable — this is one of RAG's strengths.
Step 4: Retrieve Relevant Chunks
At query time, embed the user's question and find the most semantically similar chunks:
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}, # return top 4 chunks
)
query = "What are the main findings of the study?"
relevant_chunks = retriever.invoke(query)
for chunk in relevant_chunks:
print(f"Score: {'{'}chunk.metadata.get('score', 'N/A'){'}'}")
print(chunk.page_content[:200])
print("---")The retriever computes cosine similarity between the query embedding and every chunk embedding, returning the closest matches. With k=4, you're passing the 4 most relevant chunks to the model — enough context without overwhelming the prompt.
Step 5: Generate a Grounded Answer
Pass the retrieved chunks as context to an LLM to generate an answer:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt_template = ChatPromptTemplate.from_template("""
You are a helpful assistant. Answer the question based only on the provided context.
If the context doesn't contain the answer, say "I don't have enough information to answer that."
Context:
{'{'}context{'}'}
Question: {'{'}question{'}'}
Answer (cite the relevant section if possible):
""")
context_text = "\n\n---\n\n".join([c.page_content for c in relevant_chunks])
chain = prompt_template | llm
response = chain.invoke({'{'}
"context": context_text,
"question": query,
{'}'})
print(response.content)Full Working Example
Here's the complete pipeline end-to-end, clean and runnable:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
# 1. Load
loader = PyPDFLoader("your-document.pdf")
documents = loader.load()
# 2. Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(documents)
# 3. Embed & Index
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
# 4. Retrieve
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
def rag_query(question: str) -> str:
chunks = retriever.invoke(question)
context = "\n\n---\n\n".join([c.page_content for c in chunks])
prompt = ChatPromptTemplate.from_template("""
Answer the question using only the provided context. Cite the relevant section.
If the context does not answer the question, say "I don't have enough information."
Context: {'{'}context{'}'}
Question: {'{'}question{'}'}
Answer:""")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
chain = prompt | llm
return chain.invoke({'{'}
"context": context,
"question": question,
{'}'}).content
# Try it
print(rag_query("What are the main recommendations?"))What's Next — Re-ranking and Evaluation
This tutorial covers the baseline RAG pipeline. To get it production-ready, you'll need two more components:
Re-ranking: The initial retrieval uses embedding similarity, which is fast but imprecise. A cross-encoder re-ranker reads each chunk in full context with the query and produces a much more accurate relevance score. Add it like this:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
reranker = CrossEncoderReranker(
model=HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"),
top_n=2,
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=retriever,
)Evaluation: Use RAGAS to measure faithfulness (does the answer contradict the context?), answer relevancy (does the answer address the question?), and context precision (are the retrieved chunks actually relevant?).
Go deeper
RAG From Scratch — Full Course
This tutorial covers the basics. The full course adds hybrid retrieval, cross-encoder re-ranking, HyDE, query expansion, and a complete RAGAS evaluation pipeline — 8 weeks, free, with a capstone project.
