RAG Evaluation: RAGAS and Beyond

Why You Can’t Just Read the Answers

Imagine you’ve built a RAG system for your company’s documentation. You ask it 10 questions. The answers look pretty good. You ship it.

Six weeks later, you’re getting complaints. Users say the system confidently gives wrong answers. It’s citing documents that don’t contain the information it claims. It’s ignoring relevant sections and answering from memory.

What happened? You never actually measured quality systematically. You relied on vibes.

RAGAS (Retrieval Augmented Generation Assessment) is the framework that lets you measure RAG quality scientifically — with numbers that tell you exactly where your pipeline is failing and what to fix.

The Three Metrics That Matter

RAGAS focuses on three core measurements, each targeting a different failure mode:

Metric 1: Faithfulness (target: > 0.85)

What it measures: Does every statement in the generated answer appear in the retrieved context?

What failure looks like: The context contains a paragraph about authentication. The answer says “The system supports LDAP, SAML, and Kerberos authentication.” The context only mentions LDAP and SAML. The model added Kerberos from its training data — a hallucination. Faithfulness score < 1.0.

How RAGAS computes it: The answer is decomposed into individual claims. Each claim is checked against the context using an LLM judge: “Is this claim supported by the context?” Faithfulness = (claims supported by context) / (total claims).

What a low score tells you: Your LLM is mixing retrieved context with parametric memory. Fix: strengthen your system prompt with “Answer ONLY from the provided context. Do not add information from your training.”

Metric 2: Answer Relevancy (target: > 0.80)

What it measures: Does the answer actually address the user’s question?

What failure looks like: User asks “How do I export data to CSV?” The system retrieves a chunk about CSV format and a chunk about data formats generally. The answer talks about file formats but never explains the actual export process. The answer is adjacent to the question but doesn’t answer it.

How RAGAS computes it: The LLM judge generates several hypothetical questions that the given answer would address. If those hypothetical questions are similar to the original question, the answer is relevant. If they diverge, the answer is off-topic.

What a low score tells you: Either the retriever is pulling adjacent but not directly relevant chunks, or your prompt template is too permissive and the model is going on tangents. Fix: tighten retrieval (try re-ranking) or add “Stay focused on the specific question asked” to your prompt.

Metric 3: Context Precision (target: > 0.70)

What it measures: Are the retrieved chunks actually relevant to the question? Are they ordered by relevance (most relevant first)?

What failure looks like: You retrieve 5 chunks. Chunks 1, 2, and 3 are tangentially related. Chunk 4 contains the actual answer. Chunk 5 is completely unrelated. Context precision is low because the most relevant content is buried at position 4.

How RAGAS computes it: An LLM judge evaluates each retrieved chunk: “Is this chunk relevant to the question?” Context Precision = average position-weighted relevance of chunks.

What a low score tells you: Retrieval quality problem. Fix: try hybrid search, try a better embedding model, try smaller chunk sizes, or add re-ranking.

Installation and Setup

pip install ragas langchain-openai langchain-community

import os
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall  # optional: needs reference answers
)
from datasets import Dataset

os.environ["OPENAI_API_KEY"] = "your-key"
# RAGAS uses the OpenAI API to judge answers — this incurs additional cost

Creating a Test Set

RAGAS evaluation requires a dataset with four fields per sample:

question — the user’s query
answer — what your RAG system generated
contexts — the list of retrieved chunks (as strings)
ground_truth (optional) — the correct answer (needed only for context_recall)

The most important principle here: create your test set from real or realistic queries, not easy ones.

# Method 1: Manual test set (best quality, most effort)
test_data = {
    "question": [
        "What is the rate limit for the Standard tier?",
        "How do I configure SSO with Okta?",
        "What happens when I exceed the API rate limit?",
        "Can I export data to CSV?",
        "How do I revoke an API key?",
    ],
    "ground_truth": [  # optional but useful
        "The Standard tier allows 1000 requests per minute.",
        "Configure SSO by navigating to Settings > Security > SSO and following the Okta integration guide.",
        "When the rate limit is exceeded, the API returns HTTP 429 with a Retry-After header.",
        "Yes, you can export data by clicking Export > CSV in the Reports section.",
        "Revoke an API key in Settings > API Keys by clicking the Revoke button next to the key.",
    ]
}

For a realistic test set of 30+ questions, use RAGAS’s test set generator:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load your documents
loader = PyMuPDFLoader("product_docs.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = splitter.split_documents(documents)

# Generate test questions automatically
generator = TestsetGenerator.from_langchain(
    generator_llm=ChatOpenAI(model="gpt-4o"),
    critic_llm=ChatOpenAI(model="gpt-4o"),
    embeddings=OpenAIEmbeddings(model="text-embedding-3-small")
)

testset = generator.generate_with_langchain_docs(
    documents=chunks,
    test_size=20,
    distributions={
        simple: 0.5,       # straightforward factual questions
        reasoning: 0.3,    # questions requiring synthesis
        multi_context: 0.2  # questions needing multiple chunks
    }
)

testset_df = testset.to_pandas()
print(testset_df[['question', 'ground_truth']].head())

Running Your RAG Pipeline on the Test Set

Now run your RAG system against every test question and collect the outputs:

from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Setup your RAG pipeline
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_template("""
Answer based ONLY on the context below.
If the answer isn't in the context, say "I don't have information about that."

Context:
{context}

Question: {question}
""")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Run the pipeline on all test questions and collect outputs
def run_rag_on_testset(questions: list[str]) -> tuple[list[str], list[list[str]]]:
    answers = []
    contexts = []
    
    for question in questions:
        # Get retrieved chunks
        retrieved_docs = retriever.invoke(question)
        context_texts = [doc.page_content for doc in retrieved_docs]
        
        # Generate answer
        answer = chain.invoke(question)
        
        answers.append(answer)
        contexts.append(context_texts)
        
        print(f"Q: {question[:60]}...")
        print(f"A: {answer[:100]}...")
        print()
    
    return answers, contexts

questions = testset_df['question'].tolist()
ground_truths = testset_df['ground_truth'].tolist()

answers, contexts = run_rag_on_testset(questions)

Running the RAGAS Evaluation

# Assemble the evaluation dataset
eval_dataset = Dataset.from_dict({
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truth": ground_truths
})

# Run evaluation (this makes LLM API calls — budget ~$0.50-2 for 20 samples)
result = evaluate(
    dataset=eval_dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        # context_recall,  # uncomment if you have ground_truth
    ]
)

print(result)
# Output:
# {
#   'faithfulness': 0.71,
#   'answer_relevancy': 0.82,
#   'context_precision': 0.58
# }

Interpreting the Results

Scenario A: Faithfulness = 0.71, Context Precision = 0.80, Answer Relevancy = 0.85

Diagnosis: Hallucination problem. The answer is relevant and the retrieval is good, but the model is adding facts beyond what’s in the context.

Fix:

# Strengthen the system prompt
prompt = ChatPromptTemplate.from_template("""
You are a factual assistant. Answer ONLY using information explicitly stated in the context below.
Do NOT add information from your training data.
Do NOT extrapolate or make inferences beyond what the context states.
If the context does not contain the answer, respond exactly: "This information is not in the documentation."

Context:
{context}

Question: {question}

Answer (based strictly on the context):""")

# Also lower temperature
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)  # already 0, but worth verifying

Scenario B: Faithfulness = 0.92, Context Precision = 0.45, Answer Relevancy = 0.78

Diagnosis: Retrieval quality problem. The model is faithful to what it reads, but what it reads is not very relevant.

Fix:

# Option 1: Try smaller chunks (more targeted retrieval)
splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=30)

# Option 2: Add re-ranking
from sentence_transformers import CrossEncoder
# (see previous lesson)

# Option 3: Try hybrid search
# (see previous lesson)

# Option 4: Try a better embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
# or
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
embeddings = HuggingFaceBgeEmbeddings(model_name="BAAI/bge-m3")

Scenario C: Faithfulness = 0.88, Context Precision = 0.75, Answer Relevancy = 0.55

Diagnosis: The answer is grounded and retrieval works, but the answer is going off-topic or not directly addressing the question.

Fix: Add focus instructions to the prompt and check if the model is being distracted by tangentially related context.

Building a Continuous Evaluation Loop

Don’t just run RAGAS once. Build it into your development workflow:

import json
from datetime import datetime

def evaluate_and_save(pipeline_config: dict, testset: list) -> dict:
    """Run evaluation and save results with configuration for comparison."""
    questions = [item['question'] for item in testset]
    ground_truths = [item['ground_truth'] for item in testset]
    
    answers, contexts = run_rag_on_testset(questions)
    
    eval_dataset = Dataset.from_dict({
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths
    })
    
    result = evaluate(
        dataset=eval_dataset,
        metrics=[faithfulness, answer_relevancy, context_precision]
    )
    
    run_record = {
        "timestamp": datetime.now().isoformat(),
        "config": pipeline_config,
        "metrics": dict(result),
        "sample_size": len(questions)
    }
    
    # Append to results log
    with open("rag_eval_log.jsonl", "a") as f:
        f.write(json.dumps(run_record) + "\n")
    
    return run_record

# Baseline evaluation
baseline = evaluate_and_save(
    pipeline_config={
        "embedding": "text-embedding-3-small",
        "chunk_size": 512,
        "retriever": "dense_only",
        "k": 5
    },
    testset=testset_df.to_dict('records')
)

print(f"Baseline: {baseline['metrics']}")

# After adding re-ranking
improved = evaluate_and_save(
    pipeline_config={
        "embedding": "text-embedding-3-small",
        "chunk_size": 512,
        "retriever": "hybrid_with_reranking",
        "k": 5,
        "reranker": "ms-marco-MiniLM-L-6-v2"
    },
    testset=testset_df.to_dict('records')
)

print(f"With improvements: {improved['metrics']}")

The API Cost Reality

RAGAS uses an LLM judge to evaluate each sample. Budget accordingly:

20 test questions: ~$0.20-0.50 per evaluation run
100 test questions: ~$1-2.50 per run
Run evaluations after each significant pipeline change

The cost is minimal compared to the value of knowing your system’s actual quality. Don’t skip evaluation to save $0.50.

What Good Looks Like

For a production RAG system serving real users:

Metric	Minimum Acceptable	Target	Excellent
Faithfulness	0.80	0.90	> 0.95
Answer Relevancy	0.75	0.85	> 0.90
Context Precision	0.60	0.75	> 0.85

If any metric is below the minimum threshold, do not ship. Fix the indicated failure mode first.

In the capstone project, you’ll see a baseline system and an optimized system side by side, with RAGAS scores showing the improvement quantitatively. The next lesson covers three advanced techniques that push scores above the target thresholds.

Course Content