Course Content
Re-Ranking: Improve Precision After Retrieval
Cross-encoder re-ranking to surface the most relevant chunks
The Precision Problem
Your retriever returns the top 10 chunks. They’re all vaguely related to the query. But only 2 of them contain the actual answer — and they’re ranked at positions 4 and 7.
You set k=4 to limit how much context you send to the LLM. You get chunks 1, 2, 3, and 4. You miss the best chunk at position 4… wait, you get that one. But you miss position 7 entirely. The LLM reads four chunks that are loosely relevant and produces a mediocre answer.
This is the precision problem. Your retriever is good at recall (finding all the relevant chunks somewhere in the top 10), but poor at precision (putting the most relevant chunks at the top).
Re-ranking is the solution.
The Scout and the Judge
Here’s an analogy that explains the architecture perfectly.
The Retriever (bi-encoder) is a fast scout. The scout runs ahead and quickly identifies 10 candidates from a crowd of 50,000. The scout uses a simple heuristic: does this person look like who we’re searching for? The scout processes each person in isolation, without comparing them to each other. It’s fast because it’s working independently in parallel. But it’s approximate — the scout can’t do a detailed comparison.
The Re-ranker (cross-encoder) is a thorough judge. The judge sits down with the query and each candidate side by side. For each pair (query, chunk), the judge reads both together and scores their relevance on a precise 0-1 scale. The judge takes much longer per candidate but produces highly accurate scores. The judge can catch subtle distinctions the scout missed — because the judge sees the query and document simultaneously, in context.
The two-stage pipeline:
- Retrieve 20-50 candidates quickly (recall phase)
- Re-rank those candidates precisely (precision phase)
- Pass only the top k=4 to the LLM (context injection)
You get the recall of a large k with the precision of a much smaller k.
Bi-Encoder vs. Cross-Encoder: The Technical Difference
Understanding why cross-encoders are more accurate requires understanding how they process input differently.
Bi-encoder (what your embedding model does):
Query → Encoder → Query vector
} cosine_similarity() → score
Chunk → Encoder → Chunk vectorThe query and chunk are encoded independently. Their interaction only happens when you compare the resulting vectors. The encoder cannot model the relationship between the query and the specific chunk.
Cross-encoder (what the re-ranker does):
[Query + Chunk] → Single Encoder → Relevance scoreThe query and chunk are concatenated and fed through a single encoder together. The attention mechanism in the transformer can now model every token in the query attending to every token in the chunk. This captures fine-grained relevance that bi-encoders miss.
The trade-off: cross-encoding requires running inference for every (query, chunk) pair. With 20 candidates, you run the model 20 times. You cannot pre-compute these scores at indexing time. This is why you only apply it to a small candidate set, not your full index.
Implementing Re-Ranking
# pip install sentence-transformers
from sentence_transformers import CrossEncoder
from langchain_core.documents import Document
class CrossEncoderReranker:
def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
"""
Initialize with a cross-encoder model.
Good models for re-ranking:
- cross-encoder/ms-marco-MiniLM-L-6-v2: Fast, good quality, ~22MB
- cross-encoder/ms-marco-MiniLM-L-12-v2: Slower, better quality, ~33MB
- cross-encoder/ms-marco-electra-base: Best quality, ~267MB, slower
- BAAI/bge-reranker-v2-m3: Multilingual, competitive with commercial
"""
self.model = CrossEncoder(model_name)
print(f"Loaded re-ranker: {model_name}")
def rerank(
self,
query: str,
documents: list[Document],
top_k: int = 4
) -> list[tuple[Document, float]]:
"""
Re-rank documents by relevance to query.
Returns list of (document, score) tuples, sorted by score descending.
"""
if not documents:
return []
# Create (query, document) pairs for the cross-encoder
pairs = [(query, doc.page_content) for doc in documents]
# Score all pairs simultaneously (batched inference)
scores = self.model.predict(pairs)
# Combine documents with their scores
scored_docs = list(zip(documents, scores))
# Sort by score descending, return top k
scored_docs.sort(key=lambda x: x[1], reverse=True)
return scored_docs[:top_k]Integrating Re-Ranking into Your RAG Pipeline
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Setup
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
reranker = CrossEncoderReranker()
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_template("""
Answer based ONLY on the context. Cite the source document.
If the answer isn't in the context, say so.
Context:
{context}
Question: {question}
""")
def rag_with_reranking(query: str) -> str:
# Stage 1: Retrieve many candidates (high recall)
candidates = vectorstore.similarity_search(query, k=20)
# Stage 2: Re-rank for precision
reranked = reranker.rerank(query, candidates, top_k=4)
# Stage 3: Format context with scores for transparency
context_parts = []
for doc, score in reranked:
context_parts.append(
f"[Relevance: {score:.2f}] [{doc.metadata.get('source')}]\n"
f"{doc.page_content}"
)
context = "\n\n---\n\n".join(context_parts)
# Stage 4: Generate
chain = prompt | llm | StrOutputParser()
return chain.invoke({"context": context, "question": query})
answer = rag_with_reranking("What are the rate limits for the API?")
print(answer)The Benchmark: Before and After Re-Ranking
Let’s measure the impact concretely. Using 50 queries across a technical documentation corpus:
import time
from typing import Callable
def evaluate_retrieval_precision(
queries_with_relevant_docs: list[tuple[str, set[str]]],
retrieve_fn: Callable,
k: int = 4
) -> dict:
"""
Evaluate precision@k: what fraction of top-k results are relevant?
queries_with_relevant_docs: list of (query, set_of_relevant_doc_ids)
"""
precisions = []
for query, relevant_ids in queries_with_relevant_docs:
start = time.time()
results = retrieve_fn(query, k=k)
latency = (time.time() - start) * 1000
retrieved_ids = {doc.metadata.get('chunk_id') for doc in results}
hits = len(retrieved_ids & relevant_ids)
precision = hits / k
precisions.append({
"precision": precision,
"latency_ms": latency
})
avg_precision = sum(p["precision"] for p in precisions) / len(precisions)
avg_latency = sum(p["latency_ms"] for p in precisions) / len(precisions)
return {"precision_at_k": avg_precision, "avg_latency_ms": avg_latency}
# Compare without and with re-ranking
def retrieve_no_rerank(query: str, k: int) -> list[Document]:
return vectorstore.similarity_search(query, k=k)
def retrieve_with_rerank(query: str, k: int) -> list[Document]:
candidates = vectorstore.similarity_search(query, k=20)
reranked = reranker.rerank(query, candidates, top_k=k)
return [doc for doc, score in reranked]
baseline = evaluate_retrieval_precision(test_queries, retrieve_no_rerank, k=4)
reranked = evaluate_retrieval_precision(test_queries, retrieve_with_rerank, k=4)
print(f"Without re-ranking: Precision@4 = {baseline['precision_at_k']:.2f}, "
f"Latency = {baseline['avg_latency_ms']:.0f}ms")
print(f"With re-ranking: Precision@4 = {reranked['precision_at_k']:.2f}, "
f"Latency = {reranked['avg_latency_ms']:.0f}ms")
# Typical output:
# Without re-ranking: Precision@4 = 0.51, Latency = 8ms
# With re-ranking: Precision@4 = 0.74, Latency = 89msThe pattern you’ll see in practice: re-ranking improves precision@4 by roughly 20-30 percentage points. The latency increase is 80-200ms depending on the re-ranker model and candidate pool size.
Using LangChain’s Built-in Re-Ranker Integration
LangChain has a ContextualCompressionRetriever that wraps re-ranking:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker as LCReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# Load the cross-encoder model
model = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
# Create the re-ranker compressor
compressor = LCReranker(model=model, top_n=4)
# Base retriever: returns many candidates
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
# Compression retriever: base retrieval + re-ranking
reranking_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
# Use it exactly like any other retriever
results = reranking_retriever.invoke("How do I configure SSO with Okta?")
for doc in results:
print(doc.page_content[:200])This integrates cleanly with LangChain chains because it follows the same Retriever interface.
When NOT to Use Re-Ranking
Re-ranking adds 50-200ms per query. This is acceptable for most applications. But there are scenarios where it’s not worth it:
Don’t re-rank when:
- Your retriever’s precision is already > 0.80 (measure first before adding complexity)
- Latency < 100ms is a hard requirement (real-time autocomplete, voice assistants)
- Your k is already small (k=2 or k=3) and candidates are already high-quality
- Your corpus is small (<10K chunks) where vector search already works excellently
Do re-rank when:
- Your corpus is large and diverse (>50K chunks)
- Queries vary widely — both semantic and exact-term queries
- Wrong answers have high cost (legal, medical, customer-facing)
- Your baseline precision@k is below 0.60
Model Selection for Re-Ranking
| Model | Size | Speed | Quality | Use Case |
|---|---|---|---|---|
| ms-marco-MiniLM-L-6-v2 | 22MB | ~50ms/20docs | Good | Production default |
| ms-marco-MiniLM-L-12-v2 | 33MB | ~80ms/20docs | Better | Higher quality budget |
| ms-marco-electra-base | 267MB | ~200ms/20docs | Best | Max quality, slower |
| BAAI/bge-reranker-v2-m3 | 568MB | ~150ms/20docs | Excellent + multilingual | Non-English content |
For most production RAG systems, ms-marco-MiniLM-L-6-v2 is the right choice: small model, loads fast, runs on CPU, and delivers strong quality improvements at ~50ms per re-ranking pass.
Summary
Re-ranking is a high-value, moderate-complexity addition to your RAG pipeline:
- Retrieve broadly (k=20) for high recall — you want the right chunks somewhere in the candidate set
- Re-rank precisely (top_k=4) using a cross-encoder that reads query + chunk together
- Generate with only the highest-quality, most precisely relevant chunks
The result is a system where the LLM almost always has what it needs in context — and rarely has to work with mediocre or irrelevant chunks.
In the next lesson, we’ll learn how to measure all of this quantitatively using RAGAS, so you can prove to yourself (and your team) that each optimization actually improves quality.
