Press ESC to exit fullscreen
📖 Lesson ⏱️ 90 minutes

Re-Ranking: Improve Precision After Retrieval

Cross-encoder re-ranking to surface the most relevant chunks

The Precision Problem

Your retriever returns the top 10 chunks. They’re all vaguely related to the query. But only 2 of them contain the actual answer — and they’re ranked at positions 4 and 7.

You set k=4 to limit how much context you send to the LLM. You get chunks 1, 2, 3, and 4. You miss the best chunk at position 4… wait, you get that one. But you miss position 7 entirely. The LLM reads four chunks that are loosely relevant and produces a mediocre answer.

This is the precision problem. Your retriever is good at recall (finding all the relevant chunks somewhere in the top 10), but poor at precision (putting the most relevant chunks at the top).

Re-ranking is the solution.


The Scout and the Judge

Here’s an analogy that explains the architecture perfectly.

The Retriever (bi-encoder) is a fast scout. The scout runs ahead and quickly identifies 10 candidates from a crowd of 50,000. The scout uses a simple heuristic: does this person look like who we’re searching for? The scout processes each person in isolation, without comparing them to each other. It’s fast because it’s working independently in parallel. But it’s approximate — the scout can’t do a detailed comparison.

The Re-ranker (cross-encoder) is a thorough judge. The judge sits down with the query and each candidate side by side. For each pair (query, chunk), the judge reads both together and scores their relevance on a precise 0-1 scale. The judge takes much longer per candidate but produces highly accurate scores. The judge can catch subtle distinctions the scout missed — because the judge sees the query and document simultaneously, in context.

The two-stage pipeline:

  1. Retrieve 20-50 candidates quickly (recall phase)
  2. Re-rank those candidates precisely (precision phase)
  3. Pass only the top k=4 to the LLM (context injection)

You get the recall of a large k with the precision of a much smaller k.


Bi-Encoder vs. Cross-Encoder: The Technical Difference

Understanding why cross-encoders are more accurate requires understanding how they process input differently.

Bi-encoder (what your embedding model does):

Query  → Encoder → Query vector
                              } cosine_similarity() → score
Chunk  → Encoder → Chunk vector

The query and chunk are encoded independently. Their interaction only happens when you compare the resulting vectors. The encoder cannot model the relationship between the query and the specific chunk.

Cross-encoder (what the re-ranker does):

[Query + Chunk] → Single Encoder → Relevance score

The query and chunk are concatenated and fed through a single encoder together. The attention mechanism in the transformer can now model every token in the query attending to every token in the chunk. This captures fine-grained relevance that bi-encoders miss.

The trade-off: cross-encoding requires running inference for every (query, chunk) pair. With 20 candidates, you run the model 20 times. You cannot pre-compute these scores at indexing time. This is why you only apply it to a small candidate set, not your full index.


Implementing Re-Ranking

# pip install sentence-transformers
from sentence_transformers import CrossEncoder
from langchain_core.documents import Document

class CrossEncoderReranker:
    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        """
        Initialize with a cross-encoder model.
        
        Good models for re-ranking:
        - cross-encoder/ms-marco-MiniLM-L-6-v2: Fast, good quality, ~22MB
        - cross-encoder/ms-marco-MiniLM-L-12-v2: Slower, better quality, ~33MB
        - cross-encoder/ms-marco-electra-base: Best quality, ~267MB, slower
        - BAAI/bge-reranker-v2-m3: Multilingual, competitive with commercial
        """
        self.model = CrossEncoder(model_name)
        print(f"Loaded re-ranker: {model_name}")
    
    def rerank(
        self, 
        query: str, 
        documents: list[Document], 
        top_k: int = 4
    ) -> list[tuple[Document, float]]:
        """
        Re-rank documents by relevance to query.
        Returns list of (document, score) tuples, sorted by score descending.
        """
        if not documents:
            return []
        
        # Create (query, document) pairs for the cross-encoder
        pairs = [(query, doc.page_content) for doc in documents]
        
        # Score all pairs simultaneously (batched inference)
        scores = self.model.predict(pairs)
        
        # Combine documents with their scores
        scored_docs = list(zip(documents, scores))
        
        # Sort by score descending, return top k
        scored_docs.sort(key=lambda x: x[1], reverse=True)
        return scored_docs[:top_k]

Integrating Re-Ranking into Your RAG Pipeline

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Setup
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
reranker = CrossEncoderReranker()
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_template("""
Answer based ONLY on the context. Cite the source document.
If the answer isn't in the context, say so.

Context:
{context}

Question: {question}
""")

def rag_with_reranking(query: str) -> str:
    # Stage 1: Retrieve many candidates (high recall)
    candidates = vectorstore.similarity_search(query, k=20)
    
    # Stage 2: Re-rank for precision
    reranked = reranker.rerank(query, candidates, top_k=4)
    
    # Stage 3: Format context with scores for transparency
    context_parts = []
    for doc, score in reranked:
        context_parts.append(
            f"[Relevance: {score:.2f}] [{doc.metadata.get('source')}]\n"
            f"{doc.page_content}"
        )
    context = "\n\n---\n\n".join(context_parts)
    
    # Stage 4: Generate
    chain = prompt | llm | StrOutputParser()
    return chain.invoke({"context": context, "question": query})

answer = rag_with_reranking("What are the rate limits for the API?")
print(answer)

The Benchmark: Before and After Re-Ranking

Let’s measure the impact concretely. Using 50 queries across a technical documentation corpus:

import time
from typing import Callable

def evaluate_retrieval_precision(
    queries_with_relevant_docs: list[tuple[str, set[str]]],
    retrieve_fn: Callable,
    k: int = 4
) -> dict:
    """
    Evaluate precision@k: what fraction of top-k results are relevant?
    queries_with_relevant_docs: list of (query, set_of_relevant_doc_ids)
    """
    precisions = []
    
    for query, relevant_ids in queries_with_relevant_docs:
        start = time.time()
        results = retrieve_fn(query, k=k)
        latency = (time.time() - start) * 1000
        
        retrieved_ids = {doc.metadata.get('chunk_id') for doc in results}
        hits = len(retrieved_ids & relevant_ids)
        precision = hits / k
        
        precisions.append({
            "precision": precision,
            "latency_ms": latency
        })
    
    avg_precision = sum(p["precision"] for p in precisions) / len(precisions)
    avg_latency = sum(p["latency_ms"] for p in precisions) / len(precisions)
    
    return {"precision_at_k": avg_precision, "avg_latency_ms": avg_latency}

# Compare without and with re-ranking
def retrieve_no_rerank(query: str, k: int) -> list[Document]:
    return vectorstore.similarity_search(query, k=k)

def retrieve_with_rerank(query: str, k: int) -> list[Document]:
    candidates = vectorstore.similarity_search(query, k=20)
    reranked = reranker.rerank(query, candidates, top_k=k)
    return [doc for doc, score in reranked]

baseline = evaluate_retrieval_precision(test_queries, retrieve_no_rerank, k=4)
reranked = evaluate_retrieval_precision(test_queries, retrieve_with_rerank, k=4)

print(f"Without re-ranking: Precision@4 = {baseline['precision_at_k']:.2f}, "
      f"Latency = {baseline['avg_latency_ms']:.0f}ms")
print(f"With re-ranking:    Precision@4 = {reranked['precision_at_k']:.2f}, "
      f"Latency = {reranked['avg_latency_ms']:.0f}ms")

# Typical output:
# Without re-ranking: Precision@4 = 0.51, Latency = 8ms
# With re-ranking:    Precision@4 = 0.74, Latency = 89ms

The pattern you’ll see in practice: re-ranking improves precision@4 by roughly 20-30 percentage points. The latency increase is 80-200ms depending on the re-ranker model and candidate pool size.


Using LangChain’s Built-in Re-Ranker Integration

LangChain has a ContextualCompressionRetriever that wraps re-ranking:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker as LCReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Load the cross-encoder model
model = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")

# Create the re-ranker compressor
compressor = LCReranker(model=model, top_n=4)

# Base retriever: returns many candidates
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

# Compression retriever: base retrieval + re-ranking
reranking_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

# Use it exactly like any other retriever
results = reranking_retriever.invoke("How do I configure SSO with Okta?")
for doc in results:
    print(doc.page_content[:200])

This integrates cleanly with LangChain chains because it follows the same Retriever interface.


When NOT to Use Re-Ranking

Re-ranking adds 50-200ms per query. This is acceptable for most applications. But there are scenarios where it’s not worth it:

Don’t re-rank when:

  • Your retriever’s precision is already > 0.80 (measure first before adding complexity)
  • Latency < 100ms is a hard requirement (real-time autocomplete, voice assistants)
  • Your k is already small (k=2 or k=3) and candidates are already high-quality
  • Your corpus is small (<10K chunks) where vector search already works excellently

Do re-rank when:

  • Your corpus is large and diverse (>50K chunks)
  • Queries vary widely — both semantic and exact-term queries
  • Wrong answers have high cost (legal, medical, customer-facing)
  • Your baseline precision@k is below 0.60

Model Selection for Re-Ranking

ModelSizeSpeedQualityUse Case
ms-marco-MiniLM-L-6-v222MB~50ms/20docsGoodProduction default
ms-marco-MiniLM-L-12-v233MB~80ms/20docsBetterHigher quality budget
ms-marco-electra-base267MB~200ms/20docsBestMax quality, slower
BAAI/bge-reranker-v2-m3568MB~150ms/20docsExcellent + multilingualNon-English content

For most production RAG systems, ms-marco-MiniLM-L-6-v2 is the right choice: small model, loads fast, runs on CPU, and delivers strong quality improvements at ~50ms per re-ranking pass.


Summary

Re-ranking is a high-value, moderate-complexity addition to your RAG pipeline:

  1. Retrieve broadly (k=20) for high recall — you want the right chunks somewhere in the candidate set
  2. Re-rank precisely (top_k=4) using a cross-encoder that reads query + chunk together
  3. Generate with only the highest-quality, most precisely relevant chunks

The result is a system where the LLM almost always has what it needs in context — and rarely has to work with mediocre or irrelevant chunks.

In the next lesson, we’ll learn how to measure all of this quantitatively using RAGAS, so you can prove to yourself (and your team) that each optimization actually improves quality.